distilabel下载 - distilabel源代码下载

distilabel

其他源码

1.4.1

下载

合成AI的数据，并随时添加反馈！

Distilabel是合成数据的框架和AI反馈，对于基于经过验证的研究论文而需要快速，可靠和可扩展管道的工程师。

如果您只想开始，我们建议您检查文档。好奇，想了解更多吗？继续阅读！

为什么要使用大甲板？

Distilabel可用于生成综合数据和AI反馈，用于各种项目，包括传统的预测NLP（分类，提取等），或生成和大型语言模型场景（以下说明下面，对话，对话，判断等）。 Distilabel的程序化方法使您可以构建可扩展的管道来生成数据和AI反馈。 Distilabel的目的是通过快速生成基于经过验证的研究方法来生成和通过AI反馈来生成和判断的高质量数据集来加速您的AI开发。

通过数据质量提高您的AI输出质量

计算很昂贵，产出质量很重要。我们帮助您专注于数据质量，该质量能够一次解决这两个问题的根本原因。 Distilabel帮助您合成和判断数据，让您花费宝贵的时间来实现并保留高质量的数据标准。

控制您的数据和模型

对您自己的LLM进行微调数据的所有权并不容易，但是Distilabel可以帮助您开始。我们使用一个统一的API集成了来自任何LLM提供商的AI反馈。

通过快速迭代正确的研究和LLM来提高效率

通过最新的研究论文合成和判断数据，同时确保灵活性，可伸缩性和容错性。因此，您可以专注于改进数据和培训模型。

社区

我们是一个开源社区驱动的项目，我们很乐意收到您的来信。以下是一些参与的方法：

社区聚会：在我们每两周一次的活动之一中听或现在听。
DISCORD：在＃Argilla-General和＃Argilla-Help的社区获得直接支持。
路线图：计划改变，但我们喜欢与我们的社区讨论那些人，因此感到鼓励参加。

人们用大型赛车建造什么？

Argilla社区使用Distilabel创建惊人的数据集和模型。

1M OpenHermespReference是一个约有100万个AI偏好的数据集，该数据集来自Teknium/OpenHermes-2.5。它显示了我们如何使用陶式标签以巨大的量表合成数据。
我们的大型Intel Orca DPO数据集和改进的OpenHermes模型，通过通过AI反馈滤除了50％的原始数据集，以显示我们如何改善模型性能。
Haiku DPO数据概述了如何为特定任务创建数据集以及提高数据集质量的最新研究论文的数据集。

安装

pip install distilabel --upgrade

需要Python 3.9+

此外，还有以下附加功能：

LLMS

anthropic ：用于通过AnthropicLLM拟人API中可用的模型。
cohere ：用于使用CohereLLM集成中可用的模型。
argilla ：将生成的数据集导出到Argilla。
groq ：使用groq Python客户端通过GroqLLM Integration使用GROQ中可用的模型。
hf-inference-endpoints ：用于通过the the the the the the the the the the the the the the the the the the the the the the the the the the the the the InferenceEndpointsLLM集成。
hf-transformers ：用于通过TransformersLLM Integration在Transformers软件包中可用的模型。
litellm ：使用LiteLLM通过LiteLLM Integration使用OpenAI格式调用任何LLM。
llama-cpp ：使用LlamaCppLLM集成使用Llama-CPP-Python Python bintings进行llama.cpp 。
mistralai ：使用MistralAILLM集成使用Mistral AI API中可用的模型。
ollama ：通过OllamaLLM Integration使用Ollama及其可用模型。
openai ：用于通过OpenAILLM集成使用OpenAI API模型，或基于OpenAI的其余集成，并依靠其客户作为AnyscaleLLM ， AzureOpenAILLM和TogetherLLM 。
vertexai ：通过VertexAILLM Integration使用Google Vertex AI专有模型。
vllm ：通过vLLM集成使用VLLM服务引擎。
sentence-transformers ：用于使用句子转换器生成句子嵌入。

结构化产生

outlines ：用于将结构化生成LLM与轮廓使用。
instructor ：用于将LLM的结构化生成与讲师一起使用。

数据处理

ray ：用于用射线缩放和分发管道。
faiss-cpu和faiss-gpu ：用于使用faiss生成句子嵌入。
text-clustering ：用于使用UMAP和SCIKIT-LEARN的文本聚类。
minhash ：使用Datasketch和NLTK使用Minhash进行重复检测。

例子

要运行以下示例，您必须将distilabel与hf-inference-endpoints安装额外：

pip install " distilabel[hf-inference-endpoints] " --upgrade

然后运行：

 from distilabel . llms import InferenceEndpointsLLM
from distilabel . pipeline import Pipeline
from distilabel . steps import LoadDataFromHub
from distilabel . steps . tasks import TextGeneration

with Pipeline (
    name = "simple-text-generation-pipeline" ,
    description = "A simple text generation pipeline" ,
) as pipeline :
    load_dataset = LoadDataFromHub ( output_mappings = { "prompt" : "instruction" })

    text_generation = TextGeneration (
        llm = InferenceEndpointsLLM (
            model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct" ,
            tokenizer_id = "meta-llama/Meta-Llama-3.1-8B-Instruct" ,
        ),
    )

    load_dataset >> text_generation

if __name__ == "__main__" :
    distiset = pipeline . run (
        parameters = {
            load_dataset . name : {
                "repo_id" : "distilabel-internal-testing/instruction-dataset-mini" ,
                "split" : "test" ,
            },
            text_generation . name : {
                "llm" : {
                    "generation_kwargs" : {
                        "temperature" : 0.7 ,
                        "max_new_tokens" : 512 ,
                    }
                }
            },
        },
    )
    distiset . push_to_hub ( repo_id = "distilabel-example" )

徽章

如果您使用distilabel构建一些很酷的东西，请考虑将这些徽章之一添加到数据集或型号卡中。

 [<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

 [<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

贡献

要直接使用distilabel贡献，请检查我们的良好第一期或打开新问题。

引用

 @misc { distilabel-argilla-2024 ,
  author = { Álvaro Bartolomé Del Canto and Gabriel Martín Blázquez and Agustín Piqueres Lajarín and Daniel Vila Suero } ,
  title = { Distilabel: An AI Feedback (AIF) framework for building datasets with and for LLMs } ,
  year = { 2024 } ,
  publisher = { GitHub } ,
  journal = { GitHub repository } ,
  howpublished = { url{https://github.com/argilla-io/distilabel} }
}

展开

附加信息

版本 1.4.1
类型其他源码
更新时间 2025-02-28
大小 6.48MB
来自于 Github

distilabel

合成AI的数据，并随时添加反馈！

为什么要使用大甲板？

通过数据质量提高您的AI输出质量

控制您的数据和模型

通过快速迭代正确的研究和LLM来提高效率

社区

人们用大型赛车建造什么？

安装

LLMS

结构化产生

数据处理

例子

徽章

贡献

引用

Google Dorks

shepherd

hidusbf

mongo express

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

hidusbf

Google Dorks

shepherd

hidusbf