distilabel下載 - distilabel源代碼下載

distilabel

其他源碼

1.4.1

下載

合成AI的數據，並隨時添加反饋！

Distilabel是合成數據的框架和AI反饋，對於基於經過驗證的研究論文而需要快速，可靠和可擴展管道的工程師。

如果您只想開始，我們建議您檢查文檔。好奇，想了解更多嗎？繼續閱讀！

為什麼要使用大甲板？

Distilabel可用於生成綜合數據和AI反饋，用於各種項目，包括傳統的預測NLP（分類，提取等），或生成和大型語言模型場景（以下說明下面，對話，對話，判斷等）。 Distilabel的程序化方法使您可以構建可擴展的管道來生成數據和AI反饋。 Distilabel的目的是通過快速生成基於經過驗證的研究方法來生成和通過AI反饋來生成和判斷的高質量數據集來加速您的AI開發。

通過數據質量提高您的AI輸出質量

計算很昂貴，產出質量很重要。我們幫助您專注於數據質量，該質量能夠一次解決這兩個問題的根本原因。 Distilabel幫助您合成和判斷數據，讓您花費寶貴的時間來實現並保留高質量的數據標準。

控制您的數據和模型

對您自己的LLM進行微調數據的所有權並不容易，但是Distilabel可以幫助您開始。我們使用一個統一的API集成了來自任何LLM提供商的AI反饋。

通過快速迭代正確的研究和LLM來提高效率

通過最新的研究論文合成和判斷數據，同時確保靈活性，可伸縮性和容錯性。因此，您可以專注於改進數據和培訓模型。

社區

我們是一個開源社區驅動的項目，我們很樂意收到您的來信。以下是一些參與的方法：

社區聚會：在我們每兩週一次的活動之一中聽或現在聽。
DISCORD：在＃Argilla-General和＃Argilla-Help的社區獲得直接支持。
路線圖：計劃改變，但我們喜歡與我們的社區討論那些人，因此感到鼓勵參加。

人們用大型賽車建造什麼？

Argilla社區使用Distilabel創建驚人的數據集和模型。

1M OpenHermespReference是一個約有100萬個AI偏好的數據集，該數據集來自Teknium/OpenHermes-2.5。它顯示了我們如何使用陶式標籤以巨大的量表合成數據。
我們的大型Intel Orca DPO數據集和改進的OpenHermes模型，通過通過AI反饋濾除了50％的原始數據集，以顯示我們如何改善模型性能。
Haiku DPO數據概述瞭如何為特定任務創建數據集以及提高數據集質量的最新研究論文的數據集。

安裝

pip install distilabel --upgrade

需要Python 3.9+

此外，還有以下附加功能：

LLMS

anthropic ：用於通過AnthropicLLM擬人API中可用的模型。
cohere ：用於使用CohereLLM集成中可用的模型。
argilla ：將生成的數據集導出到Argilla。
groq ：使用groq Python客戶端通過GroqLLM Integration使用GROQ中可用的模型。
hf-inference-endpoints ：用於通過the the the the the the the the the the the the the the the the the the the the the the the the the the the the the InferenceEndpointsLLM集成。
hf-transformers ：用於通過TransformersLLM Integration在Transformers軟件包中可用的模型。
litellm ：使用LiteLLM通過LiteLLM Integration使用OpenAI格式調用任何LLM。
llama-cpp ：使用LlamaCppLLM集成使用Llama-CPP-Python Python bintings進行llama.cpp 。
mistralai ：使用MistralAILLM集成使用Mistral AI API中可用的模型。
ollama ：通過OllamaLLM Integration使用Ollama及其可用模型。
openai ：用於通過OpenAILLM集成使用OpenAI API模型，或基於OpenAI的其餘集成，並依靠其客戶作為AnyscaleLLM ， AzureOpenAILLM和TogetherLLM 。
vertexai ：通過VertexAILLM Integration使用Google Vertex AI專有模型。
vllm ：通過vLLM集成使用VLLM服務引擎。
sentence-transformers ：用於使用句子轉換器生成句子嵌入。

結構化產生

outlines ：用於將結構化生成LLM與輪廓使用。
instructor ：用於將LLM的結構化生成與講師一起使用。

數據處理

ray ：用於用射線縮放和分發管道。
faiss-cpu和faiss-gpu ：用於使用faiss生成句子嵌入。
text-clustering ：用於使用UMAP和SCIKIT-LEARN的文本聚類。
minhash ：使用Datasketch和NLTK使用Minhash進行重複檢測。

例子

要運行以下示例，您必須將distilabel與hf-inference-endpoints安裝額外：

pip install " distilabel[hf-inference-endpoints] " --upgrade

然後運行：

 from distilabel . llms import InferenceEndpointsLLM
from distilabel . pipeline import Pipeline
from distilabel . steps import LoadDataFromHub
from distilabel . steps . tasks import TextGeneration

with Pipeline (
    name = "simple-text-generation-pipeline" ,
    description = "A simple text generation pipeline" ,
) as pipeline :
    load_dataset = LoadDataFromHub ( output_mappings = { "prompt" : "instruction" })

    text_generation = TextGeneration (
        llm = InferenceEndpointsLLM (
            model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct" ,
            tokenizer_id = "meta-llama/Meta-Llama-3.1-8B-Instruct" ,
        ),
    )

    load_dataset >> text_generation

if __name__ == "__main__" :
    distiset = pipeline . run (
        parameters = {
            load_dataset . name : {
                "repo_id" : "distilabel-internal-testing/instruction-dataset-mini" ,
                "split" : "test" ,
            },
            text_generation . name : {
                "llm" : {
                    "generation_kwargs" : {
                        "temperature" : 0.7 ,
                        "max_new_tokens" : 512 ,
                    }
                }
            },
        },
    )
    distiset . push_to_hub ( repo_id = "distilabel-example" )

徽章

如果您使用distilabel構建一些很酷的東西，請考慮將這些徽章之一添加到數據集或型號卡中。

 [<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

 [<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

貢獻

要直接使用distilabel貢獻，請檢查我們的良好第一期或打開新問題。

引用

 @misc { distilabel-argilla-2024 ,
  author = { Álvaro Bartolomé Del Canto and Gabriel Martín Blázquez and Agustín Piqueres Lajarín and Daniel Vila Suero } ,
  title = { Distilabel: An AI Feedback (AIF) framework for building datasets with and for LLMs } ,
  year = { 2024 } ,
  publisher = { GitHub } ,
  journal = { GitHub repository } ,
  howpublished = { url{https://github.com/argilla-io/distilabel} }
}

展開

附加信息

版本 1.4.1
類型其他源碼
更新時間 2025-02-28
大小 6.48MB
來自於 Github

相關應用

Google Dorks

2025-03-10
shepherd

2025-06-04
hidusbf

2025-02-14
mongo express

2025-06-04
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
hidusbf

其他源碼

1.0.0
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
hidusbf

其他源碼

1.0.0

相關資訊全部