Retrieval Augmented Generation for news

Retrieval Augmented Generation for news

其他源码

1.0.0

下载

检索新生的新成员

A RAG (Retrival Augmented Generation) based fully open source software which provides summaries of related news articles built using ChromaDB vector database, mixtral-8x7b-instruct-v0.1 LLM (through Replicate AI), New York Times web scraper, dhivyeshrk/bart-large-cnn-samsum Fine-Tuned model for text summarization and sentence-transformers/sentence-t5-base来自拥抱面的嵌入。

系统体系结构

数据收集

Data for different categories of news articles were obtained from the following rss-formatted files : Technology: https://rss.nytimes.com/services/xml/rss/nyt/Technology.xml Sports: https://rss.nytimes.com/services/xml/rss/nyt/Sports.xml Science: https://rss.nytimes.com/services/xml/rss/nyt/science.xml健康：https：//rss.nytimes.com/services/services/xml/rss/rss/rss/nyt/nyt/science.xml

每本新闻文章的头条，描述和域名都是使用句子-T5基本嵌入式矢量化的，并存储在持久的Chromadb客户端中。与各自新闻文章的链接也存储在元数据中。此外，每个域中的新闻都存储在不同的Chromadb集合实例中，以进行有效的检索。

网络刮擦

Web刮擦是使用NY Times API提供的刮板完成的，该刮板仅在新闻中提供约40-60个单词。即使使用BeautifulSoup4，也可以轻松绕过墙，但对其合法性不太确定。

数据格式

为了及时分类，我们使用了MixTral-8x7b-Instruct-V0.1模型，因为它具有出色的功能，基于云的复制AI执行以及轻松的幻觉可预防性。对于文本仪式，我们使用Facebook最初提出的HuggingFace的微调版本的BART-LARGE模型。该模型已在CNN_DailyMail数据集上进行了培训，并在Samsum数据集上进行了进一步的调整，可在Rouge2基准测试中提高103％。这是一个相当轻的型号，大小约为1.6 GB。链接：https：//huggingface.co/dhivyeshrk/bart-large-cnn-samsum https://replicate.com/mistralai/mixtral-8x7b-instruct-ystruct-v0.1