A RAG (Retrival Augmented Generation) based fully open source software which provides summaries of related news articles built using ChromaDB vector database, mixtral-8x7b-instruct-v0.1 LLM (through Replicate AI), New York Times web scraper, dhivyeshrk/bart-large-cnn-samsum Fine-Tuned model for text summarization and sentence-transformers/sentence-t5-base来自拥抱面的嵌入。
Data for different categories of news articles were obtained from the following rss-formatted files : Technology: https://rss.nytimes.com/services/xml/rss/nyt/Technology.xml Sports: https://rss.nytimes.com/services/xml/rss/nyt/Sports.xml Science: https://rss.nytimes.com/services/xml/rss/nyt/science.xml健康:https://rss.nytimes.com/services/services/xml/rss/rss/rss/nyt/nyt/science.xml
每本新闻文章的头条,描述和域名都是使用句子-T5基本嵌入式矢量化的,并存储在持久的Chromadb客户端中。与各自新闻文章的链接也存储在元数据中。此外,每个域中的新闻都存储在不同的Chromadb集合实例中,以进行有效的检索。
Web刮擦是使用NY Times API提供的刮板完成的,该刮板仅在新闻中提供约40-60个单词。即使使用BeautifulSoup4,也可以轻松绕过墙,但对其合法性不太确定。
为了及时分类,我们使用了MixTral-8x7b-Instruct-V0.1模型,因为它具有出色的功能,基于云的复制AI执行以及轻松的幻觉可预防性。对于文本仪式,我们使用Facebook最初提出的HuggingFace的微调版本的BART-LARGE模型。该模型已在CNN_DailyMail数据集上进行了培训,并在Samsum数据集上进行了进一步的调整,可在Rouge2基准测试中提高103%。这是一个相当轻的型号,大小约为1.6 GB。链接:https://huggingface.co/dhivyeshrk/bart-large-cnn-samsum https://replicate.com/mistralai/mixtral-8x7b-instruct-ystruct-v0.1
使用来自纽约时报API的API键,然后复制AI API,然后在Web_scrape_nyt.py中替换它们,并分别epcyorize_prompt.py。然后运行main.py