Promote the Arsenal Search
Author: Yang Xi
NLP paper study notes: https://github.com/km1994/nlp_paper_study
Personal introduction: Hello, big guys, my name is Yang Xi. This project is mainly about what I see, think, think, and hear during the process of studying the top-season papers and recreating classic papers. I hope the big guys will correct them.
NLP versatile and versatile address: https://github.com/km1994/NLP-Interview-Notes
Recommended system with all sides and all kinds of addresses: https://github.com/km1994/RES-Interview-Notes
Promotion and search for the arms library : https://github.com/km1994/recommendation_advertisement_search
Follow the official account [Things you don’t know about NLP] and join [NLP && Recommended Learning Group] to study together! ! !
1. Project
1.1 Some large models that can be downloaded in the industry at present
- chatgpt:
- https://openai.com/blog/chatgpt
- Experience address: https://chat.openai.com/
- GLM-10B/130B
- Introduction: Bilingual (Chinese and English) Bidirectional Dense Model
- OPT-2.7B/13B/30B/66B:
- Introduction: Meta open source pre-trained language model
- github: https://github.com/facebookresearch/metaseq
- paper: https://arxiv.org/pdf/2205.01068.pdf
- LLaMA-7B/13B/30B/65B:
- Introduction: The basic large language model of Meta open source
- github: https://github.com/facebookresearch/llama
- paper: https://arxiv.org/pdf/2302.13971v1.pdf
- Alpaca (LLaMA-7B):
- Introduction: Stanford proposed a powerful reproducible instruction follow-up model. The seed tasks are all in English and the collected data are also in English. Therefore, the trained model has not been optimized for Chinese.
- github: https://github.com/tatsu-lab/stanford_alpaca
- Chinese-LLaMA-Alpaca github: https://github.com/ymcui/Chinese-LLaMA-Alpaca
- BELLE (BLOOMZ-7B/LLaMA-7B):
- Introduction: This project is based on Stanford Alpaca and is optimized for Chinese. Model tuning only uses data produced by ChatGPT (no other data)
- ChatGLM-6B:
- Introduction: Chinese and English bilingual dialogue language model
- github: https://github.com/THUDM/ChatGLM-6B/
- Bloom-7B/13B/176B:
- Introduction: Can handle 46 languages including French, Chinese, Vietnamese, Indonesian, Catalan, 13 Indian languages (such as Hindi) and 20 African languages. Among them, the Bloomz series model is fine-tuned based on the xP3 dataset. Recommended for English prompts (prompting); Bloomz-mt series models are fine-tuned based on the xP3mt dataset. Recommended for non-English prompts (prompting)
- github: https://huggingface.co/bigscience/bloom
- paper: https://arxiv.org/pdf/2211.05100.pdf
- Vicuna(7B/13B):
- Introduction: Vicuna-13B, created by researchers at UC Berkeley, CMU, Stanford, and UC San Diego, was obtained by fine-tuning LLaMA in user-shared conversation data collected by ShareGPT. Among them, GPT-4 was used for evaluation and found that the performance of Vicuna-13B achieved capabilities comparable to ChatGPT and Bard in more than 90% of the cases; at the same time, it was better than other models such as LLaMA and Alpaca in 90% of the cases. And training for Vicuna-13B costs about $300. Not only that, it also provides an open platform for training, serving and evaluating chatbots based on large language models: FastChat.
- Baize:
- Introduction: Bai Ze trained on LLaMA. Currently, four English models are included: Bai Ze-7B, 13B, 30B (general dialogue model) and a vertical field Bai Ze-medical model for research/non-commercial use, and plans to release Chinese Bai Ze model in the future. All codes such as Bai Ze's data processing, training models, and demo have been open sourced.
- LLMZoo:
- Introduction: A series of big models launched by the Chinese University of Hong Kong and the Shenzhen Big Data Research Institute team, such as Phoenix and Chimera, etc. - MOSS: The MOSS large language model launched by the Fudan NLP team.
- Alpaca FastChat
- github: https://github.com/lm-sys/FastChat
- MiniGPT-4
- github: https://github.com/Vision-CAIR/MiniGPT-4
1.2 [LLMs Introduction to Practical Sequence Series]
Tsinghua University Open Source Chinese Version ChatGLM-6B Model Learning and Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical Practical P Model Learning and Practical Battle [LLMs Beginner Practical Battle - 8] MiniGPT-4 Model Learning and Practical Battle
1.3 NLP Project Arsenal Learning
- 【Knowledge Graph Construction DeepKg】https://github.com/powercy/DeepKg
- Introduction: This project is committed to the construction of knowledge graphs, and is currently building its methods bit by bit, and I hope to help more people.
1.4 Recommended system project arsenal learning
- 【fun-rec】https://github.com/datawhalechina/fun-rec
- Introduction: It is mainly aimed at students who have a basic machine learning foundation and want to find a recommended algorithm position. The tutorial consists of recommendation algorithm foundation, recommendation algorithm introductory competition, news recommendation projects and recommendation algorithms, forming a complete closed loop from basic to practical and then to interview.
- 【RecSys】https://github.com/qcymkxyc/RecSys
- Introduction: Xiang Liang's code implementation of "Recommended System Practice"
1.5 Search engine project arsenal learning
- [Search Engine Project Open Source] https://github.com/zuo369301826/Search_Project
- Project introduction: Simulate Baidu search to implement the site search engine. The entire project is divided into two parts: HTTP server and search server: HTTP server obtains user requests and analyzes the requests to obtain the specific information we need, and then passes the information to the search server; the search server will process and retrieve data based on this information, feed the results back to the HTTP server, and the HTTP server will receive the index result and print it on the page.
- Project features: 1. Use open source frameworks such as protobuf, gflag, and glog provided by Google to complete the development; 2. The search server uses the RPC protocol and implements the high-performance RPC framework based on Baidu's open source sofa-pbrpc; 3. The search principle is to search through a combination of forward index and inverted index; 4. The HTTP server uses the epoll model to improve the concurrent response speed; 5. The HTTP server calls the search client in CGI to complete the search function
- 【Elastic】https://www.elastic.co/cn/
- Introduction: Elasticsearch is a distributed, RESTful-style search and data analysis engine that can solve the emergence of various use cases. At the heart of the Elastic Stack, it centrally stores your data and helps you discover unexpected and unexpected situations.
- 【Nutch】http://nutch.sourceforge.net/docs/zh/about.html
- Introduction: Nutch is a web search engine that has just born open-source. There are detailed Chinese instructions on its homepage.
- 【Lucene】http://jakarta.apache.org/lucene/docs/index.html
- Introduction: Apache Lucene is an open source program crawler engine that can easily add full-text search functions to Java software. Lucene's main job is to index every word of the file. Indexing greatly improves the search efficiency compared to traditional word-by-word comparison. Lucen provides a set of APIs for interpreting, filtering, analyzing files, orchestrating and using indexes. Its power is not only efficient and simple, but also the most important thing is to enable users to customize their functions at any time as they need.
- 【Egothor】http://www.egothor.org/
- Introduction: Egothor is an open source and efficient full text search engine written in Java. With the cross-platform features of Java, Egothor can be applied to applications in any environment, both as a separate search engine and for your application as a full-text search.
- 【Oxyus】http://oxyus.sourceforge.net/
- Introduction: It is a pure Java-written web search engine.
- 【BDDBot】http://www.twmacinta.com/bddbot/
- Introduction: BDDBot is a simple and easy to understand and use search engine. It currently crawls in a URL listed in a text file (urls.txt) and saves the results in a database. It also supports a simple web server that accepts queries from the browser and returns response results. It can be easily integrated into your web site.
1.6 Calculate advertising project arsenal learning
- [Meituan DSP advertising strategy practice] https://tech.meituan.com/2017/05/05/mt-dsp.html
- [Introduction to Internet Advertising and Computational Advertising] http://web.stanford.edu/class/msande239/
2. Ai artifact
- ChatGPT Ai Artifact
- 【Ai Artifact】Dialogue AI—ChatGPT https://999.weny66.cn/chat?bd_vid=11997231054327469370
- 【AI Artifact】GPT-4 Online Experience Website chatmindai.cn
- 【AI Artifact】ChatGPT3.5 is free to log in, available in China https://chat23.yqcloud.top/
- Forefront Chat chat.forefront.ai
- Poe poe.com/GPT-4
- 3D Ai Artifact
- Masterpiece Studio: https://xiaobot.net/p/SuperIndividual
- Masterpiece Studio: https://masterpiecestudio.com
- G3DAI {Jedi}: https://g3d.ai
- Ponzu: https://www.ponzu.gg
- PrometheanAI: https://www.prometheanai.com
- Leonardo.Ai: https://leonardo.ai
- Art Ai Artifact
- Dream Up (Deviant Art): https://www.dreamup.com
- NightCafe Studio: https://creator.nightcafe.studio
- Midjourney: https://www.midjourney.com/home/
- Artbreeder: https://www.artbreeder.com
- Wombo: https://www.wombo.art
- Audio Editing Ai Artifact
- Podcastle : Editing https://podcastle.ai
- Cleanvoice : Audio Editing https://cleanvoice.ai
- Code Assistant Ai Artifact
- CodeSquire https://codesquire.ai
- Buildt Code Assistant https://www.buildt.ai
- Hey, GitHub! Code Assistant https://githubnext.com/projects/hey-github
- Continuous update
3. Introduction to Xiaobai AI
3.1 Introduction to machine learning
- [Wu Wanda Machine Learning Series Courses] https://www.bilibili.com/video/BV164411b7dx?from=search&seid=18138466354258018449&spm_id_from=333.337.0.0
3.2 Getting started with NLP
- [2021 Ng Deep Learning - NLP Sequence Model] https://www.bilibili.com/video/BV1Co4y1279R?from=search&seid=17563746002586971760&spm_id_from=333.337.0.0
- 【Introduction to Knowledge Graph】
- Zhejiang University Map Lecture Notes | Lecture 1 - Introduction to Knowledge Graphs - Section 1 - Language and Knowledge
- Zhejiang University Map Lecture Notes | Lecture 1 - Introduction to Knowledge Graphs - Section 2 - Origin of Knowledge Graphs
- Graph Lecture Notes | Lecture 1 - Section 3 - Value of Knowledge Graph
- Graph Lecture Notes | Lecture 1 - Section 4 - Technical Connotation of Knowledge Graph
- Graph Lecture Notes | Lecture 2 - Section 1 - What is Knowledge Representation
3.3 Getting started with computing advertising
- [Introduction to Internet Advertising and Computational Advertising] http://web.stanford.edu/class/msande239/
- Lecture 1: Introduction, Supplementary notes
- Lecture 2: Marketplace design, In class presentation, Supplementary notes
- Lecture 3: Sponsored search 1, In class presentation
- Lecture 4: Sponsored search 2, In class presentation
- Lecture 5: Display advertising 1, In class presentation
- Lecture 6: Display advertising 2, In class presentation
- Lecture 7: Targeting, In class presentation
- Lecture 8: Recommender systems, In class presentation 1, In class presentation 2
- Lecture 9: Mobile, video, and other emerging formats, In class presentation 1, In class presentation 2
- [Liu Peng – Computational Advertising (Recommended)] http://study.163.com/course/introduction.htm?courseId=321007
- Introduction: Teacher Liu Peng currently serves as the chief architect of commercial products in 360 and has rich practical experience in the field of Internet advertising. The content of its course "Computational Advertising" is easy to understand, from advertising history model to recent technology, which is very suitable for friends who are new to the field to learn.
- Basic knowledge of advertising
- Contract advertising system
- Audience Targeting
- Bidding advertising system
- Search Advertising and Advertising Network Demand Technology
- Advertising Market
- 【Baidu – Computational Advertising】http://openresearch.baidu.com/courses/1231.jhtml
- Overview of Computational Advertising
- Search engine advertising principles, technology and engineering practices
- Content matching advertising principles, techniques and practices
- [Wang Yongrui – Internet advertising algorithms and system practice] http://yuedu.baidu.com/ebook/3e31c551964bcf84b9d57bc0.html
- Introduction: Teacher Wang is the person in charge of Taobao’s targeted advertising algorithm. Its course combines Taobao's advertising practice experience, from advertising theory to systematic technical practice, and is very worthy of learning by technicians.
- Introduction to Internet advertising
- Search Ads
- Targeted advertising
- Real-time advertising bidding
- Advertising system architecture and challenges
- 【UCS - Introduction to Computational Advertising】 http://classes.soe.ucsc.edu/ism293/Spring09/index_archivos/Page456.html
- Introduction and Overview
- Information Retrieval (IR) for Computational
- Marketplace design
- Machine Learning Techniques
- Sponsored Search I
- Sponsored Search II
- Graphical ads and guaranteed delivery
- Contextual Advertising I
- Contextual Advertising II
- Behavioral Targeting (BT)
4. Promote thesis search and study notes
- 【NLP Study Notes】
- 【Transformer】
- 【About Efficient Transformers: A Survey】Things you don't know
- 【Bert Model Compression】
- 【About self-training + pre-training = better natural language understanding model】Things you don't know
- 【About BERT to TextCNN】Things you don't know
- 【Named Entity Recognition】
- 【Biaffine about nested entity recognition】Things you don’t know
- paperShape by Biaffine
- paperShape's inventory of named entity recognition in recent years
- 【About Continual Learning for NER】Things you don’t know
- 【Relationship Extraction】
- 【About HBT Relationship Extraction】Things you don’t know
- From the beginning, the relationship extraction
- From the beginning, relationship extraction - remote supervision attack
- [Document-level relationship extraction]
- 【About ATLOP】Things you don't know
- Paper summary | Document-level relationship extraction method (Part 1)
- Paper summary | Document-level relationship extraction method (Part 2)
- 【Text Match】
- 【About Sentence-BERT】Things you don't know
- Facebook: Faiss principle + application of the search library for millions of vector similarity
- New Sentence Vector Solution CoSENT Practical Record
- 【Status Chain Reference】
- 【About GENER】Things you don’t know
- 【Text error correction】
- 【About GECToR】Things you don't know
- 【Q&A Robot】
- TopicShare sharing scene-based and search-based Q&A robot
- 【Dialogue System】
- "【Community Says】Let's talk about Rasa 3.0" Incomplete Notes
- (I) Overview of dialogue robots
- (II) Introduction to RASA open source engine
- (III) RASA NLU language model
- (IV) RASA NLU word segmenter
- (V) RASA NLU feature generator
- (VI) RASA NLU Intent Classifier
- (VII) RASA NLU entity extractor
- (9) RASA custom pipeline components
- (10) RASA CORE Policy
- (11) RASA CORE Action
- (12) RASA Domain
- (13) RASA training data
- (14) RASA story
- (15) Rasa Rules
- (16) RASA best practices
- (17) Start Chinese robot based on RASA
- (18) Start the Chinese robot implementation mechanism based on RASA
- (19) Question and Answer System Based on Knowledge Graph (KBQA)
- (20) A Q&A system based on reading comprehension
- DIET: Dual Intent and Entity Transformer—— RASA paper translation
- (21) FAQs on RASA Application
- (22) Hyperparameter optimization of RASA
- (23) Robot testing and evaluation
- (24) Create a context dialogue assistant using Rasa Forms
- 【KBQA】
- 【About Complex KBQA】Things you don’t know (Part 1)
- 【About Complex KBQA】Things you don’t know (Chinese)
- 【About Complex KBQA】Things you don’t know (Part 2)
- 【Event Extraction】
- 【About MLBiNet】Things you don't know
- 【Prompt Tuning】
- Prompt Tuning Introduction
- 【New Word Discovery】
- Build your own PTM! New word mining + pre-training
- 【Text to SQL】
- Text to SQL? Here is a Baseline analysis
- 【Recommended System Study Notes】
- Recommended system technology evolution trend: recall
- Recommended system technology evolution trend: sorting
- Recommended system technology evolution trend: Rearrangement
- How does the recommendation system find similar users?
- A long article with ten thousand words details the logic and evolution of dialogue recommendation system
- Summary of the related technologies of model adaptation in the recommended system
- 【GCN Study Notes】
- 【About GCN in NLP】Things you don't know
- [Calculate advertising papers and data list github repo]
- Three major perspectives, talk about the advertising system in my eyes
- [Recommended system papers and data list github repo]
- 【Search Engine】
- 【About PLM for Web-scale Retrieval in Baidu Search】Things you don’t know
- EMNLP 2021 | RocketQAv2: A joint training method for dense paragraph search and paragraph fine layout
5. Promote the search article
- 【NLP versatile and versatile】
- 【Machine Learning】
- 【About regularization】Things you don't know
- 【About Optimization Algorithm】Things you don't know
- 【About BatchNorm vs LayerNorm】Things you don't know
- 【About Normalization】Things you don't know
- 【About Overfitting and Underfitting】Things you don't know
- 【Deep Learning】
- 【About CNN】Things you don't know
- 【About Attention】Things you don’t know
- 【About Transformer】Things you don’t know (Part 1)
- 【About Transformer】Things you don’t know (Chinese)
- 【About Transformer】Things you don’t know (Part 2)
- 【NLP Tasks】
- 【Pretrained Model】
- 【About TF-idf】Things you don't know
- 【About Word2vec】Things you don't know
- 【About fastText】Things you don't know
- 【About Elmo】Things you don't know
- 【About Bert】Things you don’t know (Part 1)
- 【About Bert】Things you don’t know (Part 2)
- 【About Bert Source Code Analysis I's main body】Things you don't know
- 【About Bert Source Code Analysis II Pre-training Chapter】Things you don’t know
- 【About Bert Source Code Analysis III fine-tuning chapter】Things you don't know
- [About Bert source code analysis IV sentence vector generation article] Things you don't know
- 【About Bert’s bigger, the more refined sequence】Things you don’t know (I)
- 【About Bert’s bigger, the more refined sequence】Things you don’t know (II)
- 【About Bert’s bigger, the more refined sequence】Things you don’t know (III)
- 【New Word Discovery】
- 【About New Word Discovery】Things you don't know
- 【Keyword Extraction】
- 【About keyword extraction】Things you don’t know
- 【About KeyBERT】Things you don’t know
- 【Recommended system with all sides】
- to be continued
6. Framework
6.1 Pytorch Learning
- 【PyTorch English version official manual】https://pytorch.org/tutorials/
- Introduction: PyTorch English version official manual: https://pytorch.org/tutorials/. For students with good English, this PyTorch official document is highly recommended, which will take you step by step from getting started to mastering. This document details the basics to how to build deep neural networks using PyTorch, as well as PyTorch syntax and some high-quality cases.
- [PyTorch Chinese official document] https://pytorch-cn.readthedocs.io/zh/latest/
- Introduction: PyTorch Chinese official document: https://pytorch-cn.readthedocs.io/zh/latest/. It doesn't matter if you have difficulty reading the above-mentioned English documents. We have prepared a more official Chinese document in PyTorch for you. The document introduces each function in a very detailed manner, which can be used as a quick search book for PyTorch.
- [PyTorch code tutorial for practical algorithms] https://github.com/yunjey/pytorch-tutorial
- Introduction: This is a PyTorch code tutorial that is more practical on algorithms. It has a high star on github, https://github.com/yunjey/pytorch-tutorial. It is recommended that you learn the above two basic PyTorch tutorials before reading this document.
- 【Pytorch Open Source Books】https://github.com/zergtant/pytorch-handbook
- Introduction: Introduction to an open source book: https://github.com/zergtant/pytorch-handbook. This is an open source book with the goal of helping those who wish to and use PyTorch for deep learning development and research quickly. However, this document is not very complete and is still being updated.
- ["Hand-On Deep Learning" pytorch] http://tangshusen.me/Dive-into-DL-PyTorch/#/
- 【Practical Tutorial on Pytorch Model Training】 https://github.com/km1994/PyTorch_Tutorial
- 【Pytorch Advanced NLP Practical Practice】https://github.com/km1994/NLP_pytorch_project
- 【ark-nlp NLP Tool Library】https://github.com/xiangking/ark-nlp
- Introduction: Wang Xiang’s open source arsenal is mainly used to collect and reproduce the commonly used NLP models in academics and work.
6.2 tensorflow learning
- 【TensorFlow official website】https://www.tensorflow.org/tutorials
- Introduction: The official website tutorial is definitely the most fragrant learning material
- 【TensorFlow Examples】https://github.com/aymericdamien/TensorFlow-Examples
- Introduction: Tensorflow tutorials and code examples for beginners: https://github.com/aymericdamien/TensorFlow-Examples. This tutorial not only provides some classic data sets, but also starts from implementing the simplest "Hello World", to classic algorithms for machine learning, and to commonly used models for neural networks. It takes you step by step from getting started to mastering. It is the best tutorial for beginners to learn Tensorflow.
- 【TensorFlow Tutorials】https://github.com/pkmital/tensorflow_tutorials
- Introduction: From the basics of Tensorflow to interesting project applications: https://github.com/pkmital/tensorflow_tutorials. It is also a tutorial for novices, from installation to project practice, to teach you to build your own neural network.
- 【Tensorflow Tutorials using Jupyter Notebook】https://github.com/sjchoi86/Tensorflow-101
- Introduction: TensorFlow tutorial written in Python using Jupyter Notebook: https://github.com/sjchoi86/Tensorflow-101. This tutorial is a Tensorflow tutorial based on the Jupyter Notebook development environment. Jupyter Notebook is a very useful interactive development tool. It not only supports more than 40 programming languages, but also can run code, share documents, data visualization, support markdown, etc. in real time, and is suitable for machine learning, statistical modeling data processing, feature extraction and other fields.
- 【TensorFlow_Exercises】https://github.com/terryum/TensorFlow_Exercises
- Introduction: Tensorflow code exercise: https://github.com/terryum/TensorFlow_Exercises. A Tensorflow code exercise manual from easy to difficult. Very suitable for friends who study Tensorflow.
- 【Application of BERT and ALBERT in downstream tasks】 https://github.com/km1994/bert-for-task
- Introduction: Bert's implementation in NLP tasks
6.3 keras learning
- 【bert4keras】https://github.com/bojone/bert4keras
- Introduction: Sushen’s open source arsenal, a re-implemented keras version of transformer model library, is committed to combining transformer and keras with the refreshing code possible.
6.4 Distributed training framework learning
- The first category: distributed training functions that come with deep learning frameworks. Such as: TensorFlow, PyTorch, MindSpore, Oneflow, PaddlePaddle, etc.
- The second category: Scaling and optimizing based on existing deep learning frameworks (such as PyTorch, Flax) to perform distributed training. Such as: Megatron-LM (tensor parallel), DeepSpeed (Zero-DP), Colossal-AI (high-dimensional model parallelism, such as 2D, 2.5D, 3D), Alpa (automatic parallelism), etc.
7. Competition
5.1 Domestic Competition
- [iFlytek Developer Competition] http://challenge.xfyun.cn/
- 【Ali Tianchi】https://tianchi.aliyun.com/
- 【biendata】https://www.biendata.xyz/
- 【datafountain】https://www.datafountain.cn/
- 【Baidu Paddle Paddle】https://aistudio.baidu.com/
5.2 Competition Official Account
- 【Mapo Tofu AI】
- Introduction: Will introduce some recent events you can participate in
5.3 NLP Competition Arsenal
- [NLP Arsenal Tool Library] https://github.com/TingFree/NLPer-Arsenal
- Introduction: NLP arsenal, which mainly includes NLP competition strategy implementation, various task tutorials, experience posts, learning materials, and meeting time.
- 【CHIP2021-Task3-Open source solution for standardized tasks in clinical terms】
- github source code
- Evaluation website: http://cips-chip.org.cn/2021/eval3
- All code is based on our open source ark-nlp implementation. There is no A list for the clinical term standardization task of CHIP2021, so the code debugging is completed on the clinical term standardization task of CBLUE, the Chinese medical information processing data set of Tianchi.
- ark-nlp address: https://github.com/xiangking/ark-nlp
- Chinese medical information processing data set CBLUE: https://tianchi.aliyun.com/dataset/dataDetail?dataId=95414
- [ChIP2021 Medical Dialogue Clinical Discovery Yin-Yang Discrimination Task Champion Open Source Plan]
- github source code
- Name: CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
- Evaluation Task: CBLUE 1.0 is composed of the data set of previous academic evaluation competitions of the CHIP conference and the Ali Quark medical search business, including medical text information extraction (entity recognition, relationship extraction), medical term normalization, medical text classification, medical sentence relationship judgment and medical QA total of 8 sub-tasks.
- Task types: text classification, text similarity, named entity recognition, relationship extraction and terminology standardization (can be regarded as entity linking tasks without context)
- Evaluation link: https://tianchi.aliyun.com/dataset/dataDetail?dataId=95414
- 【CBLUE-Ali Tianchi Chinese Medical NLP ranking Baseline】https://github.com/DataArk/CBLUE-Baseline
- [Shandong Big Data Competition—Grid Event Intelligent Classification Baseline] https://github.com/xiangking/ShandongDataCompetition2021-grid-events-classification-baseline
- Evaluation task: Based on grid event data, extract and analyze the event content in the grid, divide the categories of events, and divide the government affairs types to which the event belongs.
- Task Type: Text Classification
- Evaluation link: http://data.sd.gov.cn/cmpt/cmptDetail.html?id=67
8. Corpus
8.1 NLP material
- 【nlp_chinese_corpus 】https://github.com/brightmart/nlp_chinese_corpus
- Introduction: Large Scale Chinese Corpus for NLP
8.2 Recommended system quotation
- 【MovieLens】https://grouplens.org/datasets/movielens/
- Introduction: The MovieLens dataset was organized by the GroupLens research group in the University of Minnesota (not related to our use of the dataset). MovieLens is a collection of movie ratings, available in various sizes. The datasets are named 1M, 10M and 20M because they contain 1, 10 and 200,000 ratings. The largest data set uses data from about 140,000 users and covers 27,000 movies. In addition to ratings, the MovieLens data also contains genre information similar to "Western" and tags for user applications such as "over the top" and "Arnold Schwarzenegger". These genre tags and labels are useful in building content vectors. Content vectors encode the information of an item, such as color, shape, genre, or any other attribute, really - which can be any form of a recommendation algorithm for content-based.
- 【Book-Crossings】
- Introduction: Book-Crossings is a book rating dataset written by Cai-Nicolas Ziegler based on the data from http://bookcrossing.com. It contains 1.1 million ratings for 270,000 books from 90,000 users. The score ranges from 1 to 10 and also includes implicit scores.
- 【Last.fm】http://www2.informatik.uni-freiburg.de/~cziegler/BX/
- Introduction: Last.fm provides a data set for music recommendations. For each user in the dataset, include a list of their most popular artists and the number of plays. It also includes user application tags that can be used to build content vectors.
- 【Dating Agency】(http://www2.informatik.uni-freiburg.de/~cziegler/BX/)
- Introduction: This dataset contains 17,359,346 anonymous ratings for 168,791 configuration files by 135,359 LibimSeTi users exported on April 4, 2006.
- Others: https://zhuanlan.zhihu.com/p/258566760
8.3 Labeling Tools
- Are you still worried about not finding the entity relationship annotator?
- https://labelstud.io/
- doccano
9. Official account chapter
- Things you don't know about NLP
- Introduction: Things you don't know about NLP
- CS's humble room
- Introduction: A post on sharing experiences of the char siu boss. If you encounter problems, you may have unexpected gains.
- DataArk
- Introduction: DataArk is data-driven and open source sharing-oriented, and is committed to data mining, algorithm innovation and the development of practical tools.
- Intelligent recommendation system
- Introduction: Focusing on intelligent recommendation systems, here are the latest and most comprehensive personalized recommendation-related algorithms and industry applications sharing. Welcome to follow and travel with you in the ocean of recommendations and explore the unknown new world together.
- DataFunTalk
- Introduction: Focusing on sharing and communication of big data and artificial intelligence technology applications. Committed to achieving millions of data scientists. Regularly organize live broadcasts on technology sharing, and organize technical application articles such as big data, recommendation/search algorithms, advertising algorithms, NLP natural language processing algorithms, intelligent risk control, autonomous driving, machine learning/deep learning, etc.
- RUC Al Box
- Introduction: This official account mainly focuses on the research content of using artificial intelligence technology to solve natural language processing and social media data mining. Share Al's cutting-edge and interpret hot papers.
- NewBeeNLP
- Introduction: Will introduce many excellent NLP notes
- Open Knowledge Graph
- Introduction: openKG: Openness promotes interconnection and links create value
- WeData365
- Introduction: Friends who study [Search Engine] must pay attention, because there are many [Search Engine] sharing of practical information
- Science Space
- Introduction: Su Shen’s official account, Su Shen opens his research notes every Thursday.
- Lao Liu says NLP
- Introduction: Liu Huanyong, the tycoon of 360 Artificial Intelligence Research Institute, regularly publishes language resources, engineering practices, technical summary and other content.
- Data picker
- Introduction: Friends who study [Advertising] must pay attention, because there are many [Advertising] sharing of practical information
- Functional model
- Introduction: Share study notes by Tencent boss
- Calculate advertising
- Introduction: Friends who study [Advertising] must pay attention, because there are many [Advertising] sharing of practical information
- Medicine Algorithm
- Introduction: Friends who study [Search Engine] must pay attention, because there are many [Search Engine] sharing of practical information
- Machine Learning Algorithms and Natural Language Processing
- Introduction: A passionate official account. Machine learning, natural language processing, algorithm and other knowledge concentration camps, I look forward to meeting you~
- Wang Zhe's machine learning notes
- Introduction: Frontier progress in recommendation systems, computing advertising, machine learning
- AINLP
- Introduction: Pay attention to AI, NLP, machine learning, recommendation systems, computing advertising and other related technologies. The official account can directly talk to the bilingual chatbot, try automatic couplets, poem making machines, hidden poem generators, tease and praise robots, and rainbow fart generators, use Chinese and English translations, query similar words, and test NLP-related toolkits.
- Lee rumor
- Introduction: Miss Li rumor's study notes sharing
- Xi Xiaoyao's cute house
- Introduction: Natural language processing, computer vision, information retrieval, recommendation system, machine learning
10. Study notes
- Science Space:
- Address: https://spaces.ac.cn/
- Introduction: Share the study notes of Su Shen’s experience
- Chilia of the Magic Academy
- Address: https://www.zhihu.com/people/wang-zi-han-81-18/posts
- Direction: Recommendation System | Advertising | Search | NLP
- Brother Shui
- Address: https://www.zhihu.com/people/shui-ge-99
- Direction: Recommended system
- JayJay
- Address: https://www.zhihu.com/people/lou-jie-9
- I've thought about a lot of things
- Address: https://www.zhihu.com/people/yuan-chao-yi-83
11. Deployment Notes
- Bert and TensorRT deployment manual for silky smoothness
refer to
- Some summary of big model practice