LLMs Paper Study Club
Author: Yang Xi
Introduction: The warehouse mainly records the top-session paper study notes related to LLMs algorithm engineers (multimodal, PEFT, small sample QA Q&A, RAG, LMMs interpretability, Agents, CoT)
LLMs Nine-story Demon Tower Address: https://github.com/km1994/LLMsNineStoryDemonTower
LLMs Qianmeng Langjun Address: https://github.com/km1994/LLMs_interview_notes
LLMs paper study notes: https://gitee.com/km601/llms_paper
NLP versatile and versatile address: https://github.com/km1994/NLP-Interview-Notes
NLP paper study notes: https://github.com/km1994/nlp_paper_study
Recommended system with all sides and all kinds of addresses: https://github.com/km1994/RES-Interview-Notes
Recommended system paper study notes: https://github.com/km1994/RS_paper_study
Search engines with all sides and all kinds of addresses: https://github.com/km1994/search-engine-Interview-Notes [Writing]
GCN Thesis Study Notes: https://github.com/km1994/GCN_study
Promotion and search for the arms library : https://github.com/km1994/recommendation_advertisement_search
For mobile notes, you can follow the official account [Things you don’t know about NLP] to get it and join the [NLP && Recommended Learning Group] to study together! ! !

LLMs Thousand Faces Langjun Interview Exchange Group (Note: You can add editor wx: yzyykm666 to join the group!)
- LLMs Paper Study Club
- Multimodal
- PEFT series
- GPT Series
- RAG series
- RAG Trick
- RAG application field
- QA Q&A in the medical field
- QA Q&A in the Religious Field
- QA Q&A in the Common Sense Field
- QA Q&A in the Legal Field
- QA Q&A in the field of knowledge graph
- QA Q&A in Task-based Domain
- QA Q&A in the automotive field
- Prompt series
- LMMs Interpretability
- LLMs4KG
- LLMs Agents
- Attention
- Search
- How to build a "query-doc" through a big model?
- How to label "query-doc" positive and negative examples through a big model?
- How to rewrite "query-doc" through a big model?
- How to comprehensively utilize PRF (pseudo-related feedback) + GRF (generate related feedback) through large models?
- How to schedule arranging through a big model?
- What is a recall?
- What are the problems with the recall?
- How to use encoder-based LLM retriever?
- How to use a generative LLM retriever?
- How to sort by big models?
- Fine-tune LLM for similarity calculation
- Prompt LLM
- CoT
- Fine-tuning data engineering
- Efficient big model reasoning
- Big model evaluation
- Pre-training of big model
- Robots
- Reinforcement learning
- Digital people
- refer to
Multimodal
Gemini: A family of powerful multimodal modes
- Paper title: Gemini: A Family of Highly Capable Multimodal Models
- Paper address: https://arxiv.org/pdf/2312.11805
- Organization: Google
- Github address:
- Meeting:
- Paper Methods: This paper introduces a new series of multimodal models, Gemini, with extraordinary abilities in image, audio, video and text comprehension. The Gemini family includes three scales: Ultra, Pro and Nano, suitable for memory-limited use cases on devices.
- Paper Experiment Results: In a wide range of benchmarks, the paper’s state-of-the-art Gemini Ultra model has made the latest progress in 30 of 32 benchmarks, especially for the first time reaching the human expert level on the recognized examination benchmark MMLU, and improving the latest technology level in the 20 multimodal benchmarks examined by the paper. The paper believes that the Gemini model’s new capabilities in cross-modal reasoning and language understanding will be able to support a variety of use cases and discusses the paper’s approach to responsibly deploying them to users.
Evaluate the performance of GPT4-V on structured inference tasks
- Paper title: Assessing GPT4-V on Structured Reasoning Tasks
- Paper address: https://arxiv.org/pdf/2312.11524
- Organization: OpenAI
- Github address:
- Meeting:
- Paper Methods: This paper mainly evaluates the performance of the latest language model GPT-4V and five other benchmark models on structured inference tasks. These tasks include mathematical reasoning, visual data analysis, and code generation.
- The research results show that the introduction of visual Chain-of-Thought multimodal LLMs has significantly improved compared with ordinary models. At the same time, the paper also classified analysis of scenarios in which the model performs well and is difficult, highlighting the challenges faced in multimodal reasoning.
ProTIP: Progressive tool search improvement planning
- Paper title: ProTIP: Progressive Tool Retrieval Improves Planning
- Paper address: https://arxiv.org/pdf/2312.10332
- mechanism:
- Github address:
- Meeting:
- Paper Methods: This paper introduces a progressive tool retrieval framework called ProTIP for complex multi-step planning tasks. The framework implicitly decomposes tasks through contrast learning while maintaining the atomicity of subtask-tools.
- On the ToolBench dataset, ProTIP surpasses ChatGPT-based task decomposition method in tool retrieval and improves 24% in TR's Recall@K=10 and 41% in plan generation.
LLaVA: Classic multimodal large model
- Paper title: Visual Instruction Tuning
- Paper address: https://arxiv.org/abs/2304.08485
- Institutions: Microsoft Research Institute and Columbia University
- Github address: https://github.com/haotian-liu/LLaVA
- Meeting:
- Motivation: Large language models like ChatGPT only accept text input, so how can we make large language models receive image input?
- Paper method: LLaVA proposes a method,
- Use Clip as the encoder of the image and add a linear mapping layer behind Clip;
- Map the Clip-encoded image features Zu into the language model feature space to obtain visual features Hv;
- It is sent to the Language Model together with the encoding of the text (encoding of the language instructions).
- Training method:
- Phase 1: Pre-training stage . At this stage, only the linear mapping layer (Projection W) is trained to learn the mapping from image space to language model word vector space . The data set used in this stage is CC3M;
- Phase 2: Fine-tuning stage . At this stage, the parameters of the visual encoder are frozen and the parameters of the linear mapping layer and the large language model are trained . The datasets used in this phase are ScienceQA and GPT-4-based datasets.
- Experimental effect: This model demonstrates some graphic and text comprehension skills close to multimodal GPT-4: it obtained a relative score of 85.1% compared to GPT-4. When fine-tuning was performed on Science QA, the synergy between LLaVA and GPT-4 achieved a new SoTA with 92.53% accuracy.
LLaVAR: Enhanced visual instruction fine-tuning
- Paper title: LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
- Paper address: https://arxiv.org/pdf/2306.17107.pdf
- Organizations: Georgia Tech, Adobe and Stanford
- Github address: https://github.com/SALT-NLP/LLaVAR
- Meeting:
- motivation:
- Paper method: The OCR tool was used to collect 422K pictures containing text information from the LAION dataset, and then use the text recognized from the picture and the caption of the picture as prompt words. A 16K conversation was generated using text only GPT-4. Each conversation contained questions related to each picture - answer pair. These dialogue data sets collected in the text and the dialogue data of LLaVA train the LLaVAR model that can carefully understand the scenes in the picture.
- Model structure:
- Visual encoder V: CLIP-ViT-L/14 is used for inputs with 224x224 resolution; CLIP-ViT-L/14-336 is used for inputs with 336x336 resolution. The features output from the last layer of Transformer Layer are mapped to the word embedding space of the language Decoder through a mapping matrix W;
- Language Decoder D: Adopt Vicuna-13B based on LLAMA
- Training method:
- Pre-training: only the mapping layer between the visual encoder and the LLM encoder is trained (595k graphics and text filtered from CC3M using LLaVA and newly constructed 422k rough data);
- Fine-tuning: train the mapping layer and LLM between the visual encoder and the LLM encoder (using LLaVA to train the instruction understanding capabilities of the model based on MSCOCO 158k instruction data and newly constructed 16k instruction data, and fine-tuning the mapping layer between LLM and graphics);
Vary: Scaling up the Vision Vocabulary forLarge Vision-Language Models
- Paper title: Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
- Paper address: arxiv.org/abs/2312.06109
- motivation:
- The difficulty of PDF documents lies in how to completely restore pictures, tables, titles, paragraphs and other contents to form a text version of the document.
- Problems with existing open source multimodal large models
- The support for Chinese is poor, after all, most of the training data is still mainly in English.
- The level of document-level recognition is not high. After all, multimodal large models are not simply OCR tasks, so the training data may be lacking. When identifying document images, it is easy to lack content, resulting in hallucinations or inaccuracies in the answers.
- Idea: By collecting new data, training a new visual encoder, and then merging it with the original visual encoder.
Instruct-Imagen: Image generation under multi-mode guidance
- Paper title: Instruct-Imagen: Image Generation with Multi-modal Instruction
- Organization: Google Research Institute, Google DeepMind
- Related fields: instruction fine-tuning, multimodal
- Paper address: https://arxiv.org/pdf/2401.01952
- Author: Hexiang Hu, Kelvin CK Chan, Yu-Chuan Su
- Paper Method: The paper introduces instruct-imagen, a model that solves heterogeneous image generation tasks and can generalize on unknown tasks. It introduces multi-modal-guided image generation, a task representation that uses natural language to combine different modalities (e.g., text, edges, styles, themes, etc.) so that rich image generation intentions can be standardized in a unified format. The authors construct instruct-imagen by fine-tuning the pretrained text-to-image diffusion model in a two-stage framework. First, the authors use retrieval enhancement training to enable the model to generate images based on external multimodal contexts. The author then fine-tuned the fine-tuned model on a variety of image generation tasks that require understanding of visual language (e.g., topic-based generation, etc.), each paired with a multi-modal guidance that contains the nature of the task. Manual evaluations on various image generation datasets show that instruct-imagen competes with or surpasses previous task-specific models within the domain and demonstrates promising generalization capabilities for unknown and more complex tasks.
LLaVA-φ: Efficient multimodal assistant and small language model
- Paper title: LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model
- Institutions: IDEA, East China Normal University
- Related fields: instruction fine-tuning, multimodal
- Paper address: arxiv.org/pdf/2401.02330
- Code: github.com/zhuyiche/llava-phi
- Author: Yichen Zhu, Minjie Zhu, Ning Liu
- Paper Methods: LLaVA-φ is an efficient multimodal assistant that utilizes the power of the recent advanced small language model Phi-2 to promote multimodal dialogue. LLaVA-φ marks a significant advancement in the field of compact multimodal models. It proves that even smaller language models with only 2.7B parameters can effectively participate in complex dialogues that blend text and visual elements as long as they are trained with high-quality corpus. The paper’s model has commendable performance on publicly available benchmarks including visual understanding, reasoning, and knowledge-based perception. In addition to performing well in multimodal dialogue tasks, the paper’s model opens new avenues for applications in time-sensitive environments and systems that require real-time interactions, such as embodied agents. It highlights the potential of smaller language models to achieve complex levels of understanding and interaction while maintaining higher resource efficiency.
Use text training only, mine fine-grained images in zero-sample subtitle generation - text alignment
- Paper title: Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training
- Institution: Shanghai University of Science and Technology
- Related fields: Multimodal
- Paper address: https://arxiv.org/pdf/2401.02347
- Code: https://github.com/Artanic30/MacCap
- Author: Longtian Qiu, Shan Ning, Xuming He
- Paper Method: This paper proposes a framework for generating zero-sample image subtitles using only text training through analysis of CLIP potential space. By mining the visual features of image subregions and information loss in text descriptions, modal gaps can be reduced and subtitle generation performance can be improved by introducing noise injection and reordering strategies.
Use text-supervised to learn visual-language model prompt learning
- Paper title: Learning to Prompt with Text Only Supervision for Vision-Language Models
- Institutions: Google, ETH Zurich
- Related fields: pre-training, multimodal
- Paper address: https://arxiv.org/pdf/2401.02418
- Code: https://github.com/muzairkhattak/ProText
- Author: Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Muzammal Naseer
- Paper Method: This paper combines the advantages of visual information and large language models by learning prompts from language models using only text data. Through this method, zero sample transfer to new categories and data sets can be achieved, reducing the cost of large language model prompt engineering.
GPT4Video
- GPT4Video
- Paper title: GPT4Video: A Unified Multimodal Large Language Model for lstruction-Followed Understanding and Safety-Aware Generation
- Paper address: https://arxiv.org/abs/2311.16511
- Paper example: https://gpt4video.github.io/
- Paper background: The current multimodal large language model (MLLM) has verified the effectiveness of multimodal data fusion, but has no work to explore the generation of multimodal information;
- Paper framework:
- Video understanding module. First, the video feature extractor is extracted, and then the video feature and LLM are aligned through the video abstractor;
- Large language model. Parameters pretrained using LLaMA, fine-tuning by LoRA;
- Video generation module. Enter the Prompt output from LLM to the Text-Video model to obtain the generated video.
PEFT series
Prompt
- Paper title: Prompt Tuning
- Paper address: https://arxiv.org/pdf/2107.13586.pdf
- Github address:
- Meeting:
- Motivation: But for a pre-trained large language model, it seems that it is customized for each task, which is very inefficient . Is there a way to use the pre-trained language model as a power supply and different tasks as electrical appliances. Different sockets only need to be selected according to different electrical appliances (tasks). For the model, that is, inserting different task-specific parameters, the model can be adapted to the downstream task .
- Paper method: Give a clue/hint to the pre-trained language model to help it better understand human problems.
Instruction
- Paper title: Finetuned Language Models Are Zero-Shot Learners
- Paper address: https://arxiv.org/abs/2109.01652
- Github address: https://github.com/google-research/flan
- Meeting:
- Motivation: PLM generally performs well on Few-Shot, but it is very ordinary on Zero-Shot. One potential reason is that the model is difficult to execute propts of different formats than pre-training.
- Paper method: By stimulating the understanding ability of language models, and giving more obvious instructions/instructions, let the model understand and make the correct action.
self-instruct
- Paper title: Self-Instruct: Aligning Language Model with Self Generated Instructions
- Paper address: https://arxiv.org/abs/2212.10560
- Github address: https://github.com/yizhongw/self-instruct
- Meeting:
- Motivation: "Instruction Tuning" on trained LLM has excellent ability to generalize the instruction comprehension ability under Zero-shot settings to new tasks . However, this approach relies heavily on large language models as well as high-instruction data written by humans, which requires a lot of manpower and material resources .
- Paper method: Improve the instruction following ability of LLM by guiding the model to generate instructions by itself on the exposed LLM interface . This was an efficient distillation method in the LLM era, that is, by obtaining supervised data from high-quality pre-trained LLM interfaces, tuning the model, distilling the knowledge of the big model, and deploying it on the target model .
LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS
- Paper title: LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS
- Paper address:
- Github address: https://github.com/microsoft/LoRA
- Meeting:
- motivation:
- Add adapter : The main problem is the additional computational effort and delay caused by reasoning .
- Optimization prompt : Prefix tuning is difficult to optimize, and performance does not change monotonically as the number of parameters grows .
- Paper method:
- Add a bypass next to the original model to simulate the update amount of parameters through low-rank decomposition (dimension reduction first and then dimensionality increase);
- During training, the original model is fixed, and only the dimension reduction matrix A and the dimension uplift matrix B are trained;
- When reasoning, BA can be added to the original parameters without introducing additional reasoning delays;
- Initialization, A adopts a Gaussian distribution initialization, and B is initialized to all 0, ensuring that the bypass is a 0 matrix at the start of training;
- Plugable switching task, the current task W0+B1A1, subtract the lora part and replace it with B2A2, and the task switching can be achieved;
DyLoRA: Effective fine-tuning of parameters for pretrained models using dynamic search-free low-rank adaptation
- Paper title: DyLoRA: Parameter-Efficient Tuning of Pretrained Models using Dynamic Search-Free Low Rank Adaptation
- Paper address: https://arxiv.org/pdf/2210.07558v2.pdf
- Github address: https://github.com/huawei-noah/KD-NLP/tree/main/DyLoRA
- Meeting:
- Motivation: Problems with LoRA:
- The value of the rank is fixed and cannot be modified after training is completed.
- Optimizing the value of a rank requires a lot of search and effort.
- Paper method: A dynamic low-rank adaptation (Dy-LoRA) technology was introduced. LoRA blocks are trained for a series of ranks rather than a single rank by sorting the representations learned by the adapter module during training .
LOMO: Use limited resources to fine-tune the full parameters of large language models
- Paper title: FULL PARAMETER FINE-TUNING FOR LARGE LANGUAGE MODELS WITH LIMITED RESOURCES
- Paper address: https://arxiv.org/abs/2306.09782
- Github address: https://github.com/OpenLMLab/LOMO
- Meeting:
- Motivation: Problems with LoRA:
- Large language models (LLMs) have completely changed natural language processing (NLP), but training LLMs requires a lot of GPU resources ;
- Although existing methods focus on efficient parameter fine-tuning, i.e. fine-tuning or adding a small number of parameters, few people have solved the challenge of adjusting all parameters of LLMs under limited resources , and full-parameter fine-tuning is considered to be more powerful than efficient parameter fine-tuning;
- Paper method: A new optimizer, LOw-Memory Optimization (LOMO), is proposed, which fuses gradient calculations and parameter updates in one step to reduce memory usage . Memory usage is reduced to 10.8% by integrating LOMO with existing memory saving technology , compared to the standard method (DeepSpeed solution). Therefore, this approach enables full parameter fine-tuning of the 65B model on a single machine, equipped with an 8×RTX 3090 with 24GB of video memory per video.
QLoRA
- Paper title: QLoRA: Efficient Finetuning of Quantized LLMs
- Paper address: https://arxiv.org/pdf/2305.14314.pdf
- Github address: https://github.com/artidoro/qlora
- Meeting:
- Motivation: There are three pain points in LoRA fine tuning:
- Small parameter space : LoRA has fewer parameters participating in training, and the solution space is smaller, and the effect is somewhat different from full-scale fine-tuning;
- The cost of fine-tuning large models is high : for models with tens of billions of parameters, the cost of fine-tuning of LoRA is still very high;
- Accuracy loss : For the second point, int8 or int4 can be used to further compress the parameters of the model base. However, it will also cause accuracy loss problems and reduce model performance.
- Paper method:
- 4-bit NormalFloat : proposes a theoretically optimal 4-bit quantitative data type, which is better than the currently commonly used FP4 and Int4;
- Double Quantization : Compared with the current model quantization method, it saves more video memory space. Each parameter saves an average of 0.37 bits, which can save about 3GB of video memory space for the 65B LLaMA model;
- Paged Optimizers : Use NVIDIA unified memory to avoid gradient checkpoint memory peaks when processing small batches of long sequences;
- Adding Adapter : 4-bit NormalFloat and Double Quantization save a lot of space, but brings performance losses. The author compensates for this performance losses by inserting more adapters. In LoRA, it is generally chosen to insert adapters at the full connection layer of query and value. QLoRA inserts adapters at all fully connected layers, adding training parameters to make up for the performance losses caused by accuracy.
VeRA: Low-rank fine-tuning method with adjustable parameters 10 times smaller than LoRA
- Paper title: VeRA: Vector-based Random Matrix Adaptation
- Paper address: https://arxiv.org/pdf/2310.11454.pdf
- Github address:
- Meeting:
- Motivation: There are three pain points in LoRA fine tuning:
- LoRA: requires a large number of trainable parameters. Based on the study of Aghajanyan et al., the upper limit of the intrinsic dimension is much smaller than the rank commonly used in this method. Therefore, the amount of parameters can be further reduced.
- AdaLoRA: This further reduces fine-tuned parameters by dynamically allocating parameters. However, we believe that there is another method that can significantly reduce trainable parameters without decreasing the effect.
- Paper method:
- Reparameterization of low rank matrix. Specifically, a pair of randomly initialized matrices are frozen, which are shared among all adaptation layers, and then a trainable scalable vector that can be adapted layer by layer is introduced. As shown, similar to LoRA, the trained scaling vector and low-rank matrix can be merged into the original weight, eliminating additional inference delays.
Multilingual instruction fine-tuning can be performed with only a small amount of multilingual data
- Paper title: Multilingual Instruction Tuning With Just a Pinch of Multilinguality
- Related fields: Instruction fine-tuning
- Institutions: Google Research Institute, Tel Aviv University
- Author: Uri Shaham, Jonathan Herzig, Roee Aharoni
- Paper address: https://arxiv.org/pdf/2401.01854
- Github address:
- Meeting:
- Analysis: By studying the impact of multilingual instruction fine-tuning on instruction following capabilities in multilingual large language models (LLMs), this paper found that even in monolingual fine-tuning, many languages can transfer some instruction following capabilities to other languages. Furthermore, by using only 40 multilingual examples on English fine-tuning, the performance of multilingual instruction following can be greatly improved, whether in seen or not seen languages. Although there are 10 times fewer training examples in these languages, overall, models using multilingual mixed fine-tuning exhibit comparable or better performance in several languages compared to monolingual fine-tuning models. Finally, cross-language universality can be increased by increasing the number of languages in the instruction fine-tuning set from 1 to 2, 3 or 4. Experimental results show that by using a very small multilingual instruction response set, a large-scale multilingual instruction fine-tuning model can be constructed.
GPT Series
Table analysis
- Small sample QA Q&A MINPROMPT
- Paper title: MINPROMPT: Graph-based Minimal Prompt Data Augmentation for Few-shot Question Answering
- Paper address: https://arxiv.org/pdf/2310.05007v1.pdf
- Thesis Github address:
- Meeting:
- Motivation: llm reads the form
- Question 1: Missing value recognition
- Question 2: Missing value recognition
- Question 3: Question answers on the form
- Paper method:
- Optimization strategy 1: Table optimization
- Optimization Strategy 2: Creating Data Sets: Synthesis Enhancement
RAG series
RAG Trick
Self-RAG: A RAG strategy for retrieval-enhanced generation through self-reflection
- Paper title: Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
- Paper address: https://arxiv.org/abs/2310.11511
- Thesis Github address:
- Meeting:
- motivation:
- The incorrelation between search segments and query : These methods search and merge a certain number of search segments indiscriminately, regardless of whether the search is required or whether the segment is related, which reduces the versatility of LLMs or leads to poor generation quality (Shi et al., 2023), because they search segments indiscriminately, regardless of whether factual support is helpful;
- The generated results are not necessarily consistent with the relevant segments retrieved (Gao et al., 2023): because these models are not explicitly trained to utilize and follow the facts of the segments provided;
- Paper method:
- Improve the quality of LLM generation through on-demand search and self-reflection , including its factual accuracy without compromising its versatility.
- The paper trains arbitrary LLM in an end-to-end way to learn to reflect on its own generation process, by generating task output and intermittent special tokens (i.e., reflective tokens). Reflection tokens are divided into search and comment tokens, which represent the search requirements and the quality of generation.
Active RAG: A RAG strategy that actively determines whether or not it needs to be searched and then searched when needed
- Paper title: Active Retrieval Augmented Generation
- Paper address: https://arxiv.org/pdf/2305.06983.pdf
- Thesis Github address: https://github.com/jzbjyb/FLARE
- Meeting:
- Motivation: It is obviously a bit redundant if you search every step
- Paper method:
- Method 1: FLARE with Retrieval Instructions
- Method 2: Direct FLARE
MINPROMPT Document QA Q&A
- Paper title: MemSum-DQA: Adapting an Efficient Long Document Extractive Summarizer for Document Question Answering
- Paper address: https://arxiv.org/pdf/2310.06436v1.pdf
- Thesis Github address: https://github.com/nianlonggu/MemSum-DQA
- Meeting: CIKM 2023
- motivation:
- Paper Method: The paper proposes the **"MemSum-DQA, an efficient document question and answer (DQA) system", which utilizes MemSum (a long document extraction summaryr) to selectively extract text blocks from the document as an answer by adding the provided question and question type prefix to each text block in a parsed document.
PDFTriage: Q&A for long structured documents
- Paper title: PDFTriage: Question Answering over Long, Structured Documents
- Paper address: https://arxiv.org/pdf/2309.08872.pdf
- Thesis Github address:
- Meeting:
- Motivation: When the document is not suitable for the LLM's limited context window, different strategies can be deployed to obtain the relevant context.
- Paper method:
- Generate document metadata : Extract the structural elements of the document and convert them into readable metadata;
- LLM-based classification : query LLM to select precise content (page, section, retrieved content) from the document;
- Use the retrieved content to answer : generate answers based on the question and the retrieved content.
RAGTruth: A corpus of transformed languages for developing reliable search-enhanced language models
Paper title: RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models
Paper address: https://arxiv.org/pdf/2401.00396
Related fields: model evaluation, data set construction
Github address:
Meeting:
Paper Methods: This article introduces RAGTruth, a corpus dedicated to analyzing word-level illusions in various fields and tasks in a standard RAG framework for LLM applications. RAGTruth includes nearly 18,000 naturally generated replies from different LLMs using RAG. These replies are finely written and manually annotated, including assessments of hallucination intensity. This paper not only benchmarks the hallucination frequencies of different LLMs, but also critically evaluates the effectiveness of several existing hallucination detection methods. In addition, the paper demonstrates that using high-quality datasets such as RAGTruth, relatively small LLMs can be fine-tuned and achieve competitive performance levels in hallucination detection with existing prompt approaches using state-of-the-art large language models such as GPT-4.
RAG application field
QA Q&A in the medical field
QA Q&A in the Religious Field
- QASiNa Religious Field QA Q&A
- Paper title: QASiNa: Religious Domain Question Answering using Sirah Nabawiyah
- Paper address: https://arxiv.org/pdf/2310.08102v1.pdf
- Motivation: With the development of large language models (LLM). LLM can be applied to various fields, but when applied to the Islamic religious field, it contradicts the principle of information transmission. In Islam, the source of information is strictly regulated and who can explain that source. The way LLM generates answers based on its own explanation is similar to the concept of tafseer, which is neither an Islamic expert nor a person that Islam does not allow. Given the high influence of LLM, the author of this article "evaluates LLM in the religious field."
- Paper Methods: The paper proposes the Q&A Sirah Nabawiyah (QASiNa) dataset, a novel dataset compiled from the Indonesian Sirah Nabawiyah literature and validates the dataset using mBERT, XLM-R, and IndoBERT, and fine-tune it using the Indonesian translation of SQuAD v2.0.
QA Q&A in the Common Sense Field
- QADYNAMICS Common Sense QA Q&A
- Paper title: QADYNAMICS: Training Dynamics-Driven Synthetic QA Diagnostic for Zero-Shot Commonsense Question Answering
- Paper address: https://arxiv.org/pdf/2310.11303v1.pdf
- Thesis Github address: https://github.com/HKUST-KnowComp/QaDynamics
- Motivation: Zero-shot Common Sense Question and Answer (QA) requires that the model be able to reason about general situations. The most advanced method is to build QA pairs based on the Common Knowledge Base (CSKB) and fine-tune the language model to enable it to have more common sense knowledge. However, during this process, the QA pair may introduce noise from CSKB during the construction process, thereby generating syntax question-and-answer pairs that do not meet expectations, which hinders the generalization ability of the model.
- Paper Method: The paper proposes **"QADYNAMICS, a dynamic driving framework for QA diagnosis and improvement"**. This method analyzes the training dynamics of QA pairs in both Q&A and options, and simplifies the training and detection components by deleting information-free QA pairs, error marks, and error options.
QA Q&A in the Legal Field
- Long-Form Legal Question Answering Legal QA Q&A
- Paper title: Interpretable Long-Form Legal Question Answering with Retrieval-Augmented Large Language Models
- Paper address: https://arxiv.org/pdf/2309.17050v1.pdf
- Thesis Github address: https://github.com/maastrichtlawtech/lleqa
- Meeting: CIKM 2023
- Motivation: Many people may face legal disputes at some point in their lives, but their lack of understanding of how to resolve these complex problems often make them vulnerable. Advances in natural language processing have opened up new ways to bridge the legal literacy gap by developing automated legal aid systems. However, existing legal Q&A (LQA) methods tend to be narrow in scope, either limited to specific legal areas or limited to short, informative answers .
- Paper Method: Paper proposes an end-to-end approach, “to use the “retrieve first and then read” pipeline to generate long-form answers to any statutory problem.” To support this approach, a long format legal question and answer (LLeQA) dataset was introduced and published, containing 1,868 French legal questions annotated by experts, as well as detailed answers based on relevant legal terms.
QA Q&A in the field of knowledge graph
- CHATKBQA: Knowledge Retrieval QA Q&A
- Paper title: CHATKBQA: A GENERATE-THEN-RETRIEVE FRAMEWORK FOR KNOWLEDGE BASE QUESTION ANSWERING WITH FINE-TUNED LARGE LANGUAGE MODELS
- Paper address: https://arxiv.org/pdf/2310.08975v1.pdf
- Thesis Github address: https://github.com/LHRLAB/ChatKBQA
- Meeting:
- motivation:
- Inefficient knowledge retrieval;
- Retrieval errors affect semantic parsing results;
- Complexity of previous KBQA methods.
- Paper method: The paper proposes to first use fine-tuned LLM to generate logical forms, and then search and replace entities and relationships through unsupervised search methods, which directly improves generation and search.
QA Q&A in Task-based Domain
- InstructTODS: Knowledge Search QA Q&A
- Paper title: InstructTODS: Large Language Models for End-to-End Task-Oriented Dialogue Systems
- Paper address: https://arxiv.org/pdf/2310.08885v1.pdf
- Thesis Github address: https://github.com/WillyHC22/InstructTODS/
- Meeting:
- Motivation: Currently, the large language model (LLM) has been used in various natural language processing (NLP) tasks, but there are still certain limitations in the exploration of task-oriented dialogue systems (TODS), especially end-to-end TODS.
- Paper Method: The paper proposes "InstructTODS, a framework that can be used in Zero-Shot end-to-end task-oriented dialogue systems that can adapt to different fields without fine-tuning." By leveraging LLM, InstructTODS generates proxy belief states, seamlessly converting user intent into dynamic queries for efficient interaction with any knowledge base.
QA Q&A in the automotive field
- CarExpert: Automotive Retrieval Enhanced QA Q&A
- Paper title: CarExpert: Leveraging Large Language Models for In-Car Conversational Question Answering
- Paper address: https://arxiv.org/pdf/2310.09536v1.pdf
- Thesis Github address:
- Meeting:
- 动机:大型语言模型(LLM)通过遵循自然语言指令而无需对特定领域的任务和数据进行微调,表现出了卓越的性能。然而,利用LLM进行特定领域的问题回答往往会产生幻觉。此外,由于缺乏对领域和预期输出的认识,LLM可能会生成不适合目标领域的错误答案。
- 论文方法:论文提出了「CarExpert」,车内检索增强会话问答系统利用了LLM的不同任务。具体而言,CarExpert采用LLM来控制输入,为提取和生成回答组件提供特定领域的文档,并控制输出以确保安全和特定领域的答案。
Prompt 系列篇
- 小样本QA问答MINPROMPT
- 论文名称:MINPROMPT: Graph-based Minimal Prompt Data Augmentation for Few-shot Question Answering
- 论文地址:https://arxiv.org/pdf/2310.05007v1.pdf
- 论文Github地址:
- Meeting:
- 动机:小样本问答(Few-shot QA)旨在少量训练样本的情况下,让模型给出令人满意的回答。 最新的研究进展主要依赖大型语言模型(LLM)。尽管预训练阶段已经让LLM具备了强大的推理能力,但LLM仍需要进行微调以适应特定领域,以达到最佳结果。
- 论文方法:论文提出了「MinPrompt」,一个基于近似图算法和无监督问题生成的开放域QA的最小数据增强框架。 作者将原始文本转换为图形结构,以在不同的事实句子之间建立联系,然后应用图形算法来识别原始文本中最多信息所需的最小句子集。然后,根据识别的句子子集生成问答对,并在选定的句子上训练模型以获得最终模型。 实证结果表明,MinPrompt 能够以高效率实现与基线相当或更好的结果。
LMMs 可解释性篇
LLMs4KG 篇
- ChatKBQA
- 论文名称:ChatKBQA: A Generate-then-Retrieve Framework for Knowledge Base Question Answering with Fine-tuned Large Language Models
- 论文地址:https://arxiv.org/abs/2310.08975
- Github 地址:https://github.com/LHRLAB/ChatKBQA
- Meeting:
- 动机:利用微调开源大模型进行自然语言问题到逻辑形式的转换,再利用无监督实体关系检索生成图数据库查询语言,实现自然语言的知识图谱问答框架。
- 论文方法:提出了ChatKBQA,这是一种基于微调开源LLMs(大型语言模型),如Llama-2-7B,ChatGLM2-6B和Baichuan2-7B等,的新型生成-检索KBQA框架;
- 首先微调生成逻辑形式,然后对生成的逻辑形式中的实体和关系在知识库中的实体库和关系库分别做检索,避免了以前方法存在的先检索对逻辑形式生成的影响,并提高检索效率;
- 在生成阶段,使用指令微调技术对开源LLMs进行微调,赋予它们感知和生成逻辑形式的能力
LLMs Agents 篇
角色扮演(Role-Play)
Attention 篇
- System 2 Attention
- 论文标题:System 2 Attention (is something you might need too)
- 论文链接:https://arxiv.org/abs/2311.11829
- Github 地址:
- 动机:大型语言模型(LLM)非常强大,但它们仍容易出现简单的错误,这似乎显示出弱的推理能力。例如,不相关的上下文或输入提示中固有的偏好或意见,都可能使它们产生错误判断,在后一种情况下,展现了一种称为阿谀奉承的问题,即模型与输入一致同意。
- 论文方法:论文提出了一种技术方案--System 2 Attention(S2A),可以让LLM决定输入上下文的重要部分,来生成好的响应。实现这点的方法是:首先诱导LLM重新生成只包含相关部分的输入上下文,然后关注重新生成的上下文以引出最终响应。
- 论文在实验中证明,S2A可以成功重写会降低最终答案质量的上下文,因此论文的方法可以同时提高事实性并减少其响应中的阿谀奉承。
- 未来的研究仍有许多空间。在论文的实验中,采用了零样本提示来实现S2A。其他方法可以通过考虑微调、强化学习或替代提示技术(alternative prompting techniques)来进一步优化论文的方法。成功的S2A还可以压缩回标准LLM生成,例如:通过使用原始提示作为输入和最终改进的S2A响应作为目标进行微调。
搜索篇
如何通过大模型构建“query-doc”?
解释:对搜索数据进行数据增强就是获取更多的“query-doc”对。一种方法是根据query生成假doc,而另一种是根据doc生成假query。
InPars: 基于大型语言模型的信息检索数据扩充
- 论文名称: InPars: Data Augmentation for Information Retrieval using Large Language Models
- 论文地址:https://arxiv.org/abs/2202.05144
- 方法:InPairs利用LLM的上下文学习能力,结合给出的示例,给doc生成了大量的假query,然后通过微调后的语言模型进行结果“过滤”。
InPars-v2: 大型语言模型作为信息检索的有效数据集生成器
- 论文名称: InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval
- 论文地址:https://arxiv.org/abs/2301.01820
- 方法:在inPairs-V2版本中,一个较大的变化是,其利用在检索数据集上微调的T5-3B模型来过滤生成的查询,而不是简单的通过概率进行过滤,以此来提升生成数据的可靠性。
InPairs-Light:高效排名者的成本效益无监督培训
- 论文名称: InPairs-Light:Cost-Effective Unsupervised Training of Efficient Rankers
- 论文地址:https://arxiv.org/abs/2301.02998
- 方法:后续的inPairs-Light版本也对“过滤器”进行了瘦身,参数从30亿降至2亿。
InPairs-Light:从8个例子看Few-shot Dense Retrieval
- 论文名称: Promptagator:Few-shot Dense Retrieval From 8 Examples
- 论文地址:https://arxiv.org/abs/2301.02998
- 方法:PROMPTAGATOR 利用inPairs中“生成-过滤”这一过程,在生成的样本上微调检索器,然后使用该检索器过滤生成的样本。重复这两个步骤直到收敛,以产生高质量的训练集。
UDAPDR:基于LLM提示和重排序的无监督域自适应
- 论文名称: UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers
- 论文地址:https://arxiv.org/abs/2303.00807
- 动机:在inPairs-V2版本中,研究者意识到请求LLM如chatgpt、gpt4的API进行数据增强会带来高额的成本,开始采用开源的LLM替换API请求方式,但可能会导致增强数据的质量下降。
- 方法:UDAPDR 针对这一问题,先用高质量LLM根据doc生成高质量query,然后用高质量doc-query送入低成本LLM扩充数量,兼顾了成本和效果问题,其过程如图所示。
如何通过大模型标注“query-doc” 正负样例?
通过上述方法虽然能够构建“query-doc”,但是如何辨别真假呢?这个时候可以利用LLM获取query与doc的假label,即让模型帮我判断这条数据是不是正样本,是正样本的概率是多少?
ART:训练Dense Passage Retriever 所需的全部问题
- 论文名称: ART:Questions Are All You Need to Train a Dense Passage Retriever
- 论文地址:https://arxiv.org/abs/2206.10658
- 方法:先将query经过向量编码,然后通过向量检索器选出相关文档,再让模型给每个文档与query的相关性进行打分。这一打分被作为soft label,反馈给之前的passage encoder和question encoder进行更新训练。
ExaRanker:Explanation-Augmented Neural Ranker
- 论文名称: ExaRanker:Explanation-Augmented Neural Ranker
- 论文地址:https://arxiv.org/abs/2206.10658
- 方法:ExaRanker 使用GPT-3.5 为检索数据集生成解释,随后训练一个seq2seq 排名模型来生成相关标签以及给定查询-文档对的相应解释。
ChatGPT-RetrievalQA:为交叉编码器重排器生成合成文档: ChatGPT 与人类专家的比较研究
- 论文名称: ChatGPT-RetrievalQA:Generating Synthetic Documents for Cross-Encoder Re-Rankers: A Comparative Study of ChatGPT and Human Experts
- 论文地址:https://arxiv.org/abs/2305.02320
- 方法:我们研究了生成式大型语言模型(llm)在为交叉编码器重新排序器生成训练数据方面的有用性,该方向是:生成合成文档而不是合成查询。我们引入了一个新的数据集ChatGPT-RetrievalQA,并比较了在llm生成和人工生成数据上微调的模型的有效性。生成式llm生成的数据可用于增强训练数据,特别是在标记数据数量较少的领域。我们基于一个现有的数据集,人类ChatGPT比较语料库(HC3)构建ChatGPT- retrievalqa,该数据集由公共问题集合组成,其中包含来自ChatGPT的人类响应和答案。
- 实验结果:我们在人工生成或chatgpt生成的数据上微调一系列交叉编码器重新排名。我们对MS MARCO DEV、TREC DL'19和TREC DL'20的评估表明,在ChatGPT响应上训练的交叉编码器重新排序模型比在人类响应上训练的模型更有效。在有监督的环境中,人工训练的重新排名者的表现优于法学硕士训练的重新排名者。我们的新发现表明,生成式llm在为神经检索模型生成训练数据方面具有很高的潜力。需要进一步的工作来确定在生成的响应中事实错误信息的影响,并测试我们的发现在开源法学硕士中的普遍性。我们为将来的工作发布数据、代码和交叉编码器检查点。
如何通过大模型改写“query-doc”?
让LLM作为生成模型,根据用户的query写一段文本,将其作为改写结果送入后续的检索模块,以提高最终的检索质量。
如何通过大模型综合利用PRF(伪相关反馈)+GRF(生成相关反馈)?
以上研究都是利用LLM的生成结果作为改写结果的主要内容,我们可以将其看作是一种生成相关反馈(GRF),而不少研究也同时在模型生成或结果后处理阶段加入伪相关反馈(PRF)的方法来改进改写结果的质量。
HyDE:无关联标签的精确Zero-Shot Dense Retrieval
- 论文名称: HyDE:Precise Zero-Shot Dense Retrieval without Relevance Labels
- 论文地址:https://arxiv.org/abs/2212.10496
- 动机:LLM幻觉问题
- 方法:HyDE将LLM生成的结果进行编码,利用向量检索器,与真实的文档库中的候选文档进行相关性匹配,然后利用真实的文档作为改写的结果辅助查询。可以看出,该方法实质上就是利用LLM的输出结果而不是query去召回伪文档。
- advantage:
- 相比传统的PRF方法,保证了第一次检索的伪文档的相关性;
- 相比Query2doc等方法,又通过结合PRF避免了LLM可能产生幻觉的问题,保证了结果的高度真实性。
- 类似地,LameR则是将PRF这一过程放到了LLM输入之前。
LameR:大型语言模型是强大的零样本检索器
- 论文名称: LameR:Large Language Models are Strong Zero-Shot Retriever
- 论文地址:https://arxiv.org/abs/2304.14233
- 动机:LLM幻觉问题
- method:
- advantage:
Rewrite-Retrieve-Read:针对检索增强的大型语言模型的查询重写
- 论文名称: Rewrite-Retrieve-Read:Query Rewriting for Retrieval-Augmented Large Language Models
- 论文地址:https://arxiv.org/abs/2305.14283
- 动机:LLM幻觉问题
- 方法:Rewrite-Retrieve-Read这一研究则是利用改写去加强检索增强LLM的效果。Rewrite-Retrieve-Read图中从左到右分别是:检索增强LLM、带有改写器的检索增强LLM、带有强化学习改写器的检索增强LLM。其中Rewrite-Retrieve-Read指的是第三个。可以看出,Rewrite-Retrieve-Read方法不仅利用LLM作为改写器增加了其检索增强的效果,还引入了强化学习,通过最终答案的反馈,来训练高质量LLM改写器。
- advantage:
PRF+GRF:稀疏、稠密和学习稀疏检索的生成和伪相关反馈
- 论文名称: PRF+GRF:Generative and Pseudo-Relevant Feedback for Sparse, Dense and Learned Sparse Retrieval
- 论文地址:https://arxiv.org/abs/2305.07477
- 动机:LLM幻觉问题
- 方法:PRF+GRF直接结合PRF和LLM输出的结果,然后综合加权考虑两者的结果作为改写结果。
- advantage:
InteR:通过搜索引擎和大型语言模型之间的交互进行知识提炼
- 论文名称: InteR:Knowledge Refinement via Interaction Between Search Engines and Large Language Models
- 论文地址:https://www.researchgate.net/publication/370763983_Knowledge_Refinement_via_Interaction_Between_Search_Engines_and_Large_Language_Models
- 动机:LLM幻觉问题
- 方法:InteR则是一种搜索系统和LLM多轮交互框架,通过多次PRF、LLM输出,达到增强两过程效果的目的。
- advantage:
如何通过大模型进行召排?
何为召回?
召回(retrive)是搜索系统中的核心模块,可分为基于统计算法的稀疏检索(Sparse Retriever)和基于神经网络的密集检索(Dense Retriever)。
召回存在哪些问题?
- query短且模糊
- doc长且噪声多
- 监督数据标注成本高
- PLM模型仍存在改进空间
如何基于encoder的LLM检索器?
基于encoder的检索器指的是在密集检索中,使用LLM出色的语义能力获取query或doc的向量表示,用向量检索器进行检索召回。
如何基于生成式的LLM检索器?
上面的研究都旨在利用LLM的强大语义编码能力对query、doc等内容进行编码。但在LLM崭露头角之前,就有不少研究致力于构建end2end式的检索模型,成为生成式检索器(Generative Retriever)。相比先编码再检索,生成式方法通过联合编码器和解码器,直接获取要检索的文档标识符
如何通过大模型进行排序?
微调LLM进行相似度计算
在gpt3等超大型参数模型出现之前,不少研究都利用PLM,将排序任务看作相似度计算任务来获得每个query和doc的相似度得分。RankT5就是这样一种模型,他基于T5直接计算查询-文档对的相关分数,并使用pairwise或listwise计算排名损失进行微调。
- RankT5: 用于具有排名损失的文本排名的微调T5
- 论文名称: RankT5:Fine-Tuning T5 for Text Ranking with Ranking Losses
- 论文地址:https://arxiv.org/abs/2202.06991
- motivation:
- 方法:RankT5有两种得分计算方法,一种是encoder-decoder结构,另一种则是不需要解码直接根据encoder编码得到排序分数。
- 作者实验证明,两种结构效果上各有胜负,这也侧面表明decoder作用其实不大,蒸馏等操作可以直接对encoder下手。类似的研究还有很多,只是把backbone换为BERT、BART、GPT等即可。
提示LLM
对超大规模LLM进行微调存在成本昂贵的明显问题,不少研究选择利用LLM的提示能力得到query与doc是否相似的答案。
UPR:利用零样本问题生成改进文章检索
- 论文名称: UPR:Improving Passage Retrieval with Zero-Shot Question Generation
- 论文地址:https://aclanthology.org/2022.emnlp-main.249/
- 会议:ACL2022
- 动机:排序的实质是进行query和doc间的相似度计算,这一分数也可以看作是根据query获得doc的概率。
- 方法:UPR利用这一过程的逆向思路,利用prompt提示LLM,针对每一个doc,逐一计算query中各个token的生成概率,并将这一概率作为query和doc的相似度分数。简单理解,就是用LLM根据prompt对每个doc生成对应的query,称为假query。然后将生成后的假query和原query送入语言模型进行打分,计算两者的一个“相似度”。这里的相似度并不是我们熟知的向量相似度,而是“假query复原原query”的概率,其过程如上面公式所示。最后,对这个得分进行排序以获取最终的排序结果。
RankGTP:ChatGPT擅长搜索吗?作为重新排序代理的大型语言模型研究
- 论文名称: RankGTP:Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent
- 论文地址:https://aclanthology.org/2023.emnlp-main.923/
- 会议:EMNLP2023
- motivation:
- 方法:RankGPT和LLR都采用类似list-wise的方式来获取LLM的排序结果。相比point-wise,list-wise的场景下LLM能够关注到更多的doc信息,直接输出文档id的排序结果,且不需要打分模型的参与。为了解决list-wise场景下输入的doc过长的问题,RankGPT采用了滑动窗口的方法,指定k大小的窗口来获取最终top-k的排序结果。
LLR:基于大型语言模型的零射击列表式文档重排序
- 论文名称: LLR:Zero-Shot Listwise Document Reranking with a Large Language Model
- 论文地址:https://aclanthology.org/2023.emnlp-main.923/
- 会议:ACL2023
- motivation:
- method:
PRP:大型语言模型是具有成对排序提示的有效文本排序器
- 论文名称: PRP:Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting
- 论文地址:https://arxiv.org/pdf/2306.17563.pdf
- Meeting:
- motivation:
- 方法:PRP的作者认为相比其他两种方式,LLM的对比理解能力更强。而且pairwise的方式既支持生成式模型,又支持打分模型,且因为要比较两个对象,可选择的排序算法较多,如堆排序、冒泡排序、快速排序等,整体方式方法较为灵活。
Co-Prompt:通过约束生成的离散提示优化零样本重随机
- 论文名称: Co-Prompt:Discrete Prompt Optimization via Constrained Generation for Zero-shot Re-ranker
- 论文地址:https://aclanthology.org/2023.findings-acl.61.pdf
- 会议:ACL2023
- motivation:
- 方法:Co-prompt方法将soft prompt条件生成技术应用至point-wise的LLM排序任务,将PLM作为生成器生成soft prompt,然后通过LLM作为鉴别器鉴别,来条件生成最优的prompt。这一方法可以同样被应用于其他提示LLM的任务中,有效提升LLM的提示效果。
CoT 篇
- 如何提升LLMs:Self-Prompted CoT
- 论文名称:Self-prompted Chain-of-Thought on Large Language Models for Open-domain Multi-hop Reasoning
- 论文地址:https://arxiv.org/pdf/2310.13552.pdf
- motivation:
- 开放域多跳推理(ODMR) 局限性:ODMR需要通过明确的推理步骤回答多跳问题,而不依赖于任何提供的上下文。这比有上下文的多跳问答要困难得多,因为模型不能依赖于检索相关段落;
- 链式思考(CoT) 局限性:
- 论文框架:提出了一种自我提示的思维链(SP-CoT)自动化框架,通过大型语言模型(LLMs)自身生成高质量多样化的思维链,用于开放域多轮推理(ODMR)。关键思想是:
- 自动化流水线生成带有多跳问题和推理链的ODMR数据集
- 自适应采样选择多样化的高质量CoTs作为示范
- 通过上下文学习从生成的CoTs中学习自我引导的推理
微调数据工程篇
EMNLP'23大模型时代的数据标注——FreeAL
- 论文名称:FreeAL: Towards Human-Free Active Learning in the Era of Large Language Models[J].
- 论文地址: https://arxiv.org/pdf/2311.15614
- 思路:
- 数据标注依然重要,完全监督、弱监督的小模型在很多场景下比(未精调)大模型强;
- 利用LLM进行标注是完全可行的,小模型可以协同进行过滤、精炼大模型的标签;
- 弱监督学习、主动学习这两个领域,我想依然有活着的价值。
From Quantity to Quality:如何挑选具有增强LLM指令调优潜力的数据样例?
- 论文名称:From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning
- 论文地址:https://arxiv.org/pdf/2308.12032.pdf
- GitHub 地址:https://github.com/MingLiiii/Cherry_LLM
- 动机:如何挑选具有增强LLM指令调优潜力的数据样例?
- 思路:
- Learning from Brief Experience:选取有代表性的训练数据训练LLaMA;
- Evaluating Based on Experience:利用训练好模型计算原始数据中所有IFD指标;
- Retraining from Self-Guided Experience:批量跑得到每个样本的IFD得分,然后选取较高得分(prompt困难样本)的样本,paper中称为cherry samples,用其重新训练模型。
Active Instruction Tuning:怎么更好的选择一个新任务来提高模型泛化性?
- 论文名称:Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks
- 论文地址:https://arxiv.org/pdf/2311.00288.pdf
- GitHub 地址:
- 动机:如何筛选出适合当前给定这个LLM的高质量数据,也就是说高质量是和模型深度绑定的。
- 提出了一个Prompt Uncertainty 思路:假设有一个原始样本对<prompt, response>,然后对prompt做一些扰动得到promot_v1,其中promot_v1还是要保留大部分prompt语义,然后将prompt和promot_v1分别传给模型,分别拿到response的输出,计算得到两者之间的likelihood值,该值即为Prompt Uncertainty。
MoDS: 如何自动筛选高质量数据?
- 论文名称:MoDS: Model-oriented Data Selection for Instruction Tuning
- 论文地址:https://arxiv.org/pdf/2311.15653.pdf
- GitHub 地址:https://github.com/CASIA-LM/MoDS
- 动机:如何筛选出适合当前给定这个LLM的高质量数据,也就是说高质量是和模型深度绑定的。
- “高质量”数据的标准是什么?
- 质量:高质量的prompt以及对应的高质量response可以很好的让模型学会遵循指令;
- 覆盖率: prompt的多样性,越多样性越好;
- 必要性:同一条prompt对不同基座模型的重要度和必要性是不一样的,如果一条prompt对于基座来说已经很好的输出response了,也就是说模型已经很好的遵循prompt了,不需要再训练了,相反则是模型需要的。
- “高质量”数据的如何筛选?
- Quality Evaluation:基于模型打分筛选出高质量的SFT数据;
- Diverse Data Selection for Seed Instrucitons:在这份高质量SFT数据集中继续过滤出一个子集,该子集的多样性要足够好,能表征整个数据集;
- Augmented Data Selection
符尧:别卷大模型训练了,来卷数据吧!
- 论文名称:An Initial Exploration of Theoretical Support for Language Model Data Engineering
- 论文地址:https://yaofu.notion.site/An-Initial-Exploration-of-Theoretical-Support-for-Language-Model-Data-Engineering-Part-1-Pretraini-dc480d9bf7ff4659afd8c9fb738086eb
大模型对代码的记忆痕迹
- 论文名称:Traces of Memorisation in Large Language Models for Code
- 论文地址:https://arxiv.org/pdf/2312.11658
- Github 地址:
- Meeting:
- 论文方法:该论文主要研究了大语言模型对代码的记忆问题,并比较了代码模型和自然语言模型的记忆率。研究人员构建了自然语言的基准测试集,并通过识别易受攻击的样本构建了代码的基准测试集。他们对多种模型运行了这两个测试集,并进行了数据提取攻击。研究发现,大语言模型对代码也存在数据提取攻击的风险。从可提取的训练数据中,他们成功提取了CodeGen-Mono-16B代码补全模型中的47%数据。研究还发现,随着参数数量的增加,模型记忆的内容也增加,并且模型的预训练数据也容易受到攻击。数据承载者的记忆率高于普通代码或文档,并且不同的模型架构记忆不同的样本。数据泄露具有严重后果,因此该论文敦促研究界采用更广泛的模型和提取技术来进一步调查这一现象,以建立相应的保护措施。
避免语言模型评估中的数据污染:动态测试构建与最新材料
- 论文名称:Avoiding Data Contamination in Language Model Evaluation: Dynamic Test Construction with Latest Materials
- 论文地址:https://arxiv.org/pdf/2312.12343
- Github 地址:
- Meeting:
- 论文方法:这篇论文提出了最新评估方法(LatestEval),利用最新的文本创建无污染的阅读理解评估,避免数据污染带来的挑战。最新评估通过仅使用最近时间窗口内发布的文本来避免数据污染,并确保不与预训练语言模型的训练语料库重叠。论文开发了一套LatestEval自动化流程,包括:1)收集最新文本;2)识别关键信息;3)构建问题,同时从上下文中删除现有答案,鼓励模型基于剩余上下文推断答案而不是简单复制粘贴。
- 实验结果表明,相对于先前的基准测试,语言模型在最新评估上几乎不表现出记忆行为,这表明了数据污染的风险大大降低,从而导致更可靠的评估。
GeomVerse: 对几何推理的大型模型的系统评估
- 论文名称:GeomVerse: A Systematic Evaluation of Large Models for Geometric Reasoning
- 机构:谷歌研究院、Google DeepMind
- 论文地址:https://arxiv.org/pdf/2312.12241
- Github 地址:
- Meeting:
- 论文方法:这篇论文通过几何问题的视角评估了视觉语言模型(VLMs)在多个方面上的推理能力。
- 通过在多个深度级别上构建该论文的基准测试,实验结果表明,与以前的基准测试所示的推理能力相比,这些模型在几何学(以及一般情况下需要类似推理的其他主题)方面的能力并不如人们所想的那么强大。这在解决更高深度问题时尤为明显,因为解决更高深度的问题需要较长的推理链而不是额外的记忆知识。该论文在该领域的进一步研究中发布了数据集。
仅用1%的数据完胜全量数据微调模型!
论文名称:One Shot Learning as Instruction Data Prospector for Large Language Models
mechanism:
作者:Li, Yunshui and Hui, Binyuan and Xia, Xiaobo and Yang, Jiaxi and Yang, Min and Zhang, Lei and Si, Shuzheng and Liu, Junhao and Liu, Tongliang and Huang, Fei and others
论文地址:arxiv.org/pdf/2312.10302.pdf
相关领域:训练数据构建
Github 地址:https://github.com/pldlgb/nuggets
Meeting:
论文方法:仅用1%的数据完胜全量数据微调模型!#不懂就问有问必答论文中提出了一种名为Nuggets”的方法,意欲从堆积如山的指令微调数据中挖掘出黄金数据。这种方法利用大语言模型(LLM)自身作为数据探索工具,通过One shot learning 或者说是Incontext learning,从庞大的指令数据集中挑选出有益的数据。直观来说,如果某个指令对于某个特定任务的少样本学习(Few shot learning)有帮助,那么这个指令就值得被用于训练。如果这个指令能对多个任务有益,那么它就应该成为主要的数据重点另外,有研究显示,In context learning通过提示(Demonstrations)来隐式微调模型,相当于语言模型在幕后以元优化器的角色进行梯度下降操作。因此,利用在In context learning下的性能来预测指令微调的效果是很有前景的。
高效大模型推理篇
有限内存下的高效大模型推理
- 论文名称:LLM in a flash: Efficient Large Language Model Inference with Limited Memory
- 论文地址:https://arxiv.org/pdf/2312.11514
- Github 地址:
- Meeting:
- 论文方法:这篇论文主要解决的问题是如何在有限的内存容量下高效地运行超出DRAM容量的大语言模型。通过将模型参数存储在闪存上,并根据闪存内存行为按需将其带入DRAM来解决这一挑战。论文通过构建一个与闪存内存行为相协调的推理成本模型,指导该论文在两个关键领域进行优化:减少从闪存传输的数据量和以更大、更连续的块读取数据。论文介绍了两种主要技术:窗口化策略降低数据传输量,行-列捆绑增加从闪存读取的数据块大小。这些方法使得模型可以在可用DRAM容量的两倍大小下运行,并且与CPU和GPU中的简单加载方法相比,推理速度分别增加了4-5倍和20-25倍。该论文的稀疏意识、上下文适应加载和面向硬件的设计为在内存有限的设备上高效推理大语言模型铺平了道路。
ComplexityNet: 通过学习任务复杂度来提高LLM推理效率
- 论文名称:ComplexityNet: Increasing LLM Inference Efficiency by Learning Task Complexity
- 论文地址:https://arxiv.org/pdf/2312.11511
- Github 地址:
- Meeting:
- 论文方法:这篇论文主要介绍了ComplexityNet,这是一个专门用于评估任务复杂度的精简语言模型。该模型预测了不同能力的各种语言模型的输出准确性的可能性。作者的初步应用是在Mostly Basic Python Problems (MBPP)数据集上。他们首次创建了一组标签来定义任务复杂度。ComplexityNet在确定任务复杂度方面取得了显著的79%准确率,相比于原始模型的34%准确率有了显著改进。此外,与使用最高复杂度模型相比,ComplexityNet可以有效地减少90%的计算资源使用量,同时保持高达86.7%的代码生成准确率。这项研究表明,通过微调较小的模型来对任务进行分类,可以在使用大型语言模型时在准确性和效率之间取得更平衡的权衡。该论文的发现为优化LLM应用指明了一个有前景的方向,尤其是在资源受限的环境下。
超越Chinchilla-Optimal: 在语言模型缩放定律中考虑推理
- 论文名称:Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws
- 论文地址:https://arxiv.org/pdf/2401.00448
- 相关领域:模型结构改进
- Github 地址:
- Meeting:
- 论文方法:本论文修改了Chinchilla缩放定律,计算了训练和部署具有给定推理需求和质量的语言模型所需的最佳参数数量和预训练数据大小。研究发现,对于预计存在相当大推理需求(约10亿次请求)的语言模型研究者来说,应该训练比Chinchilla-optimal更小更长的模型。
Understanding LLMs:从训练到推理的全面概述
- 论文名称:Understanding LLMs: A Comprehensive Overview from Training to Inference
- 论文地址:https://arxiv.org/pdf/2401.02038
- 相关领域:模型结构改进、预训练
- 作者:Yiheng Liu, Hao He, Tianle Han
- Github 地址:
- Meeting:
- 论文方法:这篇论文讨论了大语言模型(LLMs)的训练技术和推理部署技术的演变,并探讨了低成本训练和部署LLMs在未来的发展趋势。训练方面的讨论包括数据预处理、训练架构、预训练任务、并行训练以及与模型微调相关的内容。在推理方面,论文还涵盖了模型压缩、并行计算、内存调度和结构优化等主题。它还探讨了LLMs的应用,并对它们的未来发展提供了见解。
大模型评估篇
大模型预训练篇
TeleChat:一个包含30亿、70亿和120亿参数的大型语言模型集合
- 论文名称:TeleChat Technical Report
- mechanism:
- 作者:Zihan Wang, Xinzhang Liu, Shixuan Liu
- 论文地址:arxiv.org/pdf/2401.03804
- 相关领域:模型结构改进、预训练、指令微调、模型评估
- Github 地址:
- Meeting:
- 论文方法:TeleChat是一个包含30亿、70亿和120亿参数的大型语言模型集合。它包括预训练的语言模型和与人类偏好一致的fine-tuned聊天模型。TeleChat首先在包含英文和中文的各种文本的广泛语料库上进行预训练,包括数万亿个标记。随后,模型通过细调以与人类偏好一致,遵循该论文描述的详细方法。该论文对TeleChat在语言理解、数学、推理、代码生成和基于知识的问答等各种任务中的性能进行评估。
- 实验结果:TeleChat在广泛的公共基准测试中达到了与其他相似规模的开源模型相当的性能。为了支持未来利用LLMs的研究和应用,该论文向公众社区发布了TeleChat 7B和12B变种的fine-tuned模型检查点,以及代码和部分预训练数据。
大模型并不是你所需要的全部
- 论文名称:Large Language Models aren't all that you need
- 机构:印度理工学院
- 作者:Kiran Voderhobli Holla, Chaithanya Kumar, Aryan Singh
- 论文地址:arxiv.org/pdf/2401.00698
- 相关领域:模型结构改进、预训练
- Github 地址:
- Meeting:
- 论文方法:这篇论文主要探讨了在解决SemEval 2023任务2:多语种复杂命名实体识别方面的架构和系统。作者评估了两种方法,一种是传统的CRF模型,另一种是经过定制头部微调的大型语言模型(LLM),并进行了比较。论文探索了一些新颖的想法,包括:1)衰减辅助损失(具有残差)- 在模型上训练粗粒度命名实体识别的辅助任务,并将该任务作为损失函数的一部分;2)三元标记混合- 在最终的命名实体识别层中,探索了混合相邻标记嵌入的方法;3)任务优化头部- 探索了各种定制头部和学习率用于LLM的最终层。作者还尝试了多个LLM,包括GPT-3,并在最终模型上进行了多种dropout和超参数设置的实验,最终在测试数据上获得了0.67/0.61的micro & macro f1分数。研究结果表明,尽管预训练的LLM相比传统模型带来了很大的性能提升,但通过上述额外的特征/损失/模型工程技术对宏观F1分数的改进是可行的。
TinyLlama: 一个开源的小型语言模型
- 论文名称:TinyLlama: An Open-Source Small Language Model
- mechanism:
- 作者:Peiyuan Zhang, Guangtao Zeng, Tianduo Wang
- 论文地址:arxiv.org/pdf/2401.02385
- 相关领域:模型结构改进、预训练
- Github 地址:github.com/jzhang38/TinyLlama
- Meeting:
- 论文方法:TinyLlama是一个在大约3个时期内在大约1万亿个标记上预训练的紧凑1.1B语言模型。TinyLlama建立在Llama 2的架构和分词器之上,利用了开源社区贡献的各种进展(例如FlashAttention),实现了更好的计算效率。尽管规模相对较小,但TinyLlama在一系列下游任务中展现了显著的性能。它明显优于具有相似规模的现有开源语言模型。该论文的模型检查点和代码公开在GitHub上,网址为https://github.com/jzhang38/TinyLlama。
LLM增强LLM:通过组合扩展能力
- 论文名称:LLM Augmented LLMs: Expanding Capabilities through Composition
- 机构:谷歌研究院、Google DeepMind
- 作者:Rachit Bansal, Bidisha Samanta, Siddharth Dalmia
- 论文地址:arxiv.org/pdf/2401.02412
- 相关领域:模型结构改进、预训练
- Github 地址:
- Meeting:
- 论文方法:这篇论文主要探讨了在大语言模型的基础上如何通过组合来增强模型能力的问题。通过引入交叉注意力机制,将现有的模型与具有特定任务的模型进行组合,从而实现新的能力。作者提出的CALM方法在多个领域和设置下都适用,并通过将PaLM2-S与在低资源语言上训练的较小模型进行组合,在翻译和算术推理等任务上取得了显著的改进。
LLaMA Pro: 带有块扩展的渐进式LLaMA
- 论文名称:LLaMA Pro: Progressive LLaMA with Block Expansion
- 机构:香港大学、上海交通大学、Tencent PCG实验室
- 作者:Chengyue Wu, Yukang Gan, Yixiao Ge
- 论文地址:arxiv.org/pdf/2401.02415
- 相关领域:模型结构改进、预训练
- Github 地址:
- Meeting:
- 论文方法:这篇论文介绍了一种新的后预训练方法,通过扩展Transformer模块,仅使用新语料库进行调整,有效提升模型的知识,避免灾难性遗忘。研究者在代码和数学语料库上进行实验,得到了LLaMA Pro-8.3B模型,该模型基于LLaMA2-7B模型初始,在通用任务、编程和数学方面有出色表现。LLaMA Pro及其指令遵循对应模型(LLaMA Pro-Instruct)在各项基准测试中取得了先进的性能,证明其在LLaMA系列和各种任务中具有卓越的优势和推理能力。该研究为融合自然语言和编程语言提供了有价值的洞见,为在不同环境中有效运行的先进语言模型的开发奠定了坚实的基础。
无需注释的病理定位的通用视觉语言预训练
- 论文名称:Generalizable vision-language pre-training for annotation-free pathology localization
- 机构:香港大学、鹏城实验室、中国科学院大学
- 作者:Hao Yang, Hong-Yu Zhou, Cheng Li
- 论文地址:arxiv.org/pdf/2401.02044
- 相关领域:预训练
- Github 地址:
- Meeting:
- 论文方法:该论文介绍了一种针对无需注释的病理定位的通用视觉语言预训练模型。该模型的核心优势在于其基于图像注释无关的多级语义结构对比学习,将医学报告中的多粒度医学概念与丰富的图像特征全面对齐,以适应观察到的和新出现的未知病理的多样表达。实验证明,该模型在4个不同的外部数据集上验证了其泛化能力,在定位5种不同病理方面优于6种最先进的方法,甚至超过人类基准,表明其适用于复杂的临床环境。
ChartAssistant: 通过图表到表格预训练和多任务指令微调的通用图表多模态语言模型
- 论文名称:ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning
- 机构:香港大学、南京大学、上海交通大学
- 作者:Fanqing Meng, Wenqi Shao, Quanfeng Lu
- 论文地址:https://arxiv.org/pdf/2401.02384
- 相关领域:预训练、指令微调
- Github 地址:https://github.com/OpenGVLab/ChartAst
- Meeting:
- 论文方法:这篇论文提出了ChartAssistant,这是一个基于图表的图像语言模型,旨在实现图表理解和推理的通用性。ChartAssistant通过图表到表格解析的预训练和多任务指令遵循的微调,解决了通用多模态模型在泛化和任务特定微调方面的挑战。实验结果显示,与最先进的UniChart方法相比,ChartAssistant在各种图表任务上取得了显著的性能提升,并在实际图表数据上优于OpenAI的GPT-4V(ision)。这篇论文的内容主要是介绍了ChartAssistant的设计与训练方法,并展示了其在图表任务上的性能优势。
DIALIGHT: 利用大模型轻量级开发和评估任务导向对话系统
- 论文名称:DIALIGHT: Lightweight Multilingual Development and Evaluation of Task-Oriented Dialogue Systems with Large Language Models
- 机构:剑桥大学
- 作者:Fanqing Meng, Wenqi Shao, Quanfeng Lu
- 论文地址:https://arxiv.org/pdf/2401.02208
- 相关领域:模型结构改进、预训练
- Github 地址:https://github.com/OpenGVLab/ChartAst
- Meeting:
- 论文方法:
机器人篇
- Mobile ALOHA:低成本全身远程操作学习双手机器人移动操作
- 论文名称:Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
- 机构:斯坦福大学
- 作者:Zipeng Fu, Tony Z. Zhao, Chelsea Finn
- 论文地址:https://arxiv.org/pdf/2401.02117
- 相关领域:模型结构改进、预训练
- Github 地址:
- Meeting:
- 论文方法:本论文介绍了一种学习移动操作任务的系统,该任务需要双手协作和全身控制。使用Mobile ALOHA系统进行数据采集,通过与现有的静态ALOHA数据集联合训练,进行监督式行为克隆,提高了移动操作任务的性能,使得Mobile ALOHA能够自主完成复杂的移动操作任务。通过扩展了移动底盘和全身远程操作界面的ALOHA系统,Mobile ALOHA实现了低成本的整体身体远程操作系统。本论文解决了传统机器人学习中关注的桌面操作的局限性,使得机器人具备了移动和灵活性,可以完成更广泛实用的任务。
强化学习篇
数字人
- 从音频到逼真的人体化:合成对话中的人类
- 论文名称:From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
- mechanism:
- author:
- 论文地址:https://arxiv.org/pdf/2401.01885
- 相关领域:
- Github 地址:
- Meeting:
- 论文方法:该论文提出了一个生成全身逼真的头像的框架,根据双方互动的对话动态进行手势生成。通过语音音频输入,该论文可以输出个体的多种手势动作,包括面部、身体和手部的动作。该论文的方法将向量量化的样本多样性与扩散获得的高频细节相结合,生成更具动态和表现力的动作。该论文使用高度逼真的人体化头像可视化生成的动作,可以表达手势中的重要细微之处(例如冷笑和嘲笑)。为了促进这一研究领域的发展,该论文推出了一种首个多视角对话数据集,可用于逼真重构。实验结果显示,该论文的模型生成适当且多样的手势,优于扩散和向量量化单独的方法。此外,该论文的感知评估凸显了光真度(与网格相比)在准确评估对话手势中细微动作细节方面的重要性。代码和数据集可在网上获得。
Long LLM 篇
MoE 篇
- Mixtral 8x7B: 稀疏专家混合语言模型
- 标题:Mixtral of Experts
- 相关领域:模型结构改进、指令微调
- mechanism:
- 作者:Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux
- 发表时间:2023.09.23
- 论文地址:arxiv.org/pdf/2401.04088
- 相关领域:Transformers
- Github 地址:
- Meeting:
- 论文方法:这篇论文介绍了Mixtral 8x7B,一种稀疏专家混合语言模型(SMoE)。Mixtral具有与Mistral 7B相同的架构,不同之处在于每个层由8个前馈块(即专家)组成。对于每个令牌,在每个层中,路由网络选择两个专家来处理当前状态并将其输出进行组合。尽管每个令牌只能看到两个专家,但所选择的专家在每个时间步骤可以不同。结果是,每个令牌可以访问470亿个参数,但在推理过程中只使用130亿个活跃参数。Mixtral使用32k令牌的上下文尺寸进行训练,并且在所有评估基准中胜过或与Llama 2 70B和GPT-3.5相匹配。特别是,在数学、代码生成和多语言基准测试中,Mixtral远远优于Llama 2 70B。该论文还提供了一个fine-tuned的模型,Mixtral 8x7B - Instruct,在人类基准测试中超过了GPT-3.5 Turbo、Claude-2.1、Gemini Pro和Llama 2 70B - chat模型。基础模型和指令模型都是在Apache 2.0许可下发布的。
mini LLMs 篇
refer to
- 文档领域多模态大模型整理https://zhuanlan.zhihu.com/p/673470907