qa_match Download - qa_match Source code download

qa_match

Other source code

1.0.0

Download

Project Introduction

qa_match is a deep learning-based question-and-answer matching tool that supports one- and two-layer structure knowledge base Q&A. qa_match supports one-layer structural knowledge base Q&A through the intent matching model, and supports two-layer structural knowledge base Q&A through the results of the fusion domain classification model and the intent matching model. qa_match also supports unsupervised pre-training function, and through lightweight pre-trained language models (SPTM, Simple Pre-trained Model) can improve the effectiveness of downstream tasks such as knowledge base questions and answers.

Knowledge Base Q&A

In actual scenarios, the knowledge base is generally constructed through manual summary, annotation, machine mining, etc. The knowledge base contains a large number of standard questions, each standard question has a standard answer and some extended questions. We call these extended questions extended questions extended questions. For a one-layer structural knowledge base that contains only standard questions and extension questions, we refer to standard questions as intentions. For a two-layer structural knowledge base, each standard problem and its extended problem has a category, which we call domains, and one domain contains multiple intents.

The qa_match supports knowledge base structure as follows:

Knowledge Base Structure

For input questions, qa_match can give three answers in combination with the knowledge base:

Unique answer (identified as the user's specific intention)
List answers (identified as multiple possible intentions of the user)
Reject to identify (no specific user intention was identified)

Under the two knowledge base structures, there are differences in the usage of qa_match, which are explained below:

Automatic Q&A based on two-layer structure knowledge base

Two layers of results fusion

For the two-layer structure knowledge base Q&A, qa_match will first classify and identify user questions in domains and intents, and then integrate the results of the two to obtain the user's true intention and answer accordingly (unique answers, list answers, rejection answers). For example: As shown in the knowledge base structure diagram in the above knowledge base question and answer, we have a two-layer structure knowledge base, which includes "information" and "account" two fields. The "information" field contains two intentions: "How to publish information", "How to delete information", and the "account" field contains intent: "How to cancel the account". When the user enters the question: "How do I publish a post? When ”, qa_match will perform the following Q&A logic:

The input problems are scored using the LSTM domain classification model and the DSSM intention matching model respectively. For example, the domain classification model is scored at a maximum of 0.99 and is identified as the "information" field, and the intention matching model is scored at a maximum of 0.98 and is identified as the "how to publish information" intent. Since the corresponding label with the highest score of the domain classification model is an information class, it is judged as a certain type of branch.
After entering the judgment as a certain classification branch, the highest score of the domain classification model is 0.99 and the threshold value b1 (such as b1=0.9) in the two-layer structure knowledge base Q&A chart. Since 0.99>=b1, it is judged to be taken as a sub-branch of "strict DSSM intention matching".
After entering the "Strict DSSM Intent Matching" branch, compare the maximum score of the intent matching model with the threshold x1 (for example x1=0.8) and x2 (for example x2=0.95) and find 0.98>x2. Therefore, use the answer corresponding to the information to publish the answer (the answers of other branches are similar).

Automatic Q&A based on a one-layer structure knowledge base

In actual scenarios, we will also encounter a layer of structural knowledge base Q&A questions. Using DSSM intent matching model and SPTM lightweight pre-trained language model can solve this kind of problem. Comparison of the two:

Model	How to use	advantage	shortcoming
DSSM Intent Matching Model	DSSM matching model directly matches	①Easy to use, the model takes up little space ②Fast training/prediction speed	Unable to utilize context information
SPTM lightweight pre-trained language model	Pre-trained LSTM/Transformer language model + Fine-tune LSTM/Transformer matching model	① Can make full use of unsupervised pre-training data to improve the effect ② Language model can be used for other downstream tasks	① Pre-training requires a large amount of label-free data ② The operation is more complicated (two steps are required to obtain the matching model)

Automatic Q&A based on DSSM model

One layer of result fusion

For the first-layer structure knowledge base Q&A, you only need to use the DSSM intent matching model to score the input questions, and compare the highest score of the intent matching with x1 and x2 in the figure above to determine the answer type (unique answer, list answer, rejection).

Automatic Q&A based on SPTM model

Introduction to the Lightweight Pre-trained Language Model (SPTM, Simple Pre-trained Model)

Considering that there are often a large amount of unlabeled data in actual use, when the knowledge base data is limited, unsupervised pre-trained language models can be used to improve the effectiveness of matching models. Referring to the BERT pre-training process, in May 2019, we developed the SPTM model. Compared with BERT, this model has mainly improved three aspects: First, it removes NSP (Next Sentence Prediction) with insignificant effects, second, to improve online inference performance, the Transformer was replaced with LSTM, and third, to ensure that the model effect reduces the parameter quantity, it also provides a Transformer with shared parameters between BLOCKs. The model principle is as follows:

Data preprocessing

When pre-training the model, the training data needs to be generated using labelless single sentences as the data set, and BERT is used to construct the sample: each single sentence is used as a sample, 15% of the words in the sentence participate in the prediction, 80% of the words participating in the prediction are masked, 10% are randomly replaced with another word in the dictionary, and 10% are not replaced.

Pre-training

The model structure of the pre-training stage is shown in the figure below:

Model structure

In order to improve the expression ability of the model and retain more shallow information, the residual Bi-LSTM network (Residual LSTM) was introduced as the model body. The network normalizes the input of each layer of Bi-LSTM and the output of this layer, and the result is used as the input of the next layer. In addition, the last layer Bi-LSTM output is used as the input of a fully connected layer. After summing and normalizing it with the output of the fully connected layer, the result is used as the output of the entire network.

The time-consuming example of pre-training tasks is shown in the following table:

Metric Name	Indicator value	Indicator value	Indicator value
Model structure	LSTM	Transformer for sharing parameters	Transformer for sharing parameters
Pretrained dataset size	10Million	10Million	10Million
Pre-training resources	10 Nvidia K40 / 12G Memory	10 Nvidia K40 / 12G Memory	10 Nvidia K40 / 12G Memory
Pre-training parameters	step = 100000 / batch size = 128	step = 100000 / batch size = 128 / 1 layers / 12 heads	step = 100000 / batch size = 128 / 12 layers / 12 heads
Pre-training time consuming	8.9 hours	13.5 hours	32.9 hours
Pretrained model size	81M	80.6M	121M

SPTM automatic question and answer process

Pre-trained language model

After the introduction of SPTM, for the first-layer structure knowledge base Q&A, the input questions are first scored using the intent matching model based on the language model fine-tuning, and then the answer type (unique answer, list answer, rejection) is determined based on the same strategy as the DSSM intent matching model.

How to use

Data introduction

The format of the data file (under the data_demo folder) that needs to be used is as follows. In order not to leak data, we have encoded the original text of the standard problem and the extended problem, and in actual application scenarios, just prepare the data in the following format.

std_data: The correspondence between category and standard problem, including three columns: category ID, standard problem ID, and standard problem text
pre_train_data: a label-free pre-trained dataset, each line is a piece of text
vocab: Pre-trained data dictionary, each line is a word (the dictionary should contain <PAD> , 、 `)
train_data: training set, including three columns: standard problem ID, extended problem ID, and extended problem text
valid_data: Verification set, including three columns: Standard question ID, extended question ID, and extended question text
test_data: test set, including three columns: standard problem ID, extended problem ID, and extended problem text

The data is separated by t, the problem encoding is separated by spaces, and the words are separated by spaces. Note that in the data example of this project, the original text is encoded and each word is replaced with a number. For example, how the actual text corresponding to 205 19 90 417 41 44如何删除信息, and this encoding operation is not required when actually used ; if the knowledge base structure is one level, all the category ids in the std_data file need to be set to __label__0 .

Semi-automatic mining process of knowledge base

The semi-automatic mining process of the knowledge base is a set of semi-automatic mining solutions for knowledge bases built on the QA match automatic question and answer process (refer to the automatic question and answer based on a one-layer knowledge base structure), which helps improve the scale of the knowledge base and the quality of the knowledge base. On the one hand, it improves the ability to match online; on the other hand, it improves the quality of offline model training data, and thus improves the model performance. The semi-automatic mining process of the knowledge base can be used for two scenarios: cold start mining and iterative mining after the model is launched. For details, please refer to the knowledge base mining instructions.

How to run

See the operation instructions for details

Tips

Since the corresponding labels of the original sample are randomly broken up when selecting negative samples in the DSSM model, the model parameters need to meet batch_size >= negitive_size , otherwise the model cannot be effectively trained.
Model fusion parameter selection method: The current parameter selection is based on statistics. First, calculate the f1 value of the model label (such as rejection) corresponding to different values of the same parameter (such as a1 in the two-layer structure knowledge base Q&A diagram) on the test set, and then select the value corresponding to the larger f1 value as the value of the parameter. For example: when selecting the final value of parameter a1 in the two-layer structure knowledge base Q&A diagram, firstly, the model label corresponding to different a1 candidate values will be obtained on the test set (such as rejection, non-rejection), and then calculate the f1 value based on the real label of the sample, and finally select the candidate value corresponding to the appropriate f1 value (accuracy/recall rate can be emphasized according to the project requirements) as the final value of a1.

Running environment

 tensorflow 版本>r1.8 <r2.0, python3

Version

v1.0: https://github.com/wuba/qa_match/tree/v1.0

v1.1: https://github.com/wuba/qa_match/tree/v1.1

v1.2: https://github.com/wuba/qa_match/tree/v1.2

v1.3: https://github.com/wuba/qa_match/tree/v1.3

Future planning

In the future, we will continue to optimize and expand the capabilities of qa_match, and the plan is to open source as follows:

Currently, tensorflow has been released to version 2.4. In the future, we will release the tensorflow 2.X version or pytorch version of qa_match according to our needs.

How to contribute & question feedback

We sincerely hope that developers will give us valuable opinions and suggestions. You can choose the following ways to feedback suggestions and questions to us:

Submit PR or Issue on github
Email to [email protected]

Expand

Additional Information

Version 1.0.0
Type Other source code
Update Time 2025-04-16
size 2.18MB
From Github

Related Applications

Aizhi·QA IoT supporting software

2024-08-15
Makeover Match

2024-01-29
Jellipop Match latest version

2024-01-19
On-site smart management qa app

2023-08-07
Block Match

2023-07-07
Anipang Match

2023-04-10

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3

Related Information All