HanLP Download - HanLP Source code download

HanLP: Han Language Processing

English | Japanese | Documents | Papers | Forums | docker | ▶️ Run online

A multilingual natural language processing toolkit for production environments is based on PyTorch and TensorFlow 2.x dual engines, with the goal of popularizing the most cutting-edge NLP technology. HanLP has the characteristics of complete functions, accurate accuracy, efficient performance, new corpus, clear architecture, and customizable.

With the world's largest multilingual corpus, HanLP2.1 supports 10 joint tasks and multiple single tasks in 130 languages, including traditional Chinese, Chinese, English, Japanese, Japanese, Russian, French and German. HanLP pre-trains dozens of models on more than a dozen tasks and is continuously iterating the corpus and models:

Function	RESTful	Multitasking	Single task	Model	Labeling standards
Participle	Tutorial	Tutorial	Tutorial	tok	Coarse, subdivision
Part of speech annotation	Tutorial	Tutorial	Tutorial	pos	CTB, PKU, 863
Named entity recognition	Tutorial	Tutorial	Tutorial	ner	PKU, MSRA, OntoNotes
Dependence syntax analysis	Tutorial	Tutorial	Tutorial	dep	SD, UD, PMT
Component syntax analysis	Tutorial	Tutorial	Tutorial	con	Chinese Tree Bank
Semantic dependency analysis	Tutorial	Tutorial	Tutorial	sdp	CSDP
Semantic role annotation	Tutorial	Tutorial	Tutorial	srl	Chinese Proposition Bank
Abstract meaning expression	Tutorial	None yet	Tutorial	amr	CAMR
Refers to the dissolution	Tutorial	None yet	None yet	None yet	OntoNotes
Semantic text similarity	Tutorial	None yet	Tutorial	sts	None yet
Text style conversion	Tutorial	None yet	None yet	None yet	None yet
Keyword phrase extraction	Tutorial	None yet	None yet	None yet	None yet
Extracted automatic summary	Tutorial	None yet	None yet	None yet	None yet
Generative automatic summary	Tutorial	None yet	None yet	None yet	None yet
Text syntax correction	Tutorial	None yet	None yet	None yet	None yet
Text classification	Tutorial	None yet	None yet	None yet	None yet
Sentiment Analysis	Tutorial	None yet	None yet	None yet	`[-1,+1]`
Language detection	Tutorial	None yet	Tutorial	None yet	ISO 639-1 encoding

For stem extraction and grammatical grammatical feature extraction, please refer to the English tutorial; for word vectors and cloze-filling, please refer to the corresponding documents.
For simple and traditional Chinese conversion, pinyin, new word discovery, and text clustering, please refer to the 1.x tutorial.

Tailored, HanLP provides two APIs: RESTful and native , which are aimed at two scenarios: lightweight and massive. Regardless of the API and language, the HanLP interface remains semantically consistent and insists on open source in code. If you have used HanLP in your research, please cite our EMNLP paper.

Lightweight RESTful API

Only a few KBs, suitable for agile development, mobile APP and other scenarios. Simple and easy to use, no need for GPU to install, and it is installed in seconds. More corpus, larger models, higher accuracy, highly recommended . The server GPU computing power is limited and the anonymous user quota is small. It is recommended to apply for a free public welfare API key auth .

Python

pip install hanlp_restful

Create a client and fill in the server address and secret key:

 from hanlp_restful import HanLPClient
HanLP = HanLPClient ( 'https://www.hanlp.com/api' , auth = None , language = 'zh' ) # auth不填则匿名，zh中文，mul多语种

Golang

Install go get -u github.com/hankcs/gohanlp@main , create a client, fill in the server address and secret key:

 HanLP := hanlp . HanLPClient ( hanlp . WithAuth ( "" ), hanlp . WithLanguage ( "zh" )) // auth不填则匿名，zh中文，mul多语种

Java

Add dependencies in pom.xml :

< dependency >
    < groupId >com.hankcs.hanlp.restful</ groupId >
    < artifactId >hanlp-restful</ artifactId >
    < version >0.0.12</ version >
</ dependency >

Create a client and fill in the server address and secret key:

 HanLPClient HanLP = new HanLPClient ( "https://www.hanlp.com/api" , null , "zh" ); // auth不填则匿名，zh中文，mul多语种

Get started quickly

No matter what development language, call the parse interface and pass in an article to obtain HanLP's accurate analysis results.

 HanLP . parse ( "2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。阿婆主来到北京立方庭参观自然语义科技公司。" )

For more functions, please refer to the documentation and test cases.

Massive native API

Relying on deep learning technologies such as PyTorch and TensorFlow, it is suitable for professional NLP engineers, researchers and local massive data scenarios. Requires Python 3.6 to 3.10, supports Windows, and *nix is recommended. Can run on the CPU, GPU/TPU is recommended. Install PyTorch version:

pip install hanlp

HanLP has passed the unit tests for Python 3.6 to 3.10 on Linux, macOS and Windows every time it is released, and there is no installation problem.

The models released by HanLP are divided into two types: multi-task and single-task. Multi-task speed is fast and saves video memory, and single-task accuracy is high and flexible.

Multitasking model

HanLP's workflow is to load the model and then call it as a function, such as the following joint multitasking model:

 import hanlp
HanLP = hanlp . load ( hanlp . pretrained . mtl . CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_SMALL_ZH ) # 世界最大中文语料库
HanLP ([ '2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。' , '阿婆主来到北京立方庭参观自然语义科技公司。' ])

The input unit of the Native API is a sentence, and it is necessary to use a multilingual clause model or a rule-based clause function to pre-section. The semantic designs of the two APIs RESTful and native are completely consistent, and users can seamlessly interchange. The simple interface also supports flexible parameters. Common techniques include:

Flexible tasks scheduling, the fewer tasks, the faster the speed. See the tutorial for details. In scenarios with limited memory, users can also delete unnecessary tasks to achieve the effect of model slimming.
For efficient trie tree custom dictionary, as well as three rules: mandatory, merge, and correction, please refer to demo and documentation. The effects of the rule system will be seamlessly applied to subsequent statistical models, thereby quickly adapting to new areas.

Single task model

According to our latest research, the advantages of multitask learning lie in speed and video memory, but the accuracy is often not as good as the single-task model. So, HanLP pretrained many single-task models and designed elegant pipeline modes to assemble them.

 import hanlp
HanLP = hanlp . pipeline () 
    . append ( hanlp . utils . rules . split_sentence , output_key = 'sentences' ) 
    . append ( hanlp . load ( 'FINE_ELECTRA_SMALL_ZH' ), output_key = 'tok' ) 
    . append ( hanlp . load ( 'CTB9_POS_ELECTRA_SMALL' ), output_key = 'pos' ) 
    . append ( hanlp . load ( 'MSRA_NER_ELECTRA_SMALL_ZH' ), output_key = 'ner' , input_key = 'tok' ) 
    . append ( hanlp . load ( 'CTB9_DEP_ELECTRA_SMALL' , conll = 0 ), output_key = 'dep' , input_key = 'tok' )
    . append ( hanlp . load ( 'CTB9_CON_ELECTRA_SMALL' ), output_key = 'con' , input_key = 'tok' )
HanLP ( '2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。阿婆主来到北京立方庭参观自然语义科技公司。' )

For more features, please refer to the demo and documentation for more models and usage.

Output format

No matter what API, development language and natural language, HanLP's output is unified into a json format compatible Document is dict :

{
  "tok/fine" : [
    [ " 2021年" , " HanLPv2.1 " , "为" , "生产" , "环境" , "带来" , "次" , "世代" , "最" , "先进" , "的" , "多" , "语种" , " NLP " , "技术" , " 。 " ],
    [ "阿婆主" , "来到" , "北京" , "立方庭" , "参观" , "自然" , "语义" , "科技" , "公司" , " 。 " ]
  ],
  "tok/coarse" : [
    [ " 2021年" , " HanLPv2.1 " , "为" , "生产" , "环境" , "带来" , "次世代" , "最" , "先进" , "的" , "多语种" , " NLP " , "技术" , " 。 " ],
    [ "阿婆主" , "来到" , "北京立方庭" , "参观" , "自然语义科技公司" , " 。 " ]
  ],
  "pos/ctb" : [
    [ " NT " , " NR " , " P " , " NN " , " NN " , " VV " , " JJ " , " NN " , " AD " , " JJ " , " DEG " , " CD " , " NN " , " NR " , " NN " , " PU " ],
    [ " NN " , " VV " , " NR " , " NR " , " VV " , " NN " , " NN " , " NN " , " NN " , " PU " ]
  ],
  "pos/pku" : [
    [ " t " , " nx " , " p " , " vn " , " n " , " v " , " b " , " n " , " d " , " a " , " u " , " a " , " n " , " nx " , " n " , " w " ],
    [ " n " , " v " , " ns " , " ns " , " v " , " n " , " n " , " n " , " n " , " w " ]
  ],
  "pos/863" : [
    [ " nt " , " w " , " p " , " v " , " n " , " v " , " a " , " nt " , " d " , " a " , " u " , " a " , " n " , " ws " , " n " , " w " ],
    [ " n " , " v " , " ns " , " n " , " v " , " n " , " n " , " n " , " n " , " w " ]
  ],
  "ner/pku" : [
    [],
    [[ "北京立方庭" , " ns " , 2 , 4 ], [ "自然语义科技公司" , " nt " , 5 , 9 ]]
  ],
  "ner/msra" : [
    [[ " 2021年" , " DATE " , 0 , 1 ], [ " HanLPv2.1 " , " ORGANIZATION " , 1 , 2 ]],
    [[ "北京" , " LOCATION " , 2 , 3 ], [ "立方庭" , " LOCATION " , 3 , 4 ], [ "自然语义科技公司" , " ORGANIZATION " , 5 , 9 ]]
  ],
  "ner/ontonotes" : [
    [[ " 2021年" , " DATE " , 0 , 1 ], [ " HanLPv2.1 " , " ORG " , 1 , 2 ]],
    [[ "北京立方庭" , " FAC " , 2 , 4 ], [ "自然语义科技公司" , " ORG " , 5 , 9 ]]
  ],
  "srl" : [
    [[[ " 2021年" , " ARGM-TMP " , 0 , 1 ], [ " HanLPv2.1 " , " ARG0 " , 1 , 2 ], [ "为生产环境" , " ARG2 " , 2 , 5 ], [ "带来" , " PRED " , 5 , 6 ], [ "次世代最先进的多语种NLP技术" , " ARG1 " , 6 , 15 ]], [[ "最" , " ARGM-ADV " , 8 , 9 ], [ "先进" , " PRED " , 9 , 10 ], [ "技术" , " ARG0 " , 14 , 15 ]]],
    [[[ "阿婆主" , " ARG0 " , 0 , 1 ], [ "来到" , " PRED " , 1 , 2 ], [ "北京立方庭" , " ARG1 " , 2 , 4 ]], [[ "阿婆主" , " ARG0 " , 0 , 1 ], [ "参观" , " PRED " , 4 , 5 ], [ "自然语义科技公司" , " ARG1 " , 5 , 9 ]]]
  ],
  "dep" : [
    [[ 6 , " tmod " ], [ 6 , " nsubj " ], [ 6 , " prep " ], [ 5 , " nn " ], [ 3 , " pobj " ], [ 0 , " root " ], [ 8 , " amod " ], [ 15 , " nn " ], [ 10 , " advmod " ], [ 15 , " rcmod " ], [ 10 , " assm " ], [ 13 , " nummod " ], [ 15 , " nn " ], [ 15 , " nn " ], [ 6 , " dobj " ], [ 6 , " punct " ]],
    [[ 2 , " nsubj " ], [ 0 , " root " ], [ 4 , " nn " ], [ 2 , " dobj " ], [ 2 , " conj " ], [ 9 , " nn " ], [ 9 , " nn " ], [ 9 , " nn " ], [ 5 , " dobj " ], [ 2 , " punct " ]]
  ],
  "sdp" : [
    [[[ 6 , " Time " ]], [[ 6 , " Exp " ]], [[ 5 , " mPrep " ]], [[ 5 , " Desc " ]], [[ 6 , " Datv " ]], [[ 13 , " dDesc " ]], [[ 0 , " Root " ], [ 8 , " Desc " ], [ 13 , " Desc " ]], [[ 15 , " Time " ]], [[ 10 , " mDegr " ]], [[ 15 , " Desc " ]], [[ 10 , " mAux " ]], [[ 8 , " Quan " ], [ 13 , " Quan " ]], [[ 15 , " Desc " ]], [[ 15 , " Nmod " ]], [[ 6 , " Pat " ]], [[ 6 , " mPunc " ]]],
    [[[ 2 , " Agt " ], [ 5 , " Agt " ]], [[ 0 , " Root " ]], [[ 4 , " Loc " ]], [[ 2 , " Lfin " ]], [[ 2 , " ePurp " ]], [[ 8 , " Nmod " ]], [[ 9 , " Nmod " ]], [[ 9 , " Nmod " ]], [[ 5 , " Datv " ]], [[ 5 , " mPunc " ]]]
  ],
  "con" : [
    [ " TOP " , [[ " IP " , [[ " NP " , [[ " NT " , [ " 2021年" ]]]], [ " NP " , [[ " NR " , [ " HanLPv2.1 " ]]]], [ " VP " , [[ " PP " , [[ " P " , [ "为" ]], [ " NP " , [[ " NN " , [ "生产" ]], [ " NN " , [ "环境" ]]]]]], [ " VP " , [[ " VV " , [ "带来" ]], [ " NP " , [[ " ADJP " , [[ " NP " , [[ " ADJP " , [[ " JJ " , [ "次" ]]]], [ " NP " , [[ " NN " , [ "世代" ]]]]]], [ " ADVP " , [[ " AD " , [ "最" ]]]], [ " VP " , [[ " JJ " , [ "先进" ]]]]]], [ " DEG " , [ "的" ]], [ " NP " , [[ " QP " , [[ " CD " , [ "多" ]]]], [ " NP " , [[ " NN " , [ "语种" ]]]]]], [ " NP " , [[ " NR " , [ " NLP " ]], [ " NN " , [ "技术" ]]]]]]]]]], [ " PU " , [ " 。 " ]]]]]],
    [ " TOP " , [[ " IP " , [[ " NP " , [[ " NN " , [ "阿婆主" ]]]], [ " VP " , [[ " VP " , [[ " VV " , [ "来到" ]], [ " NP " , [[ " NR " , [ "北京" ]], [ " NR " , [ "立方庭" ]]]]]], [ " VP " , [[ " VV " , [ "参观" ]], [ " NP " , [[ " NN " , [ "自然" ]], [ " NN " , [ "语义" ]], [ " NN " , [ "科技" ]], [ " NN " , [ "公司" ]]]]]]]], [ " PU " , [ " 。 " ]]]]]]
  ]
}

In particular, Python RESTful and native APIs support visualization based on monospace fonts, which can directly visualize linguistic structures in the console:

 HanLP ([ '2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。' , '阿婆主来到北京立方庭参观自然语义科技公司。' ]). pretty_print ()

Dep Tree    	Token    	Relati	PoS	Tok      	NER Type        	Tok      	SRL PA1     	Tok      	SRL PA2     	Tok      	PoS    3       4       5       6       7       8       9 
────────────	─────────	──────	───	─────────	────────────────	─────────	────────────	─────────	────────────	─────────	─────────────────────────────────────────────────────────
 ┌─────────►	2021年    	tmod  	NT 	2021年    	───► DATE        	2021年    	───► ARGM - TMP	2021年    	            	2021年    	NT ───────────────────────────────────────────► NP ───┐   
 │┌────────►	HanLPv2 . 1	nsubj 	NR 	HanLPv2 . 1	───► ORGANIZATION	HanLPv2 .1	───► ARG0    	HanLPv2 .1	            	HanLPv2 .1	NR ───────────────────────────────────────────► NP ────┤   
 ││┌─►┌─────	为        	prep  	P  	为        	                	为        	◄─┐         	为        	            	为        	P ───────────┐                                       │   
 │││  │  ┌─►	生产       	nn    	NN 	生产       	                	生产       	  ├► ARG2    	生产       	            	生产       	NN ──┐       ├────────────────────────► PP ───┐       │   
 │││  └─►└──	环境       	pobj  	NN 	环境       	                	环境       	◄─┘         	环境       	            	环境       	NN ──┴► NP ───┘                               │       │   
┌┼┴┴────────	带来       	root  	VV 	带来       	                	带来       	╟──► PRED    	带来       	            	带来       	VV ──────────────────────────────────┐       │       │   
││       ┌─►	次        	amod  	JJ 	次        	                	次        	◄─┐         	次        	            	次        	JJ ───► ADJP ──┐                       │       ├► VP ────┤   
││  ┌───►└──	世代       	nn    	NN 	世代       	                	世代       	  │         	世代       	            	世代       	NN ───► NP ───┴► NP ───┐               │       │       │   
││  │    ┌─►	最        	advmod	AD 	最        	                	最        	  │         	最        	───► ARGM - ADV	最        	AD ───────────► ADVP ──┼► ADJP ──┐       ├► VP ───┘       ├► IP
││  │┌──►├──	先进       	rcmod 	JJ 	先进       	                	先进       	  │         	先进       	╟──► PRED    	先进       	JJ ───────────► VP ───┘       │       │               │   
││  ││   └─►	的        	assm  	DEG	的        	                	的        	  ├► ARG1    	的        	            	的        	DEG ──────────────────────────┤       │               │   
││  ││   ┌─►	多        	nummod	CD 	多        	                	多        	  │         	多        	            	多        	CD ───► QP ───┐               ├► NP ───┘               │   
││  ││┌─►└──	语种       	nn    	NN 	语种       	                	语种       	  │         	语种       	            	语种       	NN ───► NP ───┴────────► NP ────┤                       │   
││  │││  ┌─►	NLP      	nn    	NR 	NLP      	                	NLP      	  │         	NLP      	            	NLP      	NR ──┐                       │                       │   
│└─►└┴┴──┴──	技术       	dobj  	NN 	技术       	                	技术       	◄─┘         	技术       	───► ARG0    	技术       	NN ──┴────────────────► NP ───┘                       │   
└──────────►	。        	punct 	PU 	。        	                	。        	            	。        	            	。        	PU ──────────────────────────────────────────────────┘   

Dep Tree    	Tok	Relat	Po	Tok	NER Type        	Tok	SRL PA1 	Tok	SRL PA2 	Tok	Po    3       4       5       6 
────────────	───	─────	──	───	────────────────	───	────────	───	────────	───	────────────────────────────────
         ┌─►	阿婆主	nsubj	NN	阿婆主	                	阿婆主	───► ARG0	阿婆主	───► ARG0	阿婆主	NN ───────────────────► NP ───┐   
┌┬────┬──┴──	来到 	root 	VV	来到 	                	来到 	╟──► PRED	来到 	        	来到 	VV ──────────┐               │   
││    │  ┌─►	北京 	nn   	NR	北京 	───► LOCATION    	北京 	◄─┐     	北京 	        	北京 	NR ──┐       ├► VP ───┐       │   
││    └─►└──	立方庭	dobj 	NR	立方庭	───► LOCATION    	立方庭	◄─┴► ARG1	立方庭	        	立方庭	NR ──┴► NP ───┘       │       │   
│└─►┌───────	参观 	conj 	VV	参观 	                	参观 	        	参观 	╟──► PRED	参观 	VV ──────────┐       ├► VP ────┤   
│   │  ┌───►	自然 	nn   	NN	自然 	◄─┐             	自然 	        	自然 	◄─┐     	自然 	NN ──┐       │       │       ├► IP
│   │  │┌──►	语义 	nn   	NN	语义 	  │             	语义 	        	语义 	  │     	语义 	NN  │       ├► VP ───┘       │   
│   │  ││┌─►	科技 	nn   	NN	科技 	  ├► ORGANIZATION	科技 	        	科技 	  ├► ARG1	科技 	NN  ├► NP ───┘               │   
│   └─►└┴┴──	公司 	dobj 	NN	公司 	◄─┘             	公司 	        	公司 	◄─┘     	公司 	NN ──┘                       │   
└──────────►	。  	punct	PU	。  	                	。  	        	。  	        	。  	PU ──────────────────────────┘

For the meaning of the label set, please refer to the "Linguistic Labeling Specifications" and "Format Specifications". We have purchased, marked or used the world's largest and most diverse corpus for joint multi-language multi-task learning, so the HanLP's annotation set is also the most extensive.

Train your own domain model

Writing a deep learning model is not difficult at all, but the difficulty is reproducing a higher accuracy rate. The following code shows how to spend 6 minutes on the sighan2005 PKU corpus to train a Chinese word segmentation model that goes beyond the academic world.

 tokenizer = TransformerTaggingTokenizer ()
save_dir = 'data/model/cws/sighan2005_pku_bert_base_96.73'
tokenizer . fit (
    SIGHAN2005_PKU_TRAIN_ALL ,
    SIGHAN2005_PKU_TEST ,  # Conventionally, no devset is used. See Tian et al. (2020).
    save_dir ,
    'bert-base-chinese' ,
    max_seq_len = 300 ,
    char_level = True ,
    hard_constraint = True ,
    sampler_builder = SortingSamplerBuilder ( batch_size = 32 ),
    epochs = 3 ,
    adam_epsilon = 1e-6 ,
    warmup_steps = 0.1 ,
    weight_decay = 0.01 ,
    word_dropout = 0.1 ,
    seed = 1660853059 ,
)
tokenizer . evaluate ( SIGHAN2005_PKU_TEST , save_dir )

Among them, since a random number seed is specified, the result must be 96.73 . Unlike those falsely advertised academic papers or commercial projects, HanLP guarantees that all results can be reproduced. If you have any questions, we will troubleshoot the problem as the highest priority fatal bug.

Please refer to demo for more training scripts.

performance

lang	Corpora	model	tok		pos				ner			dep	con	srl	sdp				lem	fea	amr
lang	Corpora	model	fine	Coarse	ctb	pku	863	ud	pku	msra	ontonotes	dep	con	srl	SemEval16	DM	PAS	PSD	lem	fea	amr
mul	UD2.7 OntoNotes5	Small	98.62	-	-	-	-	93.23	-	-	74.42	79.10	76.85	70.63	-	91.19	93.67	85.34	87.71	84.51	-
mul	UD2.7 OntoNotes5	base	98.97	-	-	-	-	90.32	-	-	80.32	78.74	71.23	73.63	-	92.60	96.04	81.19	85.08	82.13	-
zh	open	Small	97.25	-	96.66	-	-	-	-	-	95.00	84.57	87.62	73.40	84.57	-	-	-	-	-	-
	open	base	97.50	-	97.07	-	-	-	-	-	96.04	87.11	89.84	77.78	87.11	-	-	-	-	-	-
	Close	Small	96.70	95.93	96.87	97.56	95.05	-	96.22	95.74	76.79	84.44	88.13	75.81	74.28	-	-	-	-	-	-
		base	97.52	96.44	96.99	97.59	95.29	-	96.48	95.72	77.77	85.29	88.57	76.52	73.76	-	-	-	-	-	-
		ernie	96.95	97.29	96.76	97.64	95.22	-	97.31	96.47	77.95	85.67	89.17	78.51	74.10	-	-	-	-	-	-

According to our latest research, single-task learning tends to outperform multi-task learning. If you care about accuracy over speed, it is recommended to use a single-task model.

The data preprocessing and splitting ratios adopted by HanLP are not necessarily the same as popular methods. For example, HanLP adopts the full version of the MSRA named entity recognition corpus instead of the castrated version used by the public; HanLP uses the Stanford Dependencies standard with a wider syntax coverage, rather than the Zhang and Clark (2008) standard adopted by the academic community; HanLP proposes a method of uniform segmentation of CTBs instead of the uneven academic community and missing 51 gold documents. HanLP opens the source of a complete set of corpus preprocessing scripts and corresponding corpus, striving to promote the transparency of Chinese NLP.

In short, HanLP only does what we think is correct and advanced, not necessarily what is popular and authoritative.

Quote

If you use HanLP in your research, please quote it in the following format:

 @inproceedings { he-choi-2021-stem ,
    title = " The Stem Cell Hypothesis: Dilemma behind Multi-Task Learning with Transformer Encoders " ,
    author = " He, Han and Choi, Jinho D. " ,
    booktitle = " Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing " ,
    month = nov,
    year = " 2021 " ,
    address = " Online and Punta Cana, Dominican Republic " ,
    publisher = " Association for Computational Linguistics " ,
    url = " https://aclanthology.org/2021.emnlp-main.451 " ,
    pages = " 5555--5577 " ,
    abstract = "Multi-task learning with transformer encoders (MTL) has emerged as a powerful technique to improve performance on closely-related tasks for both accuracy and efficiency while a question still remains whether or not it would perform as well on tasks that are distinct in nature. We first present MTL results on five NLP tasks, POS, NER, DEP, CON, and SRL, and depict its deficiency over single-task learning. We then conduct an extensive pruning analysis to show that a certain set of attention heads get claimed by most tasks during MTL, who interfere with one another to fine-tune those heads for their own objectives. Based on this finding, we propose the Stem Cell Hypothesis to reveal the existence of attention heads naturally talented for many tasks that cannot be jointly trained to create adequate embeddings for all of those tasks. Finally, we design novel parameter-free probes to justify our hypothesis and demonstrate how attention heads are transformed across the five tasks during MTL through label analysis.",
}

License

source code

The authorization agreement for HanLP source code is Apache License 2.0 , which can be used for commercial purposes for free. Please attach HanLP's link and authorization agreement to the product description. HanLP is protected by copyright law and infringement will be pursued.

Natural Semantics (Qingdao) Technology Co., Ltd.

HanLP operates independently from v1.7, with Natural Semantics (Qingdao) Technology Co., Ltd. as the main body of the project, leading the development of subsequent versions and having the copyright of subsequent versions.

Search quickly

HanLP v1.3~v1.65 versions are developed by Dakuai Search and continue to be completely open source. Dakuai Search has the relevant copyright.

Shanghai Linyuan Company

HanLP was supported by Shanghai Linyuan Company in the early days and has the copyright of the 1.28 and previous versions. The relevant versions have also been released on the Shanghai Linyuan Company website.

Pre-trained model

The authorization of machine learning models is not legally determined, but in the spirit of respecting the original authorization of open source corpus, if not specifically stated, HanLP's multilingual model authorization continues to use CC BY-NC-SA 4.0, and the Chinese model authorization is for research and teaching purposes only.

References

https://hanlp.hankcs.com/docs/references.html

Expand

HanLP

HanLP: Han Language Processing

English | Japanese | Documents | Papers | Forums | docker | ▶️ Run online

Lightweight RESTful API

Python

Golang

Java

Get started quickly

Massive native API

Multitasking model

Single task model

Output format

Train your own domain model

performance

Quote

License

source code

Natural Semantics (Qingdao) Technology Co., Ltd.

Search quickly

Shanghai Linyuan Company

Pre-trained model

References

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express