tagged wiki2019zh
v1.0.0
Based on the 2019 Chinese wiki corpus wiki2019zh.zip, the COARSE_ELECTRA_SMALL_ZH model in hanlp was used for word segmentation.
The word participle results were sequenced using 4-tag BMES annotation method, and the format is as follows:
Suppose the corpus of the participle is:你好Tom。我喜欢吃羊肉串。 , the labeling result is:
你 B
好 E
T B
o M
m E
。 S
SENTENCE END
我 S
喜 B
欢 E
吃 S
羊 B
肉 M
串 E
。 S
SENTENCE END
TEXT END
During use, you may need to pay attention to how embeddings and punctuation are handled, as well as the flags SENTENCE END and TEXT END for endings of statements and corpus.
The script used by participle is process_wiki_data.py.
It takes a lot of time to run this script: