VnCoreNLP下載VnCoreNLP源代碼下載

VnCoreNLP

其他源碼

v1.2

下載

VNCORENLP：越南自然語言處理工具包

VNCORENLP是越南人的快速準確的NLP註釋管道，通過單詞分割的關鍵NLP組件， POS標記，命名實體識別（NER）和依賴性解析提供豐富的語言註釋。用戶不必安裝外部依賴項。用戶可以從命令行或API運行處理管道。 VNCorenlp的一般體系結構和實驗結果可以在以下相關論文中找到：

Thanh Vu，Dat Quoc Nguyen，Dai Quoc Nguyen，Mark Dras和Mark Johnson。 2018 。 VNCORENLP：越南自然語言處理工具包。在計算語言學協會北美分會2018會議論文集：示範，NAACL 2018，第56-60頁。 [。圍兜]
Dat Quoc Nguyen，Dai Quoc Nguyen，Thanh Vu，Mark Dras和Mark Johnson。 2018 。快速準確的越南單詞細分器。在第11屆國際語言資源與評估會議論文集，LREC 2018，第2582-2587頁。 [。圍兜]
Dat Quoc nguyen，Thanh Vu，Dai Quoc Nguyen，Mark Dras和Mark Johnson。 2017 。從單詞分割到越南語的POS標籤。在澳大利亞語言技術協會第15屆年度研討會論文集，Alta 2017，第108-113頁。 [。圍兜]

請引用紙張[1]，每當使用VNCORENLP產生已發布的結果或併入其他軟件時。如果您深入處理單詞分割或pos標記，也建議您分別引用紙張[2]或[3]。

如果您正在尋找輕型版本，則VNCORENLP的單詞分割和POS標記組件也已作為獨立包裝rdrsementer [2]和vnmarmot [3]釋放。

安裝

Java 1.8+ （先決條件）
文件VnCoreNLP-1.2.jar （27MB）和文件夾models （115MB）放置在同一工作文件夾中。
Python 3.6+如果使用Vncorenlp的Python包裝器。要安裝此包裝器，用戶必須運行以下命令：
$ pip3 install py_vncorenlp
特別感謝Nguyen的Linh創建了這個包裝紙！

Python用戶的使用情況

 import py_vncorenlp

# Automatically download VnCoreNLP components from the original repository
# and save them in some local working folder
py_vncorenlp . download_model ( save_dir = '/absolute/path/to/vncorenlp' )

# Load VnCoreNLP from the local working folder that contains both `VnCoreNLP-1.2.jar` and `models` 
model = py_vncorenlp . VnCoreNLP ( save_dir = '/absolute/path/to/vncorenlp' )
# Equivalent to: model = py_vncorenlp.VnCoreNLP(annotators=["wseg", "pos", "ner", "parse"], save_dir='/absolute/path/to/vncorenlp')

# Annotate a raw corpus
model . annotate_file ( input_file = "/absolute/path/to/input/file" , output_file = "/absolute/path/to/output/file" )

# Annotate a raw text
model . print_out ( model . annotate_text ( "Ông Nguyễn Khắc Chúc  đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây." ))

默認情況下，輸出的格式為6列，代表單詞索引，單詞形式，pos tag，ner標籤，當前單詞的頭部索引及其依賴關係類型：

 1       Ông     Nc      O       4       sub
2       Nguyễn_Khắc_Chúc        Np      B-PER   1       nmod
3       đang    R       O       4       adv
4       làm_việc        V       O       0       root
5       tại     E       O       4       loc
6       Đại_học N       B-ORG   5       pob
...

對於僅使用vncorenlp進行單詞分割的用戶：

 rdrsegmenter = py_vncorenlp . VnCoreNLP ( annotators = [ "wseg" ], save_dir = '/absolute/path/to/vncorenlp' )
text = "Ông Nguyễn Khắc Chúc  đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây."
output = rdrsegmenter . word_segment ( text )
print ( output )
# ['Ông Nguyễn_Khắc_Chúc đang làm_việc tại Đại_học Quốc_gia Hà_Nội .', 'Bà Lan , vợ ông Chúc , cũng làm_việc tại đây .']

Java用戶的使用情況

使用命令行中的vncorenlp

您可以使用以下命令來註釋輸入原始文本語料庫（例如新聞內容的集合）：

 // To perform word segmentation, POS tagging, NER and then dependency parsing
$ java -Xmx2g -jar VnCoreNLP-1.2.jar -fin input.txt -fout output.txt
// To perform word segmentation, POS tagging and then NER
$ java -Xmx2g -jar VnCoreNLP-1.2.jar -fin input.txt -fout output.txt -annotators wseg,pos,ner
// To perform word segmentation and then POS tagging
$ java -Xmx2g -jar VnCoreNLP-1.2.jar -fin input.txt -fout output.txt -annotators wseg,pos
// To perform word segmentation
$ java -Xmx2g -jar VnCoreNLP-1.2.jar -fin input.txt -fout output.txt -annotators wseg

使用來自API的VNCORENLP

以下代碼是一個簡單而完整的示例：

 import vn . pipeline .*;
import java . io .*;
public class VnCoreNLPExample {
    public static void main ( String [] args ) throws IOException {
    
        // "wseg", "pos", "ner", and "parse" refer to as word segmentation, POS tagging, NER and dependency parsing, respectively. 
        String [] annotators = { "wseg" , "pos" , "ner" , "parse" }; 
        VnCoreNLP pipeline = new VnCoreNLP ( annotators ); 
    
        String str = "Ông Nguyễn Khắc Chúc  đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây." ; 
        
        Annotation annotation = new Annotation ( str ); 
        pipeline . annotate ( annotation ); 
        
        System . out . println ( annotation . toString ());
        // 1    Ông                 Nc  O       4   sub 
        // 2    Nguyễn_Khắc_Chúc    Np  B-PER   1   nmod
        // 3    đang                R   O       4   adv
        // 4    làm_việc            V   O       0   root
        // ...
        
        //Write to file
        PrintStream outputPrinter = new PrintStream ( "output.txt" );
        pipeline . printToFile ( annotation , outputPrinter ); 
    
        // You can also get a single sentence to analyze individually 
        Sentence firstSentence = annotation . getSentences (). get ( 0 );
        System . out . println ( firstSentence . toString ());
    }
}