SmoothNLP下載 - SmoothNLP源代碼下載

SmoothNLP

Author	Email
Victor	[email protected]
Yinjun	[email protected]
海蜇	[email protected]

SmoothNLP
- Install 安裝
- 知識圖譜
  - 調用示例&可視化
- NLP基礎Pipelines
  - 1. Tokenize分詞
  - 2. Postag詞性標註
  - 3. NER 實體識別
  - 4. 金融實體識別
  - 5. 依存句法分析
  - 6. 切句
  - 7. 多線程支持
  - 8. 日誌
- 無監督學習
  - 新詞挖掘
  - 事件聚類
- 有監督學習
  - (資訊)事件分類
- Tutorial
- 服務說明
  - 聲明
  - Pro 專業版本
  - 常見問題
- 設置字體
- 彩蛋

Install 安裝

通過pip安裝

pip install smoothnlp > =0.4.0

通過源代碼安裝最新版本

git clone https://github.com/smoothnlp/SmoothNLP.git
cd SmoothNLP
python setup.py install

知識圖譜

僅支持SmoothNLP V0.3.0以後的版本; 以下展示為V0.4版本後樣例:

調用示例&可視化

 from smoothnlp . algorithm import kg
from kgexplore import visual
ngrams = kg . extract_ngram ([ "SmoothNLP在V0.3版本中正式推出知识抽取功能" ,
                            "SmoothNLP专注于可解释的NLP技术" ,
                            "SmoothNLP支持Python与Java" ,
                            "SmoothNLP将帮助工业界与学术界更加高效的构建知识图谱" ,
                            "SmoothNLP是上海文磨网络科技公司的开源项目" ,
                            "SmoothNLP在V0.4版本中推出对图谱节点的分类功能" ,
                            "KGExplore是SmoothNLP的一个子项目" ])
visual . visualize ( ngrams , width = 12 , height = 10 )

SmoothNLP_KG_Demo

功能說明

V0.4版本中支持的邊關係(edge-type), 包括:事件触发,状态描述,属性描述,数值描述.
V0.4版本中支持的節點種類(node-type), 包括:产品、地区、公司与品牌、货品、机构、人物、修饰短语、其他.

NLP基礎Pipelines

1.Tokenize分詞

 >> import smoothnlp 
>> smoothnlp . segment ( '欢迎在Python中使用SmoothNLP' )
[ '欢迎' , '在' , 'Python' , '中' , '使用' , 'SmoothNLP' ]

2.Postag詞性標註

詞性標註標籤解釋wiki

 >> smoothnlp . postag ( '欢迎使用smoothnlp的Python接口' )
[{ 'token' : '欢迎' , 'postag' : 'VV' },
 { 'token' : '在' , 'postag' : 'P' },
 { 'token' : 'Python' , 'postag' : 'NN' },
 { 'token' : '中' , 'postag' : 'LC' },
 { 'token' : '使用' , 'postag' : 'VV' },
 { 'token' : 'SmoothNLP' , 'postag' : 'NN' }]

3.NER 實體識別

 >> smoothnlp . ner ( "中国平安2019年度长期服务计划于2019年5月7日至5月14日通过二级市场完成购股" )
[{ 'charStart' : 0 , 'charEnd' : 4 , 'text' : '中国平安' , 'nerTag' : 'COMPANY_NAME' , 'sTokenList' : { '1' : { 'token' : '中国平安' , 'postag' : None }}, 'normalizedEntityValue' : '中国平安' },
{ 'charStart' : 4 , 'charEnd' : 9 , 'text' : '2019年' , 'nerTag' : 'NUMBER' , 'sTokenList' : { '2' : { 'token' : '2019年' , 'postag' : 'CD' }}, 'normalizedEntityValue' : '2019年' },
{ 'charStart' : 17 , 'charEnd' : 26 , 'text' : '2019年5月7日' , 'nerTag' : 'DATETIME' , 'sTokenList' : { '8' : { 'token' : '2019年5月' , 'postag' : None }, '9' : { 'token' : '7日' , 'postag' : None }}, 'normalizedEntityValue' : '2019年5月7日' },
{ 'charStart' : 27 , 'charEnd' : 32 , 'text' : '5月14日' , 'nerTag' : 'DATETIME' , 'sTokenList' : { '11' : { 'token' : '5月' , 'postag' : None }, '12' : { 'token' : '14日' , 'postag' : None }}, 'normalizedEntityValue' : '5月14日' }]

4. 金融實體識別

 >> smoothnlp . company_recognize ( "旷视科技预计将在今年9月在港IPO" )
[{ 'charStart' : 0 ,
  'charEnd' : 4 ,
  'text' : '旷视科技' ,
  'nerTag' : 'COMPANY_NAME' ,
  'sTokenList' : { '1' : { 'token' : '旷视科技' , 'postag' : None }},
  'normalizedEntityValue' : '旷视科技' }]

5. 依存句法分析

注意, smoothnlp.dep_parsing返回的Index=0為dummy的root token.

依存句法分析標籤解釋wiki

 smoothnlp . dep_parsing ( "特斯拉是全球最大的电动汽车制造商。" )
> [{ 'relationship' : 'top' , 'dependentIndex' : 2 , 'targetIndex' : 1 },
  { 'relationship' : 'root' , 'dependentIndex' : 0 , 'targetIndex' : 2 },
  { 'relationship' : 'dep' , 'dependentIndex' : 5 , 'targetIndex' : 3 },
  { 'relationship' : 'advmod' , 'dependentIndex' : 5 , 'targetIndex' : 4 },
  { 'relationship' : 'ccomp' , 'dependentIndex' : 2 , 'targetIndex' : 5 },
  { 'relationship' : 'cpm' , 'dependentIndex' : 5 , 'targetIndex' : 6 },
  { 'relationship' : 'amod' , 'dependentIndex' : 8 , 'targetIndex' : 7 },
  { 'relationship' : 'attr' , 'dependentIndex' : 2 , 'targetIndex' : 8 },
  { 'relationship' : 'attr' , 'dependentIndex' : 2 , 'targetIndex' : 9 },
  { 'relationship' : 'punct' , 'dependentIndex' : 2 , 'targetIndex' : 10 }]

6. 切句

 smoothnlp . split2sentences ( "句子1!句子2!" )
> [ '句子1!' , '句子2!' ]

7. 多線程支持

SmoothNLP 默認使用2個Thread進行服務調用;

 from smoothnlp import config
config . setNumThreads ( 2 )

8. 日誌

 from smoothnlp import config
config . setLogLevel ( "DEBUG" )  ## 设定日志级别

無監督學習

新詞挖掘

算法介紹| 使用說明

事件聚類

該功能我們目前僅支持商業化的解決方案支持, 與線上服務. 詳情可聯繫[email protected]

效果演示

[
  {
    "url" : " https://36kr.com/p/5167309 " ,
    "title" : " Facebook第三次数据泄露，可能导致680万用户私人照片泄露" ,
    "pub_ts" : 1544832000
  },
  {
    "url" : " https://www.pencilnews.cn/p/24038.html " ,
    "title" : "热点 | Facebook将因为泄露700万用户个人照片 面临16亿美元罚款" ,
    "pub_ts" : 1544832000
  },
  {
    "url" : " https://finance.sina.com.cn/stock/usstock/c/2018-12-15/doc-ihmutuec9334184.shtml " ,
    "title" : " Facebook再曝新数据泄露 6800万用户或受影响" ,
    "pub_ts" : 1544844120
  }
]

吐槽: 新浪小編數據錯誤... 誇大事實, 真實情況Facebook並沒有洩露6800萬張照片

有監督學習

(資訊)事件分類

該功能我們目前僅支持商業化的解決方案支持, 與線上服務. 詳情可聯繫[email protected]; 線上服務支持API輸出

效果

事件名稱	AUC	Precision
投資併購	0.996	0.982
企業合作	0.977	0.885
董監高管	0.982	0.940
營收報導	0.994	0.960
企業簽約	0.993	0.904
商業拓展	0.968	0.869
產品報導	0.977	0.911
產業政策	0.990	0.879
經營不善	0.981	0.765
違規約談	0.951	0.890

參考文獻

ASER
HanLP

Tutorial

多線程調用

服務說明

聲明

SmoothNLP通過雲端微服務提供完整的REST文本解析及相關服務應用. 對於開源愛好者等一般用戶, 目前我們提供qps<=5的服務支持; 對於商業用戶, 我們提供部不受限制的雲端賬號或本地部署方案.
包括:切詞,詞性標註,依存句法分析等基礎NLP任務由java代碼實現, 在文件夾smoothnlp_maven下. 可通過maven編譯打包
如果您尋求商業化的NLP或知識圖譜解決方案, 歡迎郵件至[email protected]

Pro 專業版本

SmoothNLP Pro 支持穩定可靠的企業級用戶, 使用文檔; 如需試用或購買, 請聯繫[email protected]

常見問題

注意, 在0.2.20版本調整後, 以下基礎Pipeline功能僅對字符串長度做出了限制(不超過200). 如對較長corpus進行處理, 請先試用smoothnlp.split2sentences進行切句預處理
知識圖譜可視化部分(V0.4版本以前)默認支持字體SimHei ,大多數環境下的matplotlib不支持中文字體, 我們提供字體包的下載鏈接; 您可以通過運行以下代碼, 將Simhei字體加載入matplotlib字體庫

 import matplotlib . pyplot as plt
import matplotlib . font_manager as font_manager
## 设置字体
font_dirs = [ 'simhei/' ]
font_files = font_manager . findSystemFonts ( fontpaths = font_dirs )
font_list = font_manager . createFontList ( font_files )
font_manager . fontManager . ttflist . extend ( font_list )
plt . rcParams [ 'font.family' ] = "SimHei"

彩蛋

如果你對本項目, 有任何建議或者想成為聯合開發者, 歡迎提交issue或pull request; 作為回贈, 我們會提供數據分享或kgexplore 的免費數據體驗
如果你對NLP相關算法或引用場景感興趣, 但是卻缺少實現數據, 我們提供免費的數據支持, 下載.
如果你是高校學生, 尋求NLP或知识图谱相關的研究素材, 甚至是實習機會. 歡迎郵件到[email protected]

展開

SmoothNLP

SmoothNLP

Install 安裝

知識圖譜

調用示例&可視化

NLP基礎Pipelines

1.Tokenize分詞

2.Postag詞性標註

3.NER 實體識別

4. 金融實體識別

5. 依存句法分析

6. 切句

7. 多線程支持

8. 日誌

無監督學習

新詞挖掘

事件聚類

有監督學習

(資訊)事件分類

Tutorial

服務說明

聲明

Pro 專業版本

常見問題

彩蛋

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express