spikex下載 - spikex源代碼下載

Spikex-用於知識提取的Spacy管道

Spikex是一系列準備插入Spacy管道的管道。它旨在幫助以幾乎零的精力來構建知識提取工具。

Spikex 0.5.0中的新功能

威克拉普從未如此快速閃電：

？性能月亮，由於採用了稀疏的鄰接矩陣來處理頁面圖，而不是使用igraph
內存優化，消費量減少了約40％，壓縮尺寸降低了約20％，引入了新的雙向字典來管理數據
新的API ，可更快，更輕鬆地使用和互動
？總體修復程序，以獲取更好的圖形和更好的頁面匹配

管道

wikipagex將wikipedia頁面鏈接到文本中的塊
clusterx在文本中挑選名詞塊，並根據重新訪問球映射算法，徑向球映射器的重新審視
Abbrx檢測到縮寫和首字母縮寫，將它們與長形式聯繫起來。它基於Scispacy的改進
Labelx採用圖案匹配表達式的標籤，並在文本中捕獲它們，解決疊加，縮寫和首字母縮寫詞
Phrasex基於自定義屬性名稱和短語模式創建Doc的下劃線擴展。示例是nounphrasex和verbphrasex ，分別提取名詞短語和動詞短語
Sentx基於splitta的文本中檢測句子

工具

帶有頁面的Wikigraph作為節點的類別鏈接
Matcher從Spacy的界面繼承了界面，但使用ROGEX製成的引擎構建，從而提高了其性能

安裝Spikex

一些要求是從Spacy繼承的：

Spacy版本：2.3+
操作系統：MacOS / OS X·Linux·Windows（Cygwin，Mingw，Visual Studio）
Python版本：Python 3.6+（只有64位）
包裝管理人員：PIP

一些依賴項使用Cython ，需要在Spikex之前安裝它：

pip install cython

請記住，始終建議使用虛擬環境，以避免修改系統狀態。

pip

此時，安裝Spikex通過PIP是一行命令：

pip install spikex

用法

預定

Spikex管與Spacy一起使用，因此需要安裝它的模型。在這裡遵循官方指示。支持全新的Spacy 3.0！

威克拉普

WikiGraph是由Wikipedia的一些關鍵組成部分開始構建的：頁面，類別及其之間的關係。

汽車

根據維基百科垃圾場的大小，創建WikiGraph可能需要時間。因此，我們提供準備使用的Wikigraphs：

日期	威克拉普	朗	尺寸（壓縮）	大小（內存）
2021-05-20	enwiki_core	en	1.3GB	8GB
2021-05-20	SimpleWiki_Core	en	20MB	130MB
2021-05-20	itwiki_core	它	208MB	1.2GB
更多...

Spikex提供了一個命令，可以快捷下下載和安裝WikiGraph （Linux或MacOS，Windows尚不支持）：

spikex download-wikigraph simplewiki_core

手動的

可以從命令行創建WikiGraph ，並指定Wikipedia轉儲要進行的Wikipedia轉儲以及在哪裡保存：

spikex create-wikigraph 
  < YOUR-OUTPUT-PATH > 
  --wiki < WIKI-NAME, default: en > 
  --version < DUMP-VERSION, default: latest > 
  --dumps-path < DUMPS-BACKUP-PATH >

然後需要包裝和安裝：

spikex package-wikigraph 
  < WIKIGRAPH-RAW-PATH > 
  < YOUR-OUTPUT-PATH >

按照包裝過程結束時的說明，並在虛擬環境中安裝分發包。現在，您準備按照您的意願使用Wikigraph：

 from spikex . wikigraph import load as wg_load

wg = wg_load ( "enwiki_core" )
page = "Natural_language_processing"
categories = wg . get_categories ( page , distance = 1 )
for category in categories :
    print ( category )

> >> Category : Speech_recognition
> >> Category : Artificial_intelligence
> >> Category : Natural_language_processing
> >> Category : Computational_linguistics

匹配器

匹配器與Spacy的相同，但是在一次處理許多模式時（成千上萬的訂單）時，請更快，因此請在此處遵循官方用法說明。

一個瑣碎的例子：

 from spikex . matcher import Matcher
from spacy import load as spacy_load

nlp = spacy_load ( "en_core_web_sm" )
matcher = Matcher ( nlp . vocab )
matcher . add ( "TEST" , [[{ "LOWER" : "nlp" }]])
doc = nlp ( "I love NLP" )
for _ , s , e in matcher ( doc ):
  print ( doc [ s : e ])

> >> NLP

Wikipagex

WikiPageX管使用WikiGraph ，以便在匹配Wikipedia頁面標題的文本中找到塊。

 from spacy import load as spacy_load
from spikex . wikigraph import load as wg_load
from spikex . pipes import WikiPageX

nlp = spacy_load ( "en_core_web_sm" )
doc = nlp ( "An apple a day keeps the doctor away" )
wg = wg_load ( "simplewiki_core" )
wpx = WikiPageX ( wg )
doc = wpx ( doc )
for span in doc . _ . wiki_spans :
  print ( span . _ . wiki_pages )

> >> [ 'An' ]
> >> [ 'Apple' , 'Apple_(disambiguation)' , 'Apple_(company)' , 'Apple_(tree)' ]
> >> [ 'A' , 'A_(musical_note)' , 'A_(New_York_City_Subway_service)' , 'A_(disambiguation)' , 'A_(Cyrillic)' )]
> >> [ 'Day' ]
> >> [ 'The_Doctor' , 'The_Doctor_(Doctor_Who)' , 'The_Doctor_(Star_Trek)' , 'The_Doctor_(disambiguation)' ]
> >> [ 'The' ]
> >> [ 'Doctor_(Doctor_Who)' , 'Doctor_(Star_Trek)' , 'Doctor' , 'Doctor_(title)' , 'Doctor_(disambiguation)' ]

clusterx

ClusterX管在文本中取用名詞塊，並使用徑向球映射器算法將其簇。

 from spacy import load as spacy_load
from spikex . pipes import ClusterX

nlp = spacy_load ( "en_core_web_sm" )
doc = nlp ( "Grab this juicy orange and watch a dog chasing a cat." )
clusterx = ClusterX ( min_score = 0.65 )
doc = clusterx ( doc )
for cluster in doc . _ . cluster_chunks :
  print ( cluster )

> >> [ this juicy orange ]
> >> [ a cat , a dog ]

abbrx

Abbrx管在文本中找到縮寫和首字母縮寫詞，將簡短和長的形式鏈接在一起：

 from spacy import load as spacy_load
from spikex . pipes import AbbrX

nlp = spacy_load ( "en_core_web_sm" )
doc = nlp ( "a little snippet with an abbreviation (abbr)" )
abbrx = AbbrX ( nlp . vocab )
doc = abbrx ( doc )
for abbr in doc . _ . abbrs :
  print ( abbr , "->" , abbr . _ . long_form )

> >> abbr - > abbreviation

 from spacy import load as spacy_load
from spikex . pipes import LabelX

nlp = spacy_load ( "en_core_web_sm" )
doc = nlp ( "looking for a computer system engineer" )
patterns = [
  [{ "LOWER" : "computer" }, { "LOWER" : "system" }],
  [{ "LOWER" : "system" }, { "LOWER" : "engineer" }],
]
labelx = LabelX ( nlp . vocab , [( "TEST" , patterns )], validate = True , only_longest = True )
doc = labelx ( doc )
for labeling in doc . _ . labelings :
  print ( labeling , f"[ { labeling . label_ } ]" )

> >> computer system engineer [ TEST ]

Phrasex

PhraseX管會創建一個自定義Doc的下劃線擴展，該擴展可以通過短語模式的匹配來滿足。

 from spacy import load as spacy_load
from spikex . pipes import PhraseX

nlp = spacy_load ( "en_core_web_sm" )
doc = nlp ( "I have Melrose and McIntosh apples, or Williams pears" )
patterns = [
  [{ "LOWER" : "mcintosh" }],
  [{ "LOWER" : "melrose" }],
]
phrasex = PhraseX ( nlp . vocab , "apples" , patterns )
doc = phrasex ( doc )
for apple in doc . _ . apples :
  print ( apple )

> >> Melrose
> >> McIntosh

Sentx

Sentx管將句子分成文本。它修改了令牌的IS_SENT_START屬性，因此必須在Spacy Pipeline中的解析器管道之前添加它：

 from spacy import load as spacy_load
from spikex . pipes import SentX
from spikex . defaults import spacy_version

if spacy_version >= 3 :
  from spacy . language import Language

  @ Language . factory ( "sentx" )
  def create_sentx ( nlp , name ):
      return SentX ()

nlp = spacy_load ( "en_core_web_sm" )
sentx_pipe = SentX () if spacy_version < 3 else "sentx"
nlp . add_pipe ( sentx_pipe , before = "parser" )
doc = nlp ( "A little sentence. Followed by another one." )
for sent in doc . sents :
  print ( sent )

> >> A little sentence .
> >> Followed by another one .

就是所有人

隨時貢獻並獲得樂趣！

展開

spikex

Spikex-用於知識提取的Spacy管道

Spikex 0.5.0中的新功能

管道

工具

安裝Spikex

pip

用法

預定

威克拉普

汽車

手動的

匹配器

Wikipagex

clusterx

abbrx

標籤

Phrasex

Sentx

就是所有人

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express