spikex下载 - spikex源代码下载

Spikex-用于知识提取的Spacy管道

Spikex是一系列准备插入Spacy管道的管道。它旨在帮助以几乎零的精力来构建知识提取工具。

Spikex 0.5.0中的新功能

威克拉普从未如此快速闪电：

？性能月亮，由于采用了稀疏的邻接矩阵来处理页面图，而不是使用igraph
内存优化，消费量减少了约40％，压缩尺寸降低了约20％，引入了新的双向字典来管理数据
新的API ，可更快，更轻松地使用和互动
？总体修复程序，以获取更好的图形和更好的页面匹配

管道

wikipagex将wikipedia页面链接到文本中的块
clusterx在文本中挑选名词块，并根据重新访问球映射算法，径向球映射器的重新审视
Abbrx检测到缩写和首字母缩写，将它们与长形式联系起来。它基于Scispacy的改进
Labelx采用图案匹配表达式的标签，并在文本中捕获它们，解决叠加，缩写和首字母缩写词
Phrasex基于自定义属性名称和短语模式创建Doc的下划线扩展。示例是nounphrasex和verbphrasex ，分别提取名词短语和动词短语
Sentx基于splitta的文本中检测句子

工具

带有页面的Wikigraph作为节点的类别链接
Matcher从Spacy的界面继承了界面，但使用ROGEX制成的引擎构建，从而提高了其性能

安装Spikex

一些要求是从Spacy继承的：

Spacy版本：2.3+
操作系统：MacOS / OS X·Linux·Windows（Cygwin，Mingw，Visual Studio）
Python版本：Python 3.6+（只有64位）
包装管理人员：PIP

一些依赖项使用Cython ，需要在Spikex之前安装它：

pip install cython

请记住，始终建议使用虚拟环境，以避免修改系统状态。

pip

此时，安装Spikex通过PIP是一行命令：

pip install spikex

用法

预定

Spikex管与Spacy一起使用，因此需要安装它的模型。在这里遵循官方指示。支持全新的Spacy 3.0！

威克拉普

WikiGraph是由Wikipedia的一些关键组成部分开始构建的：页面，类别及其之间的关系。

汽车

根据维基百科垃圾场的大小，创建WikiGraph可能需要时间。因此，我们提供准备使用的Wikigraphs：

日期	威克拉普	朗	尺寸（压缩）	大小（内存）
2021-05-20	enwiki_core	en	1.3GB	8GB
2021-05-20	SimpleWiki_Core	en	20MB	130MB
2021-05-20	itwiki_core	它	208MB	1.2GB
更多...

Spikex提供了一个命令，可以快捷下下载和安装WikiGraph （Linux或MacOS，Windows尚不支持）：

spikex download-wikigraph simplewiki_core

手动的

可以从命令行创建WikiGraph ，并指定Wikipedia转储要进行的Wikipedia转储以及在哪里保存：

spikex create-wikigraph 
  < YOUR-OUTPUT-PATH > 
  --wiki < WIKI-NAME, default: en > 
  --version < DUMP-VERSION, default: latest > 
  --dumps-path < DUMPS-BACKUP-PATH >

然后需要包装和安装：

spikex package-wikigraph 
  < WIKIGRAPH-RAW-PATH > 
  < YOUR-OUTPUT-PATH >

按照包装过程结束时的说明，并在虚拟环境中安装分发包。现在，您准备按照您的意愿使用Wikigraph：

 from spikex . wikigraph import load as wg_load

wg = wg_load ( "enwiki_core" )
page = "Natural_language_processing"
categories = wg . get_categories ( page , distance = 1 )
for category in categories :
    print ( category )

> >> Category : Speech_recognition
> >> Category : Artificial_intelligence
> >> Category : Natural_language_processing
> >> Category : Computational_linguistics

匹配器

匹配器与Spacy的相同，但是在一次处理许多模式时（成千上万的订单）时，请更快，因此请在此处遵循官方用法说明。

一个琐碎的例子：

 from spikex . matcher import Matcher
from spacy import load as spacy_load

nlp = spacy_load ( "en_core_web_sm" )
matcher = Matcher ( nlp . vocab )
matcher . add ( "TEST" , [[{ "LOWER" : "nlp" }]])
doc = nlp ( "I love NLP" )
for _ , s , e in matcher ( doc ):
  print ( doc [ s : e ])

> >> NLP

Wikipagex

WikiPageX管使用WikiGraph ，以便在匹配Wikipedia页面标题的文本中找到块。

 from spacy import load as spacy_load
from spikex . wikigraph import load as wg_load
from spikex . pipes import WikiPageX

nlp = spacy_load ( "en_core_web_sm" )
doc = nlp ( "An apple a day keeps the doctor away" )
wg = wg_load ( "simplewiki_core" )
wpx = WikiPageX ( wg )
doc = wpx ( doc )
for span in doc . _ . wiki_spans :
  print ( span . _ . wiki_pages )

> >> [ 'An' ]
> >> [ 'Apple' , 'Apple_(disambiguation)' , 'Apple_(company)' , 'Apple_(tree)' ]
> >> [ 'A' , 'A_(musical_note)' , 'A_(New_York_City_Subway_service)' , 'A_(disambiguation)' , 'A_(Cyrillic)' )]
> >> [ 'Day' ]
> >> [ 'The_Doctor' , 'The_Doctor_(Doctor_Who)' , 'The_Doctor_(Star_Trek)' , 'The_Doctor_(disambiguation)' ]
> >> [ 'The' ]
> >> [ 'Doctor_(Doctor_Who)' , 'Doctor_(Star_Trek)' , 'Doctor' , 'Doctor_(title)' , 'Doctor_(disambiguation)' ]

clusterx

ClusterX管在文本中取用名词块，并使用径向球映射器算法将其簇。

 from spacy import load as spacy_load
from spikex . pipes import ClusterX

nlp = spacy_load ( "en_core_web_sm" )
doc = nlp ( "Grab this juicy orange and watch a dog chasing a cat." )
clusterx = ClusterX ( min_score = 0.65 )
doc = clusterx ( doc )
for cluster in doc . _ . cluster_chunks :
  print ( cluster )

> >> [ this juicy orange ]
> >> [ a cat , a dog ]

abbrx

Abbrx管在文本中找到缩写和首字母缩写词，将简短和长的形式链接在一起：

 from spacy import load as spacy_load
from spikex . pipes import AbbrX

nlp = spacy_load ( "en_core_web_sm" )
doc = nlp ( "a little snippet with an abbreviation (abbr)" )
abbrx = AbbrX ( nlp . vocab )
doc = abbrx ( doc )
for abbr in doc . _ . abbrs :
  print ( abbr , "->" , abbr . _ . long_form )

> >> abbr - > abbreviation

 from spacy import load as spacy_load
from spikex . pipes import LabelX

nlp = spacy_load ( "en_core_web_sm" )
doc = nlp ( "looking for a computer system engineer" )
patterns = [
  [{ "LOWER" : "computer" }, { "LOWER" : "system" }],
  [{ "LOWER" : "system" }, { "LOWER" : "engineer" }],
]
labelx = LabelX ( nlp . vocab , [( "TEST" , patterns )], validate = True , only_longest = True )
doc = labelx ( doc )
for labeling in doc . _ . labelings :
  print ( labeling , f"[ { labeling . label_ } ]" )

> >> computer system engineer [ TEST ]

Phrasex

PhraseX管会创建一个自定义Doc的下划线扩展，该扩展可以通过短语模式的匹配来满足。

 from spacy import load as spacy_load
from spikex . pipes import PhraseX

nlp = spacy_load ( "en_core_web_sm" )
doc = nlp ( "I have Melrose and McIntosh apples, or Williams pears" )
patterns = [
  [{ "LOWER" : "mcintosh" }],
  [{ "LOWER" : "melrose" }],
]
phrasex = PhraseX ( nlp . vocab , "apples" , patterns )
doc = phrasex ( doc )
for apple in doc . _ . apples :
  print ( apple )

> >> Melrose
> >> McIntosh

Sentx

Sentx管将句子分成文本。它修改了令牌的IS_SENT_START属性，因此必须在Spacy Pipeline中的解析器管道之前添加它：

 from spacy import load as spacy_load
from spikex . pipes import SentX
from spikex . defaults import spacy_version

if spacy_version >= 3 :
  from spacy . language import Language

  @ Language . factory ( "sentx" )
  def create_sentx ( nlp , name ):
      return SentX ()

nlp = spacy_load ( "en_core_web_sm" )
sentx_pipe = SentX () if spacy_version < 3 else "sentx"
nlp . add_pipe ( sentx_pipe , before = "parser" )
doc = nlp ( "A little sentence. Followed by another one." )
for sent in doc . sents :
  print ( sent )

> >> A little sentence .
> >> Followed by another one .

就是所有人

随时贡献并获得乐趣！

展开

spikex

Spikex-用于知识提取的Spacy管道

Spikex 0.5.0中的新功能

管道

工具

安装Spikex

pip

用法

预定

威克拉普

汽车

手动的

匹配器

Wikipagex

clusterx

abbrx

标签

Phrasex

Sentx

就是所有人

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express