spikexダウンロードspikexソースコードのダウンロード

Spikex -知識抽出のためのスペイシーパイプ

Spikexは、スペイシーパイプラインに接続する準備ができているパイプのコレクションです。それは、ほぼゼロの努力で知識抽出ツールの構築を支援することを目指しています。

Spikex 0.5.0の新しいもの

ウィキグラフはこれほど速く稲妻になったことがありません：

？パフォーマンスのムーニング、 igraphを使用する代わりに、ページグラフを処理するためのスパース隣接マトリックスの採用のおかげで
消費量が約40％削減され、圧縮サイズが〜20％削減されたメモリの最適化により、データを管理するための新しい双方向辞書が導入されています
より速く、より簡単な使用とインタラクションのための新しいAPI
？全体的な修正、より良いグラフとより良いページのマッチングのために

パイプ

Wikipagexは、Wikipediaのページをテキストのチャンクにリンクします
ClusterXはテキストで名詞のチャンクを選択し、ボールマッパーアルゴリズムの再訪に基づいてそれらをクラスターします。ラジアルボールマッパー
Abbrxは略語と頭字語を検出し、それらを長い形にリンクします。それは、改善を伴うサイズピーのものに基づいています
Labelxは、パターンマッチング式のラベル付けを受け取り、テキストでそれらをキャッチし、オーバーラップ、略語、頭字語を解きます
phrasexは、カスタム属性名とフレーズパターンに基づいて、 Docのアンダースコア拡張機能を作成します。例は、それぞれ名詞句と動詞句を抽出するnounphrasexとverbphphrasexです
sentxは、改良のあるsplittaに基づいて、テキスト内の文を検出します

ツール

葉がノードとしてカテゴリにリンクされている葉としてページを持つウィキグラフ
スペイシーのインターフェースを継承するマッチャーですが、そのパフォーマンスを高めるRegexで作られたエンジンを使用して構築されました

Spikexをインストールします

いくつかの要件はスペイシーから継承されます：

スペイシーバージョン：2.3+
オペレーティングシステム：MacOS / OS X・Linux・Windows（Cygwin、Mingw、Visual Studio）
Pythonバージョン：Python 3.6+（64ビットのみ）
パッケージマネージャー：PIP

一部の依存関係はCythonを使用しており、Spikexの前にインストールする必要があります。

pip install cython

システム状態の変更を避けるために、仮想環境が常に推奨されることを忘れないでください。

ピップ

この時点で、PIPを介してSpikexをインストールすることは1行コマンドです。

pip install spikex

使用法

前提条件

SpikexパイプはSpacyで動作するため、モデルをインストールする必要があります。ここで公式の指示に従ってください。真新しいSpacy 3.0がサポートされています！

ウィキグラフ

WikiGraphは、ウィキペディアのいくつかの主要なコンポーネント、それらの間のページ、カテゴリ、関係から構築されています。

自動

WikiGraphを作成するには、ウィキペディアダンプの大きさに応じて時間がかかります。このため、使用する準備ができているウィキグラフを提供します。

日付	ウィキグラフ	ラング	サイズ（圧縮）	サイズ（メモリ）
2021-05-20	enwiki_core	en	1.3GB	8GB
2021-05-20	SimpleWiki_Core	en	20MB	130MB
2021-05-20	itwiki_core	それ	208MB	1.2GB
もっと来る...

Spikexは、 WikiGraphダウンロードとインストール（LinuxまたはMacOS、Windowsがまだサポートされていない）のショートカットにコマンドを提供します。

spikex download-wikigraph simplewiki_core

マニュアル

WikiGraphコマンドラインから作成でき、どのウィキペディアダンプを取るか、どこに保存するかを指定できます。

spikex create-wikigraph 
  < YOUR-OUTPUT-PATH > 
  --wiki < WIKI-NAME, default: en > 
  --version < DUMP-VERSION, default: latest > 
  --dumps-path < DUMPS-BACKUP-PATH >

その後、梱包してインストールする必要があります。

spikex package-wikigraph 
  < WIKIGRAPH-RAW-PATH > 
  < YOUR-OUTPUT-PATH >

梱包プロセスの最後にある指示に従って、仮想環境に配布パッケージをインストールします。今、あなたはあなたが望むようにあなたのウィキグラフを使用する準備ができています：

 from spikex . wikigraph import load as wg_load

wg = wg_load ( "enwiki_core" )
page = "Natural_language_processing"
categories = wg . get_categories ( page , distance = 1 )
for category in categories :
    print ( category )

> >> Category : Speech_recognition
> >> Category : Artificial_intelligence
> >> Category : Natural_language_processing
> >> Category : Computational_linguistics

マッチャー

マッチャーはスペイシーのものと同じですが、一度に多くのパターンを処理する場合（数千の順序）が速いため、公式の使用手順に従ってください。

些細な例：

 from spikex . matcher import Matcher
from spacy import load as spacy_load

nlp = spacy_load ( "en_core_web_sm" )
matcher = Matcher ( nlp . vocab )
matcher . add ( "TEST" , [[{ "LOWER" : "nlp" }]])
doc = nlp ( "I love NLP" )
for _ , s , e in matcher ( doc ):
  print ( doc [ s : e ])

> >> NLP

wikipagex

WikiPageXパイプは、 WikiGraphを使用して、ウィキペディアのページタイトルに一致するテキストでチャンクを見つけるためです。

 from spacy import load as spacy_load
from spikex . wikigraph import load as wg_load
from spikex . pipes import WikiPageX

nlp = spacy_load ( "en_core_web_sm" )
doc = nlp ( "An apple a day keeps the doctor away" )
wg = wg_load ( "simplewiki_core" )
wpx = WikiPageX ( wg )
doc = wpx ( doc )
for span in doc . _ . wiki_spans :
  print ( span . _ . wiki_pages )

> >> [ 'An' ]
> >> [ 'Apple' , 'Apple_(disambiguation)' , 'Apple_(company)' , 'Apple_(tree)' ]
> >> [ 'A' , 'A_(musical_note)' , 'A_(New_York_City_Subway_service)' , 'A_(disambiguation)' , 'A_(Cyrillic)' )]
> >> [ 'Day' ]
> >> [ 'The_Doctor' , 'The_Doctor_(Doctor_Who)' , 'The_Doctor_(Star_Trek)' , 'The_Doctor_(disambiguation)' ]
> >> [ 'The' ]
> >> [ 'Doctor_(Doctor_Who)' , 'Doctor_(Star_Trek)' , 'Doctor' , 'Doctor_(title)' , 'Doctor_(disambiguation)' ]

clusterx

ClusterXパイプは、テキストに名詞のチャンクを取り、ラジアルボールマッパーアルゴリズムを使用してクラスターします。

 from spacy import load as spacy_load
from spikex . pipes import ClusterX

nlp = spacy_load ( "en_core_web_sm" )
doc = nlp ( "Grab this juicy orange and watch a dog chasing a cat." )
clusterx = ClusterX ( min_score = 0.65 )
doc = clusterx ( doc )
for cluster in doc . _ . cluster_chunks :
  print ( cluster )

> >> [ this juicy orange ]
> >> [ a cat , a dog ]

abbrx

Abbrxパイプは、テキスト内の略語と頭字語を見つけ、短いフォームと長いフォームを結び付けます。

 from spacy import load as spacy_load
from spikex . pipes import AbbrX

nlp = spacy_load ( "en_core_web_sm" )
doc = nlp ( "a little snippet with an abbreviation (abbr)" )
abbrx = AbbrX ( nlp . vocab )
doc = abbrx ( doc )
for abbr in doc . _ . abbrs :
  print ( abbr , "->" , abbr . _ . long_form )

> >> abbr - > abbreviation

labelx

LabelXパイプは、テキストのパターンとラベルを付け、オーバーラップ、略語、頭字語を解きます。

 from spacy import load as spacy_load
from spikex . pipes import LabelX

nlp = spacy_load ( "en_core_web_sm" )
doc = nlp ( "looking for a computer system engineer" )
patterns = [
  [{ "LOWER" : "computer" }, { "LOWER" : "system" }],
  [{ "LOWER" : "system" }, { "LOWER" : "engineer" }],
]
labelx = LabelX ( nlp . vocab , [( "TEST" , patterns )], validate = True , only_longest = True )
doc = labelx ( doc )
for labeling in doc . _ . labelings :
  print ( labeling , f"[ { labeling . label_ } ]" )

> >> computer system engineer [ TEST ]

phrasex

PhraseXパイプは、フレーズパターンからのマッチで満たすCustom Docのアンダースコア拡張機能を作成します。

 from spacy import load as spacy_load
from spikex . pipes import PhraseX

nlp = spacy_load ( "en_core_web_sm" )
doc = nlp ( "I have Melrose and McIntosh apples, or Williams pears" )
patterns = [
  [{ "LOWER" : "mcintosh" }],
  [{ "LOWER" : "melrose" }],
]
phrasex = PhraseX ( nlp . vocab , "apples" , patterns )
doc = phrasex ( doc )
for apple in doc . _ . apples :
  print ( apple )

> >> Melrose
> >> McIntosh

sentx

sentxパイプはテキストに文章を分割します。 Tokensのis_sent_start属性を変更するため、スペイシーパイプラインのパーサーパイプの前に追加することが必須です。

 from spacy import load as spacy_load
from spikex . pipes import SentX
from spikex . defaults import spacy_version

if spacy_version >= 3 :
  from spacy . language import Language

  @ Language . factory ( "sentx" )
  def create_sentx ( nlp , name ):
      return SentX ()

nlp = spacy_load ( "en_core_web_sm" )
sentx_pipe = SentX () if spacy_version < 3 else "sentx"
nlp . add_pipe ( sentx_pipe , before = "parser" )
doc = nlp ( "A little sentence. Followed by another one." )
for sent in doc . sents :
  print ( sent )

> >> A little sentence .
> >> Followed by another one .

それはすべて人々です

お気軽に貢献して楽しんでください！

拡大する

spikex

Spikex -知識抽出のためのスペイシーパイプ

Spikex 0.5.0の新しいもの

パイプ

ツール

Spikexをインストールします

ピップ

使用法

前提条件

ウィキグラフ

自動

マニュアル

マッチャー

wikipagex

clusterx

abbrx

labelx

phrasex

sentx

それはすべて人々です

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express