ดาวน์โหลด spikex - ดาวน์โหลดซอร์สโค้ด spikex

Spikex - Spacy Pipes สำหรับการสกัดความรู้

Spikex เป็นคอลเลกชันของท่อที่พร้อมเสียบปลั๊กในไปป์ไลน์ มันมีจุดมุ่งหมายเพื่อช่วยในการสร้างเครื่องมือสกัดความรู้ด้วยความพยายามเกือบเป็นศูนย์

มีอะไรใหม่ใน Spikex 0.5.0

Wikigraph ไม่เคยมีสายฟ้าเร็วขนาดนี้:

- Performance Mooning ต้องขอบคุณการใช้ เมทริกซ์ adjacency ที่กระจัดกระจาย เพื่อจัดการกราฟหน้าแทนที่จะใช้ IGRAPH
การเพิ่มประสิทธิภาพหน่วยความจำ ด้วยการลดการบริโภค ~ 40% และขนาดบีบอัดลดลง ~ 20% แนะนำ พจนานุกรมสองทิศทาง ใหม่เพื่อจัดการข้อมูล
API ใหม่ สำหรับการใช้งานและการโต้ตอบที่เร็วขึ้นและง่ายขึ้น
- การแก้ไขโดยรวม สำหรับกราฟที่ดีขึ้นและการจับคู่หน้าเว็บที่ดีขึ้น

ท่อ

WikiPagex ลิงก์หน้า Wikipedia ไปยังชิ้นข้อความในข้อความ
Clusterx เลือกคำนามในข้อความและจัดกลุ่มตามการกลับมาของอัลกอริทึม Ball Mapper, Radial Ball Mapper
ABBRX ตรวจพบตัวย่อและตัวย่อเชื่อมโยงพวกเขาเข้ากับรูปแบบที่ยาวนาน มันขึ้นอยู่กับ Scispacy ของการปรับปรุง
LABALX ใช้การฉลากของการแสดงออกการจับคู่รูปแบบและจับพวกเขาในข้อความการแก้ปัญหาที่ซ้อนทับตัวย่อและคำย่อที่ทับซ้อนกัน
PHRASEX สร้างส่วนขยายขีดล่างของ Doc ตามชื่อแอตทริบิวต์ที่กำหนดเองและรูปแบบวลี ตัวอย่างคือ nounphrasex และ verbphrasex ซึ่งแยกวลีคำนามและวลีคำกริยาตามลำดับ
SENTX ตรวจจับประโยคในข้อความตาม splitta ด้วยการปรับแต่ง

เครื่องมือ

Wikigraph พร้อมหน้าเป็นใบไม้ที่เชื่อมโยงกับหมวดหมู่เป็นโหนด
เครื่องจับคู่ ที่สืบทอดอินเทอร์เฟซจาก Spacy's One แต่สร้างขึ้นโดยใช้เครื่องยนต์ที่ทำจาก Regex ซึ่งช่วยเพิ่มประสิทธิภาพ

ติดตั้ง Spikex

ข้อกำหนดบางประการได้รับการสืบทอดมาจาก Spacy:

รุ่น Spacy : 2.3+
ระบบปฏิบัติการ : MacOS / OS X · Linux · Windows (Cygwin, MingW, Visual Studio)
รุ่น Python : Python 3.6+ (เพียง 64 บิต)
ผู้จัดการแพ็คเกจ : PIP

การพึ่งพาบางส่วนใช้ Cython และจำเป็นต้องติดตั้งก่อน Spikex:

pip install cython

โปรดจำไว้ว่ามีการแนะนำสภาพแวดล้อมเสมือนจริงเสมอเพื่อหลีกเลี่ยงการปรับเปลี่ยนสถานะระบบ

ปิ๊ก

ณ จุดนี้การติดตั้ง Spikex ผ่าน PIP เป็นคำสั่งหนึ่งบรรทัด:

pip install spikex

การใช้งาน

ก่อน

ท่อ Spikex ทำงานกับ Spacy ดังนั้นรูปแบบที่จำเป็นในการติดตั้ง ทำตามคำแนะนำอย่างเป็นทางการที่นี่ รองรับ Spacy 3.0 ใหม่เอี่ยม!

วิกิกราฟ

WikiGraph ถูกสร้างขึ้นเริ่มต้นจากองค์ประกอบสำคัญบางอย่างของ Wikipedia: หน้า หมวดหมู่ และ ความสัมพันธ์ ระหว่างพวกเขา

รถยนต์

การสร้าง WikiGraph อาจใช้เวลาขึ้นอยู่กับการถ่ายโอนข้อมูลวิกิพีเดียขนาดใหญ่แค่ไหน ด้วยเหตุนี้เราจึงให้ Wikigraphs พร้อมใช้งาน:

วันที่	วิกิกราฟ	หรั่ง	ขนาด (บีบอัด)	ขนาด (หน่วยความจำ)
2021-05-20	enwiki_core	en	1.3GB	8GB
2021-05-20	simplewiki_core	en	20MB	130MB
2021-05-20	itwiki_core	มัน	208MB	1.2GB
เพิ่มเติมมา ...

Spikex มีคำสั่งสำหรับการดาวน์โหลดทางลัดและการติดตั้ง WikiGraph (linux หรือ macOS, windows ยังไม่รองรับ):

spikex download-wikigraph simplewiki_core

คู่มือ

สามารถสร้าง WikiGraph จากบรรทัดคำสั่งโดยระบุว่า Wikipedia dump จะใช้เวลาใดและจะบันทึกได้ที่ไหน:

spikex create-wikigraph 
  < YOUR-OUTPUT-PATH > 
  --wiki < WIKI-NAME, default: en > 
  --version < DUMP-VERSION, default: latest > 
  --dumps-path < DUMPS-BACKUP-PATH >

จากนั้นจะต้องมีการบรรจุและติดตั้ง:

spikex package-wikigraph 
  < WIKIGRAPH-RAW-PATH > 
  < YOUR-OUTPUT-PATH >

ทำตามคำแนะนำในตอนท้ายของกระบวนการบรรจุและติดตั้งแพ็คเกจการกระจายในสภาพแวดล้อมเสมือนจริงของคุณ ตอนนี้คุณพร้อมที่จะใช้ wikigraph ของคุณตามที่คุณต้องการ:

 from spikex . wikigraph import load as wg_load

wg = wg_load ( "enwiki_core" )
page = "Natural_language_processing"
categories = wg . get_categories ( page , distance = 1 )
for category in categories :
    print ( category )

> >> Category : Speech_recognition
> >> Category : Artificial_intelligence
> >> Category : Natural_language_processing
> >> Category : Computational_linguistics

ผู้จับคู่

ตัวจับคู่นั้น เหมือนกับ Spacy's One แต่เร็วกว่าเมื่อพูดถึงการจัดการรูปแบบจำนวนมากในครั้งเดียว (คำสั่งนับพัน) ดังนั้นทำตามคำแนะนำการใช้อย่างเป็นทางการที่นี่

ตัวอย่างเล็กน้อย:

 from spikex . matcher import Matcher
from spacy import load as spacy_load

nlp = spacy_load ( "en_core_web_sm" )
matcher = Matcher ( nlp . vocab )
matcher . add ( "TEST" , [[{ "LOWER" : "nlp" }]])
doc = nlp ( "I love NLP" )
for _ , s , e in matcher ( doc ):
  print ( doc [ s : e ])

> >> NLP

WikiPagex

ท่อ WikiPageX ใช้ WikiGraph เพื่อค้นหาชิ้นส่วนในข้อความที่ตรงกับชื่อหน้า Wikipedia

 from spacy import load as spacy_load
from spikex . wikigraph import load as wg_load
from spikex . pipes import WikiPageX

nlp = spacy_load ( "en_core_web_sm" )
doc = nlp ( "An apple a day keeps the doctor away" )
wg = wg_load ( "simplewiki_core" )
wpx = WikiPageX ( wg )
doc = wpx ( doc )
for span in doc . _ . wiki_spans :
  print ( span . _ . wiki_pages )

> >> [ 'An' ]
> >> [ 'Apple' , 'Apple_(disambiguation)' , 'Apple_(company)' , 'Apple_(tree)' ]
> >> [ 'A' , 'A_(musical_note)' , 'A_(New_York_City_Subway_service)' , 'A_(disambiguation)' , 'A_(Cyrillic)' )]
> >> [ 'Day' ]
> >> [ 'The_Doctor' , 'The_Doctor_(Doctor_Who)' , 'The_Doctor_(Star_Trek)' , 'The_Doctor_(disambiguation)' ]
> >> [ 'The' ]
> >> [ 'Doctor_(Doctor_Who)' , 'Doctor_(Star_Trek)' , 'Doctor' , 'Doctor_(title)' , 'Doctor_(disambiguation)' ]

Clusterx

ท่อ ClusterX ใช้คำนามในข้อความและจัดกลุ่มพวกเขาโดยใช้อัลกอริทึม Mapper Radial Ball

 from spacy import load as spacy_load
from spikex . pipes import ClusterX

nlp = spacy_load ( "en_core_web_sm" )
doc = nlp ( "Grab this juicy orange and watch a dog chasing a cat." )
clusterx = ClusterX ( min_score = 0.65 )
doc = clusterx ( doc )
for cluster in doc . _ . cluster_chunks :
  print ( cluster )

> >> [ this juicy orange ]
> >> [ a cat , a dog ]

abbrx

ท่อ ABBRX พบตัวย่อและตัวย่อในข้อความเชื่อมโยงรูปแบบสั้นและยาวเข้าด้วยกัน:

 from spacy import load as spacy_load
from spikex . pipes import AbbrX

nlp = spacy_load ( "en_core_web_sm" )
doc = nlp ( "a little snippet with an abbreviation (abbr)" )
abbrx = AbbrX ( nlp . vocab )
doc = abbrx ( doc )
for abbr in doc . _ . abbrs :
  print ( abbr , "->" , abbr . _ . long_form )

> >> abbr - > abbreviation

labelx

ท่อจับ LabelX และรูปแบบฉลากในข้อความการแก้ปัญหาที่ซ้อนทับตัวย่อและคำย่อ

 from spacy import load as spacy_load
from spikex . pipes import LabelX

nlp = spacy_load ( "en_core_web_sm" )
doc = nlp ( "looking for a computer system engineer" )
patterns = [
  [{ "LOWER" : "computer" }, { "LOWER" : "system" }],
  [{ "LOWER" : "system" }, { "LOWER" : "engineer" }],
]
labelx = LabelX ( nlp . vocab , [( "TEST" , patterns )], validate = True , only_longest = True )
doc = labelx ( doc )
for labeling in doc . _ . labelings :
  print ( labeling , f"[ { labeling . label_ } ]" )

> >> computer system engineer [ TEST ]

วลี

ท่อ PhraseX สร้างส่วนขยายขีดล่างของ Doc ที่กำหนดเองซึ่งเติมเต็มด้วยการจับคู่จากรูปแบบวลี

 from spacy import load as spacy_load
from spikex . pipes import PhraseX

nlp = spacy_load ( "en_core_web_sm" )
doc = nlp ( "I have Melrose and McIntosh apples, or Williams pears" )
patterns = [
  [{ "LOWER" : "mcintosh" }],
  [{ "LOWER" : "melrose" }],
]
phrasex = PhraseX ( nlp . vocab , "apples" , patterns )
doc = phrasex ( doc )
for apple in doc . _ . apples :
  print ( apple )

> >> Melrose
> >> McIntosh

SentX

ท่อ SENTX แยกประโยคเป็นข้อความ มันปรับเปลี่ยนแอตทริบิวต์ IS_SENT_START ของโทเค็นดังนั้นจึงจำเป็นต้องเพิ่มก่อนที่จะ ใช้ PARSER PIPE ในท่อ Spacy:

 from spacy import load as spacy_load
from spikex . pipes import SentX
from spikex . defaults import spacy_version

if spacy_version >= 3 :
  from spacy . language import Language

  @ Language . factory ( "sentx" )
  def create_sentx ( nlp , name ):
      return SentX ()

nlp = spacy_load ( "en_core_web_sm" )
sentx_pipe = SentX () if spacy_version < 3 else "sentx"
nlp . add_pipe ( sentx_pipe , before = "parser" )
doc = nlp ( "A little sentence. Followed by another one." )
for sent in doc . sents :
  print ( sent )

> >> A little sentence .
> >> Followed by another one .

นั่นคือทุกคน

อย่าลังเลที่จะมีส่วนร่วมและสนุก!

ขยาย

spikex

Spikex - Spacy Pipes สำหรับการสกัดความรู้

มีอะไรใหม่ใน Spikex 0.5.0

ท่อ

เครื่องมือ

ติดตั้ง Spikex

ปิ๊ก

การใช้งาน

ก่อน

วิกิกราฟ

รถยนต์

คู่มือ

ผู้จับคู่

WikiPagex

Clusterx

abbrx

labelx

วลี

SentX

นั่นคือทุกคน

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express