multi_rakeダウンロードmulti_rakeソースコードのダウンロード

Python用の多言語迅速な自動キーワード抽出（Rake）

特徴

あらゆる言語で書かれたテキストからの自動キーワード抽出
事前にテキストの言語を知る必要はありません
ストップワードのリストを持つ必要はありません
26の言語が現在利用可能です。
レイクを構成し、テキストをプラグインしてキーワードを取得するだけです（実装の詳細を参照）

インストール

pip install multi-rake

cldエラーがnarrowing conversionsためにインストールが失敗した場合、

CFLAGS= " -Wno-narrowing " pip install multi-rake

例

英語のテキスト、明示的な言語やストップワードのリストを指定しません（ビルトインリストが使用されています）。

 from multi_rake import Rake

text_en = (
    'Compatibility of systems of linear constraints over the set of '
    'natural numbers. Criteria of compatibility of a system of linear '
    'Diophantine equations, strict inequations, and nonstrict inequations '
    'are considered. Upper bounds for components of a minimal set of '
    'solutions and algorithms of construction of minimal generating sets '
    'of solutions for all types of systems are given. These criteria and '
    'the corresponding algorithms for constructing a minimal supporting '
    'set of solutions can be used in solving all the considered types of '
    'systems and systems of mixed types.'
)

rake = Rake ()

keywords = rake . apply ( text_en )

print ( keywords [: 10 ])

#  ('minimal generating sets', 8.666666666666666),
#  ('linear diophantine equations', 8.5),
#  ('minimal supporting set', 7.666666666666666),
#  ('minimal set', 4.666666666666666),
#  ('linear constraints', 4.5),
#  ('natural numbers', 4.0),
#  ('strict inequations', 4.0),
#  ('nonstrict inequations', 4.0),
#  ('upper bounds', 4.0),
#  ('mixed types', 3.666666666666667),

エスペラントで書かれたテキスト（自由主義に関する記事）。この言語にはストップワードのリストはありません。それらは提供されたテキストから生成されます。

text 、導入の3つの最初の段落で構成されています。 text_for_stopwords他のすべてのテキスト。

 text = (
    'Liberalismo estas politika filozofio aŭ mondrigardo konstruita en '
    'ideoj de libereco kaj egaleco. Liberaluloj apogas larĝan aron de '
    'vidpunktoj depende de sia kompreno de tiuj principoj, sed ĝenerale '
    'ili apogas ideojn kiel ekzemple liberaj kaj justaj elektoj, '
    'civitanrajtoj, gazetara libereco, religia libereco, libera komerco, '
    'kaj privata posedrajto. Liberalismo unue iĝis klara politika movado '
    'dum la Klerismo, kiam ĝi iĝis populara inter filozofoj kaj '
    'ekonomikistoj en la okcidenta mondo. Liberalismo malaprobis heredajn '
    'privilegiojn, ŝtatan religion, absolutan monarkion kaj la Didevena '
    'Rajto de Reĝoj. La filozofo John Locke de la 17-a jarcento ofte '
    'estas meritigita pro fondado de liberalismo kiel klara filozofia '
    'tradicio. Locke argumentis ke ĉiu homo havas naturon rekte al vivo, '
    'libereco kaj posedrajto kaj laŭ la socia '
    'kontrakto, registaroj ne rajtas malobservi tiujn rajtojn. '
    'Liberaluloj kontraŭbatalis tradician konservativismon kaj serĉis '
    'anstataŭigi absolutismon en registaroj per reprezenta demokratio kaj '
    'la jura hegemonio.'
)

rake = Rake ( max_words_unknown_lang = 3 )

keywords = rake . apply ( text , text_for_stopwords = other_text )

print ( keywords )

#  ('serĉis anstataŭigi absolutismon', 9.0)  # sought to replace absolutism
#  ('filozofo john locke', 8.5),  # philosopher John Locke
#  ('locke argumentis', 4.5)  # Locke argues
#  ('justaj elektoj', 4.0),  # fair elections
#  ('libera komerco', 4.0),  # free trade
#  ('okcidenta mondo', 4.0),  # western world
#  ('ŝtatan religion', 4.0),  # state religion
#  ('absolutan monarkion', 4.0),  # absolute monarchy
#  ('didevena rajto', 4.0),  # Dominican Rights
#  ('socia kontrakto', 4.0),  # social contract
#  ('jura hegemonio', 4.0),  # legal hegemony
#  ('mondrigardo konstruita', 4.0)  # worldview built
#  ('vidpunktoj depende', 4.0),  # views based
#  ('sia kompreno', 4.0),  # their understanding
#  ('tiuj principoj', 4.0),  # these principles
#  ('gazetara libereco', 3.5),  # freedom of press
#  ('religia libereco', 3.5),  # religious freedom
#  ('privata posedrajto', 3.5),  # private property
#  ('libereco', 1.5),  # liberty
#  ('posedrajto', 1.5)]  # property

そのため、明示的な一連のストップワードセットなしでは、まともな結果を得ることができます。

使用法

Rakeオブジェクトを初期化します

 from multi_rake import Rake

rake = Rake (
    min_chars = 3 ,
    max_words = 3 ,
    min_freq = 1 ,
    language_code = None ,  # 'en'
    stopwords = None ,  # {'and', 'of'}
    lang_detect_threshold = 50 ,
    max_words_unknown_lang = 2 ,
    generated_stopwords_percentile = 80 ,
    generated_stopwords_max_len = 3 ,
    generated_stopwords_min_freq = 2 ,
)

min_chars-その長さが> = min_charsの場合、単語はキーワードの一部になるように選択されます。デフォルト3

max_words-キーワードと見なされるフレーズの単語の最大数。デフォルト3

min_freq-キーワードと見なされるフレーズの発生数の最小数。デフォルト1

Language_Code-文字列として言語コードを提供して、組み込みのストップワードセットを使用します。利用可能な言語のリストを参照してください。言語が指定されていない場合、アルゴリズムはCLD2で言語を決定しようとし、対応する内蔵ストップワードのセットを使用します。デフォルトなし

STOPWORDS-独自のストップワードのコレクションを提供します（できればセットとして、低い段階で）。指定されている場合は、 language_codeをオーバーライドします。デフォルトなし

language_codeとstopwordsをNone 、提供されたテキストからstopwordsが生成されます。

lang_detect_threshold- CLD2（0-100）で検出された言語の確率のしきい値。デフォルト50

max_words_unknown_lang- max_wordsと同じですが、言語が不明であり、stopwordsが提供されたテキストから生成される場合は使用されます。通常、特別に作成されたストップワードのセットが使用されると、最良の結果が得られます。生成されたストップワードが不在で使用される場合、キーワードはそれほどきれいではない場合があり、たとえば、未知の言語の2ワードキーワードと、定義されたストップワードセットの言語の3ワードキーワードを生成することをお勧めします。デフォルト2

generated_stopwords_percentile-ストップワードを生成するには、頻度ごとにすべての単語の分布を作成します。このパーセンタイル（0-100）の上の単語は、ストップワードになる候補者と見なされます。デフォルト80

generated_stopwords_max_len-生成されたストップワードの最大文字長。デフォルト3

generated_stopwords_min_freq-分布内の生成されたストップワードの最小頻度。デフォルト2

Rakeオブジェクトをテキストに適用します。

 keywords = rake . apply (
    text ,
    text_for_stopwords = None ,
)

テキスト- キーワードを生成するテキストを含む文字列。

text_for_stopwords- textとともにストップワード生成に使用されるテキストを含む文字列。たとえば、紹介のある記事といくつかのサブセクションがあります。あなたは、あなたの目的のために、紹介からのキーワードで十分であり、あなたはテキストの言語を知りませんし、あなたはあなたがストップワードのリストを持っていることを知っています。したがって、テキスト自体からストップワードを生成でき、テキストが多いほど良いです。 text=introduction, text_for_stopwords=rest_of_your_textを指定するよりも。

実装の詳細

Rakeアルゴリズムは、Rose、S.、Engel、D.、Cramer、N。、およびCowley、W。（2010）で説明されています。個々のドキュメントからの自動キーワード抽出。 MW Berry＆J。Kogan（編）、テキストマイニング：理論と応用：John Wiley＆Sons

この実装は、多言語のサポートによって他のものとは異なります。基本的に、その言語を知らずにテキストを提供することができます（キリル語またはラテン語のアルファベットで書く必要があります）。ただし、最良の結果は、徹底的に構築されたストップワードのリストで達成されます。

フードの下で何が起こっているのか：

STOPWORDが指定されている場合、それらが使用されます
言語が指定されている場合、組み込みのストップワードがない場合、この言語の組み込みのストップワードが使用されます - > 4
言語が指定されていない場合、CLD2は言語を決定しようとします - > 2
stopwordsは、 textとtext_for_stopwordsから生成されます

テキスト内の単語の頻度分布を作成し、 generated_stopwords_percentile 、 generated_stopwords_max_len 、 generated_stopwords_min_freqでフィルタリングすることにより、ストップワードを生成します。それらを完全に生成することはできませんが、通常は3〜4文字で構成され、頻繁に表示されるため、記事や前置詞を見つけるのはかなり簡単です。これらのストップワードは、句読点デリミターと相まって、理解できない言語の結果を適切に取得できるようになります。

現在利用可能な言語のリスト

レーキの初期化中は、言語コードのみを使用する必要があります。

BG-ブルガリア語
CS -Czech
DA-デンマーク語
de-ドイツ語
エル - ギリシャ語
en-英語
ES-スペイン語
FA-ペルシャ語
fi-フィンランド語
FR-フランス語
GA-アイルランド
HR-クロアチア語
胡 - ハンガリー
ID-インドネシア語
それ - イタリア語
Lt -Lithuanian
LV-ラトビア
NL-オランダ語
いいえ - ノルウェー語
pl-ポリッシュ
PT-ポルトガル語
RO-ルーマニア語
ru-ロシア語
SK-スロバキア
SV-スウェーデン語
TR-トルコ語
英国 - ウクライナ人

発達

リポジトリには、リナー、テスト、カバレッジが構成されています。

それを使用するために、Multi_rakeフォルダー内に新しい仮想環境を作成します。

python3 -m venv env
source env/bin/activate

make install-dev  # install dependencies

make lint  # run linter

make test  # run tests and coverage

参照

Rake Algorithm：Rose、S.、Engel、D.、Cramer、N。、＆Cowley、W。（2010）。個々のドキュメントからの自動キーワード抽出。 MW Berry＆J。Kogan（編）、テキストマイニング：理論と応用：John Wiley＆Sons

FabianVFによる基本的なレーキの実装が使用されました。

ストップワード：TREC-KBA、ランクNL

拡大する

multi_rake

Python用の多言語迅速な自動キーワード抽出（Rake）

特徴

インストール

例

使用法

実装の詳細

現在利用可能な言語のリスト

発達

参照

multi roblox macos

OpenCore_NO_ACPI_Build

nspanel_pro_tools_apk

zkwork_aleo_gpu_worker

TikTok Multi Downloader

nextcloud_share_url_downloader

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express