postgres full text searchダウンロード-POSTGRES postgres full text searchソースコードダウンロード

Postgres-full-text-search

Postgresフルテキスト検索オプション（Tsearch、Trigram、ilike）の例。

DBを作成します
Simple ilikeを使用した全文検索
TRIGRAMインデックスでサポートされているilikeを使用した全文検索
tsearchフルテキスト検索のために、非デフォルト言語構成を作成します
保存されたインデックスなしのTsearchフルテキスト検索
保存された部分インデックスを使用したTsearchフルテキスト検索
tsearchフルテキストの部分的な単語の検索
tsearchフルテキスト検索結果のランキング
GIST対ジン
インスピレーションとヘルプ

DBを作成します

 >> CREATE DATABASE ftdb;

DBをデータセットの例（ dataset.txt 、100k行、それぞれ15語）でフィードするには、python init_db.pyスクリプトを使用しました。

Simple `ilike`を使用した全文検索

 >> EXPLAIN ANALYZE
   SELECT text , language
   FROM public . document
   WHERE
      text ilike ' %field% '
      AND text ilike ' %window% '
      AND text ilike ' %lamp% '
      AND text ilike ' %research% '
      AND language = ' en '
    LIMIT 1 ;
                                                                  QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 0 . 00 .. 3734 . 02 rows = 1 width = 105 ) (actual time = 87 . 473 .. 87 . 474 rows = 0 loops = 1 )
   - >  Seq Scan on document  (cost = 0 . 00 .. 3734 . 02 rows = 1 width = 105 ) (actual time = 87 . 466 .. 87 . 466 rows = 0 loops = 1 )
         Filter: (( text ~~ * ' %field% ' :: text ) AND ( text ~~ * ' %window% ' :: text ) AND ( text ~~ * ' %lamp% ' :: text ) AND ( text ~~ * ' %research% ' :: text ))
         Rows Removed by Filter: 100001
 Planning Time : 2 . 193 ms
 Execution Time : 87 . 500 ms

TRIGRAMインデックスでサポートされている`ilike`を使用した全文検索

トリグラムとは何ですか？この例を参照してください：

 >> CREATE EXTENSION pg_trgm;
CREATE EXTENSION
>> select show_trgm( ' fielded ' );
                show_trgm
-- ---------------------------------------
 { "  f " , " fi " ,ded, " ed " ,eld,fie,iel,lde}

gin_trgm_opsなど、Trigramインデックスを使用してilikeパフォーマンスを改善できます。

 >> CREATE INDEX  ix_document_text_trigram ON document USING gin ( text gin_trgm_ops) where language = ' en ' ;
CREATE INDEX

>> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE
      text ilike ' %field% '
      AND text ilike ' %window% '
      AND text ilike ' %lamp% '
      AND text ilike ' %research% '
      AND language = ' en '
    LIMIT 1 ;
                                                                                       QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 176 . 00 .. 180 . 02 rows = 1 width = 105 ) (actual time = 1 . 473 .. 1 . 474 rows = 0 loops = 1 )
   - >  Bitmap Heap Scan on document  (cost = 176 . 00 .. 180 . 02 rows = 1 width = 105 ) (actual time = 1 . 470 .. 1 . 471 rows = 0 loops = 1 )
         Recheck Cond: (( text ~~ * ' %field% ' :: text ) AND ( text ~~ * ' %window% ' :: text ) AND ( text ~~ * ' %lamp% ' :: text ) AND ( text ~~ * ' %research% ' :: text ) AND ((language):: text = ' en ' :: text ))
         - >  Bitmap Index Scan on ix_document_text_trigram  (cost = 0 . 00 .. 176 . 00 rows = 1 width = 0 ) (actual time = 1 . 466 .. 1 . 466 rows = 0 loops = 1 )
               Index Cond: (( text ~~ * ' %field% ' :: text ) AND ( text ~~ * ' %window% ' :: text ) AND ( text ~~ * ' %lamp% ' :: text ) AND ( text ~~ * ' %research% ' :: text ))
 Planning Time : 2 . 389 ms
 Execution Time : 1 . 524 ms

tsearchフルテキスト検索のために、非デフォルト言語構成を作成します

Postgresは、デフォルトでは多くの言語のサポートを提供しません。ただし、構成を非常に簡単にセットアップできます。追加の辞書ファイルが必要です。これがポーランド語の例です。ポーランドの辞書ファイルは、https：//github.com/judehunter/polish-tsearchからダウンロードできます。

Poling.affix、polish.stop、およびpolish.dictファイルは、postgresql sharedir tsearch_data locationにコピーする必要があります/usr/share/postgresql/13/tsearch_data Sharedirの場所を決定するには、 pg_config --sharedir使用できます

また、データベース内に構成（ドキュメントを参照）を作成する必要があります。

 >> DROP TEXT SEARCH DICTIONARY IF EXISTS polish_hunspell CASCADE;
   CREATE TEXT SEARCH DICTIONARY polish_hunspell (
    TEMPLATE  = ispell,
    DictFile  = polish,
    AffFile   = polish,
    StopWords = polish
  );
  CREATE TEXT SEARCH CONFIGURATION public . polish (
    COPY = pg_catalog . english
  );
  ALTER TEXT SEARCH CONFIGURATION polish
    ALTER MAPPING
    FOR
        asciiword, asciihword, hword_asciipart,  word, hword, hword_part
    WITH
        polish_hunspell, simple;

フルテキスト検索エンジンはlexemeを使用して最高の一致を見つけるためにレクセムを使用しているため、これらのファイルと構成が必要です（クエリパターンと保存されたテキストの両方が語られます）：

 >> SELECT to_tsquery( ' english ' , ' fielded ' ), to_tsvector( ' english ' , text )
   FROM document
   LIMIT 1 ;
 to_tsquery |                                                                    to_tsvector
-- ----------+----------------------------------------------------------------------------------------------------------------------------------------------------
 ' field '    | ' 19 ' : 16 ' bat ' : 12 ' dead ' : 8 ' degre ' : 1 ' depth ' : 5 ' field ' : 15 ' lamp ' : 13 ' men ' : 6 ' put ' : 14 ' ranch ' : 2 ' tall ' : 4 ' time ' : 3 ' underlin ' : 11 ' wast ' : 10 ' window ' : 9

辞書ファイルを提供できない場合は、フルテキストを「単純な」フォームで使用できます（lexemeへの変換なし）：

 >> SELECT to_tsquery( ' simple ' , ' fielded ' ), to_tsvector( ' simple ' , text )
   FROM document
   LIMIT 1 ;
 to_tsquery |                                                                             to_tsvector
-- ----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 ' fielded '  | ' 19 ' : 16 ' bat ' : 12 ' below ' : 7 ' dead ' : 8 ' degree ' : 1 ' depth ' : 5 ' field ' : 15 ' lamp ' : 13 ' men ' : 6 ' putting ' : 14 ' ranch ' : 2 ' tall ' : 4 ' time ' : 3 ' underline ' : 11 ' waste ' : 10 ' window ' : 9

保存されたインデックスなしのTsearchフルテキスト検索

 >> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
   LIMIT 1 ;
                                                                                  QUERY PLAN
-- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 1000 . 00 .. 18298 . 49 rows = 1 width = 103 ) (actual time = 489 . 802 .. 491 . 352 rows = 0 loops = 1 )
   - >  Gather  (cost = 1000 . 00 .. 18298 . 49 rows = 1 width = 103 ) (actual time = 489 . 800 .. 491 . 349 rows = 0 loops = 1 )
         Workers Planned: 1
         Workers Launched: 1
         - >  Parallel Seq Scan on document  (cost = 0 . 00 .. 17298 . 39 rows = 1 width = 103 ) (actual time = 486 . 644 .. 486 . 644 rows = 0 loops = 2 )
               Filter: (((language):: text = ' en ' :: text ) AND (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery))
               Rows Removed by Filter: 50000
 Planning Time : 0 . 272 ms
 Execution Time : 491 . 376 ms
( 9 rows)

保存された部分インデックスを使用したTsearchフルテキスト検索

Partial Indexは、1つのテーブルを使用してレコードを異なる言語で保存し、効果的にクエリする可能性として提供されます。

 >> CREATE INDEX ix_en_document_tsvector_text ON public . document USING gin (to_tsvector( ' english ' ::regconfig, text )) WHERE language = ' en ' ;
CREATED INDEX
>> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
   LIMIT 1 ;
                                                               QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 1000 . 00 .. 18151 . 43 rows = 1 width = 103 ) (actual time = 487 . 120 .. 488 . 569 rows = 0 loops = 1 )
   - >  Gather  (cost = 1000 . 00 .. 18151 . 43 rows = 1 width = 103 ) (actual time = 487 . 117 .. 488 . 567 rows = 0 loops = 1 )
         Workers Planned: 1
         Workers Launched: 1
         - >  Parallel Seq Scan on document  (cost = 0 . 00 .. 17151 . 33 rows = 1 width = 103 ) (actual time = 484 . 418 .. 484 . 419 rows = 0 loops = 2 )
               Filter: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery)
               Rows Removed by Filter: 50000
 Planning Time : 0 . 193 ms
 Execution Time : 488 . 596 ms

違いはありませんか？インデックスは使用されていません...なぜ機能していないのですか？ああ、部分的なインデックスドキュメントに目を向けます：

ただし、述語は、インデックスの恩恵を受けるはずのクエリで使用される条件と一致する必要があることに留意してください。正確には、システムが数学的に数学的にインデックスの述語を意味することをシステムが認識できる場合にのみ、クエリで部分インデックスを使用できます。 PostgreSQLには、さまざまな形式で記述された数学的に同等の式を認識できる洗練された定理を備えていません。（このような一般的な定理を作成するのが非常に難しいだけでなく、おそらく実際の使用には遅すぎるでしょう。）システムは、単純な不平等の意味を認識することができます。たとえば、「x <1」は「x <2」を意味します。それ以外の場合、述語条件は、クエリの条件またはインデックスが使用可能であると認識されない場合のクエリの一部を正確に一致させる必要があります。マッチングは、実行時ではなく、クエリ計画時間で行われます。その結果、パラメーター化されたクエリ条項は、部分インデックスでは機能しません。

部分的なインデックスを作成するために使用された条件を照会するために追加する必要があります： document.language = 'en' ：

 >> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE
      to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
      AND language = ' en '
   LIMIT 1 ;                                                                           QUERY PLAN
-- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 64 . 00 .. 68 . 27 rows = 1 width = 103 ) (actual time = 0 . 546 .. 0 . 548 rows = 0 loops = 1 )
   - >  Bitmap Heap Scan on document  (cost = 64 . 00 .. 68 . 27 rows = 1 width = 103 ) (actual time = 0 . 544 .. 0 . 545 rows = 0 loops = 1 )
         Recheck Cond: ((to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery) AND ((language):: text = ' en ' :: text ))
         - >  Bitmap Index Scan on ix_en_document_tsvector_text  (cost = 0 . 00 .. 64 . 00 rows = 1 width = 0 ) (actual time = 0 . 540 .. 0 . 540 rows = 0 loops = 1 )
               Index Cond: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery)
 Planning Time : 0 . 244 ms
 Execution Time : 0 . 590 ms

tsearchフルテキストの部分的な単語の検索

:*演算子はプレフィックス検索を有効にします。単語の入力中に全文検索を実行すると便利です。

 >> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE
      to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & l:* ' )
      AND language = ' en '
   LIMIT 1 ;
                                                                   QUERY PLAN
-- ----------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on document  (cost = 168 . 00 .. 172 . 27 rows = 1 width = 102 ) (actual time = 5 . 207 .. 5 . 210 rows = 4 loops = 1 )
   Recheck Cond: ((to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' l ' ' :* ' ::tsquery) AND ((language):: text = ' en ' :: text ))
   Heap Blocks: exact = 4
   - >  Bitmap Index Scan on ix_en_document_tsvector_text  (cost = 0 . 00 .. 168 . 00 rows = 1 width = 0 ) (actual time = 5 . 202 .. 5 . 202 rows = 4 loops = 1 )
         Index Cond: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' l ' ' :* ' ::tsquery)
 Planning Time : 0 . 240 ms
 Execution Time : 5 . 240 ms

>> SELECT id,  text
   FROM public . document
   WHERE
      to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & l:* ' )
      AND language = ' en '
   LIMIT 20 ;
  id   |                                                   text
-- -----+-----------------------------------------------------------------------------------------------------------
     1 | degree ranch time tall depth men below dead window waste underline bat lamp putting field               +
 20152 | Law pony follow memory star whatever window sets oxygen longer word whom glass field actual              +
 21478 | Dried symbol willing design managed shade window pick share faster education drive field land everybody  +
 30293 | Pencil seen engineer labor image entire smallest serve field should riding smaller window imagine traffic +

tsearchフルテキスト検索結果のランキング

Tsearchの結果をランク付けするには、2つの非常に類似した機能があります。

ts_rank 、それは一致する辞石の頻度に基づいてベクトルをランク付けします
ts_rank_cdは、「カバー密度」ランキングを計算します

詳細については、ドキュメントを参照してください

 >> SELECT
     id,
     ts_rank_cd(to_tsvector( ' english ' , text ), to_tsquery( ' english ' , ' fielded & wind:* ' )) rank,
     text
    FROM public . document
    WHERE to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & wind:* ' )
    ORDER BY rank DESC
    LIMIT 20 ;
   id   |    rank     |                                                   text
-- ------+-------------+-----------------------------------------------------------------------------------------------------------
 100002 |         0 . 1 | fielded window
   9376 |        0 . 05 | Own mouse girl effect surprise physical newspaper forgot eat upper field element window simply unhappy   +
  96597 |        0 . 05 | Opinion fastened pencil rear more theory size window heading field understanding farm up position attack +
  44626 | 0 . 033333335 | Symbol each halfway window swam spider field page shinning donkey chose until cow cabin congress         +
  80922 | 0 . 033333335 | Victory famous field shelter girl wind adventure he divide rear tip few studied ruler judge              +
  30293 |       0 . 025 | Pencil seen engineer labor image entire smallest serve field should riding smaller window imagine traffic +
      1 | 0 . 016666668 | degree ranch time tall depth men below dead window waste underline bat lamp putting field               +
  21478 | 0 . 016666668 | Dried symbol willing design managed shade window pick share faster education drive field land everybody  +
  60059 | 0 . 016666668 | However hungry make proud kids come willing field officer row above highest round wind mile              +
  26001 | 0 . 014285714 | Earth earlier pocket might sense window way frog fire court family mouth field somebody recognize        +
  20152 | 0 . 014285714 | Law pony follow memory star whatever window sets oxygen longer word whom glass field actual              +
  37470 |      0 . 0125 | Farm weight balloon buried wind water donkey grain pig week should damage field was he                   +
  49433 |        0 . 01 | Wind scientist leaving atom year bad child drink shore spirit field facing indicate wagon here           +
  37851 | 0 . 007142857 | Field cloud you wife rhythm upward applied weigh continued property replace ahead forgotten trip window  +

text='fielded window'レコードは、ベストマッチ結果を示すために手動で追加されました。

GIST対ジン

GINインデックスを作成しました。ただし、GISTインデックスオプションもあります。どちらが良いですか？場合によります...

 >> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE
      to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
      AND language = ' en '
   LIMIT 1 ;
                                                                  QUERY PLAN
-- ---------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 0 . 28 .. 8 . 30 rows = 1 width = 103 ) (actual time = 2 . 699 .. 2 . 700 rows = 0 loops = 1 )
   - >  Index Scan using ix_en_document_tsvector_text on document  (cost = 0 . 28 .. 8 . 30 rows = 1 width = 103 ) (actual time = 2 . 697 .. 2 . 697 rows = 0 loops = 1 )
         Index Cond: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery)
 Planning Time : 0 . 274 ms
 Execution Time : 2 . 730 ms

ジンは少し速いようです。ドキュメントよりもよく説明できるとは思わない：

使用するインデックスタイプ、GIST、またはジンを選択する際に、これらのパフォーマンスの違いを考慮してください。
ジンインデックスルックアップは、GISTの約3倍高速です
ジンインデックスは、GISTよりも約3倍時間がかかります。
GINインデックスはGISTインデックスよりも更新が適度に遅くなりますが、高速アップデートサポートが無効になった場合は約10倍遅いです（詳細については58.4.1セクションを参照）
GINインデックスは、GISTインデックスよりも2〜3倍大きくなっています

インスピレーションとヘルプ

https://about.gitlab.com/blog/2016/03/18/fast-search-using-postgresql-trigram-indexes/
http://rachbelaid.com/postgres-full-text-search-is-good-enough/
https://scoutapm.com/blog/how-to-make-text-searches-in-postgresql-with-trigram-similarity
https://stackoverflow.com/questions/27443950/make-postgres-full-text-search-search-tsvector-act-ilike-ilike-to-search-inside-words
https://stackoverflow.com/questions/46122175/fulltext-search-combined-with-fuzzysearch-in-postgresql
https://stackoverflow.com/questions/58651852/use-postgresql-full-text-search-to-fuzzy-match-all-search-terms
https://stackoverflow.com/questions/52140727/fuzzy-search-in-full-text-search
https://stackoverflow.com/questions/2513501/postgresql-full-text-search-how-to-search-partial-words
https://stackoverflow.com/questions/28975517/difference-between-gist and-gin-index
https://dba.stackexchange.com/questions/149765/postgresql-gin-index-not-used-when-ts-query-language-is-freted-from-a-column
https://dba.stackexchange.com/questions/251177/postgres-full-text-search-on-words-not-lexemes

拡大する

postgres full text search