postgres full text search herunterladen - postgres full text search -Quellencode herunterladen

postgres-full-text-Suche

Postgres Volltext -Suchoptionen (Tsearch, Trigram, Ilike) Beispiele.

DB erstellen
Volltext -Suche mit einfachem ilike
Volltext -Suche mit ilike unterstützt vom Trigram Index
Erstellen
Tsearch Volltextsuche ohne gespeicherten Index
Tsearch Volltextsuche mit gespeichertem Teilindex
Tsearch Volltext -Suche nach Teilwörtern
Tsearch Volltext -Suchergebnisse Ranking
Gist gegen Gin
Inspiration und Hilfe

DB erstellen

 >> CREATE DATABASE ftdb;

Um DB mit einem Beispiel -Datensatz zu füttern ( dataset.txt , 100k Zeilen, jeweils 15 Wörter), habe ich Python init_db.py -Skript verwendet.

Volltext -Suche mit einfachem `ilike`

 >> EXPLAIN ANALYZE
   SELECT text , language
   FROM public . document
   WHERE
      text ilike ' %field% '
      AND text ilike ' %window% '
      AND text ilike ' %lamp% '
      AND text ilike ' %research% '
      AND language = ' en '
    LIMIT 1 ;
                                                                  QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 0 . 00 .. 3734 . 02 rows = 1 width = 105 ) (actual time = 87 . 473 .. 87 . 474 rows = 0 loops = 1 )
   - >  Seq Scan on document  (cost = 0 . 00 .. 3734 . 02 rows = 1 width = 105 ) (actual time = 87 . 466 .. 87 . 466 rows = 0 loops = 1 )
         Filter: (( text ~~ * ' %field% ' :: text ) AND ( text ~~ * ' %window% ' :: text ) AND ( text ~~ * ' %lamp% ' :: text ) AND ( text ~~ * ' %research% ' :: text ))
         Rows Removed by Filter: 100001
 Planning Time : 2 . 193 ms
 Execution Time : 87 . 500 ms

Volltext -Suche mit `ilike` unterstützt vom Trigram Index

Was ist ein Trigramm? Siehe dieses Beispiel:

 >> CREATE EXTENSION pg_trgm;
CREATE EXTENSION
>> select show_trgm( ' fielded ' );
                show_trgm
-- ---------------------------------------
 { "  f " , " fi " ,ded, " ed " ,eld,fie,iel,lde}

Wir können ilike -Leistung mit dem Trigram Index, z. B. gin_trgm_ops , verbessern.

 >> CREATE INDEX  ix_document_text_trigram ON document USING gin ( text gin_trgm_ops) where language = ' en ' ;
CREATE INDEX

>> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE
      text ilike ' %field% '
      AND text ilike ' %window% '
      AND text ilike ' %lamp% '
      AND text ilike ' %research% '
      AND language = ' en '
    LIMIT 1 ;
                                                                                       QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 176 . 00 .. 180 . 02 rows = 1 width = 105 ) (actual time = 1 . 473 .. 1 . 474 rows = 0 loops = 1 )
   - >  Bitmap Heap Scan on document  (cost = 176 . 00 .. 180 . 02 rows = 1 width = 105 ) (actual time = 1 . 470 .. 1 . 471 rows = 0 loops = 1 )
         Recheck Cond: (( text ~~ * ' %field% ' :: text ) AND ( text ~~ * ' %window% ' :: text ) AND ( text ~~ * ' %lamp% ' :: text ) AND ( text ~~ * ' %research% ' :: text ) AND ((language):: text = ' en ' :: text ))
         - >  Bitmap Index Scan on ix_document_text_trigram  (cost = 0 . 00 .. 176 . 00 rows = 1 width = 0 ) (actual time = 1 . 466 .. 1 . 466 rows = 0 loops = 1 )
               Index Cond: (( text ~~ * ' %field% ' :: text ) AND ( text ~~ * ' %window% ' :: text ) AND ( text ~~ * ' %lamp% ' :: text ) AND ( text ~~ * ' %research% ' :: text ))
 Planning Time : 2 . 389 ms
 Execution Time : 1 . 524 ms

Erstellen

Postgres bietet standardmäßig viele Sprachen. Sie können die Konfiguration jedoch ganz einfach einrichten. Sie benötigen nur zusätzliche Wörterbuchdateien. Hier ist ein Beispiel für die polnische Sprache. Polnische Dictionary-Dateien können von: https://github.com/judehunter/polish-tsearch heruntergeladen werden.

polel.affix, polic.stop und polns.dict -Dateien sollten an postgresql sharedir tsearch_data location, z. B. /usr/share/postgresql/13/tsearch_data kopiert werden. Um Ihren Sharedir -Standort zu bestimmen, können Sie pg_config --sharedir verwenden

In der Datenbank muss auch eine Konfiguration erstellt werden (siehe die DOCs):

 >> DROP TEXT SEARCH DICTIONARY IF EXISTS polish_hunspell CASCADE;
   CREATE TEXT SEARCH DICTIONARY polish_hunspell (
    TEMPLATE  = ispell,
    DictFile  = polish,
    AffFile   = polish,
    StopWords = polish
  );
  CREATE TEXT SEARCH CONFIGURATION public . polish (
    COPY = pg_catalog . english
  );
  ALTER TEXT SEARCH CONFIGURATION polish
    ALTER MAPPING
    FOR
        asciiword, asciihword, hword_asciipart,  word, hword, hword_part
    WITH
        polish_hunspell, simple;

Sie benötigen diese Dateien und die Konfiguration, da die Volltext -Suchmaschine Lexeme verwendet, um die besten Übereinstimmungen zu finden (sowohl Abfragemuster als auch gespeicherte Text sind lexemisiert):

 >> SELECT to_tsquery( ' english ' , ' fielded ' ), to_tsvector( ' english ' , text )
   FROM document
   LIMIT 1 ;
 to_tsquery |                                                                    to_tsvector
-- ----------+----------------------------------------------------------------------------------------------------------------------------------------------------
 ' field '    | ' 19 ' : 16 ' bat ' : 12 ' dead ' : 8 ' degre ' : 1 ' depth ' : 5 ' field ' : 15 ' lamp ' : 13 ' men ' : 6 ' put ' : 14 ' ranch ' : 2 ' tall ' : 4 ' time ' : 3 ' underlin ' : 11 ' wast ' : 10 ' window ' : 9

Wenn Sie keine Wörterbuchdateien angeben können, können Sie Volltext in "einfacher" Form (ohne Transformation in Lexeme) verwenden:

 >> SELECT to_tsquery( ' simple ' , ' fielded ' ), to_tsvector( ' simple ' , text )
   FROM document
   LIMIT 1 ;
 to_tsquery |                                                                             to_tsvector
-- ----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 ' fielded '  | ' 19 ' : 16 ' bat ' : 12 ' below ' : 7 ' dead ' : 8 ' degree ' : 1 ' depth ' : 5 ' field ' : 15 ' lamp ' : 13 ' men ' : 6 ' putting ' : 14 ' ranch ' : 2 ' tall ' : 4 ' time ' : 3 ' underline ' : 11 ' waste ' : 10 ' window ' : 9

Tsearch Volltextsuche ohne gespeicherten Index

 >> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
   LIMIT 1 ;
                                                                                  QUERY PLAN
-- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 1000 . 00 .. 18298 . 49 rows = 1 width = 103 ) (actual time = 489 . 802 .. 491 . 352 rows = 0 loops = 1 )
   - >  Gather  (cost = 1000 . 00 .. 18298 . 49 rows = 1 width = 103 ) (actual time = 489 . 800 .. 491 . 349 rows = 0 loops = 1 )
         Workers Planned: 1
         Workers Launched: 1
         - >  Parallel Seq Scan on document  (cost = 0 . 00 .. 17298 . 39 rows = 1 width = 103 ) (actual time = 486 . 644 .. 486 . 644 rows = 0 loops = 2 )
               Filter: (((language):: text = ' en ' :: text ) AND (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery))
               Rows Removed by Filter: 50000
 Planning Time : 0 . 272 ms
 Execution Time : 491 . 376 ms
( 9 rows)

Tsearch Volltextsuche mit gespeichertem Teilindex

Der Teilindex bietet die Möglichkeit, Datensätze in verschiedenen Sprachen mithilfe einer Tabelle zu speichern und effektiv abzufragen.

 >> CREATE INDEX ix_en_document_tsvector_text ON public . document USING gin (to_tsvector( ' english ' ::regconfig, text )) WHERE language = ' en ' ;
CREATED INDEX
>> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
   LIMIT 1 ;
                                                               QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 1000 . 00 .. 18151 . 43 rows = 1 width = 103 ) (actual time = 487 . 120 .. 488 . 569 rows = 0 loops = 1 )
   - >  Gather  (cost = 1000 . 00 .. 18151 . 43 rows = 1 width = 103 ) (actual time = 487 . 117 .. 488 . 567 rows = 0 loops = 1 )
         Workers Planned: 1
         Workers Launched: 1
         - >  Parallel Seq Scan on document  (cost = 0 . 00 .. 17151 . 33 rows = 1 width = 103 ) (actual time = 484 . 418 .. 484 . 419 rows = 0 loops = 2 )
               Filter: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery)
               Rows Removed by Filter: 50000
 Planning Time : 0 . 193 ms
 Execution Time : 488 . 596 ms

Kein Unterschied? Index wurde nicht verwendet ... Warum funktioniert er nicht? Ohh, schaut auf die Teilindex -Dokumente:

Beachten Sie jedoch, dass das Prädikat mit den in den Abfragen verwendeten Bedingungen übereinstimmen muss, die vom Index profitieren sollen. Um genau zu sein, kann ein Teilindex in einer Abfrage nur dann verwendet werden, wenn das System erkennen kann, dass der Zustand der Abfrage mathematisch das Prädikat des Index impliziert. PostgreSQL hat keinen ausgefeilten Theorem -Prover, der mathematisch äquivalente Ausdrücke erkennen kann, die in verschiedenen Formen geschrieben sind. (Nicht nur ein so allgemeiner Theorem -Prover ist äußerst schwierig, es wäre wahrscheinlich zu langsam, um echte Verwendung zu sein.) Das System kann einfache Implikationen für die Ungleichheit erkennen, z. B. "x <1" impliziert "x <2"; Andernfalls muss die Prädikatbedingung einen Teil der Abfrage übereinstimmen, wobei der Zustand oder der Index nicht als nutzbar erkannt werden. Das Matching findet in der Abfrageplanungszeit statt, nicht zur Laufzeit. Infolgedessen funktionieren parametrisierte Abfrageklauseln nicht mit einem Teilindex.

Wir müssen hinzufügen, um eine Bedingung abzufragen, die zum Erstellen von Teilindex verwendet wurde: document.language = 'en' :

 >> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE
      to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
      AND language = ' en '
   LIMIT 1 ;                                                                           QUERY PLAN
-- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 64 . 00 .. 68 . 27 rows = 1 width = 103 ) (actual time = 0 . 546 .. 0 . 548 rows = 0 loops = 1 )
   - >  Bitmap Heap Scan on document  (cost = 64 . 00 .. 68 . 27 rows = 1 width = 103 ) (actual time = 0 . 544 .. 0 . 545 rows = 0 loops = 1 )
         Recheck Cond: ((to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery) AND ((language):: text = ' en ' :: text ))
         - >  Bitmap Index Scan on ix_en_document_tsvector_text  (cost = 0 . 00 .. 64 . 00 rows = 1 width = 0 ) (actual time = 0 . 540 .. 0 . 540 rows = 0 loops = 1 )
               Index Cond: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery)
 Planning Time : 0 . 244 ms
 Execution Time : 0 . 590 ms

Tsearch Volltext -Suche nach Teilwörtern

:* Der Bediener ermöglicht die Präfix -Suche. Es kann nützlich sein, die Volltextsuche beim Eingeben eines Wortes auszuführen.

 >> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE
      to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & l:* ' )
      AND language = ' en '
   LIMIT 1 ;
                                                                   QUERY PLAN
-- ----------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on document  (cost = 168 . 00 .. 172 . 27 rows = 1 width = 102 ) (actual time = 5 . 207 .. 5 . 210 rows = 4 loops = 1 )
   Recheck Cond: ((to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' l ' ' :* ' ::tsquery) AND ((language):: text = ' en ' :: text ))
   Heap Blocks: exact = 4
   - >  Bitmap Index Scan on ix_en_document_tsvector_text  (cost = 0 . 00 .. 168 . 00 rows = 1 width = 0 ) (actual time = 5 . 202 .. 5 . 202 rows = 4 loops = 1 )
         Index Cond: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' l ' ' :* ' ::tsquery)
 Planning Time : 0 . 240 ms
 Execution Time : 5 . 240 ms

>> SELECT id,  text
   FROM public . document
   WHERE
      to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & l:* ' )
      AND language = ' en '
   LIMIT 20 ;
  id   |                                                   text
-- -----+-----------------------------------------------------------------------------------------------------------
     1 | degree ranch time tall depth men below dead window waste underline bat lamp putting field               +
 20152 | Law pony follow memory star whatever window sets oxygen longer word whom glass field actual              +
 21478 | Dried symbol willing design managed shade window pick share faster education drive field land everybody  +
 30293 | Pencil seen engineer labor image entire smallest serve field should riding smaller window imagine traffic +

Tsearch Volltext -Suchergebnisse Ranking

Es gibt zwei ziemlich ähnliche Funktionen, um die TSEARS -Ergebnisse zu bewerten:

ts_rank , das Vektoren basierend auf der Häufigkeit ihrer passenden Lexeme rangiert
ts_rank_cd , mit der das Ranking "Deckendichte" berechnet wird

Weitere Informationen finden Sie in den Dokumenten

 >> SELECT
     id,
     ts_rank_cd(to_tsvector( ' english ' , text ), to_tsquery( ' english ' , ' fielded & wind:* ' )) rank,
     text
    FROM public . document
    WHERE to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & wind:* ' )
    ORDER BY rank DESC
    LIMIT 20 ;
   id   |    rank     |                                                   text
-- ------+-------------+-----------------------------------------------------------------------------------------------------------
 100002 |         0 . 1 | fielded window
   9376 |        0 . 05 | Own mouse girl effect surprise physical newspaper forgot eat upper field element window simply unhappy   +
  96597 |        0 . 05 | Opinion fastened pencil rear more theory size window heading field understanding farm up position attack +
  44626 | 0 . 033333335 | Symbol each halfway window swam spider field page shinning donkey chose until cow cabin congress         +
  80922 | 0 . 033333335 | Victory famous field shelter girl wind adventure he divide rear tip few studied ruler judge              +
  30293 |       0 . 025 | Pencil seen engineer labor image entire smallest serve field should riding smaller window imagine traffic +
      1 | 0 . 016666668 | degree ranch time tall depth men below dead window waste underline bat lamp putting field               +
  21478 | 0 . 016666668 | Dried symbol willing design managed shade window pick share faster education drive field land everybody  +
  60059 | 0 . 016666668 | However hungry make proud kids come willing field officer row above highest round wind mile              +
  26001 | 0 . 014285714 | Earth earlier pocket might sense window way frog fire court family mouth field somebody recognize        +
  20152 | 0 . 014285714 | Law pony follow memory star whatever window sets oxygen longer word whom glass field actual              +
  37470 |      0 . 0125 | Farm weight balloon buried wind water donkey grain pig week should damage field was he                   +
  49433 |        0 . 01 | Wind scientist leaving atom year bad child drink shore spirit field facing indicate wagon here           +
  37851 | 0 . 007142857 | Field cloud you wife rhythm upward applied weigh continued property replace ahead forgotten trip window  +

Der Datensatz text='fielded window' wurde manuell hinzugefügt, um das beste Übereinstimmungsergebnis zu zeigen.

Gist gegen Gin

Wir haben den Gin -Index erstellt. Es gibt aber auch eine GIST -Indexoption. Welches ist besser? Es kommt darauf an...

 >> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE
      to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
      AND language = ' en '
   LIMIT 1 ;
                                                                  QUERY PLAN
-- ---------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 0 . 28 .. 8 . 30 rows = 1 width = 103 ) (actual time = 2 . 699 .. 2 . 700 rows = 0 loops = 1 )
   - >  Index Scan using ix_en_document_tsvector_text on document  (cost = 0 . 28 .. 8 . 30 rows = 1 width = 103 ) (actual time = 2 . 697 .. 2 . 697 rows = 0 loops = 1 )
         Index Cond: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery)
 Planning Time : 0 . 274 ms
 Execution Time : 2 . 730 ms

Gin scheint ein bisschen schneller zu sein. Ich glaube nicht, dass ich es besser erklären könnte als die Dokumente bereits:

Betrachten Sie bei der Auswahl, welcher Index -Typ, GIST oder GIN, diese Leistungsunterschiede berücksichtigen:
Die Gin -Index -Lookups sind ungefähr dreimal schneller als GIST
Die Erstellung von Gin -Indizes dauert ungefähr dreimal länger als die GIST
Die GIN-Indizes sind mäßig langsamer zu aktualisieren als die GIST-Indizes, aber etwa 10-mal langsamer, wenn die Unterstützung der Fast-Update deaktiviert wurde (siehe Abschnitt 58.4.1 für Einzelheiten)
GIN-Indizes sind zwei bis drei Mal größer als die GIST-Indizes

Inspiration und Hilfe

https://about.gitlab.com/blog/2016/03/18/fast-search-using-postgresql-trigram-indexes/
http://rachbelaid.com/postgres-full-text-search-is-good-enough/
https://scoutapm.com/blog/how-to-make-text-searches-in-postgresql-faster-with-trigram-simility
https://stackoverflow.com/questions/27443950/make-postgres-full-text-search-tsvector-act-like-ilike-tosearch-inside-words
https://stackoverflow.com/questions/46122175/fulltext-search-combined-with-fuzzysearch-in-postgresql
https://stackoverflow.com/questions/58651852/use-postgresql-full-text-search-to-fuzzy-match-all-search-terms
https://stackoverflow.com/questions/52140727/fuzzy-searchin-full-text-search
https://stackoverflow.com/questions/2513501/postgresql-full-text-search-how-to-search-partial-words
https://stackoverflow.com/questions/28975517/diffferenz-between-gist-gin-index
https://dba.stackexchange.com/questions/149765/postgresql-gin-index-not-used-when-ts-query-language-is-feched-from-a-Column
https://dba.stackexchange.com/questions/251177/postgres-full-text-search-on-words-not-lexemes

Expandieren

postgres full text search

postgres-full-text-Suche

DB erstellen

Volltext -Suche mit einfachem `ilike`

Volltext -Suche mit `ilike` unterstützt vom Trigram Index

Erstellen

Tsearch Volltextsuche ohne gespeicherten Index

Tsearch Volltextsuche mit gespeichertem Teilindex

Tsearch Volltext -Suche nach Teilwörtern

Tsearch Volltext -Suchergebnisse Ranking

Gist gegen Gin

Inspiration und Hilfe

Wortsuche 800

Aviator Predictor FULL

azure search python samples

Text mit Jesus

Text oder stirb

Destinata VOLL ROT

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express

postgres full text search

postgres-full-text-Suche

DB erstellen

Volltext -Suche mit einfachem ilike

Volltext -Suche mit ilike unterstützt vom Trigram Index

Erstellen

Tsearch Volltextsuche ohne gespeicherten Index

Tsearch Volltextsuche mit gespeichertem Teilindex

Tsearch Volltext -Suche nach Teilwörtern

Tsearch Volltext -Suchergebnisse Ranking

Gist gegen Gin

Inspiration und Hilfe

Volltext -Suche mit einfachem `ilike`

Volltext -Suche mit `ilike` unterstützt vom Trigram Index