postgres full text search download - postgres full text search Download do código fonte

PostGres-Full-Text-Search

Opções de pesquisa de texto completo do PostGres (exemplos de TSEARCH, TRIGRAM, ILIKE).

Crie DB
Pesquisa completa de texto usando ilike simples
Pesquisa completa de texto usando ilike suportado pelo TriGram Index
Crie configuração de idioma não-defensor para pesquisa de texto completo TSEARCH
Pesquisa de texto completo da TSearch sem índice armazenado
Pesquisa de texto completo da TSearch com índice parcial armazenado
Pesquisa completa de texto TSearch para palavras parciais
TSearch Full Text Pesquisa Resultados
GIST vs gin
Inspiração e ajuda

Crie DB

 >> CREATE DATABASE ftdb;

Para alimentar o banco de dados com um exemplo de conjunto de dados ( dataset.txt , 100k linhas, 15 palavras cada), usei o script python init_db.py .

Pesquisa completa de texto usando `ilike` simples

 >> EXPLAIN ANALYZE
   SELECT text , language
   FROM public . document
   WHERE
      text ilike ' %field% '
      AND text ilike ' %window% '
      AND text ilike ' %lamp% '
      AND text ilike ' %research% '
      AND language = ' en '
    LIMIT 1 ;
                                                                  QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 0 . 00 .. 3734 . 02 rows = 1 width = 105 ) (actual time = 87 . 473 .. 87 . 474 rows = 0 loops = 1 )
   - >  Seq Scan on document  (cost = 0 . 00 .. 3734 . 02 rows = 1 width = 105 ) (actual time = 87 . 466 .. 87 . 466 rows = 0 loops = 1 )
         Filter: (( text ~~ * ' %field% ' :: text ) AND ( text ~~ * ' %window% ' :: text ) AND ( text ~~ * ' %lamp% ' :: text ) AND ( text ~~ * ' %research% ' :: text ))
         Rows Removed by Filter: 100001
 Planning Time : 2 . 193 ms
 Execution Time : 87 . 500 ms

Pesquisa completa de texto usando `ilike` suportado pelo TriGram Index

O que é um trigrama? Veja este exemplo:

 >> CREATE EXTENSION pg_trgm;
CREATE EXTENSION
>> select show_trgm( ' fielded ' );
                show_trgm
-- ---------------------------------------
 { "  f " , " fi " ,ded, " ed " ,eld,fie,iel,lde}

Podemos melhorar o desempenho ilike usando o índice de trigrama, por exemplo, gin_trgm_ops .

 >> CREATE INDEX  ix_document_text_trigram ON document USING gin ( text gin_trgm_ops) where language = ' en ' ;
CREATE INDEX

>> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE
      text ilike ' %field% '
      AND text ilike ' %window% '
      AND text ilike ' %lamp% '
      AND text ilike ' %research% '
      AND language = ' en '
    LIMIT 1 ;
                                                                                       QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 176 . 00 .. 180 . 02 rows = 1 width = 105 ) (actual time = 1 . 473 .. 1 . 474 rows = 0 loops = 1 )
   - >  Bitmap Heap Scan on document  (cost = 176 . 00 .. 180 . 02 rows = 1 width = 105 ) (actual time = 1 . 470 .. 1 . 471 rows = 0 loops = 1 )
         Recheck Cond: (( text ~~ * ' %field% ' :: text ) AND ( text ~~ * ' %window% ' :: text ) AND ( text ~~ * ' %lamp% ' :: text ) AND ( text ~~ * ' %research% ' :: text ) AND ((language):: text = ' en ' :: text ))
         - >  Bitmap Index Scan on ix_document_text_trigram  (cost = 0 . 00 .. 176 . 00 rows = 1 width = 0 ) (actual time = 1 . 466 .. 1 . 466 rows = 0 loops = 1 )
               Index Cond: (( text ~~ * ' %field% ' :: text ) AND ( text ~~ * ' %window% ' :: text ) AND ( text ~~ * ' %lamp% ' :: text ) AND ( text ~~ * ' %research% ' :: text ))
 Planning Time : 2 . 389 ms
 Execution Time : 1 . 524 ms

Crie configuração de idioma não-defensor para pesquisa de texto completo TSEARCH

O Postgres não fornece suporte para muitos idiomas por padrão. No entanto, você pode configurar a configuração com bastante facilidade. Você só precisa de arquivos de dicionário adicionais. Aqui está um exemplo para a linguagem polonesa. Os arquivos de dicionário polonês podem ser baixados de: https://github.com/judehunter/polish-tsearch.

Os arquivos POLNI.Affix, Polh (POLNIP.Stop e Polh (PONLIP.DIG são copiados para PostgreSQL sharedir tsearch_data Localização, por exemplo /usr/share/postgresql/13/tsearch_data . Para determinar seu local Sharedir, você pode usar pg_config --sharedir

Também deve ser criado uma configuração (consulte os documentos) dentro do banco de dados:

 >> DROP TEXT SEARCH DICTIONARY IF EXISTS polish_hunspell CASCADE;
   CREATE TEXT SEARCH DICTIONARY polish_hunspell (
    TEMPLATE  = ispell,
    DictFile  = polish,
    AffFile   = polish,
    StopWords = polish
  );
  CREATE TEXT SEARCH CONFIGURATION public . polish (
    COPY = pg_catalog . english
  );
  ALTER TEXT SEARCH CONFIGURATION polish
    ALTER MAPPING
    FOR
        asciiword, asciihword, hword_asciipart,  word, hword, hword_part
    WITH
        polish_hunspell, simple;

Você precisa desses arquivos e configuração, porque o mecanismo de pesquisa de texto completo usa o Lexeme comparando para encontrar as melhores correspondências (o padrão de consulta e o texto armazenado são lexemizados):

 >> SELECT to_tsquery( ' english ' , ' fielded ' ), to_tsvector( ' english ' , text )
   FROM document
   LIMIT 1 ;
 to_tsquery |                                                                    to_tsvector
-- ----------+----------------------------------------------------------------------------------------------------------------------------------------------------
 ' field '    | ' 19 ' : 16 ' bat ' : 12 ' dead ' : 8 ' degre ' : 1 ' depth ' : 5 ' field ' : 15 ' lamp ' : 13 ' men ' : 6 ' put ' : 14 ' ranch ' : 2 ' tall ' : 4 ' time ' : 3 ' underlin ' : 11 ' wast ' : 10 ' window ' : 9

Se você não puder fornecer arquivos de dicionário, poderá usar o texto completo em forma "simples" (sem transformação em lexeme):

 >> SELECT to_tsquery( ' simple ' , ' fielded ' ), to_tsvector( ' simple ' , text )
   FROM document
   LIMIT 1 ;
 to_tsquery |                                                                             to_tsvector
-- ----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 ' fielded '  | ' 19 ' : 16 ' bat ' : 12 ' below ' : 7 ' dead ' : 8 ' degree ' : 1 ' depth ' : 5 ' field ' : 15 ' lamp ' : 13 ' men ' : 6 ' putting ' : 14 ' ranch ' : 2 ' tall ' : 4 ' time ' : 3 ' underline ' : 11 ' waste ' : 10 ' window ' : 9

Pesquisa de texto completo da TSearch sem índice armazenado

 >> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
   LIMIT 1 ;
                                                                                  QUERY PLAN
-- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 1000 . 00 .. 18298 . 49 rows = 1 width = 103 ) (actual time = 489 . 802 .. 491 . 352 rows = 0 loops = 1 )
   - >  Gather  (cost = 1000 . 00 .. 18298 . 49 rows = 1 width = 103 ) (actual time = 489 . 800 .. 491 . 349 rows = 0 loops = 1 )
         Workers Planned: 1
         Workers Launched: 1
         - >  Parallel Seq Scan on document  (cost = 0 . 00 .. 17298 . 39 rows = 1 width = 103 ) (actual time = 486 . 644 .. 486 . 644 rows = 0 loops = 2 )
               Filter: (((language):: text = ' en ' :: text ) AND (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery))
               Rows Removed by Filter: 50000
 Planning Time : 0 . 272 ms
 Execution Time : 491 . 376 ms
( 9 rows)

Pesquisa de texto completo da TSearch com índice parcial armazenado

O índice parcial oferece como possibilidade de armazenar registros em diferentes idiomas usando uma tabela e consultá -los de maneira eficaz.

 >> CREATE INDEX ix_en_document_tsvector_text ON public . document USING gin (to_tsvector( ' english ' ::regconfig, text )) WHERE language = ' en ' ;
CREATED INDEX
>> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
   LIMIT 1 ;
                                                               QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 1000 . 00 .. 18151 . 43 rows = 1 width = 103 ) (actual time = 487 . 120 .. 488 . 569 rows = 0 loops = 1 )
   - >  Gather  (cost = 1000 . 00 .. 18151 . 43 rows = 1 width = 103 ) (actual time = 487 . 117 .. 488 . 567 rows = 0 loops = 1 )
         Workers Planned: 1
         Workers Launched: 1
         - >  Parallel Seq Scan on document  (cost = 0 . 00 .. 17151 . 33 rows = 1 width = 103 ) (actual time = 484 . 418 .. 484 . 419 rows = 0 loops = 2 )
               Filter: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery)
               Rows Removed by Filter: 50000
 Planning Time : 0 . 193 ms
 Execution Time : 488 . 596 ms

Sem diferença? O índice não foi usado ... por que não está funcionando? Ohh, olha para os documentos de índice parcial:

No entanto, lembre -se de que o predicado deve corresponder às condições usadas nas consultas que devem se beneficiar do índice. Para ser preciso, um índice parcial pode ser usado em uma consulta apenas se o sistema puder reconhecer que a condição de onde a consulta implica matematicamente o predicado do índice. O PostgreSQL não possui um provérbio sofisticado do teorema que pode reconhecer expressões matematicamente equivalentes que são escritas de diferentes formas. (Não é apenas um prover do teorema geral que não é extremamente difícil de criar, mas provavelmente seria muito lento para ser de qualquer uso real.) O sistema pode reconhecer implicações simples de desigualdade, por exemplo "x <1" implica "x <2"; Caso contrário, a condição de predicado deve corresponder exatamente a parte da consulta onde a condição ou o índice não será reconhecido como utilizável. A correspondência ocorre no horário de planejamento da consulta, não no tempo de execução. Como resultado, as cláusulas de consulta parametrizadas não funcionam com um índice parcial.

Temos que adicionar à consulta uma condição usada para criar índice parcial: document.language = 'en' :

 >> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE
      to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
      AND language = ' en '
   LIMIT 1 ;                                                                           QUERY PLAN
-- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 64 . 00 .. 68 . 27 rows = 1 width = 103 ) (actual time = 0 . 546 .. 0 . 548 rows = 0 loops = 1 )
   - >  Bitmap Heap Scan on document  (cost = 64 . 00 .. 68 . 27 rows = 1 width = 103 ) (actual time = 0 . 544 .. 0 . 545 rows = 0 loops = 1 )
         Recheck Cond: ((to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery) AND ((language):: text = ' en ' :: text ))
         - >  Bitmap Index Scan on ix_en_document_tsvector_text  (cost = 0 . 00 .. 64 . 00 rows = 1 width = 0 ) (actual time = 0 . 540 .. 0 . 540 rows = 0 loops = 1 )
               Index Cond: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery)
 Planning Time : 0 . 244 ms
 Execution Time : 0 . 590 ms

Pesquisa completa de texto TSearch para palavras parciais

:* O operador permite a pesquisa de prefixo. Pode ser útil executar a pesquisa completa de texto durante a digitação de uma palavra.

 >> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE
      to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & l:* ' )
      AND language = ' en '
   LIMIT 1 ;
                                                                   QUERY PLAN
-- ----------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on document  (cost = 168 . 00 .. 172 . 27 rows = 1 width = 102 ) (actual time = 5 . 207 .. 5 . 210 rows = 4 loops = 1 )
   Recheck Cond: ((to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' l ' ' :* ' ::tsquery) AND ((language):: text = ' en ' :: text ))
   Heap Blocks: exact = 4
   - >  Bitmap Index Scan on ix_en_document_tsvector_text  (cost = 0 . 00 .. 168 . 00 rows = 1 width = 0 ) (actual time = 5 . 202 .. 5 . 202 rows = 4 loops = 1 )
         Index Cond: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' l ' ' :* ' ::tsquery)
 Planning Time : 0 . 240 ms
 Execution Time : 5 . 240 ms

>> SELECT id,  text
   FROM public . document
   WHERE
      to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & l:* ' )
      AND language = ' en '
   LIMIT 20 ;
  id   |                                                   text
-- -----+-----------------------------------------------------------------------------------------------------------
     1 | degree ranch time tall depth men below dead window waste underline bat lamp putting field               +
 20152 | Law pony follow memory star whatever window sets oxygen longer word whom glass field actual              +
 21478 | Dried symbol willing design managed shade window pick share faster education drive field land everybody  +
 30293 | Pencil seen engineer labor image entire smallest serve field should riding smaller window imagine traffic +

TSearch Full Text Pesquisa Resultados

Existem duas funções bastante semelhantes para classificar os resultados da TSearch:

ts_rank , que classifica os vetores com base na frequência de seus lexemes correspondentes
ts_rank_cd , que calcula a classificação da "densidade da capa"

Para mais informações, consulte os documentos

 >> SELECT
     id,
     ts_rank_cd(to_tsvector( ' english ' , text ), to_tsquery( ' english ' , ' fielded & wind:* ' )) rank,
     text
    FROM public . document
    WHERE to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & wind:* ' )
    ORDER BY rank DESC
    LIMIT 20 ;
   id   |    rank     |                                                   text
-- ------+-------------+-----------------------------------------------------------------------------------------------------------
 100002 |         0 . 1 | fielded window
   9376 |        0 . 05 | Own mouse girl effect surprise physical newspaper forgot eat upper field element window simply unhappy   +
  96597 |        0 . 05 | Opinion fastened pencil rear more theory size window heading field understanding farm up position attack +
  44626 | 0 . 033333335 | Symbol each halfway window swam spider field page shinning donkey chose until cow cabin congress         +
  80922 | 0 . 033333335 | Victory famous field shelter girl wind adventure he divide rear tip few studied ruler judge              +
  30293 |       0 . 025 | Pencil seen engineer labor image entire smallest serve field should riding smaller window imagine traffic +
      1 | 0 . 016666668 | degree ranch time tall depth men below dead window waste underline bat lamp putting field               +
  21478 | 0 . 016666668 | Dried symbol willing design managed shade window pick share faster education drive field land everybody  +
  60059 | 0 . 016666668 | However hungry make proud kids come willing field officer row above highest round wind mile              +
  26001 | 0 . 014285714 | Earth earlier pocket might sense window way frog fire court family mouth field somebody recognize        +
  20152 | 0 . 014285714 | Law pony follow memory star whatever window sets oxygen longer word whom glass field actual              +
  37470 |      0 . 0125 | Farm weight balloon buried wind water donkey grain pig week should damage field was he                   +
  49433 |        0 . 01 | Wind scientist leaving atom year bad child drink shore spirit field facing indicate wagon here           +
  37851 | 0 . 007142857 | Field cloud you wife rhythm upward applied weigh continued property replace ahead forgotten trip window  +

text='fielded window' O registro foi adicionado manualmente para mostrar o melhor resultado da partida.

GIST vs gin

Nós criamos o Gin Index. Mas também há opção de índice GIST. Qual é melhor? Depende ...

 >> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE
      to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
      AND language = ' en '
   LIMIT 1 ;
                                                                  QUERY PLAN
-- ---------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 0 . 28 .. 8 . 30 rows = 1 width = 103 ) (actual time = 2 . 699 .. 2 . 700 rows = 0 loops = 1 )
   - >  Index Scan using ix_en_document_tsvector_text on document  (cost = 0 . 28 .. 8 . 30 rows = 1 width = 103 ) (actual time = 2 . 697 .. 2 . 697 rows = 0 loops = 1 )
         Index Cond: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery)
 Planning Time : 0 . 274 ms
 Execution Time : 2 . 730 ms

Gin parece ser um pouco mais rápido. Acho que não poderia explicar melhor do que os documentos já fazem:

Ao escolher qual tipo de índice usar, GIST ou GIN, considere essas diferenças de desempenho:
As pesquisas de índice de gin são cerca de três vezes mais rápido que a essência
Os índices de gin levam cerca de três vezes mais para serem construídos do que a essência
Os índices de gin são moderadamente mais lentos para atualizar do que os índices GIST, mas cerca de 10 vezes mais lentamente se o suporte de atualização rápida foi desativada (consulte a Seção 58.4.1 para obter detalhes)
Os índices de gin são duas a três vezes maiores que os índices GIST

Inspiração e ajuda

https://about.gitlab.com/blog/2016/03/18/fast-search-using-postgresql-tigram-indexes/
http://rachbelaid.com/postgres-full-text-search-is-good-enough/
https://scoutapm.com/blog/how-to-make-text-searches-inposgresql-faster-with-tigram-similaridade
https://stackoverflow.com/questions/27443950/make-posgres-full-text-search-tsvector-act-tipo-ilike-to-search-side-words
https://stackoverflow.com/questions/46122175/fulltext-search-combined-with-fuzzysearch-inposgresql
https://stackoverflow.com/questions/58651852/useposgresql-full-text-search-to-fuzzy-satch-l-search-terms
https://stackoverflow.com/questions/52140727/fuzzy-search-in-full-text-search
https://stackoverflow.com/questions/2513501/postgresql-ffull-text-search-ow-to-search-partial-words
https://stackoverflow.com/questions/28975517/difference-betwen-gist-and-gin-index
https://dba.stackexchange.com/questions/149765/postgresql-gin-index-not-ususe-when-ts-query-language-is-fetched-from-a-coluna
https://dba.stackexchange.com/questions/251177/postgres-full-text-search-on-words-not-lelexemes

Expandir

postgres full text search