postgres full text search 다운로드 - postgres full text search 소스 코드 다운로드

Postgres-full-text search

Postgres 전문 검색 옵션 (Tsearch, Trigram, Ilike) 예제.

DB 생성
간단한 ilike 사용한 전문 검색
Trigram Index에서 지원하는 ilike 사용한 전체 텍스트 검색
TSearch Full Text 검색을위한 비 기본 언어 구성을 만듭니다
저장된 인덱스가없는 Tsearch Full Text 검색
저장된 부분 색인이있는 Tsearch Full 텍스트 검색
tsearch 부분 단어에 대한 전체 텍스트 검색
Tsearch Full 텍스트 검색 결과 순위
요지 대 진
영감과 도움

DB 생성

 >> CREATE DATABASE ftdb;

예제 DataSet ( dataset.txt , 100K 행, 각각 15 단어)으로 DB를 공급하려면 Python init_db.py 스크립트를 사용했습니다.

간단한 `ilike` 사용한 전문 검색

 >> EXPLAIN ANALYZE
   SELECT text , language
   FROM public . document
   WHERE
      text ilike ' %field% '
      AND text ilike ' %window% '
      AND text ilike ' %lamp% '
      AND text ilike ' %research% '
      AND language = ' en '
    LIMIT 1 ;
                                                                  QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 0 . 00 .. 3734 . 02 rows = 1 width = 105 ) (actual time = 87 . 473 .. 87 . 474 rows = 0 loops = 1 )
   - >  Seq Scan on document  (cost = 0 . 00 .. 3734 . 02 rows = 1 width = 105 ) (actual time = 87 . 466 .. 87 . 466 rows = 0 loops = 1 )
         Filter: (( text ~~ * ' %field% ' :: text ) AND ( text ~~ * ' %window% ' :: text ) AND ( text ~~ * ' %lamp% ' :: text ) AND ( text ~~ * ' %research% ' :: text ))
         Rows Removed by Filter: 100001
 Planning Time : 2 . 193 ms
 Execution Time : 87 . 500 ms

Trigram Index에서 지원하는 `ilike` 사용한 전체 텍스트 검색

트리 그램이란 무엇입니까? 이 예를 참조하십시오 :

 >> CREATE EXTENSION pg_trgm;
CREATE EXTENSION
>> select show_trgm( ' fielded ' );
                show_trgm
-- ---------------------------------------
 { "  f " , " fi " ,ded, " ed " ,eld,fie,iel,lde}

Trigram Index (예 : gin_trgm_ops 를 사용하여 ilike 성능을 향상시킬 수 있습니다.

 >> CREATE INDEX  ix_document_text_trigram ON document USING gin ( text gin_trgm_ops) where language = ' en ' ;
CREATE INDEX

>> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE
      text ilike ' %field% '
      AND text ilike ' %window% '
      AND text ilike ' %lamp% '
      AND text ilike ' %research% '
      AND language = ' en '
    LIMIT 1 ;
                                                                                       QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 176 . 00 .. 180 . 02 rows = 1 width = 105 ) (actual time = 1 . 473 .. 1 . 474 rows = 0 loops = 1 )
   - >  Bitmap Heap Scan on document  (cost = 176 . 00 .. 180 . 02 rows = 1 width = 105 ) (actual time = 1 . 470 .. 1 . 471 rows = 0 loops = 1 )
         Recheck Cond: (( text ~~ * ' %field% ' :: text ) AND ( text ~~ * ' %window% ' :: text ) AND ( text ~~ * ' %lamp% ' :: text ) AND ( text ~~ * ' %research% ' :: text ) AND ((language):: text = ' en ' :: text ))
         - >  Bitmap Index Scan on ix_document_text_trigram  (cost = 0 . 00 .. 176 . 00 rows = 1 width = 0 ) (actual time = 1 . 466 .. 1 . 466 rows = 0 loops = 1 )
               Index Cond: (( text ~~ * ' %field% ' :: text ) AND ( text ~~ * ' %window% ' :: text ) AND ( text ~~ * ' %lamp% ' :: text ) AND ( text ~~ * ' %research% ' :: text ))
 Planning Time : 2 . 389 ms
 Execution Time : 1 . 524 ms

TSearch Full Text 검색을위한 비 기본 언어 구성을 만듭니다

Postgres는 기본적으로 많은 언어를 지원하지 않습니다. 그러나 구성을 아주 쉽게 설정할 수 있습니다. 추가 사전 파일 만 있으면됩니다. 다음은 폴란드어 언어의 예입니다. 광택 사전 파일은 https://github.com/judehunter/polish-tearch에서 다운로드 할 수 있습니다.

Polish.affix, Polish.stop 및 Polish.dict 파일은 PostgreSQL Sharedir tsearch_data 위치 (예 : /usr/share/postgresql/13/tsearch_data 에 복사해야합니다. Sharedir 위치를 결정하려면 pg_config --sharedir 사용할 수 있습니다.

데이터베이스 내부에 구성 (문서 참조)도 작성해야합니다.

 >> DROP TEXT SEARCH DICTIONARY IF EXISTS polish_hunspell CASCADE;
   CREATE TEXT SEARCH DICTIONARY polish_hunspell (
    TEMPLATE  = ispell,
    DictFile  = polish,
    AffFile   = polish,
    StopWords = polish
  );
  CREATE TEXT SEARCH CONFIGURATION public . polish (
    COPY = pg_catalog . english
  );
  ALTER TEXT SEARCH CONFIGURATION polish
    ALTER MAPPING
    FOR
        asciiword, asciihword, hword_asciipart,  word, hword, hword_part
    WITH
        polish_hunspell, simple;

전체 텍스트 검색 엔진이 Lexeme을 사용하여 최상의 일치를 찾는 데 비교하기 때문에 이러한 파일과 구성이 필요합니다 (쿼리 패턴과 저장된 텍스트는 모두 Lexemized입니다.

 >> SELECT to_tsquery( ' english ' , ' fielded ' ), to_tsvector( ' english ' , text )
   FROM document
   LIMIT 1 ;
 to_tsquery |                                                                    to_tsvector
-- ----------+----------------------------------------------------------------------------------------------------------------------------------------------------
 ' field '    | ' 19 ' : 16 ' bat ' : 12 ' dead ' : 8 ' degre ' : 1 ' depth ' : 5 ' field ' : 15 ' lamp ' : 13 ' men ' : 6 ' put ' : 14 ' ranch ' : 2 ' tall ' : 4 ' time ' : 3 ' underlin ' : 11 ' wast ' : 10 ' window ' : 9

사전 파일을 제공 할 수없는 경우 "간단한"양식으로 전체 텍스트를 사용할 수 있습니다 (Lexeme으로 변환없이).

 >> SELECT to_tsquery( ' simple ' , ' fielded ' ), to_tsvector( ' simple ' , text )
   FROM document
   LIMIT 1 ;
 to_tsquery |                                                                             to_tsvector
-- ----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 ' fielded '  | ' 19 ' : 16 ' bat ' : 12 ' below ' : 7 ' dead ' : 8 ' degree ' : 1 ' depth ' : 5 ' field ' : 15 ' lamp ' : 13 ' men ' : 6 ' putting ' : 14 ' ranch ' : 2 ' tall ' : 4 ' time ' : 3 ' underline ' : 11 ' waste ' : 10 ' window ' : 9

저장된 인덱스가없는 Tsearch Full Text 검색

 >> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
   LIMIT 1 ;
                                                                                  QUERY PLAN
-- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 1000 . 00 .. 18298 . 49 rows = 1 width = 103 ) (actual time = 489 . 802 .. 491 . 352 rows = 0 loops = 1 )
   - >  Gather  (cost = 1000 . 00 .. 18298 . 49 rows = 1 width = 103 ) (actual time = 489 . 800 .. 491 . 349 rows = 0 loops = 1 )
         Workers Planned: 1
         Workers Launched: 1
         - >  Parallel Seq Scan on document  (cost = 0 . 00 .. 17298 . 39 rows = 1 width = 103 ) (actual time = 486 . 644 .. 486 . 644 rows = 0 loops = 2 )
               Filter: (((language):: text = ' en ' :: text ) AND (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery))
               Rows Removed by Filter: 50000
 Planning Time : 0 . 272 ms
 Execution Time : 491 . 376 ms
( 9 rows)

저장된 부분 색인이있는 Tsearch Full 텍스트 검색

부분 색인은 한 테이블을 사용하여 다른 언어로 레코드를 저장하고 효과적으로 쿼리 할 수 있습니다.

 >> CREATE INDEX ix_en_document_tsvector_text ON public . document USING gin (to_tsvector( ' english ' ::regconfig, text )) WHERE language = ' en ' ;
CREATED INDEX
>> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
   LIMIT 1 ;
                                                               QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 1000 . 00 .. 18151 . 43 rows = 1 width = 103 ) (actual time = 487 . 120 .. 488 . 569 rows = 0 loops = 1 )
   - >  Gather  (cost = 1000 . 00 .. 18151 . 43 rows = 1 width = 103 ) (actual time = 487 . 117 .. 488 . 567 rows = 0 loops = 1 )
         Workers Planned: 1
         Workers Launched: 1
         - >  Parallel Seq Scan on document  (cost = 0 . 00 .. 17151 . 33 rows = 1 width = 103 ) (actual time = 484 . 418 .. 484 . 419 rows = 0 loops = 2 )
               Filter: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery)
               Rows Removed by Filter: 50000
 Planning Time : 0 . 193 ms
 Execution Time : 488 . 596 ms

차이가 없습니까? 인덱스가 사용되지 않았습니다 ... 왜 작동하지 않습니까? 오, 부분 색인 문서를 살펴 봅니다.

그러나 술어는 색인의 혜택을받을 수있는 쿼리에 사용 된 조건과 일치해야합니다. 정확하게 말하면, 쿼리의 위치가 수학적으로 인덱스의 술어를 암시한다는 것을 인식 할 수있는 경우에만 쿼리에서 부분 색인을 사용할 수 있습니다. PostgreSQL에는 다른 형태로 작성된 수학적으로 동등한 표현을 인식 할 수있는 정교한 정리 속담이 없습니다. (이러한 일반적인 이론 속도는 생성하기가 매우 어려울뿐만 아니라 실제 사용하기에는 너무 느리게 아마도 느리게 될 것입니다.) 시스템은 "x <1"과 같은 간단한 불평등 함의를 인식 할 수 있습니다. 예를 들어 "x <2"를 의미합니다. 그렇지 않으면 술어 조건은 조건 또는 인덱스가 사용 가능한 것으로 인식되지 않는 쿼리의 일부와 정확히 일치해야합니다. 일치하는 것은 실행 시간이 아닌 쿼리 계획 시간에서 이루어집니다. 결과적으로, 매개 변수화 된 쿼리 클로는 부분 색인과 함께 작동하지 않습니다.

부분 색인을 생성하는 데 사용 된 조건을 쿼리하기 위해 추가해야합니다. document.language = 'en' :

 >> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE
      to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
      AND language = ' en '
   LIMIT 1 ;                                                                           QUERY PLAN
-- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 64 . 00 .. 68 . 27 rows = 1 width = 103 ) (actual time = 0 . 546 .. 0 . 548 rows = 0 loops = 1 )
   - >  Bitmap Heap Scan on document  (cost = 64 . 00 .. 68 . 27 rows = 1 width = 103 ) (actual time = 0 . 544 .. 0 . 545 rows = 0 loops = 1 )
         Recheck Cond: ((to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery) AND ((language):: text = ' en ' :: text ))
         - >  Bitmap Index Scan on ix_en_document_tsvector_text  (cost = 0 . 00 .. 64 . 00 rows = 1 width = 0 ) (actual time = 0 . 540 .. 0 . 540 rows = 0 loops = 1 )
               Index Cond: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery)
 Planning Time : 0 . 244 ms
 Execution Time : 0 . 590 ms

tsearch 부분 단어에 대한 전체 텍스트 검색

:* 운영자는 접두사 검색을 활성화합니다. 단어를 입력하는 동안 전체 텍스트 검색을 실행하는 것이 유용 할 수 있습니다.

 >> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE
      to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & l:* ' )
      AND language = ' en '
   LIMIT 1 ;
                                                                   QUERY PLAN
-- ----------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on document  (cost = 168 . 00 .. 172 . 27 rows = 1 width = 102 ) (actual time = 5 . 207 .. 5 . 210 rows = 4 loops = 1 )
   Recheck Cond: ((to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' l ' ' :* ' ::tsquery) AND ((language):: text = ' en ' :: text ))
   Heap Blocks: exact = 4
   - >  Bitmap Index Scan on ix_en_document_tsvector_text  (cost = 0 . 00 .. 168 . 00 rows = 1 width = 0 ) (actual time = 5 . 202 .. 5 . 202 rows = 4 loops = 1 )
         Index Cond: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' l ' ' :* ' ::tsquery)
 Planning Time : 0 . 240 ms
 Execution Time : 5 . 240 ms

>> SELECT id,  text
   FROM public . document
   WHERE
      to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & l:* ' )
      AND language = ' en '
   LIMIT 20 ;
  id   |                                                   text
-- -----+-----------------------------------------------------------------------------------------------------------
     1 | degree ranch time tall depth men below dead window waste underline bat lamp putting field               +
 20152 | Law pony follow memory star whatever window sets oxygen longer word whom glass field actual              +
 21478 | Dried symbol willing design managed shade window pick share faster education drive field land everybody  +
 30293 | Pencil seen engineer labor image entire smallest serve field should riding smaller window imagine traffic +

Tsearch Full 텍스트 검색 결과 순위

TSEARCH 결과 순위를 매기는 두 가지 매우 유사한 기능이 있습니다.

ts_rank , 일치하는 Lexemes의 빈도에 따라 벡터 순위
"커버 밀도"순위를 계산하는 ts_rank_cd

자세한 내용은 문서를 참조하십시오

 >> SELECT
     id,
     ts_rank_cd(to_tsvector( ' english ' , text ), to_tsquery( ' english ' , ' fielded & wind:* ' )) rank,
     text
    FROM public . document
    WHERE to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & wind:* ' )
    ORDER BY rank DESC
    LIMIT 20 ;
   id   |    rank     |                                                   text
-- ------+-------------+-----------------------------------------------------------------------------------------------------------
 100002 |         0 . 1 | fielded window
   9376 |        0 . 05 | Own mouse girl effect surprise physical newspaper forgot eat upper field element window simply unhappy   +
  96597 |        0 . 05 | Opinion fastened pencil rear more theory size window heading field understanding farm up position attack +
  44626 | 0 . 033333335 | Symbol each halfway window swam spider field page shinning donkey chose until cow cabin congress         +
  80922 | 0 . 033333335 | Victory famous field shelter girl wind adventure he divide rear tip few studied ruler judge              +
  30293 |       0 . 025 | Pencil seen engineer labor image entire smallest serve field should riding smaller window imagine traffic +
      1 | 0 . 016666668 | degree ranch time tall depth men below dead window waste underline bat lamp putting field               +
  21478 | 0 . 016666668 | Dried symbol willing design managed shade window pick share faster education drive field land everybody  +
  60059 | 0 . 016666668 | However hungry make proud kids come willing field officer row above highest round wind mile              +
  26001 | 0 . 014285714 | Earth earlier pocket might sense window way frog fire court family mouth field somebody recognize        +
  20152 | 0 . 014285714 | Law pony follow memory star whatever window sets oxygen longer word whom glass field actual              +
  37470 |      0 . 0125 | Farm weight balloon buried wind water donkey grain pig week should damage field was he                   +
  49433 |        0 . 01 | Wind scientist leaving atom year bad child drink shore spirit field facing indicate wagon here           +
  37851 | 0 . 007142857 | Field cloud you wife rhythm upward applied weigh continued property replace ahead forgotten trip window  +

text='fielded window' 레코드가 수동으로 추가되어 최상의 일치 결과를 보여줍니다.

요지 대 진

우리는 진 인덱스를 만들었습니다. 그러나 GIST 인덱스 옵션도 있습니다. 어느 것이 더 낫습니까? 의존합니다 ...

 >> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE
      to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
      AND language = ' en '
   LIMIT 1 ;
                                                                  QUERY PLAN
-- ---------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 0 . 28 .. 8 . 30 rows = 1 width = 103 ) (actual time = 2 . 699 .. 2 . 700 rows = 0 loops = 1 )
   - >  Index Scan using ix_en_document_tsvector_text on document  (cost = 0 . 28 .. 8 . 30 rows = 1 width = 103 ) (actual time = 2 . 697 .. 2 . 697 rows = 0 loops = 1 )
         Index Cond: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery)
 Planning Time : 0 . 274 ms
 Execution Time : 2 . 730 ms

진은 조금 더 빠른 것 같습니다. 나는 이미 문서보다 더 잘 설명 할 수 없다고 생각합니다.

사용할 인덱스 유형, GIST 또는 GIN을 선택할 때 이러한 성능 차이를 고려하십시오.
진 인덱스 조회는 GIST보다 약 3 배 빠릅니다.
진 인덱스는 GIST보다 구축하는 데 약 3 배 더 걸립니다.
GIN 인덱스는 GIST 인덱스보다 업데이트가 적당히 느려지지만 빠른 업데이트 지원이 비활성화 된 경우 약 10 배 느려집니다 (자세한 내용은 섹션 58.4.1 참조).
진 인덱스는 GIST 인덱스보다 2-3 배 더 큽니다

영감과 도움

https://about.gitlab.com/blog/2016/03/18/fast-search-using-postgresql-trigram-indexes/
http://rachbelaid.com/postgres-full-full-text-search-is-good-enough/
https://scoutapm.com/blog/how-to-make-text-searches-in-postgresql-with-tigram-similarity
https://stackoverflow.com/questions/27443950/make-postgres-full-text-search-tvector-like-like-search-inside-words
https://stackoverflow.com/questions/46122175/fulltext-search-combined-fuzzysearch-in-postgresql
https://stackoverflow.com/questions/58651852/use-postgresql-full-text-search-to-fuzzy-all-search-terms
https://stackoverflow.com/questions/52140727/fuzzy-search-in-full-text-search
https://stackoverflow.com/questions/2513501/postgresql-full-text-search-how-tearch-partial-words
https://stackoverflow.com/questions/28975517/difference-between-gist-and-gin-index
https://dba.stackexchange.com/questions/149765/postgresql-gin-index-not-od-us-when-ts-query-language-is-preched-from-a-column
https://dba.stackexchange.com/questions/251177/postgres-full-text-search-on-words-not-lexemes

확장하다

postgres full text search