postgres full text search

Postgres-full-text-search

Postgres ตัวเลือกการค้นหาข้อความแบบเต็ม (tsearch, trigram, ilike) ตัวอย่าง

สร้าง db
การค้นหาข้อความแบบเต็มโดยใช้ ilike Simple
การค้นหาข้อความแบบเต็มโดยใช้ ilike ที่รองรับโดยดัชนี trigram
สร้างการกำหนดค่าภาษาที่ไม่เปลี่ยนรูปสำหรับการค้นหาข้อความแบบเต็ม TSEARC
tsearch ค้นหาข้อความแบบเต็มโดยไม่มีดัชนีที่เก็บไว้
การค้นหาข้อความแบบเต็มด้วยดัชนีบางส่วนที่เก็บไว้
tsearch ค้นหาข้อความแบบเต็มสำหรับคำบางส่วน
การจัดอันดับผลการค้นหาข้อความแบบเต็มรูปแบบ
gist vs gin
แรงบันดาลใจและความช่วยเหลือ

สร้าง db

 >> CREATE DATABASE ftdb;

ในการป้อน DB ด้วยชุดข้อมูลตัวอย่าง ( dataset.txt , แถว 100k, 15 คำแต่ละคำ) ฉันใช้ Python init_db.py สคริปต์

การค้นหาข้อความแบบเต็มโดยใช้ `ilike` Simple

 >> EXPLAIN ANALYZE
   SELECT text , language
   FROM public . document
   WHERE
      text ilike ' %field% '
      AND text ilike ' %window% '
      AND text ilike ' %lamp% '
      AND text ilike ' %research% '
      AND language = ' en '
    LIMIT 1 ;
                                                                  QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 0 . 00 .. 3734 . 02 rows = 1 width = 105 ) (actual time = 87 . 473 .. 87 . 474 rows = 0 loops = 1 )
   - >  Seq Scan on document  (cost = 0 . 00 .. 3734 . 02 rows = 1 width = 105 ) (actual time = 87 . 466 .. 87 . 466 rows = 0 loops = 1 )
         Filter: (( text ~~ * ' %field% ' :: text ) AND ( text ~~ * ' %window% ' :: text ) AND ( text ~~ * ' %lamp% ' :: text ) AND ( text ~~ * ' %research% ' :: text ))
         Rows Removed by Filter: 100001
 Planning Time : 2 . 193 ms
 Execution Time : 87 . 500 ms

การค้นหาข้อความแบบเต็มโดยใช้ `ilike` ที่รองรับโดยดัชนี trigram

Trigram คืออะไร? ดูตัวอย่างนี้:

 >> CREATE EXTENSION pg_trgm;
CREATE EXTENSION
>> select show_trgm( ' fielded ' );
                show_trgm
-- ---------------------------------------
 { "  f " , " fi " ,ded, " ed " ,eld,fie,iel,lde}

เราสามารถปรับปรุงประสิทธิภาพ ilike โดยใช้ดัชนี trigram เช่น gin_trgm_ops

 >> CREATE INDEX  ix_document_text_trigram ON document USING gin ( text gin_trgm_ops) where language = ' en ' ;
CREATE INDEX

>> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE
      text ilike ' %field% '
      AND text ilike ' %window% '
      AND text ilike ' %lamp% '
      AND text ilike ' %research% '
      AND language = ' en '
    LIMIT 1 ;
                                                                                       QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 176 . 00 .. 180 . 02 rows = 1 width = 105 ) (actual time = 1 . 473 .. 1 . 474 rows = 0 loops = 1 )
   - >  Bitmap Heap Scan on document  (cost = 176 . 00 .. 180 . 02 rows = 1 width = 105 ) (actual time = 1 . 470 .. 1 . 471 rows = 0 loops = 1 )
         Recheck Cond: (( text ~~ * ' %field% ' :: text ) AND ( text ~~ * ' %window% ' :: text ) AND ( text ~~ * ' %lamp% ' :: text ) AND ( text ~~ * ' %research% ' :: text ) AND ((language):: text = ' en ' :: text ))
         - >  Bitmap Index Scan on ix_document_text_trigram  (cost = 0 . 00 .. 176 . 00 rows = 1 width = 0 ) (actual time = 1 . 466 .. 1 . 466 rows = 0 loops = 1 )
               Index Cond: (( text ~~ * ' %field% ' :: text ) AND ( text ~~ * ' %window% ' :: text ) AND ( text ~~ * ' %lamp% ' :: text ) AND ( text ~~ * ' %research% ' :: text ))
 Planning Time : 2 . 389 ms
 Execution Time : 1 . 524 ms

สร้างการกำหนดค่าภาษาที่ไม่เปลี่ยนรูปสำหรับการค้นหาข้อความแบบเต็ม TSEARC

Postgres ไม่ได้ให้การสนับสนุนหลายภาษาโดยค่าเริ่มต้น อย่างไรก็ตามคุณสามารถตั้งค่าการกำหนดค่าได้อย่างง่ายดาย คุณเพียงแค่ต้องการไฟล์พจนานุกรมเพิ่มเติม นี่คือตัวอย่างสำหรับภาษาโปแลนด์ ไฟล์พจนานุกรมโปแลนด์สามารถดาวน์โหลดได้จาก: https://github.com/judehunter/polish-tsearch

polish.affix, polish.stop และ polish.dict ไฟล์ควรคัดลอกไปยังตำแหน่ง postgreSQL Sharedir tsearch_data เช่น /usr/share/postgresql/13/tsearch_data ในการกำหนดตำแหน่ง Sharedir ของคุณคุณสามารถใช้ pg_config --sharedir

ต้องสร้างการกำหนดค่า (ดูเอกสาร) ภายในฐานข้อมูล:

 >> DROP TEXT SEARCH DICTIONARY IF EXISTS polish_hunspell CASCADE;
   CREATE TEXT SEARCH DICTIONARY polish_hunspell (
    TEMPLATE  = ispell,
    DictFile  = polish,
    AffFile   = polish,
    StopWords = polish
  );
  CREATE TEXT SEARCH CONFIGURATION public . polish (
    COPY = pg_catalog . english
  );
  ALTER TEXT SEARCH CONFIGURATION polish
    ALTER MAPPING
    FOR
        asciiword, asciihword, hword_asciipart,  word, hword, hword_part
    WITH
        polish_hunspell, simple;

คุณต้องการไฟล์และการกำหนดค่าเหล่านี้เนื่องจากเครื่องมือค้นหาข้อความแบบเต็มใช้ Lexeme เปรียบเทียบเพื่อค้นหาการจับคู่ที่ดีที่สุด (ทั้งรูปแบบการสืบค้นและข้อความที่เก็บไว้นั้นถูก lexemized):

 >> SELECT to_tsquery( ' english ' , ' fielded ' ), to_tsvector( ' english ' , text )
   FROM document
   LIMIT 1 ;
 to_tsquery |                                                                    to_tsvector
-- ----------+----------------------------------------------------------------------------------------------------------------------------------------------------
 ' field '    | ' 19 ' : 16 ' bat ' : 12 ' dead ' : 8 ' degre ' : 1 ' depth ' : 5 ' field ' : 15 ' lamp ' : 13 ' men ' : 6 ' put ' : 14 ' ranch ' : 2 ' tall ' : 4 ' time ' : 3 ' underlin ' : 11 ' wast ' : 10 ' window ' : 9

หากคุณไม่สามารถระบุไฟล์พจนานุกรมคุณสามารถใช้ข้อความเต็มในรูปแบบ "ง่าย" (โดยไม่ต้องแปลงเป็น Lexeme):

 >> SELECT to_tsquery( ' simple ' , ' fielded ' ), to_tsvector( ' simple ' , text )
   FROM document
   LIMIT 1 ;
 to_tsquery |                                                                             to_tsvector
-- ----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 ' fielded '  | ' 19 ' : 16 ' bat ' : 12 ' below ' : 7 ' dead ' : 8 ' degree ' : 1 ' depth ' : 5 ' field ' : 15 ' lamp ' : 13 ' men ' : 6 ' putting ' : 14 ' ranch ' : 2 ' tall ' : 4 ' time ' : 3 ' underline ' : 11 ' waste ' : 10 ' window ' : 9

tsearch ค้นหาข้อความแบบเต็มโดยไม่มีดัชนีที่เก็บไว้

 >> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
   LIMIT 1 ;
                                                                                  QUERY PLAN
-- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 1000 . 00 .. 18298 . 49 rows = 1 width = 103 ) (actual time = 489 . 802 .. 491 . 352 rows = 0 loops = 1 )
   - >  Gather  (cost = 1000 . 00 .. 18298 . 49 rows = 1 width = 103 ) (actual time = 489 . 800 .. 491 . 349 rows = 0 loops = 1 )
         Workers Planned: 1
         Workers Launched: 1
         - >  Parallel Seq Scan on document  (cost = 0 . 00 .. 17298 . 39 rows = 1 width = 103 ) (actual time = 486 . 644 .. 486 . 644 rows = 0 loops = 2 )
               Filter: (((language):: text = ' en ' :: text ) AND (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery))
               Rows Removed by Filter: 50000
 Planning Time : 0 . 272 ms
 Execution Time : 491 . 376 ms
( 9 rows)

การค้นหาข้อความแบบเต็มด้วยดัชนีบางส่วนที่เก็บไว้

ดัชนีบางส่วนให้เป็นไปได้ในการจัดเก็บบันทึกในภาษาต่าง ๆ โดยใช้ตารางเดียวและสอบถามได้อย่างมีประสิทธิภาพ

 >> CREATE INDEX ix_en_document_tsvector_text ON public . document USING gin (to_tsvector( ' english ' ::regconfig, text )) WHERE language = ' en ' ;
CREATED INDEX
>> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
   LIMIT 1 ;
                                                               QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 1000 . 00 .. 18151 . 43 rows = 1 width = 103 ) (actual time = 487 . 120 .. 488 . 569 rows = 0 loops = 1 )
   - >  Gather  (cost = 1000 . 00 .. 18151 . 43 rows = 1 width = 103 ) (actual time = 487 . 117 .. 488 . 567 rows = 0 loops = 1 )
         Workers Planned: 1
         Workers Launched: 1
         - >  Parallel Seq Scan on document  (cost = 0 . 00 .. 17151 . 33 rows = 1 width = 103 ) (actual time = 484 . 418 .. 484 . 419 rows = 0 loops = 2 )
               Filter: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery)
               Rows Removed by Filter: 50000
 Planning Time : 0 . 193 ms
 Execution Time : 488 . 596 ms

ไม่มีความแตกต่าง? ดัชนียังไม่ได้ใช้ ... ทำไมมันไม่ทำงาน? โอ้ดูเอกสารดัชนีบางส่วน:

อย่างไรก็ตามโปรดทราบว่าภาคแสดงจะต้องตรงกับเงื่อนไขที่ใช้ในการสืบค้นที่ควรได้รับประโยชน์จากดัชนี เพื่อความแม่นยำดัชนีบางส่วนสามารถใช้ในการสืบค้นเฉพาะในกรณีที่ระบบสามารถรับรู้ได้ว่าเงื่อนไขของการสืบค้นทางคณิตศาสตร์หมายถึงภาคแสดงของดัชนี PostgreSQL ไม่มีสุภาษิตทฤษฎีบทที่มีความซับซ้อนซึ่งสามารถรับรู้การแสดงออกทางคณิตศาสตร์ที่เทียบเท่ากันซึ่งเขียนในรูปแบบที่แตกต่างกัน (ไม่เพียง แต่เป็นทฤษฎีบททั่วไปที่ยากมากที่จะสร้างมันอาจจะช้าเกินไปที่จะใช้งานจริง) ระบบสามารถรับรู้ถึงความไม่เท่าเทียมกันอย่างง่าย ๆ ตัวอย่างเช่น "x <1" หมายถึง "x <2"; มิฉะนั้นเงื่อนไขของภาคแสดงจะต้องตรงกับส่วนหนึ่งของแบบสอบถามที่เงื่อนไขหรือดัชนีจะไม่ได้รับการยอมรับว่าใช้งานได้ การจับคู่เกิดขึ้นในเวลาวางแผนแบบสอบถามไม่ใช่เวลาทำงาน เป็นผลให้คำสั่งแบบสอบถามพารามิเตอร์ไม่ทำงานกับดัชนีบางส่วน

เราต้องเพิ่มเงื่อนไขที่ใช้ในการสร้างดัชนีบางส่วน: document.language = 'en' :

 >> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE
      to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
      AND language = ' en '
   LIMIT 1 ;                                                                           QUERY PLAN
-- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 64 . 00 .. 68 . 27 rows = 1 width = 103 ) (actual time = 0 . 546 .. 0 . 548 rows = 0 loops = 1 )
   - >  Bitmap Heap Scan on document  (cost = 64 . 00 .. 68 . 27 rows = 1 width = 103 ) (actual time = 0 . 544 .. 0 . 545 rows = 0 loops = 1 )
         Recheck Cond: ((to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery) AND ((language):: text = ' en ' :: text ))
         - >  Bitmap Index Scan on ix_en_document_tsvector_text  (cost = 0 . 00 .. 64 . 00 rows = 1 width = 0 ) (actual time = 0 . 540 .. 0 . 540 rows = 0 loops = 1 )
               Index Cond: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery)
 Planning Time : 0 . 244 ms
 Execution Time : 0 . 590 ms

tsearch ค้นหาข้อความแบบเต็มสำหรับคำบางส่วน

:* ตัวดำเนินการเปิดใช้งานการค้นหาคำนำหน้า มันจะมีประโยชน์ในการดำเนินการค้นหาข้อความแบบเต็มระหว่างการพิมพ์คำ

 >> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE
      to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & l:* ' )
      AND language = ' en '
   LIMIT 1 ;
                                                                   QUERY PLAN
-- ----------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on document  (cost = 168 . 00 .. 172 . 27 rows = 1 width = 102 ) (actual time = 5 . 207 .. 5 . 210 rows = 4 loops = 1 )
   Recheck Cond: ((to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' l ' ' :* ' ::tsquery) AND ((language):: text = ' en ' :: text ))
   Heap Blocks: exact = 4
   - >  Bitmap Index Scan on ix_en_document_tsvector_text  (cost = 0 . 00 .. 168 . 00 rows = 1 width = 0 ) (actual time = 5 . 202 .. 5 . 202 rows = 4 loops = 1 )
         Index Cond: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' l ' ' :* ' ::tsquery)
 Planning Time : 0 . 240 ms
 Execution Time : 5 . 240 ms

>> SELECT id,  text
   FROM public . document
   WHERE
      to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & l:* ' )
      AND language = ' en '
   LIMIT 20 ;
  id   |                                                   text
-- -----+-----------------------------------------------------------------------------------------------------------
     1 | degree ranch time tall depth men below dead window waste underline bat lamp putting field               +
 20152 | Law pony follow memory star whatever window sets oxygen longer word whom glass field actual              +
 21478 | Dried symbol willing design managed shade window pick share faster education drive field land everybody  +
 30293 | Pencil seen engineer labor image entire smallest serve field should riding smaller window imagine traffic +

การจัดอันดับผลการค้นหาข้อความแบบเต็มรูปแบบ

มีฟังก์ชั่นที่ค่อนข้างคล้ายกันสองฟังก์ชั่นในการจัดอันดับผลลัพธ์ tsearch:

ts_rank ซึ่งจัดอันดับเวกเตอร์ตามความถี่ของ Lexemes ที่ตรงกัน
ts_rank_cd ที่คำนวณการจัดอันดับ "ความหนาแน่นของปก"

สำหรับข้อมูลเพิ่มเติมดูเอกสาร

 >> SELECT
     id,
     ts_rank_cd(to_tsvector( ' english ' , text ), to_tsquery( ' english ' , ' fielded & wind:* ' )) rank,
     text
    FROM public . document
    WHERE to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & wind:* ' )
    ORDER BY rank DESC
    LIMIT 20 ;
   id   |    rank     |                                                   text
-- ------+-------------+-----------------------------------------------------------------------------------------------------------
 100002 |         0 . 1 | fielded window
   9376 |        0 . 05 | Own mouse girl effect surprise physical newspaper forgot eat upper field element window simply unhappy   +
  96597 |        0 . 05 | Opinion fastened pencil rear more theory size window heading field understanding farm up position attack +
  44626 | 0 . 033333335 | Symbol each halfway window swam spider field page shinning donkey chose until cow cabin congress         +
  80922 | 0 . 033333335 | Victory famous field shelter girl wind adventure he divide rear tip few studied ruler judge              +
  30293 |       0 . 025 | Pencil seen engineer labor image entire smallest serve field should riding smaller window imagine traffic +
      1 | 0 . 016666668 | degree ranch time tall depth men below dead window waste underline bat lamp putting field               +
  21478 | 0 . 016666668 | Dried symbol willing design managed shade window pick share faster education drive field land everybody  +
  60059 | 0 . 016666668 | However hungry make proud kids come willing field officer row above highest round wind mile              +
  26001 | 0 . 014285714 | Earth earlier pocket might sense window way frog fire court family mouth field somebody recognize        +
  20152 | 0 . 014285714 | Law pony follow memory star whatever window sets oxygen longer word whom glass field actual              +
  37470 |      0 . 0125 | Farm weight balloon buried wind water donkey grain pig week should damage field was he                   +
  49433 |        0 . 01 | Wind scientist leaving atom year bad child drink shore spirit field facing indicate wagon here           +
  37851 | 0 . 007142857 | Field cloud you wife rhythm upward applied weigh continued property replace ahead forgotten trip window  +

บันทึก text='fielded window' ถูกเพิ่มเข้ามาด้วยตนเองเพื่อแสดงผลการจับคู่ที่ดีที่สุด

gist vs gin

เราได้สร้างดัชนีจิน แต่ยังมีตัวเลือกดัชนี GIST อันไหนดีกว่ากัน? ขึ้นอยู่กับ ...

 >> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE
      to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
      AND language = ' en '
   LIMIT 1 ;
                                                                  QUERY PLAN
-- ---------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 0 . 28 .. 8 . 30 rows = 1 width = 103 ) (actual time = 2 . 699 .. 2 . 700 rows = 0 loops = 1 )
   - >  Index Scan using ix_en_document_tsvector_text on document  (cost = 0 . 28 .. 8 . 30 rows = 1 width = 103 ) (actual time = 2 . 697 .. 2 . 697 rows = 0 loops = 1 )
         Index Cond: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery)
 Planning Time : 0 . 274 ms
 Execution Time : 2 . 730 ms

จินดูเหมือนจะเร็วขึ้นเล็กน้อย ฉันไม่คิดว่าฉันจะอธิบายได้ดีกว่าเอกสารที่ทำอยู่แล้ว:

ในการเลือกประเภทของดัชนีที่จะใช้ gist หรือ gin ให้พิจารณาความแตกต่างของประสิทธิภาพเหล่านี้:
การค้นหาดัชนีจินนั้นเร็วกว่า GIST ประมาณสามเท่า
ดัชนีจินใช้เวลานานกว่าสามเท่าในการสร้างมากกว่า GIST
ดัชนีจินช้าลงในระดับปานกลางกว่าดัชนี GIST แต่จะช้าลงประมาณ 10 เท่าหากปิดการสนับสนุนอย่างรวดเร็ว (ดูหัวข้อ 58.4.1 สำหรับรายละเอียด)
ดัชนีจินมีขนาดใหญ่กว่าดัชนี GIST สองถึงสามเท่า

แรงบันดาลใจและความช่วยเหลือ

https://about.gitlab.com/blog/2016/03/18/fast-search-using-postgresql-trigram-indexes/
http://rachbelaid.com/postgres-full-text-search-is-good-enough/
https://scoutapm.com/blog/how-to-make-text-searches-in-postgresql-faster-with-trigram-similarity
https://stackoverflow.com/questions/27443950/make-postgres-full-text-search-tsverector-act-like-ilike-to-search-inside-words
https://stackoverflow.com/questions/46122175/fulltext-search-combined-with-fuzzysearch-in-postgresql
https://stackoverflow.com/questions/58651852/use-postgresql-full-text-search-to-fuzzy-match-all-search-erms
https://stackoverflow.com/questions/52140727/fuzzy-search-in-full-text-search
https://stackoverflow.com/questions/2513501/postgresql-full-text-search-how-to-to-to-search-partial-words
https://stackoverflow.com/questions/28975517/difference-between-gist-and-gin-index
https://dba.stackexchange.com/questions/149765/postgresql-gin-index-not-used-wen-ts-query-language-is-fetched-from-a-column
https://dba.stackexchange.com/questions/251177/postgres-full-text-search-on-words-not-lexemes

ขยาย