postgres full text search

postgres-full-text-search

Postgres خيارات البحث النصية الكاملة (Tsearch ، Trigram ، Ilike) أمثلة.

إنشاء DB
البحث عن النص الكامل باستخدام ilike بسيط
البحث عن النص الكامل باستخدام ilike مدعوم من فهرس Trigram
قم بإنشاء تكوين لغة غير متفرغ للبحث عن نص كامل
البحث عن النص الكامل للبحث بدون فهرس مخزن
البحث عن النص الكامل للبحث مع فهرس جزئي مخزن
Tsearch النص الكامل ابحث عن كلمات جزئية
ترتيب نتائج البحث النصية الكاملة
GIST مقابل الجن
الإلهام والمساعدة

إنشاء DB

 >> CREATE DATABASE ftdb;

لتغذية DB مع مثال مجموعة بيانات ( dataset.txt ، صفوف 100K ، 15 كلمة كل واحدة) استخدمت Python init_db.py البرنامج النصي.

البحث عن النص الكامل باستخدام `ilike` بسيط

 >> EXPLAIN ANALYZE
   SELECT text , language
   FROM public . document
   WHERE
      text ilike ' %field% '
      AND text ilike ' %window% '
      AND text ilike ' %lamp% '
      AND text ilike ' %research% '
      AND language = ' en '
    LIMIT 1 ;
                                                                  QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 0 . 00 .. 3734 . 02 rows = 1 width = 105 ) (actual time = 87 . 473 .. 87 . 474 rows = 0 loops = 1 )
   - >  Seq Scan on document  (cost = 0 . 00 .. 3734 . 02 rows = 1 width = 105 ) (actual time = 87 . 466 .. 87 . 466 rows = 0 loops = 1 )
         Filter: (( text ~~ * ' %field% ' :: text ) AND ( text ~~ * ' %window% ' :: text ) AND ( text ~~ * ' %lamp% ' :: text ) AND ( text ~~ * ' %research% ' :: text ))
         Rows Removed by Filter: 100001
 Planning Time : 2 . 193 ms
 Execution Time : 87 . 500 ms

البحث عن النص الكامل باستخدام `ilike` مدعوم من فهرس Trigram

ما هو trigram؟ انظر هذا المثال:

 >> CREATE EXTENSION pg_trgm;
CREATE EXTENSION
>> select show_trgm( ' fielded ' );
                show_trgm
-- ---------------------------------------
 { "  f " , " fi " ,ded, " ed " ,eld,fie,iel,lde}

يمكننا تحسين أداء ilike باستخدام مؤشر Trigram ، على سبيل المثال gin_trgm_ops .

 >> CREATE INDEX  ix_document_text_trigram ON document USING gin ( text gin_trgm_ops) where language = ' en ' ;
CREATE INDEX

>> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE
      text ilike ' %field% '
      AND text ilike ' %window% '
      AND text ilike ' %lamp% '
      AND text ilike ' %research% '
      AND language = ' en '
    LIMIT 1 ;
                                                                                       QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 176 . 00 .. 180 . 02 rows = 1 width = 105 ) (actual time = 1 . 473 .. 1 . 474 rows = 0 loops = 1 )
   - >  Bitmap Heap Scan on document  (cost = 176 . 00 .. 180 . 02 rows = 1 width = 105 ) (actual time = 1 . 470 .. 1 . 471 rows = 0 loops = 1 )
         Recheck Cond: (( text ~~ * ' %field% ' :: text ) AND ( text ~~ * ' %window% ' :: text ) AND ( text ~~ * ' %lamp% ' :: text ) AND ( text ~~ * ' %research% ' :: text ) AND ((language):: text = ' en ' :: text ))
         - >  Bitmap Index Scan on ix_document_text_trigram  (cost = 0 . 00 .. 176 . 00 rows = 1 width = 0 ) (actual time = 1 . 466 .. 1 . 466 rows = 0 loops = 1 )
               Index Cond: (( text ~~ * ' %field% ' :: text ) AND ( text ~~ * ' %window% ' :: text ) AND ( text ~~ * ' %lamp% ' :: text ) AND ( text ~~ * ' %research% ' :: text ))
 Planning Time : 2 . 389 ms
 Execution Time : 1 . 524 ms

قم بإنشاء تكوين لغة غير متفرغ للبحث عن نص كامل

لا توفر Postgres الدعم للعديد من اللغات افتراضيًا. ومع ذلك ، يمكنك إعداد التكوين بسهولة تامة. تحتاج فقط إلى ملفات القاموس الإضافية. هنا مثال على اللغة البولندية. يمكن تنزيل ملفات القاموس البولندية من: https://github.com/judehunter/polish-tsearch.

يجب نسخ ملفات polish.affix ، polish.stop و polish.dict إلى موقع postgresql المشترك tsearch_data ، على سبيل المثال /usr/share/postgresql/13/tsearch_data . لتحديد موقعك المشترك ، يمكنك استخدام pg_config --sharedir

يجب أيضًا إنشاء تكوين (انظر المستندات) داخل قاعدة البيانات:

 >> DROP TEXT SEARCH DICTIONARY IF EXISTS polish_hunspell CASCADE;
   CREATE TEXT SEARCH DICTIONARY polish_hunspell (
    TEMPLATE  = ispell,
    DictFile  = polish,
    AffFile   = polish,
    StopWords = polish
  );
  CREATE TEXT SEARCH CONFIGURATION public . polish (
    COPY = pg_catalog . english
  );
  ALTER TEXT SEARCH CONFIGURATION polish
    ALTER MAPPING
    FOR
        asciiword, asciihword, hword_asciipart,  word, hword, hword_part
    WITH
        polish_hunspell, simple;

أنت بحاجة إلى هذه الملفات والتكوين لأن محرك البحث الكامل للنص يستخدم lexeme مقارنة مع أفضل المطابقات (كل من نمط الاستعلام والنص المخزن مصحوبين):

 >> SELECT to_tsquery( ' english ' , ' fielded ' ), to_tsvector( ' english ' , text )
   FROM document
   LIMIT 1 ;
 to_tsquery |                                                                    to_tsvector
-- ----------+----------------------------------------------------------------------------------------------------------------------------------------------------
 ' field '    | ' 19 ' : 16 ' bat ' : 12 ' dead ' : 8 ' degre ' : 1 ' depth ' : 5 ' field ' : 15 ' lamp ' : 13 ' men ' : 6 ' put ' : 14 ' ranch ' : 2 ' tall ' : 4 ' time ' : 3 ' underlin ' : 11 ' wast ' : 10 ' window ' : 9

إذا لم تتمكن من توفير ملفات القاموس ، فيمكنك استخدام النص الكامل في نموذج "بسيط" (دون تحويل إلى لكزيم):

 >> SELECT to_tsquery( ' simple ' , ' fielded ' ), to_tsvector( ' simple ' , text )
   FROM document
   LIMIT 1 ;
 to_tsquery |                                                                             to_tsvector
-- ----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 ' fielded '  | ' 19 ' : 16 ' bat ' : 12 ' below ' : 7 ' dead ' : 8 ' degree ' : 1 ' depth ' : 5 ' field ' : 15 ' lamp ' : 13 ' men ' : 6 ' putting ' : 14 ' ranch ' : 2 ' tall ' : 4 ' time ' : 3 ' underline ' : 11 ' waste ' : 10 ' window ' : 9

البحث عن النص الكامل للبحث بدون فهرس مخزن

 >> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
   LIMIT 1 ;
                                                                                  QUERY PLAN
-- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 1000 . 00 .. 18298 . 49 rows = 1 width = 103 ) (actual time = 489 . 802 .. 491 . 352 rows = 0 loops = 1 )
   - >  Gather  (cost = 1000 . 00 .. 18298 . 49 rows = 1 width = 103 ) (actual time = 489 . 800 .. 491 . 349 rows = 0 loops = 1 )
         Workers Planned: 1
         Workers Launched: 1
         - >  Parallel Seq Scan on document  (cost = 0 . 00 .. 17298 . 39 rows = 1 width = 103 ) (actual time = 486 . 644 .. 486 . 644 rows = 0 loops = 2 )
               Filter: (((language):: text = ' en ' :: text ) AND (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery))
               Rows Removed by Filter: 50000
 Planning Time : 0 . 272 ms
 Execution Time : 491 . 376 ms
( 9 rows)

البحث عن النص الكامل للبحث مع فهرس جزئي مخزن

يعطي الفهرس الجزئي كاحتمال لتخزين السجلات بلغات مختلفة باستخدام جدول واحد والاستعلام عنها بفعالية.

 >> CREATE INDEX ix_en_document_tsvector_text ON public . document USING gin (to_tsvector( ' english ' ::regconfig, text )) WHERE language = ' en ' ;
CREATED INDEX
>> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
   LIMIT 1 ;
                                                               QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 1000 . 00 .. 18151 . 43 rows = 1 width = 103 ) (actual time = 487 . 120 .. 488 . 569 rows = 0 loops = 1 )
   - >  Gather  (cost = 1000 . 00 .. 18151 . 43 rows = 1 width = 103 ) (actual time = 487 . 117 .. 488 . 567 rows = 0 loops = 1 )
         Workers Planned: 1
         Workers Launched: 1
         - >  Parallel Seq Scan on document  (cost = 0 . 00 .. 17151 . 33 rows = 1 width = 103 ) (actual time = 484 . 418 .. 484 . 419 rows = 0 loops = 2 )
               Filter: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery)
               Rows Removed by Filter: 50000
 Planning Time : 0 . 193 ms
 Execution Time : 488 . 596 ms

لا فرق؟ لم يتم استخدام الفهرس ... لماذا لا يعمل؟ أوه ، ينظر إلى مستندات الفهرس الجزئي:

ومع ذلك ، ضع في اعتبارك أن المسند يجب أن يتطابق مع الظروف المستخدمة في الاستعلامات التي من المفترض أن تستفيد من الفهرس. لكي نكون دقيقًا ، يمكن استخدام فهرس جزئي في استعلام فقط إذا كان بإمكان النظام أن يدرك أن حالة الاستعلام تعني رياضياً مسند الفهرس. ليس لدى PostgreSQL مثل Prover نظرية متطورة يمكنه التعرف على التعبيرات المكافئة رياضيًا مكتوبة بأشكال مختلفة. (ليس من الصعب للغاية إنشاء مثل هذا المثل النظري العام ، فمن المحتمل أن يكون بطيئًا للغاية في أي استخدام حقيقي.) يمكن للنظام التعرف على الآثار المترتبة على عدم المساواة البسيطة ، على سبيل المثال "x <1" يعني "x <2" ؛ وإلا يجب أن تتطابق شرط المسند تمامًا في جزء من الاستعلام حيث لن يتم التعرف على الحالة أو الفهرس على أنه قابل للاستخدام. تتم المطابقة في وقت التخطيط للاستعلام ، وليس في وقت التشغيل. نتيجة لذلك ، لا تعمل بنود الاستعلام المعلمة مع فهرس جزئي.

يجب أن نضيف إلى الاستعلام شرطًا تم استخدامه لإنشاء فهرس جزئي: document.language = 'en' :

 >> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE
      to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
      AND language = ' en '
   LIMIT 1 ;                                                                           QUERY PLAN
-- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 64 . 00 .. 68 . 27 rows = 1 width = 103 ) (actual time = 0 . 546 .. 0 . 548 rows = 0 loops = 1 )
   - >  Bitmap Heap Scan on document  (cost = 64 . 00 .. 68 . 27 rows = 1 width = 103 ) (actual time = 0 . 544 .. 0 . 545 rows = 0 loops = 1 )
         Recheck Cond: ((to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery) AND ((language):: text = ' en ' :: text ))
         - >  Bitmap Index Scan on ix_en_document_tsvector_text  (cost = 0 . 00 .. 64 . 00 rows = 1 width = 0 ) (actual time = 0 . 540 .. 0 . 540 rows = 0 loops = 1 )
               Index Cond: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery)
 Planning Time : 0 . 244 ms
 Execution Time : 0 . 590 ms

Tsearch النص الكامل ابحث عن كلمات جزئية

:* المشغل يتيح البحث البادئة. يمكن أن يكون من المفيد تنفيذ البحث عن النص الكامل أثناء كتابة كلمة.

 >> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE
      to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & l:* ' )
      AND language = ' en '
   LIMIT 1 ;
                                                                   QUERY PLAN
-- ----------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on document  (cost = 168 . 00 .. 172 . 27 rows = 1 width = 102 ) (actual time = 5 . 207 .. 5 . 210 rows = 4 loops = 1 )
   Recheck Cond: ((to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' l ' ' :* ' ::tsquery) AND ((language):: text = ' en ' :: text ))
   Heap Blocks: exact = 4
   - >  Bitmap Index Scan on ix_en_document_tsvector_text  (cost = 0 . 00 .. 168 . 00 rows = 1 width = 0 ) (actual time = 5 . 202 .. 5 . 202 rows = 4 loops = 1 )
         Index Cond: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' l ' ' :* ' ::tsquery)
 Planning Time : 0 . 240 ms
 Execution Time : 5 . 240 ms

>> SELECT id,  text
   FROM public . document
   WHERE
      to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & l:* ' )
      AND language = ' en '
   LIMIT 20 ;
  id   |                                                   text
-- -----+-----------------------------------------------------------------------------------------------------------
     1 | degree ranch time tall depth men below dead window waste underline bat lamp putting field               +
 20152 | Law pony follow memory star whatever window sets oxygen longer word whom glass field actual              +
 21478 | Dried symbol willing design managed shade window pick share faster education drive field land everybody  +
 30293 | Pencil seen engineer labor image entire smallest serve field should riding smaller window imagine traffic +

ترتيب نتائج البحث النصية الكاملة

هناك وظيفتان متشابهتان تمامًا لتصنيف نتائج البحث:

ts_rank ، التي تصنف المتجهات بناءً على تواتر ليكسيم المطابقة
ts_rank_cd ، الذي يحسب ترتيب "كثافة الغلاف"

لمزيد من المعلومات ، راجع المستندات

 >> SELECT
     id,
     ts_rank_cd(to_tsvector( ' english ' , text ), to_tsquery( ' english ' , ' fielded & wind:* ' )) rank,
     text
    FROM public . document
    WHERE to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & wind:* ' )
    ORDER BY rank DESC
    LIMIT 20 ;
   id   |    rank     |                                                   text
-- ------+-------------+-----------------------------------------------------------------------------------------------------------
 100002 |         0 . 1 | fielded window
   9376 |        0 . 05 | Own mouse girl effect surprise physical newspaper forgot eat upper field element window simply unhappy   +
  96597 |        0 . 05 | Opinion fastened pencil rear more theory size window heading field understanding farm up position attack +
  44626 | 0 . 033333335 | Symbol each halfway window swam spider field page shinning donkey chose until cow cabin congress         +
  80922 | 0 . 033333335 | Victory famous field shelter girl wind adventure he divide rear tip few studied ruler judge              +
  30293 |       0 . 025 | Pencil seen engineer labor image entire smallest serve field should riding smaller window imagine traffic +
      1 | 0 . 016666668 | degree ranch time tall depth men below dead window waste underline bat lamp putting field               +
  21478 | 0 . 016666668 | Dried symbol willing design managed shade window pick share faster education drive field land everybody  +
  60059 | 0 . 016666668 | However hungry make proud kids come willing field officer row above highest round wind mile              +
  26001 | 0 . 014285714 | Earth earlier pocket might sense window way frog fire court family mouth field somebody recognize        +
  20152 | 0 . 014285714 | Law pony follow memory star whatever window sets oxygen longer word whom glass field actual              +
  37470 |      0 . 0125 | Farm weight balloon buried wind water donkey grain pig week should damage field was he                   +
  49433 |        0 . 01 | Wind scientist leaving atom year bad child drink shore spirit field facing indicate wagon here           +
  37851 | 0 . 007142857 | Field cloud you wife rhythm upward applied weigh continued property replace ahead forgotten trip window  +

تم إضافة سجل text='fielded window' يدويًا لإظهار أفضل نتيجة تطابق.

GIST مقابل الجن

لقد أنشأنا فهرس الجن. ولكن هناك أيضا خيار فهرس جوهر. أيهما أفضل؟ هذا يعتمد ...

 >> EXPLAIN ANALYZE SELECT text , language
   FROM public . document
   WHERE
      to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
      AND language = ' en '
   LIMIT 1 ;
                                                                  QUERY PLAN
-- ---------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost = 0 . 28 .. 8 . 30 rows = 1 width = 103 ) (actual time = 2 . 699 .. 2 . 700 rows = 0 loops = 1 )
   - >  Index Scan using ix_en_document_tsvector_text on document  (cost = 0 . 28 .. 8 . 30 rows = 1 width = 103 ) (actual time = 2 . 697 .. 2 . 697 rows = 0 loops = 1 )
         Index Cond: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery)
 Planning Time : 0 . 274 ms
 Execution Time : 2 . 730 ms

يبدو أن الجن أسرع قليلاً. لا أعتقد أنني يمكن أن أشرح ذلك بشكل أفضل من المستندات بالفعل:

في اختيار نوع الفهرس الذي يجب استخدامه ، GIST أو GIN ، فكر في اختلافات الأداء هذه:
عمليات البحث عن مؤشر الجن حوالي ثلاث مرات أسرع من GIST
تستغرق فهارس الجن حوالي ثلاث مرات للبناء من GIST
فهارس GIN أبطأ بشكل معتدل في التحديث من فهارس GIST ، ولكن حوالي 10 مرات أبطأ إذا تم تعطيل دعم التحديث السريع (انظر القسم 58.4.1 للحصول على التفاصيل)
فهارس الجن أكبر مرتين إلى ثلاث مرات من فهارس جوهر

الإلهام والمساعدة

https://about.gitlab.com/blog/2016/03/18/fast-search-using-postgresql-trigram-indexes/
http://rachbelaid.com/postgres-full-text-search-is-good-nough/
https://scoutapm.com/blog/how-to-to-text-text-searches-in-postgresql-faster-with-trigram-similarity
https://stackoverflow.com/questions/27443950/make-postgres-full-text-search-tsvector-act-like-ilike-to-search-inside-words
https://stackoverflow.com/questions/46122175/fulltext-search-combined-with-fuzzysearch-in-postgresql
https://stackoverflow.com/questions/58651852/use-postgresql-full-text-search-to-fuzzy-match- all-search-terms
https://stackoverflow.com/questions/52140727/fuzzy-search-in-full-text-search
https://stackoverflow.com/questions/2513501/postgresql-full-text-search-to-to-search-bartial-words
https://stackoverflow.com/questions/28975517/difference-between-gist-and-gin-index
https://dba.stackexchange.com/questions/149765/postgresql-gin-index-not-used-ten-ts-tier-language-is-fetched-frod-a-
https://dba.stackexchange.com/questions/251177/postgres-full-text-search-on-words-not-lexemes

يوسع