Contoh Pencarian Teks Lengkap Postgres (TSEARCH, TRIGRAM, ILIKE).
ilike Sederhanailike yang Didukung oleh Indeks Trigram >> CREATE DATABASE ftdb; Untuk memberi makan DB dengan contoh data ( dataset.txt , 100K baris, masing -masing 15 kata) Saya menggunakan skrip python init_db.py .
ilike Sederhana >> EXPLAIN ANALYZE
SELECT text , language
FROM public . document
WHERE
text ilike ' %field% '
AND text ilike ' %window% '
AND text ilike ' %lamp% '
AND text ilike ' %research% '
AND language = ' en '
LIMIT 1 ;
QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost = 0 . 00 .. 3734 . 02 rows = 1 width = 105 ) (actual time = 87 . 473 .. 87 . 474 rows = 0 loops = 1 )
- > Seq Scan on document (cost = 0 . 00 .. 3734 . 02 rows = 1 width = 105 ) (actual time = 87 . 466 .. 87 . 466 rows = 0 loops = 1 )
Filter: (( text ~~ * ' %field% ' :: text ) AND ( text ~~ * ' %window% ' :: text ) AND ( text ~~ * ' %lamp% ' :: text ) AND ( text ~~ * ' %research% ' :: text ))
Rows Removed by Filter: 100001
Planning Time : 2 . 193 ms
Execution Time : 87 . 500 msilike yang Didukung oleh Indeks TrigramApa itu Trigram? Lihat contoh ini:
>> CREATE EXTENSION pg_trgm;
CREATE EXTENSION
>> select show_trgm( ' fielded ' );
show_trgm
-- ---------------------------------------
{ " f " , " fi " ,ded, " ed " ,eld,fie,iel,lde} Kami dapat meningkatkan kinerja ilike menggunakan indeks Trigram, misalnya gin_trgm_ops .
>> CREATE INDEX ix_document_text_trigram ON document USING gin ( text gin_trgm_ops) where language = ' en ' ;
CREATE INDEX
>> EXPLAIN ANALYZE SELECT text , language
FROM public . document
WHERE
text ilike ' %field% '
AND text ilike ' %window% '
AND text ilike ' %lamp% '
AND text ilike ' %research% '
AND language = ' en '
LIMIT 1 ;
QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost = 176 . 00 .. 180 . 02 rows = 1 width = 105 ) (actual time = 1 . 473 .. 1 . 474 rows = 0 loops = 1 )
- > Bitmap Heap Scan on document (cost = 176 . 00 .. 180 . 02 rows = 1 width = 105 ) (actual time = 1 . 470 .. 1 . 471 rows = 0 loops = 1 )
Recheck Cond: (( text ~~ * ' %field% ' :: text ) AND ( text ~~ * ' %window% ' :: text ) AND ( text ~~ * ' %lamp% ' :: text ) AND ( text ~~ * ' %research% ' :: text ) AND ((language):: text = ' en ' :: text ))
- > Bitmap Index Scan on ix_document_text_trigram (cost = 0 . 00 .. 176 . 00 rows = 1 width = 0 ) (actual time = 1 . 466 .. 1 . 466 rows = 0 loops = 1 )
Index Cond: (( text ~~ * ' %field% ' :: text ) AND ( text ~~ * ' %window% ' :: text ) AND ( text ~~ * ' %lamp% ' :: text ) AND ( text ~~ * ' %research% ' :: text ))
Planning Time : 2 . 389 ms
Execution Time : 1 . 524 msPostgres tidak memberikan dukungan untuk banyak bahasa secara default. Namun, Anda dapat mengatur konfigurasi dengan cukup mudah. Anda hanya perlu file kamus tambahan. Berikut adalah contoh untuk bahasa Polandia. File Kamus Polandia dapat diunduh dari: https://github.com/judehunter/polish-tsearch.
File Polish.Affix, Polandia. Polandia dan Polandia harus disalin ke Lokasi PostgreSQL Sharedir tsearch_data , EG /usr/share/postgresql/13/tsearch_data . Untuk menentukan lokasi sharedir Anda, Anda dapat menggunakan pg_config --sharedir
Juga harus dibuat konfigurasi (lihat dokumen) di dalam database:
>> DROP TEXT SEARCH DICTIONARY IF EXISTS polish_hunspell CASCADE;
CREATE TEXT SEARCH DICTIONARY polish_hunspell (
TEMPLATE = ispell,
DictFile = polish,
AffFile = polish,
StopWords = polish
);
CREATE TEXT SEARCH CONFIGURATION public . polish (
COPY = pg_catalog . english
);
ALTER TEXT SEARCH CONFIGURATION polish
ALTER MAPPING
FOR
asciiword, asciihword, hword_asciipart, word, hword, hword_part
WITH
polish_hunspell, simple;
Anda memerlukan file dan konfigurasi ini karena mesin pencari teks lengkap menggunakan lexeme membandingkan untuk menemukan kecocokan terbaik (baik pola kueri dan teks tersimpan leksemisasi):
>> SELECT to_tsquery( ' english ' , ' fielded ' ), to_tsvector( ' english ' , text )
FROM document
LIMIT 1 ;
to_tsquery | to_tsvector
-- ----------+----------------------------------------------------------------------------------------------------------------------------------------------------
' field ' | ' 19 ' : 16 ' bat ' : 12 ' dead ' : 8 ' degre ' : 1 ' depth ' : 5 ' field ' : 15 ' lamp ' : 13 ' men ' : 6 ' put ' : 14 ' ranch ' : 2 ' tall ' : 4 ' time ' : 3 ' underlin ' : 11 ' wast ' : 10 ' window ' : 9Jika Anda tidak dapat memberikan file kamus, Anda dapat menggunakan teks lengkap dalam formulir "Sederhana" (tanpa transformasi ke Lexeme):
>> SELECT to_tsquery( ' simple ' , ' fielded ' ), to_tsvector( ' simple ' , text )
FROM document
LIMIT 1 ;
to_tsquery | to_tsvector
-- ----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------
' fielded ' | ' 19 ' : 16 ' bat ' : 12 ' below ' : 7 ' dead ' : 8 ' degree ' : 1 ' depth ' : 5 ' field ' : 15 ' lamp ' : 13 ' men ' : 6 ' putting ' : 14 ' ranch ' : 2 ' tall ' : 4 ' time ' : 3 ' underline ' : 11 ' waste ' : 10 ' window ' : 9 >> EXPLAIN ANALYZE SELECT text , language
FROM public . document
WHERE to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
LIMIT 1 ;
QUERY PLAN
-- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost = 1000 . 00 .. 18298 . 49 rows = 1 width = 103 ) (actual time = 489 . 802 .. 491 . 352 rows = 0 loops = 1 )
- > Gather (cost = 1000 . 00 .. 18298 . 49 rows = 1 width = 103 ) (actual time = 489 . 800 .. 491 . 349 rows = 0 loops = 1 )
Workers Planned: 1
Workers Launched: 1
- > Parallel Seq Scan on document (cost = 0 . 00 .. 17298 . 39 rows = 1 width = 103 ) (actual time = 486 . 644 .. 486 . 644 rows = 0 loops = 2 )
Filter: (((language):: text = ' en ' :: text ) AND (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery))
Rows Removed by Filter: 50000
Planning Time : 0 . 272 ms
Execution Time : 491 . 376 ms
( 9 rows)Indeks parsial memberikan sebagai kemungkinan untuk menyimpan catatan dalam berbagai bahasa menggunakan satu tabel dan meminta secara efektif.
>> CREATE INDEX ix_en_document_tsvector_text ON public . document USING gin (to_tsvector( ' english ' ::regconfig, text )) WHERE language = ' en ' ;
CREATED INDEX
>> EXPLAIN ANALYZE SELECT text , language
FROM public . document
WHERE to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
LIMIT 1 ;
QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------------------
Limit (cost = 1000 . 00 .. 18151 . 43 rows = 1 width = 103 ) (actual time = 487 . 120 .. 488 . 569 rows = 0 loops = 1 )
- > Gather (cost = 1000 . 00 .. 18151 . 43 rows = 1 width = 103 ) (actual time = 487 . 117 .. 488 . 567 rows = 0 loops = 1 )
Workers Planned: 1
Workers Launched: 1
- > Parallel Seq Scan on document (cost = 0 . 00 .. 17151 . 33 rows = 1 width = 103 ) (actual time = 484 . 418 .. 484 . 419 rows = 0 loops = 2 )
Filter: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery)
Rows Removed by Filter: 50000
Planning Time : 0 . 193 ms
Execution Time : 488 . 596 msTidak ada bedanya? Indeks belum digunakan ... mengapa tidak berfungsi? Ohh, melihat ke dokumen indeks parsial:
Namun, perlu diingat bahwa predikat harus sesuai dengan kondisi yang digunakan dalam kueri yang seharusnya mendapat manfaat dari indeks. Lebih tepatnya, indeks parsial dapat digunakan dalam kueri hanya jika sistem dapat mengenali bahwa kondisi kueri di mana secara matematis menyiratkan predikat indeks. PostgreSQL tidak memiliki pepatah teorema yang canggih yang dapat mengenali ekspresi setara secara matematis yang ditulis dalam berbagai bentuk. ; Kalau tidak, kondisi predikat harus cocok dengan bagian dari kueri di mana kondisi atau indeks tidak akan diakui sebagai dapat digunakan. Pencocokan berlangsung pada waktu perencanaan kueri, bukan pada waktu lari. Akibatnya, klausa kueri parameter tidak berfungsi dengan indeks parsial.
Kami harus menambahkan ke kueri suatu kondisi yang digunakan untuk membuat indeks parsial: document.language = 'en' :
>> EXPLAIN ANALYZE SELECT text , language
FROM public . document
WHERE
to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
AND language = ' en '
LIMIT 1 ; QUERY PLAN
-- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost = 64 . 00 .. 68 . 27 rows = 1 width = 103 ) (actual time = 0 . 546 .. 0 . 548 rows = 0 loops = 1 )
- > Bitmap Heap Scan on document (cost = 64 . 00 .. 68 . 27 rows = 1 width = 103 ) (actual time = 0 . 544 .. 0 . 545 rows = 0 loops = 1 )
Recheck Cond: ((to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery) AND ((language):: text = ' en ' :: text ))
- > Bitmap Index Scan on ix_en_document_tsvector_text (cost = 0 . 00 .. 64 . 00 rows = 1 width = 0 ) (actual time = 0 . 540 .. 0 . 540 rows = 0 loops = 1 )
Index Cond: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery)
Planning Time : 0 . 244 ms
Execution Time : 0 . 590 ms :* Operator Mengaktifkan Pencarian Awalan. Bisa bermanfaat untuk menjalankan pencarian teks lengkap selama mengetik kata.
>> EXPLAIN ANALYZE SELECT text , language
FROM public . document
WHERE
to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & l:* ' )
AND language = ' en '
LIMIT 1 ;
QUERY PLAN
-- ----------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on document (cost = 168 . 00 .. 172 . 27 rows = 1 width = 102 ) (actual time = 5 . 207 .. 5 . 210 rows = 4 loops = 1 )
Recheck Cond: ((to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' l ' ' :* ' ::tsquery) AND ((language):: text = ' en ' :: text ))
Heap Blocks: exact = 4
- > Bitmap Index Scan on ix_en_document_tsvector_text (cost = 0 . 00 .. 168 . 00 rows = 1 width = 0 ) (actual time = 5 . 202 .. 5 . 202 rows = 4 loops = 1 )
Index Cond: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' l ' ' :* ' ::tsquery)
Planning Time : 0 . 240 ms
Execution Time : 5 . 240 ms
>> SELECT id, text
FROM public . document
WHERE
to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & l:* ' )
AND language = ' en '
LIMIT 20 ;
id | text
-- -----+-----------------------------------------------------------------------------------------------------------
1 | degree ranch time tall depth men below dead window waste underline bat lamp putting field +
20152 | Law pony follow memory star whatever window sets oxygen longer word whom glass field actual +
21478 | Dried symbol willing design managed shade window pick share faster education drive field land everybody +
30293 | Pencil seen engineer labor image entire smallest serve field should riding smaller window imagine traffic +
Ada dua fungsi yang sangat mirip untuk memberi peringkat hasil penelusuran:
ts_rank , yang menempati peringkat vektor berdasarkan frekuensi leksem yang cocokts_rank_cd , yang menghitung peringkat "kepadatan penutup"Untuk info lebih lanjut, lihat dokumen
>> SELECT
id,
ts_rank_cd(to_tsvector( ' english ' , text ), to_tsquery( ' english ' , ' fielded & wind:* ' )) rank,
text
FROM public . document
WHERE to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & wind:* ' )
ORDER BY rank DESC
LIMIT 20 ;
id | rank | text
-- ------+-------------+-----------------------------------------------------------------------------------------------------------
100002 | 0 . 1 | fielded window
9376 | 0 . 05 | Own mouse girl effect surprise physical newspaper forgot eat upper field element window simply unhappy +
96597 | 0 . 05 | Opinion fastened pencil rear more theory size window heading field understanding farm up position attack +
44626 | 0 . 033333335 | Symbol each halfway window swam spider field page shinning donkey chose until cow cabin congress +
80922 | 0 . 033333335 | Victory famous field shelter girl wind adventure he divide rear tip few studied ruler judge +
30293 | 0 . 025 | Pencil seen engineer labor image entire smallest serve field should riding smaller window imagine traffic +
1 | 0 . 016666668 | degree ranch time tall depth men below dead window waste underline bat lamp putting field +
21478 | 0 . 016666668 | Dried symbol willing design managed shade window pick share faster education drive field land everybody +
60059 | 0 . 016666668 | However hungry make proud kids come willing field officer row above highest round wind mile +
26001 | 0 . 014285714 | Earth earlier pocket might sense window way frog fire court family mouth field somebody recognize +
20152 | 0 . 014285714 | Law pony follow memory star whatever window sets oxygen longer word whom glass field actual +
37470 | 0 . 0125 | Farm weight balloon buried wind water donkey grain pig week should damage field was he +
49433 | 0 . 01 | Wind scientist leaving atom year bad child drink shore spirit field facing indicate wagon here +
37851 | 0 . 007142857 | Field cloud you wife rhythm upward applied weigh continued property replace ahead forgotten trip window +
Rekaman text='fielded window' ditambahkan secara manual untuk menunjukkan hasil pertandingan terbaik.
Kami telah membuat indeks gin. Tetapi ada juga opsi indeks inti. Mana yang lebih baik? Itu tergantung ...
>> EXPLAIN ANALYZE SELECT text , language
FROM public . document
WHERE
to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
AND language = ' en '
LIMIT 1 ;
QUERY PLAN
-- ---------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost = 0 . 28 .. 8 . 30 rows = 1 width = 103 ) (actual time = 2 . 699 .. 2 . 700 rows = 0 loops = 1 )
- > Index Scan using ix_en_document_tsvector_text on document (cost = 0 . 28 .. 8 . 30 rows = 1 width = 103 ) (actual time = 2 . 697 .. 2 . 697 rows = 0 loops = 1 )
Index Cond: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery)
Planning Time : 0 . 274 ms
Execution Time : 2 . 730 msGin tampaknya sedikit lebih cepat. Saya tidak berpikir saya bisa menjelaskannya lebih baik daripada yang sudah dilakukan oleh dokumen:
Dalam memilih jenis indeks mana yang akan digunakan, GIST atau GIN, pertimbangkan perbedaan kinerja ini:
- Pencarian indeks gin sekitar tiga kali lebih cepat dari inti
- Indeks gin membutuhkan waktu sekitar tiga kali lebih lama untuk dibangun daripada inti
- Indeks gin cukup lambat untuk diperbarui daripada indeks GIST, tetapi sekitar 10 kali lebih lambat jika dukungan pembaruan cepat dinonaktifkan (lihat Bagian 58.4.1 untuk detailnya)
- Indeks gin dua hingga tiga kali lebih besar dari indeks GIST