Postgres全文搜索選項(Tsearch,Trigram,Ilike)示例。
ilike全文搜索ilike全文搜索 >> CREATE DATABASE ftdb;要使用一個示例數據集( dataset.txt ,100k行,每個15個單詞)饋送DB,我使用了Python init_db.py腳本。
ilike全文搜索 >> EXPLAIN ANALYZE
SELECT text , language
FROM public . document
WHERE
text ilike ' %field% '
AND text ilike ' %window% '
AND text ilike ' %lamp% '
AND text ilike ' %research% '
AND language = ' en '
LIMIT 1 ;
QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost = 0 . 00 .. 3734 . 02 rows = 1 width = 105 ) (actual time = 87 . 473 .. 87 . 474 rows = 0 loops = 1 )
- > Seq Scan on document (cost = 0 . 00 .. 3734 . 02 rows = 1 width = 105 ) (actual time = 87 . 466 .. 87 . 466 rows = 0 loops = 1 )
Filter: (( text ~~ * ' %field% ' :: text ) AND ( text ~~ * ' %window% ' :: text ) AND ( text ~~ * ' %lamp% ' :: text ) AND ( text ~~ * ' %research% ' :: text ))
Rows Removed by Filter: 100001
Planning Time : 2 . 193 ms
Execution Time : 87 . 500 msilike全文搜索什麼是trigram?請參閱此示例:
>> CREATE EXTENSION pg_trgm;
CREATE EXTENSION
>> select show_trgm( ' fielded ' );
show_trgm
-- ---------------------------------------
{ " f " , " fi " ,ded, " ed " ,eld,fie,iel,lde}我們可以使用Trigram索引(例如gin_trgm_ops )提高ilike性能。
>> CREATE INDEX ix_document_text_trigram ON document USING gin ( text gin_trgm_ops) where language = ' en ' ;
CREATE INDEX
>> EXPLAIN ANALYZE SELECT text , language
FROM public . document
WHERE
text ilike ' %field% '
AND text ilike ' %window% '
AND text ilike ' %lamp% '
AND text ilike ' %research% '
AND language = ' en '
LIMIT 1 ;
QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost = 176 . 00 .. 180 . 02 rows = 1 width = 105 ) (actual time = 1 . 473 .. 1 . 474 rows = 0 loops = 1 )
- > Bitmap Heap Scan on document (cost = 176 . 00 .. 180 . 02 rows = 1 width = 105 ) (actual time = 1 . 470 .. 1 . 471 rows = 0 loops = 1 )
Recheck Cond: (( text ~~ * ' %field% ' :: text ) AND ( text ~~ * ' %window% ' :: text ) AND ( text ~~ * ' %lamp% ' :: text ) AND ( text ~~ * ' %research% ' :: text ) AND ((language):: text = ' en ' :: text ))
- > Bitmap Index Scan on ix_document_text_trigram (cost = 0 . 00 .. 176 . 00 rows = 1 width = 0 ) (actual time = 1 . 466 .. 1 . 466 rows = 0 loops = 1 )
Index Cond: (( text ~~ * ' %field% ' :: text ) AND ( text ~~ * ' %window% ' :: text ) AND ( text ~~ * ' %lamp% ' :: text ) AND ( text ~~ * ' %research% ' :: text ))
Planning Time : 2 . 389 ms
Execution Time : 1 . 524 msPostgres默認情況下不提供許多語言的支持。但是,您可以很容易地設置配置。您只需要其他字典文件。這是波蘭語言的示例。可以從:https://github.com/judehunter/polish-tsearch下載波蘭字典文件。
polish.Affix,polish.stop和polish.dict文件應複製到postgresql sharedir tsearch_data位置,例如/usr/share/postgresql/13/tsearch_data 。要確定您的共享位置,您可以使用pg_config --sharedir
還必須在數據庫中創建一個配置(請參閱文檔):
>> DROP TEXT SEARCH DICTIONARY IF EXISTS polish_hunspell CASCADE;
CREATE TEXT SEARCH DICTIONARY polish_hunspell (
TEMPLATE = ispell,
DictFile = polish,
AffFile = polish,
StopWords = polish
);
CREATE TEXT SEARCH CONFIGURATION public . polish (
COPY = pg_catalog . english
);
ALTER TEXT SEARCH CONFIGURATION polish
ALTER MAPPING
FOR
asciiword, asciihword, hword_asciipart, word, hword, hword_part
WITH
polish_hunspell, simple;
您需要這些文件和配置,因為全文搜索引擎使用Lexeme進行比較以查找最佳匹配(查詢模式和存儲的文本都是詞彙的):
>> SELECT to_tsquery( ' english ' , ' fielded ' ), to_tsvector( ' english ' , text )
FROM document
LIMIT 1 ;
to_tsquery | to_tsvector
-- ----------+----------------------------------------------------------------------------------------------------------------------------------------------------
' field ' | ' 19 ' : 16 ' bat ' : 12 ' dead ' : 8 ' degre ' : 1 ' depth ' : 5 ' field ' : 15 ' lamp ' : 13 ' men ' : 6 ' put ' : 14 ' ranch ' : 2 ' tall ' : 4 ' time ' : 3 ' underlin ' : 11 ' wast ' : 10 ' window ' : 9如果您無法提供字典文件,則可以使用“簡單”表單中的全文(無轉換為Lexeme):
>> SELECT to_tsquery( ' simple ' , ' fielded ' ), to_tsvector( ' simple ' , text )
FROM document
LIMIT 1 ;
to_tsquery | to_tsvector
-- ----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------
' fielded ' | ' 19 ' : 16 ' bat ' : 12 ' below ' : 7 ' dead ' : 8 ' degree ' : 1 ' depth ' : 5 ' field ' : 15 ' lamp ' : 13 ' men ' : 6 ' putting ' : 14 ' ranch ' : 2 ' tall ' : 4 ' time ' : 3 ' underline ' : 11 ' waste ' : 10 ' window ' : 9 >> EXPLAIN ANALYZE SELECT text , language
FROM public . document
WHERE to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
LIMIT 1 ;
QUERY PLAN
-- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost = 1000 . 00 .. 18298 . 49 rows = 1 width = 103 ) (actual time = 489 . 802 .. 491 . 352 rows = 0 loops = 1 )
- > Gather (cost = 1000 . 00 .. 18298 . 49 rows = 1 width = 103 ) (actual time = 489 . 800 .. 491 . 349 rows = 0 loops = 1 )
Workers Planned: 1
Workers Launched: 1
- > Parallel Seq Scan on document (cost = 0 . 00 .. 17298 . 39 rows = 1 width = 103 ) (actual time = 486 . 644 .. 486 . 644 rows = 0 loops = 2 )
Filter: (((language):: text = ' en ' :: text ) AND (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery))
Rows Removed by Filter: 50000
Planning Time : 0 . 272 ms
Execution Time : 491 . 376 ms
( 9 rows)部分索引可以使用一張表格以不同語言存儲記錄並有效地查詢它們。
>> CREATE INDEX ix_en_document_tsvector_text ON public . document USING gin (to_tsvector( ' english ' ::regconfig, text )) WHERE language = ' en ' ;
CREATED INDEX
>> EXPLAIN ANALYZE SELECT text , language
FROM public . document
WHERE to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
LIMIT 1 ;
QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------------------
Limit (cost = 1000 . 00 .. 18151 . 43 rows = 1 width = 103 ) (actual time = 487 . 120 .. 488 . 569 rows = 0 loops = 1 )
- > Gather (cost = 1000 . 00 .. 18151 . 43 rows = 1 width = 103 ) (actual time = 487 . 117 .. 488 . 567 rows = 0 loops = 1 )
Workers Planned: 1
Workers Launched: 1
- > Parallel Seq Scan on document (cost = 0 . 00 .. 17151 . 33 rows = 1 width = 103 ) (actual time = 484 . 418 .. 484 . 419 rows = 0 loops = 2 )
Filter: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery)
Rows Removed by Filter: 50000
Planning Time : 0 . 193 ms
Execution Time : 488 . 596 ms沒有區別?索引尚未使用...為什麼不起作用?哦,看部分索引文檔:
但是,請記住,謂詞必須與應從索引中受益的查詢中使用的條件匹配。確切地說,只有在系統可以識別出數學的位置意味著索引的謂詞時,才能在查詢中使用部分索引。 PostgreSQL沒有具有復雜的定理供慣例,可以識別以不同形式編寫的數學等效表達式。 (不僅如此一般的定理宣傳者非常難以創建,而且可能太慢了,無法實現任何真實用途。)系統可以識別簡單的不平等影響,例如“ x <1”暗示“ x <2”;否則,謂詞條件必須與查詢的一部分完全匹配,該部分條件或索引將無法確認為可用。匹配是在查詢計劃時間進行的,而不是在運行時進行。結果,參數化查詢子句不適用於部分索引。
我們必須添加查詢用於創建部分索引的條件: document.language = 'en' :
>> EXPLAIN ANALYZE SELECT text , language
FROM public . document
WHERE
to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
AND language = ' en '
LIMIT 1 ; QUERY PLAN
-- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost = 64 . 00 .. 68 . 27 rows = 1 width = 103 ) (actual time = 0 . 546 .. 0 . 548 rows = 0 loops = 1 )
- > Bitmap Heap Scan on document (cost = 64 . 00 .. 68 . 27 rows = 1 width = 103 ) (actual time = 0 . 544 .. 0 . 545 rows = 0 loops = 1 )
Recheck Cond: ((to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery) AND ((language):: text = ' en ' :: text ))
- > Bitmap Index Scan on ix_en_document_tsvector_text (cost = 0 . 00 .. 64 . 00 rows = 1 width = 0 ) (actual time = 0 . 540 .. 0 . 540 rows = 0 loops = 1 )
Index Cond: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery)
Planning Time : 0 . 244 ms
Execution Time : 0 . 590 ms:*操作員啟用前綴搜索。在鍵入單詞時執行全文搜索可能很有用。
>> EXPLAIN ANALYZE SELECT text , language
FROM public . document
WHERE
to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & l:* ' )
AND language = ' en '
LIMIT 1 ;
QUERY PLAN
-- ----------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on document (cost = 168 . 00 .. 172 . 27 rows = 1 width = 102 ) (actual time = 5 . 207 .. 5 . 210 rows = 4 loops = 1 )
Recheck Cond: ((to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' l ' ' :* ' ::tsquery) AND ((language):: text = ' en ' :: text ))
Heap Blocks: exact = 4
- > Bitmap Index Scan on ix_en_document_tsvector_text (cost = 0 . 00 .. 168 . 00 rows = 1 width = 0 ) (actual time = 5 . 202 .. 5 . 202 rows = 4 loops = 1 )
Index Cond: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' l ' ' :* ' ::tsquery)
Planning Time : 0 . 240 ms
Execution Time : 5 . 240 ms
>> SELECT id, text
FROM public . document
WHERE
to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & l:* ' )
AND language = ' en '
LIMIT 20 ;
id | text
-- -----+-----------------------------------------------------------------------------------------------------------
1 | degree ranch time tall depth men below dead window waste underline bat lamp putting field +
20152 | Law pony follow memory star whatever window sets oxygen longer word whom glass field actual +
21478 | Dried symbol willing design managed shade window pick share faster education drive field land everybody +
30293 | Pencil seen engineer labor image entire smallest serve field should riding smaller window imagine traffic +
有兩個非常相似的功能來對tsearch結果進行排名:
ts_rank ,根據其匹配詞彙的頻率對向量進行排名ts_rank_cd ,計算“封面密度”排名有關更多信息,請參閱文檔
>> SELECT
id,
ts_rank_cd(to_tsvector( ' english ' , text ), to_tsquery( ' english ' , ' fielded & wind:* ' )) rank,
text
FROM public . document
WHERE to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & wind:* ' )
ORDER BY rank DESC
LIMIT 20 ;
id | rank | text
-- ------+-------------+-----------------------------------------------------------------------------------------------------------
100002 | 0 . 1 | fielded window
9376 | 0 . 05 | Own mouse girl effect surprise physical newspaper forgot eat upper field element window simply unhappy +
96597 | 0 . 05 | Opinion fastened pencil rear more theory size window heading field understanding farm up position attack +
44626 | 0 . 033333335 | Symbol each halfway window swam spider field page shinning donkey chose until cow cabin congress +
80922 | 0 . 033333335 | Victory famous field shelter girl wind adventure he divide rear tip few studied ruler judge +
30293 | 0 . 025 | Pencil seen engineer labor image entire smallest serve field should riding smaller window imagine traffic +
1 | 0 . 016666668 | degree ranch time tall depth men below dead window waste underline bat lamp putting field +
21478 | 0 . 016666668 | Dried symbol willing design managed shade window pick share faster education drive field land everybody +
60059 | 0 . 016666668 | However hungry make proud kids come willing field officer row above highest round wind mile +
26001 | 0 . 014285714 | Earth earlier pocket might sense window way frog fire court family mouth field somebody recognize +
20152 | 0 . 014285714 | Law pony follow memory star whatever window sets oxygen longer word whom glass field actual +
37470 | 0 . 0125 | Farm weight balloon buried wind water donkey grain pig week should damage field was he +
49433 | 0 . 01 | Wind scientist leaving atom year bad child drink shore spirit field facing indicate wagon here +
37851 | 0 . 007142857 | Field cloud you wife rhythm upward applied weigh continued property replace ahead forgotten trip window +
text='fielded window'記錄被手動添加以顯示最佳匹配結果。
我們創建了杜松子酒索引。但是也有GIST索引選項。哪一個更好?這取決於...
>> EXPLAIN ANALYZE SELECT text , language
FROM public . document
WHERE
to_tsvector( ' english ' , text ) @@ to_tsquery( ' english ' , ' fielded & window & lamp & depth & test ' )
AND language = ' en '
LIMIT 1 ;
QUERY PLAN
-- ---------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost = 0 . 28 .. 8 . 30 rows = 1 width = 103 ) (actual time = 2 . 699 .. 2 . 700 rows = 0 loops = 1 )
- > Index Scan using ix_en_document_tsvector_text on document (cost = 0 . 28 .. 8 . 30 rows = 1 width = 103 ) (actual time = 2 . 697 .. 2 . 697 rows = 0 loops = 1 )
Index Cond: (to_tsvector( ' english ' ::regconfig, text ) @@ ' ' ' field ' ' & ' ' window ' ' & ' ' lamp ' ' & ' ' depth ' ' & ' ' test ' ' ' ::tsquery)
Planning Time : 0 . 274 ms
Execution Time : 2 . 730 ms杜松子酒似乎更快一些。我認為我不能比文檔已經更好地解釋它:
在選擇使用哪種指數類型,要點或杜松子酒,請考慮以下性能差異:
- 杜松子酒索引查找的速度大約是要點的三倍
- 杜松子酒索引的建造時間大約需要三倍
- 杜松子酒索引的更新速度比GIST索引慢,但是如果禁用了快速的支持,大約要慢10倍(有關詳細信息,請參見第58.4.1節)
- 杜松子酒指數比GIST指數大兩到三倍