gin is used as http framework, grpc is used as rpc framework, etcd is used as service discovery.
The overall service is divided into用户模块,收藏夹模块,索引平台,搜索引擎(文字模块) , and搜索引擎(图片模块) .
The distributed crawler crawls data and sends it to the kafka cluster, and then enters the database to consume. (Although the crawler has not written it yet, it does not prevent me from drawing cakes...)
The text search of search engine modules is set up separately using boltdb to store index, mapreduce accelerates index building and uses roaring bitmap to store index.
Use trie tree to implement entry association (the algorithm model auxiliary entry association is planned to be added later).
Image search uses ResNet50 to perform vectorized query + query of Milvus or Faiss vector database (I started... DeepLearning is too difficult...).
Supports multiple recalls, reverse index recall in Go, and vector recall in Python. Connection is called through grpc to perform fusion.
Supports TF-IDF, BM25 and other algorithm sorting.
?? Front-end address
all in react, but still coding
react-tangseng
Future planning
Architecture-related
Introduce downgrade fuse
Introduce jaeger for full link tracking (go tracking to python)
Introduce skywalking or prometheus for monitoring
Extract the init from dao and use the key to obtain the relevant database instance
Separation of hot and cold data (refer to the es scheme, the key is to judge the hot and cold standards, which may be written in the middleware?)
At present, mysql is enough to store the forward index, but it may be directly used to OLAP in one step. Starrocks single table and 100 million-level data can also be queried in milliseconds. Mysql has long been divided into libraries and tables at this level.
Functional related
It's too slow to build an index. Add concurrency afterwards, and add concurrency when creating an index.
Index compression, inverted index, that is, inverted index table, subsequently change to save offset, use mmap
The calculation of correlation should be considered, TFIDF, bm25
Use prefix trees to store association information
Huffman encoded compressed prefix tree
When creating an index, the file transfer address is changed to the file transfer stream
Python introduces the bet model to perform word segmentation recommendation words and provides grpc interface
The storage of inverted and trie tree supports consistent hash shard storage
Word vector
pagerank
The build and recall process of separating the trie tree
Add word participle to ik word participle
Build an index platform, separate computing and storage, separate index and recall
And difference operation (bit operation)
Pagination
Sort
Correct input query, such as "Lu Jiazui" --> "Lu Jiazui"
Enter the entry to associate it, such as "Dongfang Ming" Tips --> "Dongfang Pearl"
Currently, it is a block-based indexing method. Let's see if it can be changed to a distributed mapreduce to build an index (6.824 lab1)
Add dynamic indexing based on the previous one (I don't know if the previous one can be implemented...)
Reform the reverse index and use roaring bitmap to store docid (very difficult)
Implement TF class
After searching one, the cache is not cleared, resulting in the result being merged with the previous one.
Sorter optimization
Start quickly
Environment starts!
make env-up
The small dataset is in source_data/movies_data.csv
Python starts!
Make sure that the computer has python installed, make sure that python version>=3.9, my version is 3.10.2
python --version
Install venv environment
python -m venv venv
Activate the venv python environment
macos:
source venv/bin/activate
windows:
I'll be compatible after I finish clearing the C drive... I haven't run on win yet...
Install third-party dependencies
pipinstall-rrequirements . txt
Golang Starts!
golang version >= go 1.16. My go version is 1.18.6
Download third-party dependency package
go mod tidy
Execute in the directory
make run-xxx(user,favortie ...)
# e.g:# make run-user# make run-favorite# 具体看makefile文件
Open source contribution
Please check CONTRIBUTING_CN.md before submitting pr