simple search engine
1.0.0
這個迷你項目將使用向量空間模型實現簡單的搜索引擎。這些數據將從越南每日新聞(例如Vnexpress,越南人,Thanhnien和Laodong)中爬網。
安裝Python 3.5+,如果未安裝PIP。
使用pip安裝以下軟件包:
requests (用於提出HTTP請求)。underthesea (越南NLP工具包)。beautifulsoup4 (用於解析HTML和XML)。 $ pip install requests underthesea beautifulsoup4 (可選)安裝pytest進行單元測試:
$ pip install pytest
$ cd /path/to/project
$ pytest安裝git並克隆該項目到本地機器:
$ git clone https://github.com/vancanhuit/simple-search-engine.git
$ cd simple-search-engine注意:如果您在Windows上運行此項目,則必須檢查
windows分支。這是由於Python中擱架模塊的跨平台問題(請參閱此問題):
$ git checkout windows運行index.py腳本以執行索引數據。將創建索引數據(如果不存在)或更新並存儲在db/ Directory中。
$ python index.py運行search.py腳本並傳遞查詢字符串。
$ python search.py " Your query string here "例如:
$ python search.py " Trump Trieu Tien "
https://vnexpress.net/tin-tuc/the-gioi/trump-noi-cuoc-gap-voi-kim-jong-un-van-co-the-dien-ra-vao-12-6-3754763.html - 0.32331036424704196
https://vnexpress.net/tin-tuc/the-gioi/trump-huy-cuoc-gap-voi-lanh-dao-trieu-tien-3754245.html - 0.3158077661308892
https://vnexpress.net/tin-tuc/the-gioi/trump-thuc-giuc-trung-quoc-that-chat-bien-gioi-voi-trieu-tien-3752746.html - 0.3017484484730665
https://vnexpress.net/tin-tuc/the-gioi/abe-noi-se-gap-trump-truoc-cuoc-hop-thuong-dinh-my-trieu-3755808.html - 0.30059730510834515
http://vietnamnet.vn/vn/the-gioi/binh-luan-quoc-te/nhung-nga-re-chop-nhoang-kho-luong-cua-thuong-dinh-trump-kim-453759.html - 0.2990576238183994
https://vnexpress.net/tin-tuc/the-gioi/ngoai-truong-my-giai-thich-ly-do-cuoc-gap-trump-kim-bi-huy-3754252.html - 0.2807074203562179
https://vnexpress.net/tin-tuc/the-gioi/han-quoc-hop-khan-sau-khi-trump-tuyen-bo-huy-gap-kim-jong-un-3754256.html - 0.24340889391647347
https://vnexpress.net/tin-tuc/the-gioi/my-canh-bao-trieu-tien-co-the-chiu-chung-so-phan-nhu-libya-3753226.html - 0.24232103427164864
...上述查詢結果可能會更改,因為可以更新索引數據。要獲取更新索引,請運行
git pull origin master命令。