simple search engine
1.0.0
这个迷你项目将使用向量空间模型实现简单的搜索引擎。这些数据将从越南每日新闻(例如Vnexpress,越南人,Thanhnien和Laodong)中爬网。
安装Python 3.5+,如果未安装PIP。
使用pip安装以下软件包:
requests (用于提出HTTP请求)。underthesea (越南NLP工具包)。beautifulsoup4 (用于解析HTML和XML)。 $ pip install requests underthesea beautifulsoup4 (可选)安装pytest进行单元测试:
$ pip install pytest
$ cd /path/to/project
$ pytest安装git并克隆该项目到本地机器:
$ git clone https://github.com/vancanhuit/simple-search-engine.git
$ cd simple-search-engine注意:如果您在Windows上运行此项目,则必须检查
windows分支。这是由于Python中搁架模块的跨平台问题(请参阅此问题):
$ git checkout windows运行index.py脚本以执行索引数据。将创建索引数据(如果不存在)或更新并存储在db/ Directory中。
$ python index.py运行search.py脚本并传递查询字符串。
$ python search.py " Your query string here "例如:
$ python search.py " Trump Trieu Tien "
https://vnexpress.net/tin-tuc/the-gioi/trump-noi-cuoc-gap-voi-kim-jong-un-van-co-the-dien-ra-vao-12-6-3754763.html - 0.32331036424704196
https://vnexpress.net/tin-tuc/the-gioi/trump-huy-cuoc-gap-voi-lanh-dao-trieu-tien-3754245.html - 0.3158077661308892
https://vnexpress.net/tin-tuc/the-gioi/trump-thuc-giuc-trung-quoc-that-chat-bien-gioi-voi-trieu-tien-3752746.html - 0.3017484484730665
https://vnexpress.net/tin-tuc/the-gioi/abe-noi-se-gap-trump-truoc-cuoc-hop-thuong-dinh-my-trieu-3755808.html - 0.30059730510834515
http://vietnamnet.vn/vn/the-gioi/binh-luan-quoc-te/nhung-nga-re-chop-nhoang-kho-luong-cua-thuong-dinh-trump-kim-453759.html - 0.2990576238183994
https://vnexpress.net/tin-tuc/the-gioi/ngoai-truong-my-giai-thich-ly-do-cuoc-gap-trump-kim-bi-huy-3754252.html - 0.2807074203562179
https://vnexpress.net/tin-tuc/the-gioi/han-quoc-hop-khan-sau-khi-trump-tuyen-bo-huy-gap-kim-jong-un-3754256.html - 0.24340889391647347
https://vnexpress.net/tin-tuc/the-gioi/my-canh-bao-trieu-tien-co-the-chiu-chung-so-phan-nhu-libya-3753226.html - 0.24232103427164864
...上述查询结果可能会更改,因为可以更新索引数据。要获取更新索引,请运行
git pull origin master命令。