This mini project will implement a simple search engine using Vector Space Model. The data will be crawled from Vietnamese daily news such as VnExpress, VietnamNet, Thanhnien and Laodong.
Install Python 3.5+ and Pip if not installed.
Use pip to install following packages:
requests (for making HTTP requests).underthesea (Vietnamese NLP toolkit).beautifulsoup4 (for parsing HTML and XML).$ pip install requests underthesea beautifulsoup4(Optional) Install pytest to run unit tests:
$ pip install pytest
$ cd /path/to/project
$ pytestInstall git and clone this project into local machine:
$ git clone https://github.com/vancanhuit/simple-search-engine.git
$ cd simple-search-engineNote: If you run this project on Windows, you must checkout to
windowsbranch. This is due to cross-platform issues of shelve module in python (see this issue):
$ git checkout windowsRun index.py script to perform indexing data. The indexed data will be created (if not exists) or updated and stored in db/ directory.
$ python index.pyRun search.py script and pass a query string for it.
$ python search.py "Your query string here"For example:
$ python search.py "Trump Trieu Tien"
https://vnexpress.net/tin-tuc/the-gioi/trump-noi-cuoc-gap-voi-kim-jong-un-van-co-the-dien-ra-vao-12-6-3754763.html - 0.32331036424704196
https://vnexpress.net/tin-tuc/the-gioi/trump-huy-cuoc-gap-voi-lanh-dao-trieu-tien-3754245.html - 0.3158077661308892
https://vnexpress.net/tin-tuc/the-gioi/trump-thuc-giuc-trung-quoc-that-chat-bien-gioi-voi-trieu-tien-3752746.html - 0.3017484484730665
https://vnexpress.net/tin-tuc/the-gioi/abe-noi-se-gap-trump-truoc-cuoc-hop-thuong-dinh-my-trieu-3755808.html - 0.30059730510834515
http://vietnamnet.vn/vn/the-gioi/binh-luan-quoc-te/nhung-nga-re-chop-nhoang-kho-luong-cua-thuong-dinh-trump-kim-453759.html - 0.2990576238183994
https://vnexpress.net/tin-tuc/the-gioi/ngoai-truong-my-giai-thich-ly-do-cuoc-gap-trump-kim-bi-huy-3754252.html - 0.2807074203562179
https://vnexpress.net/tin-tuc/the-gioi/han-quoc-hop-khan-sau-khi-trump-tuyen-bo-huy-gap-kim-jong-un-3754256.html - 0.24340889391647347
https://vnexpress.net/tin-tuc/the-gioi/my-canh-bao-trieu-tien-co-the-chiu-chung-so-phan-nhu-libya-3753226.html - 0.24232103427164864
...The above query results may be changed because indexed data can be updated. To get updated index, run
git pull origin mastercommand.