Weibo Terminator Work Flow
This project is the restarted version of the previous project. Here is the previous project address, the project will remain updated. This is the working version of Weibo Terminator. This version has made some optimizations to the previous version. The ultimate goal here is to crawl corpus together, including sentiment analysis, dialogue corpus, public opinion risk control, big data analysis and other applications.
UPDATE 2017-5-16
renew:
- Adjusted the first cookies acquisition logic, and if the program does not detect cookies, it will exit, preventing the crawling of more content and crashing;
- The WeiBoScraperM class has been added, which is still under construction. The submit PR implementation is welcome. This class mainly implements crawling from another Weibo domain name, that is, the mobile domain name;
You can pull the update.
UPDATE 2017-5-15
After some minor modifications and the PR of several contributors, the code has undergone some minor changes. Basically, it is fixing bugs and improving some logic, and the modifications are as follows:
- Fixed the issue of saving error. When you push the first time you need to pull the clone code;
- The error about
WeiboScraper has not attribute weibo_content , the new code has been fixed;
@Fence Submit PR to modify some content:
- The original fixed 30s rest is replaced with random time, and the specific parameters can be defined by yourself.
- Added big_v_ids_file to record the celebrity ids that have been saved for fans; use txt format to facilitate the contributor to manually add and delete
- The crawling pages of both functions have been changed to page+1 to avoid repeated crawling when the breakpoint continues to crawl.
- Change the original "All Weibo and comments after crawling an id" to "Save after crawling a tweet and comments after crawling a tweet"
- (Optional) Put the part that saves the file as a function separately, because there are 2 and 3 places to save respectively.
You can git pull origin master to get the newly updated version. At the same time, you are welcome to continue asking me for uuid. I will regularly publish the list in contirbutor.txt . I have been doing data merge work recently, as well as data cleaning, classification, etc. After the merge work is completed, I will distribute the big data set to everyone.
Improve
The following improvements were made to the previous version:
- Without too many distractions, go straight to the topic, give the id, get all Weibo, number of Weibo, number of fans, all Weibo content and comment content of the user;
- Unlike the previous version, this time our philosophy is to save all data into three pickle files and store them in dictionary files. The purpose of this is to facilitate breakpoint crawling;
- At the same time, the crawler that has crawled will not crawl again, which means that the crawler will remember the crawled id. After each id has obtained all the content, it will be marked as crawled;
- In addition, Weibo content and Weibo comments are separated separately. There is an interruption during the crawling of Weibo content. The second time it will not be crawled again, and the interrupted page number will continue to crawl from the interrupted page number;
- What's more important is! ! ! Each id crawl has no effect on each other. You can directly retrieve any id content of the id you want from the pickle file, and you can do any processing! !
- In addition, the new anti-crawl strategy was tested, and the delay mechanism adopted was able to work well, but it was not completely uncontrolled.
What's more important is! ! ! , In this version, the intelligence of the crawler has been greatly improved. When crawler crawls each id, he will automatically obtain all the fan ids of the id! ! It is equivalent to that what I give you is seed id, and the seed ids are the ids of some celebrities, companies or media big Vs. From these seed ids, you can get thousands of other seed ids! ! If a celebrity fan has 34,000, you can get 34,000 ids for the first time crawling, and then continue crawling from the child id. Each child id has 100 fans, and the second time you can get 3.4 million ids! ! ! Is it enough? ! ! ! Of course not enough! ! !
Our project will never stop! ! ! It will continue until enough corpus is harvested! ! !
(Of course we can't actually get all the fans, but these are enough.)
Work Flow
The goal of this version is to target the contributor, and our workflow is also very simple:
- Get uuid. This uuid can call 2-3 ids of distribution_ids.pkl. This is our seed id. Of course, you can also directly obtain all ids. However, in order to prevent duplicate work, it is recommended that you apply for a uuid from me. You are only responsible for your one. After crawling, you will feed back the final file to me. After I sort out the heavy load, I will distribute the final large corpus to everyone.
- Run
python3 main.py uuid , let me explain here that the crawling fan id will be retrieved after the id specified by uuid is crawled; - Done!
Discuss
I'm still posting a discussion group, and everyone is welcome to add:
QQ
AI智能自然语言处理: 476464663
Tensorflow智能聊天Bot: 621970965
GitHub深度学习开源交流: 263018023
You can add my friends on WeChat: jintianiloveu
Copyright
(c) 2017 Jin Fagang & Tianmu Inc. & weibo_terminator authors LICENSE Apache 2.0