weibo_terminator_workflow Download - weibo_terminator

weibo_terminator_workflow

Other source code

1.0.0

Download

Weibo Terminator Work Flow

This project is the restarted version of the previous project. Here is the previous project address, the project will remain updated. This is the working version of Weibo Terminator. This version has made some optimizations to the previous version. The ultimate goal here is to crawl corpus together, including sentiment analysis, dialogue corpus, public opinion risk control, big data analysis and other applications.

UPDATE 2017-5-16

renew:

Adjusted the first cookies acquisition logic, and if the program does not detect cookies, it will exit, preventing the crawling of more content and crashing;
The WeiBoScraperM class has been added, which is still under construction. The submit PR implementation is welcome. This class mainly implements crawling from another Weibo domain name, that is, the mobile domain name;

You can pull the update.

UPDATE 2017-5-15

After some minor modifications and the PR of several contributors, the code has undergone some minor changes. Basically, it is fixing bugs and improving some logic, and the modifications are as follows:

Fixed the issue of saving error. When you push the first time you need to pull the clone code;
The error about WeiboScraper has not attribute weibo_content , the new code has been fixed;

@Fence Submit PR to modify some content:

The original fixed 30s rest is replaced with random time, and the specific parameters can be defined by yourself.
Added big_v_ids_file to record the celebrity ids that have been saved for fans; use txt format to facilitate the contributor to manually add and delete
The crawling pages of both functions have been changed to page+1 to avoid repeated crawling when the breakpoint continues to crawl.
Change the original "All Weibo and comments after crawling an id" to "Save after crawling a tweet and comments after crawling a tweet"
(Optional) Put the part that saves the file as a function separately, because there are 2 and 3 places to save respectively.

You can git pull origin master to get the newly updated version. At the same time, you are welcome to continue asking me for uuid. I will regularly publish the list in contirbutor.txt . I have been doing data merge work recently, as well as data cleaning, classification, etc. After the merge work is completed, I will distribute the big data set to everyone.

Improve

The following improvements were made to the previous version:

Without too many distractions, go straight to the topic, give the id, get all Weibo, number of Weibo, number of fans, all Weibo content and comment content of the user;
Unlike the previous version, this time our philosophy is to save all data into three pickle files and store them in dictionary files. The purpose of this is to facilitate breakpoint crawling;
At the same time, the crawler that has crawled will not crawl again, which means that the crawler will remember the crawled id. After each id has obtained all the content, it will be marked as crawled;
In addition, Weibo content and Weibo comments are separated separately. There is an interruption during the crawling of Weibo content. The second time it will not be crawled again, and the interrupted page number will continue to crawl from the interrupted page number;
What's more important is! ! ! Each id crawl has no effect on each other. You can directly retrieve any id content of the id you want from the pickle file, and you can do any processing! !
In addition, the new anti-crawl strategy was tested, and the delay mechanism adopted was able to work well, but it was not completely uncontrolled.

What's more important is! ! ! , In this version, the intelligence of the crawler has been greatly improved. When crawler crawls each id, he will automatically obtain all the fan ids of the id! ! It is equivalent to that what I give you is seed id, and the seed ids are the ids of some celebrities, companies or media big Vs. From these seed ids, you can get thousands of other seed ids! ! If a celebrity fan has 34,000, you can get 34,000 ids for the first time crawling, and then continue crawling from the child id. Each child id has 100 fans, and the second time you can get 3.4 million ids! ! ! Is it enough? ! ! ! Of course not enough! ! !

Our project will never stop! ! ! It will continue until enough corpus is harvested! ! !

(Of course we can't actually get all the fans, but these are enough.)

Work Flow

The goal of this version is to target the contributor, and our workflow is also very simple:

Get uuid. This uuid can call 2-3 ids of distribution_ids.pkl. This is our seed id. Of course, you can also directly obtain all ids. However, in order to prevent duplicate work, it is recommended that you apply for a uuid from me. You are only responsible for your one. After crawling, you will feed back the final file to me. After I sort out the heavy load, I will distribute the final large corpus to everyone.
Run python3 main.py uuid , let me explain here that the crawling fan id will be retrieved after the id specified by uuid is crawled;
Done!

Discuss

I'm still posting a discussion group, and everyone is welcome to add:

 QQ
AI智能自然语言处理: 476464663
Tensorflow智能聊天Bot: 621970965
GitHub深度学习开源交流: 263018023

You can add my friends on WeChat: jintianiloveu

Copyright

 (c) 2017 Jin Fagang & Tianmu Inc. & weibo_terminator authors LICENSE Apache 2.0

Expand

Additional Information

Version 1.0.0
Type Other source code
Update Time 2025-04-18
size 22.91KB
From Github

Related Applications

OpenCore_NO_ACPI_Build

2024-11-13
nspanel_pro_tools_apk

2024-11-12
zkwork_aleo_gpu_worker

2024-11-11
nextcloud_share_url_downloader

2024-11-01
Dog_Fox_Bunny

2022-08-01
Lihua data analysis engine free version 3.0_search_navigation_collection_public opinion_ranking_api

2022-06-28

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3

Related Information All