Sina Weibo, crawler link:
https://github.com/HUANGZHIHAO1994/weibospider-keyword
Weibo content data structure (json document exported by mongo database)
content_example:
[
{'_id': '1177737142_H4PSVeZWD', 'keyword': 'A股', 'crawl_time': '2019-06-01 20:31:13', 'weibo_url': 'https://weibo.com/1177737142/H4PSVeZWD', 'user_id': '1177737142', 'created_at': '2018-11-29 03:02:30', 'tool': 'Android', 'like_num': {'$numberInt': '0'}, 'repost_num': {'$numberInt': '0'}, 'comment_num': {'$numberInt': '0'}, 'image_url': 'http://wx4.sinaimg.cn/wap180/4632d7b6ly1fxod61wktyj20u00m8ahf.jpg', 'content': '#a股观点# 鲍威尔主席或是因为被特朗普总统点名批评后萌生悔改之意,今晚一番讲话被市场解读为美联储或暂停加息步伐。美元指数应声下挫,美股及金属贵金属价格大幅上扬,A50表现也并不逊色太多。对明天A股或有积极影响,反弹或能得以延续。 [组图共2张]'},...
]
Weibo comment data structure (json document exported by mongo database)
comment_example:
[
{'_id': 'C_4322161898716112', 'crawl_time': '2019-06-01 20:35:36', 'weibo_url': 'https://weibo.com/1896820725/H9inNf22b', 'comment_user_id': '6044625121', 'content': '没问题,', 'like_num': {'$numberInt': '0'}, 'created_at': '2018-12-28 11:19:21'},...
]
prepro.py, pre_graph.py, senti_pre.py
In order to meet various analysis needs, data preprocessing is required. See these three py files for the specific required data file type and output result data structure.
PS:
When prepro.py runs, modify the three codes in lines 123, 143, and 166 as needed.
When pre_graph.py runs, modify two codes at 127 and 140 lines as needed.
Senti_pre.py runtime to modify line 119 code as needed
zh_wiki.py, langconv.py
These two py files are used to traditional Chinese to simplified Chinese without modification
Word cloud: wc.py (need to finish prepro.py)
Modify 3, 19, 26 lines of code as needed
Popularity map: map.py (need to complete prepro.py)
Modify line 8 code as needed
Repost, comment, like time series: line.py (need to run senti_pre.py and senti_analy.py)
Weibo comment relationship diagram: graph.py (need to run pre_graph.py)
(refer to)
Text clustering: cluster_tfidf.py and cluster_w2v.py (need to run prepro.py)
LDA theme model analysis: LDA.py (need to run senti_pre.py) tree.py (need to run senti_analy.py)
Senti Analysis (Dictionary): senti_analy.py (need to run senti_pre.py) 3Dbar.py (need to run senti_analy.py) pie.py (need to run senti_analy.py)
Sentiment Analysis (W2V+LSTM): senti_lstm.py in Sentiment-Analysis-master document (need to run senti_pre.py)
Modify 250 lines of code according to the situation
Some documents are too large and placed in the Baidu Netdisk link:
Link: https://pan.baidu.com/s/1l447d3d6OSd_yAlsF7b_mA Extraction code: og9t
Text similarity analysis: similar.py (for reference only)
Others are available for reference: senti_analy_refer.py, Sentiment_lstm.py
About Senti_Keyword_total_id.csv:
Download 8. Senti_Keyword_total_id.csv in Baidu Netdisk. The following is an explanation: This file is almost the same as Senti_Keyword_total.csv, but there is an additional column of weibo_id (the code to generate Senti_Keyword_total_id.csv is no longer given here. It is directly used to generate Senti_Keyword_total_id.csv. The generated Senti_Keyword_total_id.csv can be rewritten senti_analy.py and add a column of weibo_id). Baidu Netdisk in 8 (there are Senti_Keyword_total_id.csv and Senti_Keyword_total.csv, as well as all comments and all contents). Since lines.py and other words require all keywords, you need to use senti_analy.py to directly run all comment.json and content.json to generate Senti_Keyword_total.csv (just drop from the network disk, Senti_Keyword_total_id.csv and then run lines.py, 3Dbar.py, pie.py)