This article shares a Java-based Zhihu crawler to capture basic information of Zhihu users, based on HttpClient 4.5 for your reference. The specific content is as follows
Details:
Crawl 90W+ user information (basically all active users are inside)
General idea:
1. First, simulate login to Zhihu. After logging in successfully, serialize the cookies to disk. You don’t need to log in every time in the future (if you do not simulate login, you can stuff cookies directly from the browser).
2. Create two thread pools and one Storage. A pool of threads to crawl web pages is responsible for executing requests, returning web page content, and saving it in Storage. The other is to parse the web page thread pool, which is responsible for taking out the web page content from Storage and parsing it, analyzing the user's information and saving it into the database, analyzing the homepage of the person the user is following, and adding the address request to the crawling web page thread pool. Keep going on.
3. Regarding url deduplication, I directly convert the visited link md5 into the database. Before each visit, check whether the link exists in the database.
So far, 100W users have been captured, and the links visited are 220W+. The users crawled now are some inactive users. More active users should have basically been caught.
Project address: https://github.com/wycm/mycrawler
Implementation code:
Author: Wo Yan Chen Si Link: https://www.zhihu.com/question/36909173/answer/97643000 Source: Zhihu copyright belongs to the author. For commercial reprinting, please contact the author for authorization. For non-commercial reprinting, please indicate the source. /** * * @param httpClient Http Client* @param context Http context* @return */public boolean login(CloseableHttpClient httpClient, HttpClientContext context){String yzm = null;String loginState = null;HttpGet getRequest = new HttpGet("https://www.zhihu.com/#signin");HttpClientUtil.getWebPage(httpClient,context, getRequest, "utf-8", false);HttpPost request = new HttpPost("https://www.zhihu.com/login/email");List<NameValuePair> formParams = new ArrayList<NameValuePair>();yzm = yzm(httpClient, context,"https://www.zhihu.com/captcha.gif?type=login");//Natural recognition verification code formParams.add(new BasicNameValuePair("captcha", yzm));formParams.add(new BasicNameValuePair("_xsrf", ""));//This parameter can be used without formParams.add(new BasicNameValuePair("email", "email"));formParams.add(new BasicNameValuePair("password", "password"));formParams.add(new BasicNameValuePair("remember_me", "true"));UrlEncodedFormEntity entity = null;try {entity = new UrlEncodedFormEntity(formParams, "utf-8");} catch (UnsupportedEncodingException e) {e.printStackTrace();}request.setEntity(entity);loginState = HttpClientUtil.getWebPage(httpClient,context, request, "utf-8", false);//LoginJSONObject jo = new JSONObject(loginState);if(jo.get("r").toString().equals("0")){System.out.println("Login Successfully");getRequest = new HttpGet("https://www.zhihu.com");HttpClientUtil.getWebPage(httpClient,context,getRequest, "utf-8", false);// Visit the home page HttpClientUtil.serializeObject(context.getCookieStore(),"resources/zhihucookies");//Serialize Zhihu Cookies, log in directly through the cookies the next time you log in return true;}else{System.out.println("Login failed" + loginState);return false;}}/** * naked-eye recognition verification code* @param httpClient Http Client* @param context Http context* @param url Verification code address* @return */public String yzm(CloseableHttpClient httpClient,HttpClientContext context, String url){HttpClientUtil.downloadFile(httpClient, context, url, "d:/test/", "1.gif",true);Scanner sc = new Scanner(System.in);String yzm = sc.nextLine();return yzm;}Reproduction image:
The above is all about this article, I hope it will be helpful to everyone's learning.