Crawler4j crawler page solution when parsing html using jsoup

Author：Eve Cole Update Time：2025-03-06 05:32:01

crawler4j is good at crawling pages with coded results. It uses jsoup to parse, which can be operated by many programmers who know how to jquery. However, crawler4j does not specify the encoding for the response and parses it into garbled code, which is very annoying. After finding the pain, I accidentally discovered that a long-standing blog post can solve the problem and modify the contentData encoding in Page.load(). This made me feel much more comfortable, and the following questions are all the tricks And it was solved.

The code copy is as follows:

public void load(HttpEntity entity) throws Exception {

contentType = null;

Header type = entity.getContentType();

if (type != null) {

contentType = type.getValue();

}

contentEncoding = null;

Header encoding = entity.getContentEncoding();

if (encoding != null) {

contentEncoding = encoding.getValue();

}

Charset charset = ContentType.getOrDefault(entity).getCharset();

if (charset != null) {

contentCharset = charset.displayName();

}else{

contentCharset = "utf-8";

}

//Source code

//contentData = EntityUtils.toByteArray(entity);

//Modified code

contentData = EntityUtils.toString(entity, Charset.forName("gbk")).getBytes();

}