Crawler4J Crawler 페이지 솔루션 JSOUP를 사용하여 HTML을 구문 분석 할 때

저자：Eve Cole 업데이트 시간：2025-03-06 05:32:01

Crawler4J는 코딩 된 결과를 가진 페이지를 크롤링하는 데 능숙합니다. 그러나 Crawler4J는 응답에 대한 인코딩을 지정하지 않으며이를 매우 성가신 코드로 구문 분석합니다. 고통을 발견 한 후, 나는 오랜 블로그 게시물이 문제를 해결하고 page.load ()에서 ContentData 인코딩을 수정할 수 있음을 발견했습니다. 해결되었습니다.

코드 사본은 다음과 같습니다.

공개 무효로드 (httpentity entity)는 예외 {

contentType = null;

헤더 유형 = entity.getContentType ();

if (type! = null) {

contentType = type.getValue ();

}

contentencoding = null;

헤더 인코딩 = entity.getContentEncoding ();

if (encoding! = null) {

ContentEncoding = encoding.getValue ();

}

charset charset = contenttype.getordefault (entity) .getcharset ();

if (charset! = null) {

ContentCharset = charset.displayName ();

}또 다른{

ContentCharSet = "UTF-8";

}

// 소스 코드

// contentData = entityUtils.tobyteArray (entity);

// 수정 된 코드

contentData = entityUtils.toString (entity, charset.forname ( "gbk")). getBytes ();

}