Java implementa un método para extraer texto plano del texto HTML

Autor：Eve Cole Fecha de actualización：2025-08-25 23:48:01

2. Código 1: ReplaceAll está hecho

 // extraer texto plano de html public static string stripht (String strhtml) {String txtContent = strhtml.replacealll ("</? [^>]+>", ""); // Extraiga la <html> etiqueta txtContent = txtContent.replaceAll ("<a> // s*|/t |/r |/n </a>", ""); // Eliminar espacios en la cadena, retorno del carro, saltos de línea, retorno de pestaña TxtContent; }

3. Código 2: Se completan las expresiones regulares

 // extraer texto plano de html public static string html2Text (string inputString) {string htmlstr = inputString; // cadena de cadena con html etiqueta cadena textstr = ""; java.util.regex.pattern p_script; java.util.regex.matcher m_script; java.util.regex.pattern p_style; java.util.regex.matcher m_style; java.util.regex.pattern p_html; java.util.regex.matcher m_style; java.util.regex.pattern p_html; java.util.regex.matcher m_html; intente {string recepex_script = "<[// s]*? // Defina la expresión regular {o <script [^>]*?> [// s // s]*? </// script> string secex_style = "<[// s]*? Style [^>]*?> [// s // s]*? <// s]*? // [// s]*?>"; // Defina la expresión regular {o <style [^>]*?> [// s // s]*? </// style> string regex_html = "<^>]+>"; // Definir la expresión regular p_script = patrón.compile (regex_script, patrón.case_insensitive); m_script = p_script.matcher (htmlstr); htmlstr = m_script.replaceall (""); // Filtro de la etiqueta de script p_style = patrón.compile (regex_style, patrón.case_insensitive); m_style = p_style.matcher (htmlstr); htmlstr = m_style.replaceall (""); // Filtro de la etiqueta de script p_style = patrón.compile (regex_style, patrón.case_insensitive); m_style = p_style.matcher (htmlstr); htmlstr = m_style.replaceall (""); // Etiqueta de estilo de filtro p_html = patrón.compile (regex_html, patrón.case_insensitive); m_html = p_html.matcher (htmlstr); htmlstr = m_html.replaceall (""); // Filtrar la etiqueta html textstr = htmlstr; } capt (excepción e) {system.err.println ("html2Text:" + e.getMessage ()); } // Excluir líneas espaciales textstr = textstr.replaceAll ("[]+", ""); textstr = textstr.replaceall ("(? M)^// S*$ (// | // r // n)", "); return Textstr; // return Text String String}

4. Código 3: htmleditorkit.Parsercallback está terminado, la clase propia de Java

 paquete com.util; import java.io.*; import javax.swing.text.html.*; import javax.swing.text.html.parser.*; clase pública html2Text extiende htmleditorkit.parsercallback {stringbuffer s; public html2Text () {} public void parse (lector in) lanza ioexception {s = new StringBuffer (); Parserdelegator delegator = new ParSerDelegator (); // El tercer parámetro es verdadero para ignorar Charset Direct Delegator.Parse (en, this, boolean.true); } public void Handletext (char [] text, int pos) {s.append (text); } public String getText () {return s.ToString (); } public static void main (string [] args) {try {// el html para convertir // lector in = new StringReader ("String"); FileReader in = new FileReader ("java-new.html"); Html2Text parser = new html2Text (); parser.parse (in); cercar(); System.out.println (parser.gettext ()); } catch (Exception e) {E.PrintStackTrace (); }}}

El método anterior de Java para extraer texto sin formato del texto HTML es todo el contenido que comparto con usted. Espero que pueda darle una referencia y espero que pueda apoyar más a Wulin.com.