Java implementa um método para extrair texto simples do texto html

Autor：Eve Cole Data da Última Atualização：2025-08-25 23:48:01

1. Cenário do aplicativo: extraia texto simples de um arquivo HTML ou de String (que é conteúdo HTML) e remova as tags da página da web;

2. Código 1: Replaceall é feito

 // extraia texto simples do html public static string stripht (string strhtml) {string txtContent = strhtml.Replaceall ("</? [^>]+>", ""); // extraia a tag <html> txtContent = txtContent.Replaceall ("<a> // s*|/t |/r |/n </a>", ""); // remove espaços na sequência, retorno da correia, quebras de linha, guia retornar txtContent; }

3. Código 2: Expressões regulares são concluídas

 // extraia texto simples do html public static string html2text (string inputString) {string htmlstr = inputString; // string string com tag html string textstr = ""; java.util.regex.pattern p_script; java.util.regex.matcher m_script; java.util.regex.pattern p_style; java.util.rererex.matcher m_style; p_html; java.util.regex.matcher m_style; java.util.regex.pattern p_html; java.util.regex.matcher m_html; try {string regex_script = "<[// s]*? Script [^>]*?> [// s/s]*? <[// s]*? // [// s]*? // [//]*? // define a expressão regular {ou <script [^>]*?> [// s/s]*? // define a expressão regular {ou <style [^>]*?> [// s/s]*? </// style> string regex_html = "<[^>]+>"; // Defina a expressão regular p_script = Pattern.compile (regex_script, padrony.case_insensitive); m_script = p_script.matcher (htmlstr); htmlstr = m_script.replaceall (""); // Filtrar tag de script p_style = padrony.compile (regex_style, padrony.case_insensitive); m_style = p_style.matcher (htmlstr); htmlstr = m_style.replaceall (""); // Filtrar tag de script p_style = padrony.compile (regex_style, padrony.case_insensitive); m_style = p_style.matcher (htmlstr); htmlstr = m_style.replaceall (""); // Tag de estilo de filtro p_html = padrony.compile (regex_html, padrony.case_insensitive); m_html = p_html.matcher (htmlstr); htmlstr = m_html.replaceall (""); // filtrar tag html textstr = htmlstr; } catch (Exceção e) {System.err.println ("html2text:" + e.getMessage ()); } // exclui linhas espaciais textstr = textstr.Replaceall ("[]+", ""); textstr = textstr.replaceall ("(? M)^// s*$ (// n | // r // n)" ""); retornar textstr; // Retornar o texto string}}

4. Código 3: htmleditorkit.parserCallback está pronto, a própria classe de Java

 pacote com.util; importar java.io.*; importar javax.swing.text.html.*; importar javax.swing.text.html.parser.* public html2Text () {} public void parse (leitor in) lança ioexception {s = new stringbuffer (); Delegador de ParserDelegator = new ParserDelegator (); // O terceiro parâmetro é verdadeiro para ignorar o Charset Direct delegator.parse (em, este, boolean.true); } public void handletext (char [] texto, int pos) {s.append (text); } public string getText () {return s.toString (); } public static void main (string [] args) {try {// o html para converter // leitor in = new StringReader ("string"); FileReader in = new FileReader ("Java-New.html"); Html2Text Parser = new Html2Text (); parser.parse (in); in.Close (); System.out.println (parser.getText ()); } catch (Exceção e) {e.printStackTrace (); }}}

O método acima do Java para extrair texto simples do texto HTML é todo o conteúdo que eu compartilho com você. Espero que você possa lhe dar uma referência e espero que você possa apoiar mais o wulin.com.