Java تنفذ طريقة لاستخراج نص عادي من نص HTML

الكاتب：Eve Cole وقت التحديث：2025-08-25 23:48:01

1. سيناريو التطبيق: استخراج نص عادي من ملف HTML أو من سلسلة (وهو محتوى HTML) وإزالة علامات صفحة الويب ؛

2. الرمز 1: تم الانتهاء من replaceall

 // استخراج النص العادي من HTML Static String Stripht (String strhtml) {String txtContent = strhtml.replaceall ("</؟ [^>]+>" ، "") ؛ // استخراج <html> tag txtcontent = txtcontent.replaceall ("<a> // s*|/t |/r |/n </a>" ، "") }

3. الكود 2: اكتملت التعبيرات العادية

 // استخراج نص عادي من سلسلة HTML الثابتة العامة html2text (سلسلة inputString) {String htmlstr = inputString ؛ // String String with HTML TAG String TECSTSTR = "" ؛ java.util.regex.pattern p_script ؛ java.util.regex.matcher m_script ؛ java.util.regex.pattern p_html ؛ java.util.regex.matcher m_style ؛ java.util.regex.pattern p_html ؛ java.util.regex.matcher m_html "<[// s]*؟ script [^>]*؟> [// s/s]*؟ <// s]*؟ // [// s]*؟ // [// s]*؟؟ script [// s]*؟>" ؛ // تحديد التعبير العادي {أو <script [^>]*؟> [// s/s]*؟ <// script> string regex_style = "<[// s]*؟ // تحديد التعبير العادي {أو <style [^>]*؟> [// s // s]*؟ <///st style> string regex_html = "<[^>]+>" ؛ // تحديد التعبير العادي p_script = pattern.compile (regex_script ، pattern.case_insensitive) ؛ m_script = p_script.matcher (htmlstr) ؛ htmlstr = m_script.replaceall ("") ؛ // filter script tag p_style = pattern.compile (regex_style ، pattern.case_insensitive) ؛ m_style = p_style.matcher (htmlstr) ؛ htmlstr = m_style.replaceall ("") ؛ // filter script tag p_style = pattern.compile (regex_style ، pattern.case_insensitive) ؛ m_style = p_style.matcher (htmlstr) ؛ htmlstr = m_style.replaceall ("") ؛ // filter style tag p_html = pattern.compile (regex_html ، pattern.case_insensitive) ؛ m_html = p_html.matcher (htmlstr) ؛ htmlstr = m_html.replaceall ("") ؛ // filter html tag textstr = htmlstr ؛ } catch (استثناء e) {system.err.println ("html2text:" + e.getMessage ()) ؛ } // استبعاد خطوط الفضاء textstr = textstr.replaceall ("[]+" ، "")

4. الكود 3: htmleditorkit.parsercallback يتم ، فئة Java الخاصة

 package com.Util ؛ import java.io.*؛ import javax.swing.text.html.*؛ import javax.swing.text.html.parser.*؛ public class html2text يمتد htmleditorkit.parsercallback {StringBuffer s ؛ Public Html2Text () {} public void parse (reader in) يلقي ioException {s = new StringBuffer () ؛ PARSERDELEGATOR PELEGATOR = NEW PARSERDELEGATOR () ؛ // المعلمة الثالثة صحيحة لتجاهل Charset Direct Devator.Parse (في ، هذا ، boolean.true) ؛ } public void handletext (char [] text ، int pos) {s.append (text) ؛ } السلسلة العامة getText () {return S.ToString () ؛ } public static void main (string [] args) {try {// html لتحويل // reader في = new StringReader ("string") ؛ FileReader in = new fileReader ("java-new.html") ؛ html2text parser = جديد html2text () ؛ parser.parse (في) ؛ in.close () ؛ System.out.println (parser.getText ()) ؛ } catch (استثناء e) {E.PrintStackTrace () ؛ }}}

الطريقة أعلاه من Java لاستخراج نص عادي من نص HTML هي كل المحتوى الذي أشاركه معك. آمل أن تتمكن من إعطائك مرجعًا وآمل أن تتمكن من دعم wulin.com أكثر.