sensitive word Download - sensitive word Source code download

sensitive-word sensitive-word

sensitive-word is a high-performance sensitive word tool based on DFA algorithm.

Online experience

If you have some difficult and complicated diseases, you can join: Technical Exchange Group

sensitive-word-admin is the corresponding console application. The functions are currently in the early stages of development and the MVP version is available.

Creation Purpose

Hello everyone, I am Lao Ma.

I have always wanted to implement a simple and easy-to-use sensitive word tool, so I implemented this tool open source.

Based on the DFA algorithm implementation, the content of the sensitive thesaurus currently includes 6W+ (source file 18W+, after one deletion).

In the later stage, sensitive thesaurus will be continuously optimized and supplemented, and the performance of the algorithm will be further improved.

v0.24.0 has started to support the classification and refinement of sensitive words, but the workload is relatively large, so there are inevitably omissions.

Welcome to PR improvements, github requests, or join the technical exchange group to communicate and brag!

characteristic

6W+ thesaurus, and continuous optimization and update
Based on fluent-api implementation, elegant and concise use
Based on DFA algorithm, performance is 7W+ QPS, application is insensitive
Support common operations such as judgment, return, and desensitization of sensitive words
Supports common format conversion

Full-width half-width interchange, English case interchange, common forms of numbers, traditional Chinese and simplified Chinese, common forms of English interchange, ignoring duplicate words, etc.

Support sensitive word detection, email detection, digital detection, website detection, IPV4, etc.
Supports custom replacement policies
Supports user-defined sensitive words and whitelists
Supports dynamic data updates (user customization), effective in real time
Supports tag interface + built-in classification implementation for sensitive words
Support skipping some special characters to make matching more flexible
Supports single addition/modification of black and white lists, without full initialization

Change log

CHANGE_LOG.md

V0.23.0

Result condition expansion supports wordTags and chains

V0.24.0

Initial built-in implementation of word tags, enriching built-in strategies for word tags

More information

Sensitive Word Console

Sometimes there is a console for sensitive words, which makes it more flexible and convenient to configure.

How to implement sensitive word console service out of the box?

Sensitive word tag files

A large number of sensitive word tag files have been sorted out, which can make our sensitive words more convenient.

These two materials can be read in the article below:

v0.11.0-New features of sensitive words and corresponding tag files

Currently, v0.24.0 has built-in word tags, and it is recommended to upgrade to the latest version if needed.

Support open source

Open source is not easy. If this project is helpful to you, you can invite Lao Ma to have a cup of milk tea.

Start quickly

Prepare

JDK1.8+
Maven 3.x+

Maven introduction

< dependency >
    < groupId >com.github.houbb</ groupId >
    < artifactId >sensitive-word</ artifactId >
    < version >0.24.0</ version >
</ dependency >

Core Methods

SensitiveWordHelper is a tool class for sensitive words. The core methods are as follows:

Note: SensitiveWordHelper provides default configurations. If you want to make flexible custom configurations, please refer to the boot class feature configuration

method	parameter	Return value	illustrate
contains(String)	The string to be verified	Boolean value	Verify that the string contains sensitive words
replace(String, ISensitiveWordReplace)	Replace sensitive words with specified replacement strategy	String	Returns the desensitized string
replace(String, char)	Use the specified char to replace the sensitive word	String	Returns the desensitized string
replace(String)	Use `*` to replace sensitive words	String	Returns the desensitized string
findAll(String)	The string to be verified	List of strings	Returns all sensitive words in the string
findFirst(String)	The string to be verified	String	Returns the first sensitive word in the string
findAll(String, IWordResultHandler)	IWordResultHandler result processing class	List of strings	Returns all sensitive words in the string
findFirst(String, IWordResultHandler)	IWordResultHandler result processing class	String	Returns the first sensitive word in the string
tags(String)	Get tags for sensitive words	Sensitive word string	Return to the list of tags for sensitive words

Determine whether sensitive words are included

 final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。" ;

Assert . assertTrue ( SensitiveWordHelper . contains ( text ));

Return to the first sensitive word

 final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。" ;

String word = SensitiveWordHelper . findFirst ( text );
Assert . assertEquals ( "五星红旗" , word );

SensitiveWordHelper.findFirst(text) is equivalent to:

 String word = SensitiveWordHelper . findFirst ( text , WordResultHandlers . word ());

Return to all sensitive words

 final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。" ;

List < String > wordList = SensitiveWordHelper . findAll ( text );
Assert . assertEquals ( "[五星红旗, 毛主席, 天安门]" , wordList . toString ());

Returns all sensitive words usage is similar to SensitiveWordHelper.findFirst(), and also supports specifying result processing classes.

SensitiveWordHelper.findAll(text) is equivalent to:

 List < String > wordList = SensitiveWordHelper . findAll ( text , WordResultHandlers . word ());

WordResultHandlers.raw() can retain the corresponding subscript information and category information:

 final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。" ;

// 默认敏感词标签为空
List < WordTagsDto > wordList1 = SensitiveWordHelper . findAll ( text , WordResultHandlers . wordTags ());
Assert . assertEquals ( "[WordTagsDto{word='五星红旗', tags=[]}, WordTagsDto{word='毛主席', tags=[]}, WordTagsDto{word='天安门', tags=[]}]" , wordList1 . toString ());

Default replacement policy

 final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。" ;
String result = SensitiveWordHelper . replace ( text );
Assert . assertEquals ( "****迎风飘扬，***的画像屹立在***前。" , result );

Specify the replacement content

 final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。" ;
String result = SensitiveWordHelper . replace ( text , '0' );
Assert . assertEquals ( "0000迎风飘扬，000的画像屹立在000前。" , result );

Custom replacement policy

V0.2.0 supports this feature.

Scene Description: Sometimes we want different sensitive words to have different replacement results. For example, [Game] is replaced with [E-sports], and [Unemployment] is replaced with [Flexible Employment].

Admittedly, it is OK to use regular replacement of strings in advance, but the performance is average.

Example of use:

 /**
 * 自定替换策略
 * @since 0.2.0
 */
@ Test
public void defineReplaceTest () {
    final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。" ;

    ISensitiveWordReplace replace = new MySensitiveWordReplace ();
    String result = SensitiveWordHelper . replace ( text , replace );

    Assert . assertEquals ( "国家旗帜迎风飘扬，教员的画像屹立在***前。" , result );
}

Among them, MySensitiveWordReplace is our customized replacement policy, which is implemented as follows:

 public class MyWordReplace implements IWordReplace {

    @ Override
    public void replace ( StringBuilder stringBuilder , final char [] rawChars , IWordResult wordResult , IWordContext wordContext ) {
        String sensitiveWord = InnerWordCharUtils . getString ( rawChars , wordResult );
        // 自定义不同的敏感词替换策略，可以从数据库等地方读取
        if ( "五星红旗" . equals ( sensitiveWord )) {
            stringBuilder . append ( "国家旗帜" );
        } else if ( "毛主席" . equals ( sensitiveWord )) {
            stringBuilder . append ( "教员" );
        } else {
            // 其他默认使用 * 代替
            int wordLength = wordResult . endIndex () - wordResult . startIndex ();
            for ( int i = 0 ; i < wordLength ; i ++) {
                stringBuilder . append ( '*' );
            }
        }
    }

}

We do fixed mapping for some of the words, and the others are converted to * by default.

IWordResultHandler result processing class

IWordResultHandler can process results of sensitive words and allow users to customize them.

See WordResultHandlers tool class for built-in implementation:

WordResultHandlers.word()

Only sensitive words are preserved.

WordResultHandlers.raw()

Retain information related to sensitive words, including the beginning and end subscripts of sensitive words.

WordResultHandlers.wordTags()

At the same time, the words and corresponding word label information are retained.

Examples of usage

See SensitiveWordHelperTest for all test cases

1) Basic examples

 final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。" ;

List < String > wordList = SensitiveWordHelper . findAll ( text );
Assert . assertEquals ( "[五星红旗, 毛主席, 天安门]" , wordList . toString ());
List < String > wordList2 = SensitiveWordHelper . findAll ( text , WordResultHandlers . word ());
Assert . assertEquals ( "[五星红旗, 毛主席, 天安门]" , wordList2 . toString ());

List < IWordResult > wordList3 = SensitiveWordHelper . findAll ( text , WordResultHandlers . raw ());
Assert . assertEquals ( "[WordResult{startIndex=0, endIndex=4}, WordResult{startIndex=9, endIndex=12}, WordResult{startIndex=18, endIndex=21}]" , wordList3 . toString ());

wordTags example

We specify the tag information of the corresponding word in the dict_tag_test.txt file.

 final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。" ;

// 默认敏感词标签为空
List < WordTagsDto > wordList1 = SensitiveWordHelper . findAll ( text , WordResultHandlers . wordTags ());
Assert . assertEquals ( "[WordTagsDto{word='五星红旗', tags=[]}, WordTagsDto{word='毛主席', tags=[]}, WordTagsDto{word='天安门', tags=[]}]" , wordList1 . toString ());

List < WordTagsDto > wordList2 = SensitiveWordBs . newInstance ()
        . wordTag ( WordTags . file ( "dict_tag_test.txt" ))
        . init ()
        . findAll ( text , WordResultHandlers . wordTags ());
Assert . assertEquals ( "[WordTagsDto{word='五星红旗', tags=[政治, 国家]}, WordTagsDto{word='毛主席', tags=[政治, 伟人, 国家]}, WordTagsDto{word='天安门', tags=[政治, 国家, 地址]}]" , wordList2 . toString ());

More features

The subsequent features are mainly aimed at various processing targeting various situations, so as to improve the hit rate of sensitive words as much as possible.

This is a long battle of offense and defense.

Style processing

Ignore case

 final String text = "fuCK the bad words." ;

String word = SensitiveWordHelper . findFirst ( text );
Assert . assertEquals ( "fuCK" , word );

Ignore half-corner rounded corners

 final String text = "ｆｕｃｋ the bad words." ;

String word = SensitiveWordHelper . findFirst ( text );
Assert . assertEquals ( "ｆｕｃｋ" , word );

Ignore the writing of numbers

Here, the conversion of common forms of digital is implemented.

 final String text = "这个是我的微信：9⓿二肆⁹₈③⑸⒋➃㈤㊄" ;

List < String > wordList = SensitiveWordBs . newInstance (). enableNumCheck ( true ). init (). findAll ( text );
Assert . assertEquals ( "[9⓿二肆⁹₈③⑸⒋➃㈤㊄]" , wordList . toString ());

Ignore traditional simplified

 final String text = "我爱我的祖国和五星紅旗。" ;

List < String > wordList = SensitiveWordHelper . findAll ( text );
Assert . assertEquals ( "[五星紅旗]" , wordList . toString ());

Ignore English writing format

 final String text = "Ⓕⓤc⒦ the bad words" ;

List < String > wordList = SensitiveWordHelper . findAll ( text );
Assert . assertEquals ( "[Ⓕⓤc⒦]" , wordList . toString ());

Ignore duplicate words

 final String text = "ⒻⒻⒻfⓤuⓤ⒰cⓒ⒦ the bad words" ;

List < String > wordList = SensitiveWordBs . newInstance ()
        . ignoreRepeat ( true )
        . init ()
        . findAll ( text );
Assert . assertEquals ( "[ⒻⒻⒻfⓤuⓤ⒰cⓒ⒦]" , wordList . toString ());

More detection strategies

Email Detection

Personal information such as email address is not enabled by default.

 final String text = "楼主好人，邮箱 [email protected]" ;
List < String > wordList = SensitiveWordBs . newInstance (). enableEmailCheck ( true ). init (). findAll ( text );
Assert . assertEquals ( "[[email protected]]" , wordList . toString ());

Continuous digital detection

It is generally used to filter advertising information such as mobile phone number/QQ, and is not enabled by default.

After V0.2.1, the lengths detected by numCheckLen(长度) are supported.

 final String text = "你懂得：12345678" ;

// 默认检测 8 位
List < String > wordList = SensitiveWordBs . newInstance ()
. enableNumCheck ( true )
. init (). findAll ( text );
Assert . assertEquals ( "[12345678]" , wordList . toString ());

// 指定数字的长度，避免误杀
List < String > wordList2 = SensitiveWordBs . newInstance ()
. enableNumCheck ( true )
. numCheckLen ( 9 )
. init ()
. findAll ( text );
Assert . assertEquals ( "[]" , wordList2 . toString ());

Website detection

Used to filter common URL information, not enabled by default.

v0.18.0 optimizes URL detection, which is more stringent and reduces the misjudgment rate

 final String text = "点击链接 https://www.baidu.com 查看答案" ;
final SensitiveWordBs sensitiveWordBs = SensitiveWordBs . newInstance (). enableUrlCheck ( true ). init ();
List < String > wordList = sensitiveWordBs . findAll ( text );
Assert . assertEquals ( "[https://www.baidu.com]" , wordList . toString ());
Assert . assertEquals ( "点击链接 ********************* 查看答案" , sensitiveWordBs . replace ( text ));

IPV4 detection

v0.17.0 support

Avoid users bypassing URL detection through IP, etc., which is not enabled by default.

 final String text = "个人网站，如果网址打不开可以访问 127.0.0.1。" ;
final SensitiveWordBs sensitiveWordBs = SensitiveWordBs . newInstance (). enableIpv4Check ( true ). init ();
List < String > wordList = sensitiveWordBs . findAll ( text );
Assert . assertEquals ( "[127.0.0.1]" , wordList . toString ());

Boot class feature configuration

illustrate

The above features are all enabled by default, and sometimes the business needs to flexibly define related configuration features.

So v0.0.14 opens the property configuration.

Configuration method

To make use more elegant, the definition is uniformly used by fluent-api.

Users can use SensitiveWordBs to define it as follows:

Note: After configuration, use our newly defined SensitiveWordBs object instead of the previous tool method. Tool method configuration is all default.

 SensitiveWordBs wordBs = SensitiveWordBs . newInstance ()
        . ignoreCase ( true )
        . ignoreWidth ( true )
        . ignoreNumStyle ( true )
        . ignoreChineseStyle ( true )
        . ignoreEnglishStyle ( true )
        . ignoreRepeat ( false )
        . enableNumCheck ( false )
        . enableEmailCheck ( false )
        . enableUrlCheck ( false )
        . enableIpv4Check ( false )
        . enableWordCheck ( true )
        . numCheckLen ( 8 )
        . wordTag ( WordTags . none ())
        . charIgnore ( SensitiveWordCharIgnores . defaults ())
        . wordResultCondition ( WordResultConditions . alwaysTrue ())
        . init ();

final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。" ;
Assert . assertTrue ( wordBs . contains ( text ));

Configuration instructions

The description of each configuration is as follows:

Serial number	method	illustrate	default value
1	ignoreCase	Ignore case	true
2	ignoreWidth	Ignore half-corner rounded corners	true
3	ignoreNumStyle	Ignore the writing of numbers	true
4	ignore ChineseStyle	Ignore Chinese writing format	true
5	ignoreEnglishStyle	Ignore English writing format	true
6	ignoreRepeat	Ignore duplicate words	false
7	enableNumCheck	Whether to enable digital detection.	false
8	enableEmailCheck	Email detection is enabled	false
9	enableUrlCheck	Whether to enable link detection	false
10	enableIpv4Check	Whether to enable IPv4 detection	false
11	enableWordCheck	Whether to enable sensitive word detection	true
12	numCheckLen	Digital detection, customize the specified length.	8
13	wordTag	The corresponding tags of words	none
14	charIgnore	Ignored characters	none
15	wordResultCondition	Additional processing for matching sensitive words, such as limiting the need for full matches of English words	Always true

Release of memory resources

v0.16.1 is supported. Sometimes we need to free up memory, which can be as follows:

About memory recycling issues

 SensitiveWordBs wordBs = SensitiveWordBs . newInstance ()
                . init ();
// 后续因为一些原因移除了对应信息，希望释放内存。
wordBs . destroy ();

For the addition/deletion of a single blacklist word, no need to fully initialize it

Usage scenario: After initialization, we want to add/delete a single word instead of reinitializing it completely. This feature is prepared for this.

Supported version: v0.19.0

Method Description

addWord(word) adds sensitive words, supporting single words/collections

removeWord(word) removes sensitive words, supports single words/collections

Example code:

 final String text = "测试一下新增敏感词，验证一下删除和新增对不对" ;

SensitiveWordBs sensitiveWordBs =
SensitiveWordBs . newInstance ()
        . wordAllow ( WordAllows . empty ())
        . wordDeny ( WordDenys . empty ())
        . init ();

// 当前
Assert . assertEquals ( "[]" , sensitiveWordBs . findAll ( text ). toString ());

// 新增单个
sensitiveWordBs . addWord ( "测试" );
sensitiveWordBs . addWord ( "新增" );
Assert . assertEquals ( "[测试, 新增, 新增]" , sensitiveWordBs . findAll ( text ). toString ());

// 删除单个
sensitiveWordBs . removeWord ( "新增" );
Assert . assertEquals ( "[测试]" , sensitiveWordBs . findAll ( text ). toString ());
sensitiveWordBs . removeWord ( "测试" );
Assert . assertEquals ( "[]" , sensitiveWordBs . findAll ( text ). toString ());

// 新增集合
sensitiveWordBs . addWord ( Arrays . asList ( "新增" , "测试" ));
Assert . assertEquals ( "[测试, 新增, 新增]" , sensitiveWordBs . findAll ( text ). toString ());
// 删除集合
sensitiveWordBs . removeWord ( Arrays . asList ( "新增" , "测试" ));
Assert . assertEquals ( "[]" , sensitiveWordBs . findAll ( text ). toString ());

// 新增数组
sensitiveWordBs . addWord ( "新增" , "测试" );
Assert . assertEquals ( "[测试, 新增, 新增]" , sensitiveWordBs . findAll ( text ). toString ());
// 删除集合
sensitiveWordBs . removeWord ( "新增" , "测试" );
Assert . assertEquals ( "[]" , sensitiveWordBs . findAll ( text ). toString ());

For the addition/deletion of a single whitelist word, no full initialization is required

Usage scenario: After initialization, we want to add/delete a single word instead of reinitializing it completely. This feature is prepared for this.

Supported version: v0.21.0

Method Description

addWordAllow(word) adds a new whitelist, supports single words/collections

removeWordAllow(word) removes whitelists, supports single words/collections

Use Example

        final String text = "测试一下新增敏感词白名单，验证一下删除和新增对不对" ;

        SensitiveWordBs sensitiveWordBs =
                SensitiveWordBs . newInstance ()
                        . wordAllow ( WordAllows . empty ())
                        . wordDeny ( new IWordDeny () {
                            @ Override
                            public List < String > deny () {
                                return Arrays . asList ( "测试" , "新增" );
                            }
                        })
                        . init ();

        // 当前
        Assert . assertEquals ( "[测试, 新增, 新增]" , sensitiveWordBs . findAll ( text ). toString ());

        // 新增单个
        sensitiveWordBs . addWordAllow ( "测试" );
        sensitiveWordBs . addWordAllow ( "新增" );
        Assert . assertEquals ( "[]" , sensitiveWordBs . findAll ( text ). toString ());

        // 删除单个
        sensitiveWordBs . removeWordAllow ( "测试" );
        Assert . assertEquals ( "[测试]" , sensitiveWordBs . findAll ( text ). toString ());
        sensitiveWordBs . removeWordAllow ( "新增" );
        Assert . assertEquals ( "[测试, 新增, 新增]" , sensitiveWordBs . findAll ( text ). toString ());

        // 新增集合
        sensitiveWordBs . addWordAllow ( Arrays . asList ( "新增" , "测试" ));
        Assert . assertEquals ( "[]" , sensitiveWordBs . findAll ( text ). toString ());
        // 删除集合
        sensitiveWordBs . removeWordAllow ( Arrays . asList ( "新增" , "测试" ));
        Assert . assertEquals ( "[测试, 新增, 新增]" , sensitiveWordBs . findAll ( text ). toString ());

        // 新增数组
        sensitiveWordBs . addWordAllow ( "新增" , "测试" );
        Assert . assertEquals ( "[]" , sensitiveWordBs . findAll ( text ). toString ());
        // 删除集合
        sensitiveWordBs . removeWordAllow ( "新增" , "测试" );
        Assert . assertEquals ( "[测试, 新增, 新增]" , sensitiveWordBs . findAll ( text ). toString ());

Full initialization

illustrate

This method is abandoned . It is recommended to use the incremental addition method above to avoid full loading. For compatibility, this method remains.

How to use: When calling sensitiveWordBs.init() , rebuild the sensitive vocabulary based on IWordDeny+IWordAllow. Because initialization may take a long time (second level), all optimizations to init will not affect the old vocabulary function when it is not completed, and the new one shall prevail after completion .

example

 @ Component
public class SensitiveWordService {

    @ Autowired
    private SensitiveWordBs sensitiveWordBs ;

    /**
     * 更新词库
     *
     * 每次数据库的信息发生变化之后，首先调用更新数据库敏感词库的方法。
     * 如果需要生效，则调用这个方法。
     *
     * 说明：重新初始化不影响旧的方法使用。初始化完成后，会以新的为准。
     */
    public void refresh () {
        // 每次数据库的信息发生变化之后，首先调用更新数据库敏感词库的方法，然后调用这个方法。
        sensitiveWordBs . init ();
    }

}

As mentioned above, you can actively trigger the initialization of sensitiveWordBs.init(); when the database lexicon changes, and the database needs to take effect.

Other uses remain the same without restarting the app.

wordResultCondition-Further judgment for matching words

illustrate

Supported version: v0.13.0

Sometimes we may want to further limit the matching sensitive words, for example, although we define [av] as a sensitive word, we do not want [have] to be matched.

You can customize the wordResultCondition interface and implement your own policies.

The built-in policy in WordResultConditions#alwaysTrue() is always true, while WordResultConditions#englishWordMatch() requires that the English language must match the full word.

Built-in policy

The WordResultConditions tool class can get matching policies

accomplish	illustrate	Supported version
alwaysTrue	Always true
englishWordMatch	English word full word matching	v0.13.0
englishWordNumMatch	English word/number full word matching	v0.20.0
wordTags	Those that satisfy specific tags, such as only focusing on the [Advertising] tags	v0.23.0
chains(IWordResultCondition ...conditions)	Supports specifying multiple conditions and satisfying them at the same time	v0.23.0

Use Example

Original default situation:

 final String text = "I have a nice day。" ;

List < String > wordList = SensitiveWordBs . newInstance ()
        . wordDeny ( new IWordDeny () {
            @ Override
            public List < String > deny () {
                return Collections . singletonList ( "av" );
            }
        })
        . wordResultCondition ( WordResultConditions . alwaysTrue ())
        . init ()
        . findAll ( text );
Assert . assertEquals ( "[av]" , wordList . toString ());

We can specify that the English must match the full word.

 final String text = "I have a nice day。" ;

List < String > wordList = SensitiveWordBs . newInstance ()
        . wordDeny ( new IWordDeny () {
            @ Override
            public List < String > deny () {
                return Collections . singletonList ( "av" );
            }
        })
        . wordResultCondition ( WordResultConditions . englishWordMatch ())
        . init ()
        . findAll ( text );
Assert . assertEquals ( "[]" , wordList . toString ());

Of course, more complex strategies can be implemented as needed.

wordTags word tags

Supported version: v0.23.0

We can only return sensitive words affiliated with a certain label.

We have specified two sensitive words: product, AV

MyWordTag is a sensitive word tag implementation we define:

 /**
 * 自定义单词标签
 * @since 0.23.0
 */
public class MyWordTag extends AbstractWordTag {

    private static Map < String , Set < String >> dataMap ;

    static {
        dataMap = new HashMap <>();
        dataMap . put ( "商品" , buildSet ( "广告" , "中文" ));
        dataMap . put ( "AV" , buildSet ( "色情" , "单词" , "英文" ));
    }

    private static Set < String > buildSet ( String ... tags ) {
        Set < String > set = new HashSet <>();
        for ( String tag : tags ) {
            set . add ( tag );
        }
        return set ;
    }

    @ Override
    protected Set < String > doGetTag ( String word ) {
        return dataMap . get ( word );
    }

}

For example, we simulate two different implementation classes, each focusing on a different word tag.

 // 只关心SE情
SensitiveWordBs sensitiveWordBsYellow = SensitiveWordBs . newInstance ()
        . wordDeny ( new IWordDeny () {
            @ Override
            public List < String > deny () {
                return Arrays . asList ( "商品" , "AV" );
            }
        })
        . wordAllow ( WordAllows . empty ())
        . wordTag ( new MyWordTag ())
        . wordResultCondition ( WordResultConditions . wordTags ( Arrays . asList ( "色情" )))
        . init ();

// 只关心广告
SensitiveWordBs sensitiveWordBsAd = SensitiveWordBs . newInstance ()
        . wordDeny ( new IWordDeny () {
            @ Override
            public List < String > deny () {
                return Arrays . asList ( "商品" , "AV" );
            }
        })
        . wordAllow ( WordAllows . empty ())
        . wordTag ( new MyWordTag ())
        . wordResultCondition ( WordResultConditions . wordTags ( Arrays . asList ( "广告" )))
        . init ();

final String text = "这些 AV 商品什么价格？" ;
Assert . assertEquals ( "[AV]" , sensitiveWordBsYellow . findAll ( text ). toString ());
Assert . assertEquals ( "[商品]" , sensitiveWordBsAd . findAll ( text ). toString ());

Ignore characters

illustrate

Our sensitive words are generally more continuous, such as [Silly Hat]

Then there is a clever discovery that you can add some characters in the middle, such as [Silly! @#$Hat] to skip the detection, but the attack power of swearing is not reduced.

So, how to deal with these similar scenarios?

We can specify skip sets of special characters and ignore these meaningless characters.

v0.11.0 starts support

example

The character strategy corresponding to charIgnore can be flexibly defined by users.

 final String text = "傻@冒，狗+东西" ;

//默认因为有特殊字符分割，无法识别
List < String > wordList = SensitiveWordBs . newInstance (). init (). findAll ( text );
Assert . assertEquals ( "[]" , wordList . toString ());

// 指定忽略的字符策略，可自行实现。
List < String > wordList2 = SensitiveWordBs . newInstance ()
        . charIgnore ( SensitiveWordCharIgnores . specialChars ())
        . init ()
        . findAll ( text );

Assert . assertEquals ( "[傻@冒, 狗+东西]" , wordList2 . toString ());

Sensitive word tags

illustrate

Sometimes we want to add a classified label to sensitive words: such as social situation, violence, etc.

In this way, more characteristics can be performed according to labels, etc., such as only processing a certain type of label.

Supported version: v0.10.0

Main features support version: v0.24.0

Tag interface

This is just an abstract interface, and users can define the implementation by themselves. For example, from database query, file reading, API calls, etc.

 public interface IWordTag {

    /**
     * 查询标签列表
     * @param word 脏词
     * @return 结果
     */
    Set < String > getTag ( String word );

}

Built-in implementation

Method List

In order to facilitate use in most situations, some scene strategies are implemented in WordTags class

Implementation method	illustrate	Remark
none()	Empty implementation	v0.10.0 support
file(String filePath)	Specify file path	v0.10.0 support
file(String filePath, String wordSplit, String tagSplit)	Specify file paths, as well as word separators and tag separators	v0.10.0 support
map(final Map<String, Set> wordTagMap)	Initialization according to map	v0.24.0 support
lines(Collection lines)	List of strings	v0.24.0 support
lines(Collection lines, String wordSplit, String tagSpli)	List of strings, as well as word separators and label separators	v0.24.0 support
system()	Built-in implementation of the system files, integrating network classification	v0.24.0 support
defaults()	The default policy is currently system	v0.24.0 support
chains(IWordTag... others)	Chain method, supports user integration to implement multiple policies	v0.24.0 support

Format agreement

The format of sensitive word tags We default to the following敏感词tag1,tag2 , which means that the tags of敏感词are tag1 and tag2.

for example

五星红旗 政治,国家

This is also recommended for all file line contents and specified string contents. If it is not satisfied, just implement it in a custom way.

Built-in implementation of the system (default effect)

Starting with v0.24.0, the default word tag is WordTags.system() .

Note: Currently, data statistics are from the Internet, and there are many omissions. Everyone is also welcome to correct the problem and continue to improve...

 SensitiveWordBs sensitiveWordBs = SensitiveWordBs . newInstance ()
. wordTag ( WordTags . system ())
. init ();
Set < String > tagSet = sensitiveWordBs . tags ( "博彩" );
Assert . assertEquals ( "[3]" , tagSet . toString ());

Here, in order to optimize the compression size, the corresponding categories are represented by numbers.

The meaning list of numbers is as follows:

 0 政治
1 毒品
2 色情
3 赌博
4 违法

File preparation example

Here we take the file as an example to demonstrate how to use it.

 final String path = "~ \ test \ resources \ dict_tag_test.txt" ;

// 演示默认方法
IWordTag wordTag = WordTags . file ( path );
SensitiveWordBs sensitiveWordBs = SensitiveWordBs . newInstance ()
        . wordTag ( wordTag )
        . init ();

Set < String > tagSet = sensitiveWordBs . tags ( "零售" );
        Assert . assertEquals ( "[广告, 网络]" , tagSet . toString ());


// 演示指定分隔符
IWordTag wordTag2 = WordTags . file ( path , " " , "," );
SensitiveWordBs sensitiveWordBs2 = SensitiveWordBs . newInstance ()
        . wordTag ( wordTag2 )
        . init ();
Set < String > tagSet2 = sensitiveWordBs2 . tags ( "零售" );
        Assert . assertEquals ( "[广告, 网络]" , tagSet2 . toString ());

Where dict_tag_test.txt our custom content is as follows:

零售 广告,网络

Linkage between word tags and sensitive word discovery

When we obtain sensitive words, we can set the corresponding result processing strategy to obtain the corresponding sensitive word tag information

 // 自定义测试标签类
IWordTag wordTag = WordTags . lines ( Arrays . asList ( "天安门 政治,国家,地址" ));

// 指定初始化
SensitiveWordBs sensitiveWordBs = SensitiveWordBs . newInstance ()
        . wordTag ( wordTag )
        . init ()
        ;

List < WordTagsDto > wordTagsDtoList1 = sensitiveWordBs . findAll ( "天安门" , WordResultHandlers . wordTags ());
Assert . assertEquals ( "[WordTagsDto{word='天安门', tags=[政治, 国家, 地址]}]" , wordTagsDtoList1 . toString ());

We customize the tags of天安门keywords, and then specify that the result processing strategy of findAll is WordResultHandlers.wordTags() , and we can obtain the corresponding tag list while obtaining sensitive words.

Dynamic loading (user-defined)

Scenario description

Sometimes we want to design the loading of sensitive words into dynamic, such as console modification, which can then take effect in real time.

v0.0.13 supports this feature.

Interface description

To implement this feature and be compatible with previous functions, we defined two interfaces.

IWordDeny

The interface is as follows, you can customize your own implementation.

The returned list means that the word is a sensitive word.

 /**
 * 拒绝出现的数据-返回的内容被当做是敏感词
 * @author binbin.hou
 * @since 0.0.13
 */
public interface IWordDeny {

    /**
     * 获取结果
     * @return 结果
     * @since 0.0.13
     */
    List < String > deny ();

}

for example:

 public class MyWordDeny implements IWordDeny {

    @ Override
    public List < String > deny () {
        return Arrays . asList ( "我的自定义敏感词" );
    }

}

IWordAllow

The interface is as follows, you can customize your own implementation.

The returned list means that the word is not a sensitive word.

 /**
 * 允许的内容-返回的内容不被当做敏感词
 * @author binbin.hou
 * @since 0.0.13
 */
public interface IWordAllow {

    /**
     * 获取结果
     * @return 结果
     * @since 0.0.13
     */
    List < String > allow ();

}

like:

 public class MyWordAllow implements IWordAllow {

    @ Override
    public List < String > allow () {
        return Arrays . asList ( "五星红旗" );
    }

}

Configure usage

After the interface is customized, of course, it needs to be specified to take effect.

To make use more elegant, we designed the boot class SensitiveWordBs .

You can specify sensitive words through wordDeny(), wordAllow() specifies non-sensitive words, and initialize sensitive words dictionary through init().

System default configuration

 SensitiveWordBs wordBs = SensitiveWordBs . newInstance ()
        . wordDeny ( WordDenys . defaults ())
        . wordAllow ( WordAllows . defaults ())
        . init ();

final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。" ;
Assert . assertTrue ( wordBs . contains ( text ));

Note: init() is time-consuming to build the sensitive word DFA. It is generally recommended that it be initialized only once when applying initialization. Instead of repeating initialization!

Specify your own implementation

We can test the custom implementation, as follows:

 String text = "这是一个测试，我的自定义敏感词。" ;

SensitiveWordBs wordBs = SensitiveWordBs . newInstance ()
        . wordDeny ( new MyWordDeny ())
        . wordAllow ( new MyWordAllow ())
        . init ();

Assert . assertEquals ( "[我的自定义敏感词]" , wordBs . findAll ( text ). toString ());

Here is the only one where我的自定义敏感词are sensitive words, and测试are not sensitive words.

Of course, here are all our custom implementations. It is generally recommended to use the default configuration + custom configuration of the system.

The following method can be used.

Configure multiple simultaneously

Multiple sensitive words

WordDenys.chains() method merges multiple implementations into the same IWordDeny.

Multiple whitelists

WordAllows.chains() method merges multiple implementations into the same IWordAllow.

example:

 String text = "这是一个测试。我的自定义敏感词。" ;

IWordDeny wordDeny = WordDenys . chains ( WordDenys . defaults (), new MyWordDeny ());
IWordAllow wordAllow = WordAllows . chains ( WordAllows . defaults (), new MyWordAllow ());

SensitiveWordBs wordBs = SensitiveWordBs . newInstance ()
        . wordDeny ( wordDeny )
        . wordAllow ( wordAllow )
        . init ();

Assert . assertEquals ( "[我的自定义敏感词]" , wordBs . findAll ( text ). toString ());

All of them use the system default configuration and custom configuration at the same time.

Note: We initialize new wordBs, so use new wordBs to judge. Instead of using the previous SensitiveWordHelper tool method, the tool method configuration is the default!

spring integration

background

In actual use, for example, you can modify the page configuration and then take effect in real time.

The data is stored in the database. The following is an example of pseudo-code. You can refer to SpringSensitiveWordConfig.java

Required, version v0.0.15 and above.

Customize data sources

The simplified pseudo-code is as follows, and the source of the data is a database.

MyDdWordAllow and MyDdWordDeny are custom implementation classes based on databases as the source.

 @ Configuration
public class SpringSensitiveWordConfig {

    @ Autowired
    private MyDdWordAllow myDdWordAllow ;

    @ Autowired
    private MyDdWordDeny myDdWordDeny ;

    /**
     * 初始化引导类
     * @return 初始化引导类
     * @since 1.0.0
     */
    @ Bean
    public SensitiveWordBs sensitiveWordBs () {
        SensitiveWordBs sensitiveWordBs = SensitiveWordBs . newInstance ()
                . wordAllow ( WordAllows . chains ( WordAllows . defaults (), myDdWordAllow ))
                . wordDeny ( myDdWordDeny )
                // 各种其他配置
                . init ();

        return sensitiveWordBs ;
    }

}

Initialization of sensitive vocabulary is time-consuming, so it is recommended to do init initialization once when the program starts.

Benchmark

After V0.6.0, add the corresponding benchmark test.

BenchmarkTimesTest

environment

The test environment is a normal notebook:

处理器	12th Gen Intel(R) Core(TM) i7-1260P   2.10 GHz
机带 RAM	16.0 GB (15.7 GB 可用)
系统类型	64 位操作系统, 基于 x64 的处理器

ps: Different environments will vary, but the proportion is basically stable.

Test effect record

Test data: 100+ string, loop 10W times.

Serial number	Scene	time consuming	Remark
1	Only make sensitive words without any format conversion	1470ms, about 7.2W QPS	Pursuing extreme performance, you can configure it like this
2	Only make sensitive words, support all format conversion	2744ms, about 3.7W QPS	Meet most scenarios

STAR

Later road-map

Remove sensitive words from individual Chinese characters. In China, phrases should be regarded as a single word to reduce the rate of misjudgment.
Supports individual sensitive word changes?

remove, add, edit?

Sensitive word tag interface support
Tag support when processing sensitive words
Memory usage comparison + optimization of wordData
Users specify custom phrases, and allow the combination of specified phrases to be obtained, making it more flexible

FormatCombine/CheckCombine/AllowDenyCombine Combination policy, allowing user customization.

Optimization of word check strategy, unified traversal + conversion
Add ThreadLocal and other performance optimizations

Extended reading

Sensitive Word Series

sensitive-word-admin sensitive word console v1.2.0 open source

How to support distributed deployment in sensitive-word-admin v1.3.0 release?

01-Beginner of Open Source Sensitive Word Tool

02-How to implement a sensitive word tool? Clarify the idea of implementing prohibited words

03-Support Word StopWord Stop Word Optimization and Special Symbols

04-Dictionary of sensitive words slimming

05-Detailed explanation of DFA algorithm of sensitive words (Trie Tree algorithm)

06-Sensitive words (dirty words) How to ignore meaningless characters? Achieve better filtering effect

v0.10.0-Preliminary support for dirty word classification tag

v0.11.0 - New features of sensitive words: ignore meaningless characters, word tag dictionary

v0.12.0-Sensitive word/dirty word labeling ability is further enhanced

v0.13.0-Sensitive Word Feature Version Release Supports Full Word Matching in English Words

v0.16.1-Dictionary memory resource release of new features of sensitive words

v0.19.0 - Sensitive words with new features of sensitive words individually edited without repeated initialization

v0.20.0 The new characteristics of sensitive words match all numbers, not partial matches

v0.21.0 Whitelists with new features of sensitive words support single editing, correcting the problem when whitelists contain blacklists

NLP Open Source Matrix

pinyin to pinyin

pinyin2hanzi Pinyin to Chinese characters

segment high performance Chinese word segmentation

opencc4j Chinese Traditional Chinese Simplified Chinese Conversion

nlp-hanzi-similar Chinese character similarity

word-checker spelling detection

sensitive-word sensitive words

Expand