sensitive-word is a high-performance sensitive word tool based on DFA algorithm.
Online experience
If you have some difficult and complicated diseases, you can join: Technical Exchange Group
sensitive-word-admin is the corresponding console application. The functions are currently in the early stages of development and the MVP version is available.
Hello everyone, I am Lao Ma.
I have always wanted to implement a simple and easy-to-use sensitive word tool, so I implemented this tool open source.
Based on the DFA algorithm implementation, the content of the sensitive thesaurus currently includes 6W+ (source file 18W+, after one deletion).
In the later stage, sensitive thesaurus will be continuously optimized and supplemented, and the performance of the algorithm will be further improved.
v0.24.0 has started to support the classification and refinement of sensitive words, but the workload is relatively large, so there are inevitably omissions.
Welcome to PR improvements, github requests, or join the technical exchange group to communicate and brag!
6W+ thesaurus, and continuous optimization and update
Based on fluent-api implementation, elegant and concise use
Based on DFA algorithm, performance is 7W+ QPS, application is insensitive
Support common operations such as judgment, return, and desensitization of sensitive words
Supports common format conversion
Full-width half-width interchange, English case interchange, common forms of numbers, traditional Chinese and simplified Chinese, common forms of English interchange, ignoring duplicate words, etc.
Support sensitive word detection, email detection, digital detection, website detection, IPV4, etc.
Supports custom replacement policies
Supports user-defined sensitive words and whitelists
Supports dynamic data updates (user customization), effective in real time
Supports tag interface + built-in classification implementation for sensitive words
Support skipping some special characters to make matching more flexible
Supports single addition/modification of black and white lists, without full initialization
CHANGE_LOG.md
Sometimes there is a console for sensitive words, which makes it more flexible and convenient to configure.
How to implement sensitive word console service out of the box?
A large number of sensitive word tag files have been sorted out, which can make our sensitive words more convenient.
These two materials can be read in the article below:
v0.11.0-New features of sensitive words and corresponding tag files
Currently, v0.24.0 has built-in word tags, and it is recommended to upgrade to the latest version if needed.
Open source is not easy. If this project is helpful to you, you can invite Lao Ma to have a cup of milk tea.

JDK1.8+
Maven 3.x+
< dependency >
< groupId >com.github.houbb</ groupId >
< artifactId >sensitive-word</ artifactId >
< version >0.24.0</ version >
</ dependency > SensitiveWordHelper is a tool class for sensitive words. The core methods are as follows:
Note: SensitiveWordHelper provides default configurations. If you want to make flexible custom configurations, please refer to the boot class feature configuration
| method | parameter | Return value | illustrate |
|---|---|---|---|
| contains(String) | The string to be verified | Boolean value | Verify that the string contains sensitive words |
| replace(String, ISensitiveWordReplace) | Replace sensitive words with specified replacement strategy | String | Returns the desensitized string |
| replace(String, char) | Use the specified char to replace the sensitive word | String | Returns the desensitized string |
| replace(String) | Use * to replace sensitive words | String | Returns the desensitized string |
| findAll(String) | The string to be verified | List of strings | Returns all sensitive words in the string |
| findFirst(String) | The string to be verified | String | Returns the first sensitive word in the string |
| findAll(String, IWordResultHandler) | IWordResultHandler result processing class | List of strings | Returns all sensitive words in the string |
| findFirst(String, IWordResultHandler) | IWordResultHandler result processing class | String | Returns the first sensitive word in the string |
| tags(String) | Get tags for sensitive words | Sensitive word string | Return to the list of tags for sensitive words |
final String text = "五星红旗迎风飘扬,毛主席的画像屹立在天安门前。" ;
Assert . assertTrue ( SensitiveWordHelper . contains ( text )); final String text = "五星红旗迎风飘扬,毛主席的画像屹立在天安门前。" ;
String word = SensitiveWordHelper . findFirst ( text );
Assert . assertEquals ( "五星红旗" , word );SensitiveWordHelper.findFirst(text) is equivalent to:
String word = SensitiveWordHelper . findFirst ( text , WordResultHandlers . word ()); final String text = "五星红旗迎风飘扬,毛主席的画像屹立在天安门前。" ;
List < String > wordList = SensitiveWordHelper . findAll ( text );
Assert . assertEquals ( "[五星红旗, 毛主席, 天安门]" , wordList . toString ());Returns all sensitive words usage is similar to SensitiveWordHelper.findFirst(), and also supports specifying result processing classes.
SensitiveWordHelper.findAll(text) is equivalent to:
List < String > wordList = SensitiveWordHelper . findAll ( text , WordResultHandlers . word ());WordResultHandlers.raw() can retain the corresponding subscript information and category information:
final String text = "五星红旗迎风飘扬,毛主席的画像屹立在天安门前。" ;
// 默认敏感词标签为空
List < WordTagsDto > wordList1 = SensitiveWordHelper . findAll ( text , WordResultHandlers . wordTags ());
Assert . assertEquals ( "[WordTagsDto{word='五星红旗', tags=[]}, WordTagsDto{word='毛主席', tags=[]}, WordTagsDto{word='天安门', tags=[]}]" , wordList1 . toString ()); final String text = "五星红旗迎风飘扬,毛主席的画像屹立在天安门前。" ;
String result = SensitiveWordHelper . replace ( text );
Assert . assertEquals ( "****迎风飘扬,***的画像屹立在***前。" , result ); final String text = "五星红旗迎风飘扬,毛主席的画像屹立在天安门前。" ;
String result = SensitiveWordHelper . replace ( text , '0' );
Assert . assertEquals ( "0000迎风飘扬,000的画像屹立在000前。" , result );V0.2.0 supports this feature.
Scene Description: Sometimes we want different sensitive words to have different replacement results. For example, [Game] is replaced with [E-sports], and [Unemployment] is replaced with [Flexible Employment].
Admittedly, it is OK to use regular replacement of strings in advance, but the performance is average.
Example of use:
/**
* 自定替换策略
* @since 0.2.0
*/
@ Test
public void defineReplaceTest () {
final String text = "五星红旗迎风飘扬,毛主席的画像屹立在天安门前。" ;
ISensitiveWordReplace replace = new MySensitiveWordReplace ();
String result = SensitiveWordHelper . replace ( text , replace );
Assert . assertEquals ( "国家旗帜迎风飘扬,教员的画像屹立在***前。" , result );
} Among them, MySensitiveWordReplace is our customized replacement policy, which is implemented as follows:
public class MyWordReplace implements IWordReplace {
@ Override
public void replace ( StringBuilder stringBuilder , final char [] rawChars , IWordResult wordResult , IWordContext wordContext ) {
String sensitiveWord = InnerWordCharUtils . getString ( rawChars , wordResult );
// 自定义不同的敏感词替换策略,可以从数据库等地方读取
if ( "五星红旗" . equals ( sensitiveWord )) {
stringBuilder . append ( "国家旗帜" );
} else if ( "毛主席" . equals ( sensitiveWord )) {
stringBuilder . append ( "教员" );
} else {
// 其他默认使用 * 代替
int wordLength = wordResult . endIndex () - wordResult . startIndex ();
for ( int i = 0 ; i < wordLength ; i ++) {
stringBuilder . append ( '*' );
}
}
}
} We do fixed mapping for some of the words, and the others are converted to * by default.
IWordResultHandler can process results of sensitive words and allow users to customize them.
See WordResultHandlers tool class for built-in implementation:
Only sensitive words are preserved.
Retain information related to sensitive words, including the beginning and end subscripts of sensitive words.
At the same time, the words and corresponding word label information are retained.
See SensitiveWordHelperTest for all test cases
1) Basic examples
final String text = "五星红旗迎风飘扬,毛主席的画像屹立在天安门前。" ;
List < String > wordList = SensitiveWordHelper . findAll ( text );
Assert . assertEquals ( "[五星红旗, 毛主席, 天安门]" , wordList . toString ());
List < String > wordList2 = SensitiveWordHelper . findAll ( text , WordResultHandlers . word ());
Assert . assertEquals ( "[五星红旗, 毛主席, 天安门]" , wordList2 . toString ());
List < IWordResult > wordList3 = SensitiveWordHelper . findAll ( text , WordResultHandlers . raw ());
Assert . assertEquals ( "[WordResult{startIndex=0, endIndex=4}, WordResult{startIndex=9, endIndex=12}, WordResult{startIndex=18, endIndex=21}]" , wordList3 . toString ()); We specify the tag information of the corresponding word in the dict_tag_test.txt file.
final String text = "五星红旗迎风飘扬,毛主席的画像屹立在天安门前。" ;
// 默认敏感词标签为空
List < WordTagsDto > wordList1 = SensitiveWordHelper . findAll ( text , WordResultHandlers . wordTags ());
Assert . assertEquals ( "[WordTagsDto{word='五星红旗', tags=[]}, WordTagsDto{word='毛主席', tags=[]}, WordTagsDto{word='天安门', tags=[]}]" , wordList1 . toString ());
List < WordTagsDto > wordList2 = SensitiveWordBs . newInstance ()
. wordTag ( WordTags . file ( "dict_tag_test.txt" ))
. init ()
. findAll ( text , WordResultHandlers . wordTags ());
Assert . assertEquals ( "[WordTagsDto{word='五星红旗', tags=[政治, 国家]}, WordTagsDto{word='毛主席', tags=[政治, 伟人, 国家]}, WordTagsDto{word='天安门', tags=[政治, 国家, 地址]}]" , wordList2 . toString ());The subsequent features are mainly aimed at various processing targeting various situations, so as to improve the hit rate of sensitive words as much as possible.
This is a long battle of offense and defense.
final String text = "fuCK the bad words." ;
String word = SensitiveWordHelper . findFirst ( text );
Assert . assertEquals ( "fuCK" , word ); final String text = "fuck the bad words." ;
String word = SensitiveWordHelper . findFirst ( text );
Assert . assertEquals ( "fuck" , word );Here, the conversion of common forms of digital is implemented.
final String text = "这个是我的微信:9⓿二肆⁹₈③⑸⒋➃㈤㊄" ;
List < String > wordList = SensitiveWordBs . newInstance (). enableNumCheck ( true ). init (). findAll ( text );
Assert . assertEquals ( "[9⓿二肆⁹₈③⑸⒋➃㈤㊄]" , wordList . toString ()); final String text = "我爱我的祖国和五星紅旗。" ;
List < String > wordList = SensitiveWordHelper . findAll ( text );
Assert . assertEquals ( "[五星紅旗]" , wordList . toString ()); final String text = "Ⓕⓤc⒦ the bad words" ;
List < String > wordList = SensitiveWordHelper . findAll ( text );
Assert . assertEquals ( "[Ⓕⓤc⒦]" , wordList . toString ()); final String text = "ⒻⒻⒻfⓤuⓤ⒰cⓒ⒦ the bad words" ;
List < String > wordList = SensitiveWordBs . newInstance ()
. ignoreRepeat ( true )
. init ()
. findAll ( text );
Assert . assertEquals ( "[ⒻⒻⒻfⓤuⓤ⒰cⓒ⒦]" , wordList . toString ());Personal information such as email address is not enabled by default.
final String text = "楼主好人,邮箱 [email protected]" ;
List < String > wordList = SensitiveWordBs . newInstance (). enableEmailCheck ( true ). init (). findAll ( text );
Assert . assertEquals ( "[[email protected]]" , wordList . toString ());It is generally used to filter advertising information such as mobile phone number/QQ, and is not enabled by default.
After V0.2.1, the lengths detected by numCheckLen(长度) are supported.
final String text = "你懂得:12345678" ;
// 默认检测 8 位
List < String > wordList = SensitiveWordBs . newInstance ()
. enableNumCheck ( true )
. init (). findAll ( text );
Assert . assertEquals ( "[12345678]" , wordList . toString ());
// 指定数字的长度,避免误杀
List < String > wordList2 = SensitiveWordBs . newInstance ()
. enableNumCheck ( true )
. numCheckLen ( 9 )
. init ()
. findAll ( text );
Assert . assertEquals ( "[]" , wordList2 . toString ());Used to filter common URL information, not enabled by default.
v0.18.0 optimizes URL detection, which is more stringent and reduces the misjudgment rate
final String text = "点击链接 https://www.baidu.com 查看答案" ;
final SensitiveWordBs sensitiveWordBs = SensitiveWordBs . newInstance (). enableUrlCheck ( true ). init ();
List < String > wordList = sensitiveWordBs . findAll ( text );
Assert . assertEquals ( "[https://www.baidu.com]" , wordList . toString ());
Assert . assertEquals ( "点击链接 ********************* 查看答案" , sensitiveWordBs . replace ( text ));v0.17.0 support
Avoid users bypassing URL detection through IP, etc., which is not enabled by default.
final String text = "个人网站,如果网址打不开可以访问 127.0.0.1。" ;
final SensitiveWordBs sensitiveWordBs = SensitiveWordBs . newInstance (). enableIpv4Check ( true ). init ();
List < String > wordList = sensitiveWordBs . findAll ( text );
Assert . assertEquals ( "[127.0.0.1]" , wordList . toString ());The above features are all enabled by default, and sometimes the business needs to flexibly define related configuration features.
So v0.0.14 opens the property configuration.
To make use more elegant, the definition is uniformly used by fluent-api.
Users can use SensitiveWordBs to define it as follows:
Note: After configuration, use our newly defined SensitiveWordBs object instead of the previous tool method. Tool method configuration is all default.
SensitiveWordBs wordBs = SensitiveWordBs . newInstance ()
. ignoreCase ( true )
. ignoreWidth ( true )
. ignoreNumStyle ( true )
. ignoreChineseStyle ( true )
. ignoreEnglishStyle ( true )
. ignoreRepeat ( false )
. enableNumCheck ( false )
. enableEmailCheck ( false )
. enableUrlCheck ( false )
. enableIpv4Check ( false )
. enableWordCheck ( true )
. numCheckLen ( 8 )
. wordTag ( WordTags . none ())
. charIgnore ( SensitiveWordCharIgnores . defaults ())
. wordResultCondition ( WordResultConditions . alwaysTrue ())
. init ();
final String text = "五星红旗迎风飘扬,毛主席的画像屹立在天安门前。" ;
Assert . assertTrue ( wordBs . contains ( text ));The description of each configuration is as follows:
| Serial number | method | illustrate | default value |
|---|---|---|---|
| 1 | ignoreCase | Ignore case | true |
| 2 | ignoreWidth | Ignore half-corner rounded corners | true |
| 3 | ignoreNumStyle | Ignore the writing of numbers | true |
| 4 | ignore ChineseStyle | Ignore Chinese writing format | true |
| 5 | ignoreEnglishStyle | Ignore English writing format | true |
| 6 | ignoreRepeat | Ignore duplicate words | false |
| 7 | enableNumCheck | Whether to enable digital detection. | false |
| 8 | enableEmailCheck | Email detection is enabled | false |
| 9 | enableUrlCheck | Whether to enable link detection | false |
| 10 | enableIpv4Check | Whether to enable IPv4 detection | false |
| 11 | enableWordCheck | Whether to enable sensitive word detection | true |
| 12 | numCheckLen | Digital detection, customize the specified length. | 8 |
| 13 | wordTag | The corresponding tags of words | none |
| 14 | charIgnore | Ignored characters | none |
| 15 | wordResultCondition | Additional processing for matching sensitive words, such as limiting the need for full matches of English words | Always true |
v0.16.1 is supported. Sometimes we need to free up memory, which can be as follows:
About memory recycling issues
SensitiveWordBs wordBs = SensitiveWordBs . newInstance ()
. init ();
// 后续因为一些原因移除了对应信息,希望释放内存。
wordBs . destroy ();Usage scenario: After initialization, we want to add/delete a single word instead of reinitializing it completely. This feature is prepared for this.
Supported version: v0.19.0
addWord(word) adds sensitive words, supporting single words/collections
removeWord(word) removes sensitive words, supports single words/collections
final String text = "测试一下新增敏感词,验证一下删除和新增对不对" ;
SensitiveWordBs sensitiveWordBs =
SensitiveWordBs . newInstance ()
. wordAllow ( WordAllows . empty ())
. wordDeny ( WordDenys . empty ())
. init ();
// 当前
Assert . assertEquals ( "[]" , sensitiveWordBs . findAll ( text ). toString ());
// 新增单个
sensitiveWordBs . addWord ( "测试" );
sensitiveWordBs . addWord ( "新增" );
Assert . assertEquals ( "[测试, 新增, 新增]" , sensitiveWordBs . findAll ( text ). toString ());
// 删除单个
sensitiveWordBs . removeWord ( "新增" );
Assert . assertEquals ( "[测试]" , sensitiveWordBs . findAll ( text ). toString ());
sensitiveWordBs . removeWord ( "测试" );
Assert . assertEquals ( "[]" , sensitiveWordBs . findAll ( text ). toString ());
// 新增集合
sensitiveWordBs . addWord ( Arrays . asList ( "新增" , "测试" ));
Assert . assertEquals ( "[测试, 新增, 新增]" , sensitiveWordBs . findAll ( text ). toString ());
// 删除集合
sensitiveWordBs . removeWord ( Arrays . asList ( "新增" , "测试" ));
Assert . assertEquals ( "[]" , sensitiveWordBs . findAll ( text ). toString ());
// 新增数组
sensitiveWordBs . addWord ( "新增" , "测试" );
Assert . assertEquals ( "[测试, 新增, 新增]" , sensitiveWordBs . findAll ( text ). toString ());
// 删除集合
sensitiveWordBs . removeWord ( "新增" , "测试" );
Assert . assertEquals ( "[]" , sensitiveWordBs . findAll ( text ). toString ());Usage scenario: After initialization, we want to add/delete a single word instead of reinitializing it completely. This feature is prepared for this.
Supported version: v0.21.0
addWordAllow(word) adds a new whitelist, supports single words/collections
removeWordAllow(word) removes whitelists, supports single words/collections
final String text = "测试一下新增敏感词白名单,验证一下删除和新增对不对" ;
SensitiveWordBs sensitiveWordBs =
SensitiveWordBs . newInstance ()
. wordAllow ( WordAllows . empty ())
. wordDeny ( new IWordDeny () {
@ Override
public List < String > deny () {
return Arrays . asList ( "测试" , "新增" );
}
})
. init ();
// 当前
Assert . assertEquals ( "[测试, 新增, 新增]" , sensitiveWordBs . findAll ( text ). toString ());
// 新增单个
sensitiveWordBs . addWordAllow ( "测试" );
sensitiveWordBs . addWordAllow ( "新增" );
Assert . assertEquals ( "[]" , sensitiveWordBs . findAll ( text ). toString ());
// 删除单个
sensitiveWordBs . removeWordAllow ( "测试" );
Assert . assertEquals ( "[测试]" , sensitiveWordBs . findAll ( text ). toString ());
sensitiveWordBs . removeWordAllow ( "新增" );
Assert . assertEquals ( "[测试, 新增, 新增]" , sensitiveWordBs . findAll ( text ). toString ());
// 新增集合
sensitiveWordBs . addWordAllow ( Arrays . asList ( "新增" , "测试" ));
Assert . assertEquals ( "[]" , sensitiveWordBs . findAll ( text ). toString ());
// 删除集合
sensitiveWordBs . removeWordAllow ( Arrays . asList ( "新增" , "测试" ));
Assert . assertEquals ( "[测试, 新增, 新增]" , sensitiveWordBs . findAll ( text ). toString ());
// 新增数组
sensitiveWordBs . addWordAllow ( "新增" , "测试" );
Assert . assertEquals ( "[]" , sensitiveWordBs . findAll ( text ). toString ());
// 删除集合
sensitiveWordBs . removeWordAllow ( "新增" , "测试" );
Assert . assertEquals ( "[测试, 新增, 新增]" , sensitiveWordBs . findAll ( text ). toString ());This method is abandoned . It is recommended to use the incremental addition method above to avoid full loading. For compatibility, this method remains.
How to use: When calling sensitiveWordBs.init() , rebuild the sensitive vocabulary based on IWordDeny+IWordAllow. Because initialization may take a long time (second level), all optimizations to init will not affect the old vocabulary function when it is not completed, and the new one shall prevail after completion .
@ Component
public class SensitiveWordService {
@ Autowired
private SensitiveWordBs sensitiveWordBs ;
/**
* 更新词库
*
* 每次数据库的信息发生变化之后,首先调用更新数据库敏感词库的方法。
* 如果需要生效,则调用这个方法。
*
* 说明:重新初始化不影响旧的方法使用。初始化完成后,会以新的为准。
*/
public void refresh () {
// 每次数据库的信息发生变化之后,首先调用更新数据库敏感词库的方法,然后调用这个方法。
sensitiveWordBs . init ();
}
} As mentioned above, you can actively trigger the initialization of sensitiveWordBs.init(); when the database lexicon changes, and the database needs to take effect.
Other uses remain the same without restarting the app.
Supported version: v0.13.0
Sometimes we may want to further limit the matching sensitive words, for example, although we define [av] as a sensitive word, we do not want [have] to be matched.
You can customize the wordResultCondition interface and implement your own policies.
The built-in policy in WordResultConditions#alwaysTrue() is always true, while WordResultConditions#englishWordMatch() requires that the English language must match the full word.
The WordResultConditions tool class can get matching policies
| accomplish | illustrate | Supported version |
|---|---|---|
| alwaysTrue | Always true | |
| englishWordMatch | English word full word matching | v0.13.0 |
| englishWordNumMatch | English word/number full word matching | v0.20.0 |
| wordTags | Those that satisfy specific tags, such as only focusing on the [Advertising] tags | v0.23.0 |
| chains(IWordResultCondition ...conditions) | Supports specifying multiple conditions and satisfying them at the same time | v0.23.0 |
Original default situation:
final String text = "I have a nice day。" ;
List < String > wordList = SensitiveWordBs . newInstance ()
. wordDeny ( new IWordDeny () {
@ Override
public List < String > deny () {
return Collections . singletonList ( "av" );
}
})
. wordResultCondition ( WordResultConditions . alwaysTrue ())
. init ()
. findAll ( text );
Assert . assertEquals ( "[av]" , wordList . toString ());We can specify that the English must match the full word.
final String text = "I have a nice day。" ;
List < String > wordList = SensitiveWordBs . newInstance ()
. wordDeny ( new IWordDeny () {
@ Override
public List < String > deny () {
return Collections . singletonList ( "av" );
}
})
. wordResultCondition ( WordResultConditions . englishWordMatch ())
. init ()
. findAll ( text );
Assert . assertEquals ( "[]" , wordList . toString ());Of course, more complex strategies can be implemented as needed.
Supported version: v0.23.0
We can only return sensitive words affiliated with a certain label.
We have specified two sensitive words: product, AV
MyWordTag is a sensitive word tag implementation we define:
/**
* 自定义单词标签
* @since 0.23.0
*/
public class MyWordTag extends AbstractWordTag {
private static Map < String , Set < String >> dataMap ;
static {
dataMap = new HashMap <>();
dataMap . put ( "商品" , buildSet ( "广告" , "中文" ));
dataMap . put ( "AV" , buildSet ( "色情" , "单词" , "英文" ));
}
private static Set < String > buildSet ( String ... tags ) {
Set < String > set = new HashSet <>();
for ( String tag : tags ) {
set . add ( tag );
}
return set ;
}
@ Override
protected Set < String > doGetTag ( String word ) {
return dataMap . get ( word );
}
}For example, we simulate two different implementation classes, each focusing on a different word tag.
// 只关心SE情
SensitiveWordBs sensitiveWordBsYellow = SensitiveWordBs . newInstance ()
. wordDeny ( new IWordDeny () {
@ Override
public List < String > deny () {
return Arrays . asList ( "商品" , "AV" );
}
})
. wordAllow ( WordAllows . empty ())
. wordTag ( new MyWordTag ())
. wordResultCondition ( WordResultConditions . wordTags ( Arrays . asList ( "色情" )))
. init ();
// 只关心广告
SensitiveWordBs sensitiveWordBsAd = SensitiveWordBs . newInstance ()
. wordDeny ( new IWordDeny () {
@ Override
public List < String > deny () {
return Arrays . asList ( "商品" , "AV" );
}
})
. wordAllow ( WordAllows . empty ())
. wordTag ( new MyWordTag ())
. wordResultCondition ( WordResultConditions . wordTags ( Arrays . asList ( "广告" )))
. init ();
final String text = "这些 AV 商品什么价格?" ;
Assert . assertEquals ( "[AV]" , sensitiveWordBsYellow . findAll ( text ). toString ());
Assert . assertEquals ( "[商品]" , sensitiveWordBsAd . findAll ( text ). toString ());Our sensitive words are generally more continuous, such as [Silly Hat]
Then there is a clever discovery that you can add some characters in the middle, such as [Silly! @#$Hat] to skip the detection, but the attack power of swearing is not reduced.
So, how to deal with these similar scenarios?
We can specify skip sets of special characters and ignore these meaningless characters.
v0.11.0 starts support
The character strategy corresponding to charIgnore can be flexibly defined by users.
final String text = "傻@冒,狗+东西" ;
//默认因为有特殊字符分割,无法识别
List < String > wordList = SensitiveWordBs . newInstance (). init (). findAll ( text );
Assert . assertEquals ( "[]" , wordList . toString ());
// 指定忽略的字符策略,可自行实现。
List < String > wordList2 = SensitiveWordBs . newInstance ()
. charIgnore ( SensitiveWordCharIgnores . specialChars ())
. init ()
. findAll ( text );
Assert . assertEquals ( "[傻@冒, 狗+东西]" , wordList2 . toString ());Sometimes we want to add a classified label to sensitive words: such as social situation, violence, etc.
In this way, more characteristics can be performed according to labels, etc., such as only processing a certain type of label.
Supported version: v0.10.0
Main features support version: v0.24.0
This is just an abstract interface, and users can define the implementation by themselves. For example, from database query, file reading, API calls, etc.
public interface IWordTag {
/**
* 查询标签列表
* @param word 脏词
* @return 结果
*/
Set < String > getTag ( String word );
} In order to facilitate use in most situations, some scene strategies are implemented in WordTags class
| Implementation method | illustrate | Remark |
|---|---|---|
| none() | Empty implementation | v0.10.0 support |
| file(String filePath) | Specify file path | v0.10.0 support |
| file(String filePath, String wordSplit, String tagSplit) | Specify file paths, as well as word separators and tag separators | v0.10.0 support |
| map(final Map<String, Set> wordTagMap) | Initialization according to map | v0.24.0 support |
| lines(Collection lines) | List of strings | v0.24.0 support |
| lines(Collection lines, String wordSplit, String tagSpli) | List of strings, as well as word separators and label separators | v0.24.0 support |
| system() | Built-in implementation of the system files, integrating network classification | v0.24.0 support |
| defaults() | The default policy is currently system | v0.24.0 support |
| chains(IWordTag... others) | Chain method, supports user integration to implement multiple policies | v0.24.0 support |
The format of sensitive word tags We default to the following敏感词tag1,tag2 , which means that the tags of敏感词are tag1 and tag2.
for example
五星红旗 政治,国家
This is also recommended for all file line contents and specified string contents. If it is not satisfied, just implement it in a custom way.
Starting with v0.24.0, the default word tag is WordTags.system() .
Note: Currently, data statistics are from the Internet, and there are many omissions. Everyone is also welcome to correct the problem and continue to improve...
SensitiveWordBs sensitiveWordBs = SensitiveWordBs . newInstance ()
. wordTag ( WordTags . system ())
. init ();
Set < String > tagSet = sensitiveWordBs . tags ( "博彩" );
Assert . assertEquals ( "[3]" , tagSet . toString ());Here, in order to optimize the compression size, the corresponding categories are represented by numbers.
The meaning list of numbers is as follows:
0 政治
1 毒品
2 色情
3 赌博
4 违法
Here we take the file as an example to demonstrate how to use it.
final String path = "~ \ test \ resources \ dict_tag_test.txt" ;
// 演示默认方法
IWordTag wordTag = WordTags . file ( path );
SensitiveWordBs sensitiveWordBs = SensitiveWordBs . newInstance ()
. wordTag ( wordTag )
. init ();
Set < String > tagSet = sensitiveWordBs . tags ( "零售" );
Assert . assertEquals ( "[广告, 网络]" , tagSet . toString ());
// 演示指定分隔符
IWordTag wordTag2 = WordTags . file ( path , " " , "," );
SensitiveWordBs sensitiveWordBs2 = SensitiveWordBs . newInstance ()
. wordTag ( wordTag2 )
. init ();
Set < String > tagSet2 = sensitiveWordBs2 . tags ( "零售" );
Assert . assertEquals ( "[广告, 网络]" , tagSet2 . toString ()); Where dict_tag_test.txt our custom content is as follows:
零售 广告,网络
When we obtain sensitive words, we can set the corresponding result processing strategy to obtain the corresponding sensitive word tag information
// 自定义测试标签类
IWordTag wordTag = WordTags . lines ( Arrays . asList ( "天安门 政治,国家,地址" ));
// 指定初始化
SensitiveWordBs sensitiveWordBs = SensitiveWordBs . newInstance ()
. wordTag ( wordTag )
. init ()
;
List < WordTagsDto > wordTagsDtoList1 = sensitiveWordBs . findAll ( "天安门" , WordResultHandlers . wordTags ());
Assert . assertEquals ( "[WordTagsDto{word='天安门', tags=[政治, 国家, 地址]}]" , wordTagsDtoList1 . toString ()); We customize the tags of天安门keywords, and then specify that the result processing strategy of findAll is WordResultHandlers.wordTags() , and we can obtain the corresponding tag list while obtaining sensitive words.
Sometimes we want to design the loading of sensitive words into dynamic, such as console modification, which can then take effect in real time.
v0.0.13 supports this feature.
To implement this feature and be compatible with previous functions, we defined two interfaces.
The interface is as follows, you can customize your own implementation.
The returned list means that the word is a sensitive word.
/**
* 拒绝出现的数据-返回的内容被当做是敏感词
* @author binbin.hou
* @since 0.0.13
*/
public interface IWordDeny {
/**
* 获取结果
* @return 结果
* @since 0.0.13
*/
List < String > deny ();
}for example:
public class MyWordDeny implements IWordDeny {
@ Override
public List < String > deny () {
return Arrays . asList ( "我的自定义敏感词" );
}
}The interface is as follows, you can customize your own implementation.
The returned list means that the word is not a sensitive word.
/**
* 允许的内容-返回的内容不被当做敏感词
* @author binbin.hou
* @since 0.0.13
*/
public interface IWordAllow {
/**
* 获取结果
* @return 结果
* @since 0.0.13
*/
List < String > allow ();
}like:
public class MyWordAllow implements IWordAllow {
@ Override
public List < String > allow () {
return Arrays . asList ( "五星红旗" );
}
}After the interface is customized, of course, it needs to be specified to take effect.
To make use more elegant, we designed the boot class SensitiveWordBs .
You can specify sensitive words through wordDeny(), wordAllow() specifies non-sensitive words, and initialize sensitive words dictionary through init().
SensitiveWordBs wordBs = SensitiveWordBs . newInstance ()
. wordDeny ( WordDenys . defaults ())
. wordAllow ( WordAllows . defaults ())
. init ();
final String text = "五星红旗迎风飘扬,毛主席的画像屹立在天安门前。" ;
Assert . assertTrue ( wordBs . contains ( text ));Note: init() is time-consuming to build the sensitive word DFA. It is generally recommended that it be initialized only once when applying initialization. Instead of repeating initialization!
We can test the custom implementation, as follows:
String text = "这是一个测试,我的自定义敏感词。" ;
SensitiveWordBs wordBs = SensitiveWordBs . newInstance ()
. wordDeny ( new MyWordDeny ())
. wordAllow ( new MyWordAllow ())
. init ();
Assert . assertEquals ( "[我的自定义敏感词]" , wordBs . findAll ( text ). toString ()); Here is the only one where我的自定义敏感词are sensitive words, and测试are not sensitive words.
Of course, here are all our custom implementations. It is generally recommended to use the default configuration + custom configuration of the system.
The following method can be used.
WordDenys.chains() method merges multiple implementations into the same IWordDeny.
WordAllows.chains() method merges multiple implementations into the same IWordAllow.
example:
String text = "这是一个测试。我的自定义敏感词。" ;
IWordDeny wordDeny = WordDenys . chains ( WordDenys . defaults (), new MyWordDeny ());
IWordAllow wordAllow = WordAllows . chains ( WordAllows . defaults (), new MyWordAllow ());
SensitiveWordBs wordBs = SensitiveWordBs . newInstance ()
. wordDeny ( wordDeny )
. wordAllow ( wordAllow )
. init ();
Assert . assertEquals ( "[我的自定义敏感词]" , wordBs . findAll ( text ). toString ());All of them use the system default configuration and custom configuration at the same time.
Note: We initialize new wordBs, so use new wordBs to judge. Instead of using the previous SensitiveWordHelper tool method, the tool method configuration is the default!
In actual use, for example, you can modify the page configuration and then take effect in real time.
The data is stored in the database. The following is an example of pseudo-code. You can refer to SpringSensitiveWordConfig.java
Required, version v0.0.15 and above.
The simplified pseudo-code is as follows, and the source of the data is a database.
MyDdWordAllow and MyDdWordDeny are custom implementation classes based on databases as the source.
@ Configuration
public class SpringSensitiveWordConfig {
@ Autowired
private MyDdWordAllow myDdWordAllow ;
@ Autowired
private MyDdWordDeny myDdWordDeny ;
/**
* 初始化引导类
* @return 初始化引导类
* @since 1.0.0
*/
@ Bean
public SensitiveWordBs sensitiveWordBs () {
SensitiveWordBs sensitiveWordBs = SensitiveWordBs . newInstance ()
. wordAllow ( WordAllows . chains ( WordAllows . defaults (), myDdWordAllow ))
. wordDeny ( myDdWordDeny )
// 各种其他配置
. init ();
return sensitiveWordBs ;
}
}Initialization of sensitive vocabulary is time-consuming, so it is recommended to do init initialization once when the program starts.
After V0.6.0, add the corresponding benchmark test.
BenchmarkTimesTest
The test environment is a normal notebook:
处理器 12th Gen Intel(R) Core(TM) i7-1260P 2.10 GHz
机带 RAM 16.0 GB (15.7 GB 可用)
系统类型 64 位操作系统, 基于 x64 的处理器
ps: Different environments will vary, but the proportion is basically stable.
Test data: 100+ string, loop 10W times.
| Serial number | Scene | time consuming | Remark |
|---|---|---|---|
| 1 | Only make sensitive words without any format conversion | 1470ms, about 7.2W QPS | Pursuing extreme performance, you can configure it like this |
| 2 | Only make sensitive words, support all format conversion | 2744ms, about 3.7W QPS | Meet most scenarios |
Remove sensitive words from individual Chinese characters. In China, phrases should be regarded as a single word to reduce the rate of misjudgment.
Supports individual sensitive word changes?
remove, add, edit?
Sensitive word tag interface support
Tag support when processing sensitive words
Memory usage comparison + optimization of wordData
Users specify custom phrases, and allow the combination of specified phrases to be obtained, making it more flexible
FormatCombine/CheckCombine/AllowDenyCombine Combination policy, allowing user customization.
Optimization of word check strategy, unified traversal + conversion
Add ThreadLocal and other performance optimizations
sensitive-word-admin sensitive word console v1.2.0 open source
How to support distributed deployment in sensitive-word-admin v1.3.0 release?
01-Beginner of Open Source Sensitive Word Tool
02-How to implement a sensitive word tool? Clarify the idea of implementing prohibited words
03-Support Word StopWord Stop Word Optimization and Special Symbols
04-Dictionary of sensitive words slimming
05-Detailed explanation of DFA algorithm of sensitive words (Trie Tree algorithm)
06-Sensitive words (dirty words) How to ignore meaningless characters? Achieve better filtering effect
v0.10.0-Preliminary support for dirty word classification tag
v0.11.0 - New features of sensitive words: ignore meaningless characters, word tag dictionary
v0.12.0-Sensitive word/dirty word labeling ability is further enhanced
v0.13.0-Sensitive Word Feature Version Release Supports Full Word Matching in English Words
v0.16.1-Dictionary memory resource release of new features of sensitive words
v0.19.0 - Sensitive words with new features of sensitive words individually edited without repeated initialization
v0.20.0 The new characteristics of sensitive words match all numbers, not partial matches
v0.21.0 Whitelists with new features of sensitive words support single editing, correcting the problem when whitelists contain blacklists
pinyin to pinyin
pinyin2hanzi Pinyin to Chinese characters
segment high performance Chinese word segmentation
opencc4j Chinese Traditional Chinese Simplified Chinese Conversion
nlp-hanzi-similar Chinese character similarity
word-checker spelling detection
sensitive-word sensitive words