现在基本上所有的网站都需要设置敏感词过滤,z 似乎已经成了一个网站的标配,如果你的网站没有,或者你没有做相应的处理,那么小心相关部门请你喝茶哦。
最近在调研 Java web 网站的敏感词过滤的实现,网上找了相关资料,经过我的验证,把我的调研结果写出来,供大家参考。
一、敏感词过滤工具类
把敏感词词库内容加载到 ArrayList 集合中,通过双层循环,查找与敏感词列表相匹配的字符串,如果找到以*号替换,最终得到替换后的字符串。
此种方式匹配度较高,匹配速度良好。
初始化敏感词库:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38 <span class="hljs-comment">//初始化敏感词库</span>
<span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> InitializationWork()
{
replaceAll = <span class="hljs-keyword">new</span> StringBuilder(replceSize);
<span class="hljs-keyword">for</span>(<span class="hljs-keyword">int</span> x=<span class="hljs-number">0</span>;x < replceSize;x++)
{
replaceAll.<span class="hljs-keyword">append</span>(replceStr);
}
<span class="hljs-comment">//加载词库 </span>
arrayList = <span class="hljs-keyword">new</span> ArrayList<String>();
InputStreamReader <span class="hljs-keyword">read</span> = <span class="hljs-keyword">null</span>;
BufferedReader bufferedReader = <span class="hljs-keyword">null</span>;
<span class="hljs-keyword">try</span> {
<span class="hljs-keyword">read</span> = <span class="hljs-keyword">new</span> InputStreamReader(SensitiveWord.<span class="hljs-keyword">class</span>.getClassLoader().getResourceAsStream(fileName),encoding);
bufferedReader = <span class="hljs-keyword">new</span> BufferedReader(<span class="hljs-keyword">read</span>);
<span class="hljs-keyword">for</span>(String txt = <span class="hljs-keyword">null</span>;(txt = bufferedReader.readLine()) != <span class="hljs-keyword">null</span>;){
<span class="hljs-keyword">if</span>(!arrayList.contains(txt))
arrayList.add(txt);
}
} <span class="hljs-keyword">catch</span> (UnsupportedEncodingException e) {
e.printStackTrace();
} <span class="hljs-keyword">catch</span> (IOException e) {
e.printStackTrace();
}<span class="hljs-keyword">finally</span>{
<span class="hljs-keyword">try</span> {
<span class="hljs-keyword">if</span>(<span class="hljs-keyword">null</span> != bufferedReader)
bufferedReader.close();
} <span class="hljs-keyword">catch</span> (IOException e) {
e.printStackTrace();
}
<span class="hljs-keyword">try</span> {
<span class="hljs-keyword">if</span>(<span class="hljs-keyword">null</span> != <span class="hljs-keyword">read</span>)
<span class="hljs-keyword">read</span>.close();
} <span class="hljs-keyword">catch</span> (IOException e) {
e.printStackTrace();
}
}
}
过滤敏感词信息:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39 <span class="hljs-keyword">public</span> <span class="hljs-keyword">String</span> filterInfo(<span class="hljs-keyword">String</span> <span class="hljs-built_in">str</span>)
{
sensitiveWordSet = <span class="hljs-keyword">new</span> HashSet<<span class="hljs-keyword">String</span>>();
sensitiveWordList= <span class="hljs-keyword">new</span> ArrayList<>();
StringBuilder buffer = <span class="hljs-keyword">new</span> StringBuilder(<span class="hljs-built_in">str</span>);
<span class="hljs-keyword">HashMap</span><Integer, Integer> hash = <span class="hljs-keyword">new</span> <span class="hljs-keyword">HashMap</span><Integer, Integer>(arrayList.<span class="hljs-built_in">size</span>());
<span class="hljs-keyword">String</span> temp;
<span class="hljs-keyword">for</span>(<span class="hljs-built_in">int</span> x = <span class="hljs-number">0</span>; x < arrayList.<span class="hljs-built_in">size</span>();x++)
{
temp = arrayList.<span class="hljs-built_in">get</span>(x);
<span class="hljs-built_in">int</span> findIndexSize = <span class="hljs-number">0</span>;
<span class="hljs-keyword">for</span>(<span class="hljs-built_in">int</span> start = <span class="hljs-number">-1</span>;(start=buffer.indexOf(temp,findIndexSize)) > <span class="hljs-number">-1</span>;)
{
<span class="hljs-comment">//System.out.println("###replace="+temp);</span>
findIndexSize = start+temp.length();<span class="hljs-comment">//从已找到的后面开始找 </span>
Integer mapStart = hash.<span class="hljs-built_in">get</span>(start);<span class="hljs-comment">//起始位置 </span>
<span class="hljs-keyword">if</span>(mapStart == <span class="hljs-keyword">null</span> || (mapStart != <span class="hljs-keyword">null</span> && findIndexSize > mapStart))<span class="hljs-comment">//满足1个,即可更新map </span>
{
hash.put(start, findIndexSize);
<span class="hljs-comment">//System.out.println("###敏感词:"+buffer.substring(start, findIndexSize));</span>
}
}
}
Collection<Integer> values = hash.keySet();
<span class="hljs-keyword">for</span>(Integer startIndex : values)
{
Integer endIndex = hash.<span class="hljs-built_in">get</span>(startIndex);
<span class="hljs-comment">//获取敏感词,并加入列表,用来统计数量</span>
<span class="hljs-keyword">String</span> sensitive = buffer.substring(startIndex, endIndex);
<span class="hljs-comment">//System.out.println("###敏感词:"+sensitive);</span>
<span class="hljs-keyword">if</span> (!sensitive.contains(<span class="hljs-string">"*"</span>)) {<span class="hljs-comment">//添加敏感词到集合</span>
sensitiveWordSet.<span class="hljs-built_in">add</span>(sensitive);
sensitiveWordList.<span class="hljs-built_in">add</span>(sensitive);
}
buffer.replace(startIndex, endIndex, replaceAll.substring(<span class="hljs-number">0</span>,endIndex-startIndex));
}
hash.<span class="hljs-built_in">clear</span>();
<span class="hljs-keyword">return</span> buffer.toString();
}
下载地址:SensitiveWord
链接: http://pan.baidu.com/s/1skMos8l 密码: szqk
二、Java 关键词过滤
这个方式采用的是正则表达式匹配,速度上比第一种稍慢,匹配度良好。
主要代码:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41 <span class="hljs-comment">// 从words.properties初始化正则表达式字符串</span>
<span class="hljs-keyword">private</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">void</span> initPattern() {
StringBuffer patternBuffer = <span class="hljs-keyword">new</span> StringBuffer();
<span class="hljs-keyword">try</span> {
<span class="hljs-comment">//words.properties</span>
InputStream in = KeyWordFilter.class.getClassLoader().getResourceAsStream(<span class="hljs-string">"keywords.properties"</span>);
Properties property = <span class="hljs-keyword">new</span> Properties();
property.load(in);
Enumeration<?> enu = property.propertyNames();
patternBuffer.<span class="hljs-built_in">append</span>(<span class="hljs-string">"("</span>);
<span class="hljs-keyword">while</span> (enu.hasMoreElements()) {
<span class="hljs-keyword">String</span> scontent = (<span class="hljs-keyword">String</span>) enu.nextElement();
patternBuffer.<span class="hljs-built_in">append</span>(scontent + <span class="hljs-string">"|"</span>);
<span class="hljs-comment">//System.out.println(scontent);</span>
keywordsCount ++;
}
patternBuffer.deleteCharAt(patternBuffer.length() - <span class="hljs-number">1</span>);
patternBuffer.<span class="hljs-built_in">append</span>(<span class="hljs-string">")"</span>);
<span class="hljs-comment">//System.out.println(patternBuffer);</span>
<span class="hljs-comment">// unix换成UTF-8</span>
<span class="hljs-comment">// pattern = Pattern.compile(new</span>
<span class="hljs-comment">// String(patternBuf.toString().getBytes("ISO-8859-1"), "UTF-8"));</span>
<span class="hljs-comment">// win下换成gb2312</span>
<span class="hljs-comment">// pattern = Pattern.compile(new String(patternBuf.toString()</span>
<span class="hljs-comment">// .getBytes("ISO-8859-1"), "gb2312"));</span>
<span class="hljs-comment">// 装换编码</span>
pattern = Pattern.compile(patternBuffer.toString());
} <span class="hljs-keyword">catch</span> (IOException ioEx) {
ioEx.printStackTrace();
}
}
<span class="hljs-keyword">private</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">String</span> doFilter(<span class="hljs-keyword">String</span> <span class="hljs-built_in">str</span>) {
Matcher m = pattern.matcher(<span class="hljs-built_in">str</span>);
<span class="hljs-comment">// while (m.find()) {// 查找符合pattern的字符串</span>
<span class="hljs-comment">// System.out.println("The result is here :" + m.group());</span>
<span class="hljs-comment">// }</span>
<span class="hljs-comment">// 选择替换方式,这里以* 号代替</span>
<span class="hljs-built_in">str</span> = m.replaceAll(<span class="hljs-string">"*"</span>);
<span class="hljs-keyword">return</span> <span class="hljs-built_in">str</span>;
}
下载地址:KeyWordFilter
链接: http://pan.baidu.com/s/1kVBl803 密码: xi24
三、DFA 算法进行过滤
这种方式听起来高大上,采用 DFA 算法,这个算法个人不太懂,经测试发现,匹配度不行,速度良好。或许可以改良,还请大神进行改良。
主要有两个文件:SensitivewordFilter.java 和 SensitiveWordInit.java
主要代码:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26 <span class="hljs-keyword">public</span> <span class="hljs-keyword">int</span> CheckSensitiveWord(<span class="hljs-keyword">String</span> txt,<span class="hljs-keyword">int</span> beginIndex,<span class="hljs-keyword">int</span> matchType){
<span class="hljs-keyword">boolean</span> flag = false; <span class="hljs-comment">//敏感词结束标识位:用于敏感词只有1位的情况</span>
<span class="hljs-keyword">int</span> matchFlag = <span class="hljs-number">0</span>; <span class="hljs-comment">//匹配标识数默认为0</span>
<span class="hljs-keyword">char</span> <span class="hljs-keyword">word</span> = <span class="hljs-number">0</span>;
Map nowMap = sensitiveWordMap;
<span class="hljs-built_in">for</span>(<span class="hljs-keyword">int</span> i = beginIndex; i < txt.length() ; i++){
<span class="hljs-keyword">word</span> = txt.charAt(i);
nowMap = (Map) nowMap.<span class="hljs-built_in">get</span>(<span class="hljs-keyword">word</span>); <span class="hljs-comment">//获取指定key</span>
<span class="hljs-built_in">if</span>(nowMap != null){ <span class="hljs-comment">//存在,则判断是否为最后一个</span>
matchFlag++; <span class="hljs-comment">//找到相应key,匹配标识+1 </span>
<span class="hljs-built_in">if</span>(<span class="hljs-string">"1"</span>.equals(nowMap.<span class="hljs-built_in">get</span>(<span class="hljs-string">"isEnd"</span>))){ <span class="hljs-comment">//如果为最后一个匹配规则,结束循环,返回匹配标识数</span>
flag = true; <span class="hljs-comment">//结束标志位为true </span>
<span class="hljs-built_in">if</span>(SensitivewordFilter.minMatchTYpe == matchType){ <span class="hljs-comment">//最小规则,直接返回,最大规则还需继续查找</span>
<span class="hljs-built_in">break</span>;
}
}
}
<span class="hljs-built_in">else</span>{ <span class="hljs-comment">//不存在,直接返回</span>
<span class="hljs-built_in">break</span>;
}
}
<span class="hljs-built_in">if</span>(matchFlag < <span class="hljs-number">2</span> || !flag){ <span class="hljs-comment">//长度必须大于等于1,为词 </span>
matchFlag = <span class="hljs-number">0</span>;
}
<span class="hljs-built_in">return</span> matchFlag;
}
下载地址:SensitivewordFilter
链接: http://pan.baidu.com/s/1ccsa66 密码: mc1x
四、多叉树查找算法
这个方式采用了多叉树查找算法,至于这个算法是怎么回事,大家可以去查看数据结构相关内容。提供了 jar 包,直接调用进行过滤。
经测试,这个方法匹配度良好,速度稍慢。
调用方式:
1
2
3
4
5
6
7
8 <span class="hljs-comment">//敏感词过滤</span>
FilteredResult result = WordFilterUtil.filterText(<span class="hljs-built_in">str</span>, <span class="hljs-string">'*'</span>);
<span class="hljs-comment">//获取过滤后的内容</span>
System.out.<span class="hljs-built_in">println</span>(<span class="hljs-string">"替换后的字符串为:\n"</span>+result.getFilteredContent());
<span class="hljs-comment">//获取原始字符串</span>
System.out.<span class="hljs-built_in">println</span>(<span class="hljs-string">"原始字符串为:\n"</span>+result.getOriginalContent());
<span class="hljs-comment">//获取替换的敏感词</span>
System.out.<span class="hljs-built_in">println</span>(<span class="hljs-string">"替换的敏感词为:\n"</span>+result.getBadWords());
下载地址:WordFilterUtil
链接: http://pan.baidu.com/s/1nvftzeD 密码: 5t2h
以上就是我的调研结果,希望对大家有所帮助。
最后,附上大量敏感词库下载地址:
链接: http://pan.baidu.com/s/1boWQvr5 密码: 4nyc
参考了以下文章:
其他