文章詳情頁(yè)

Python實(shí)現(xiàn)敏感詞過(guò)濾的4種方法

瀏覽：5日期：2022-07-11 14:05:03

在我們生活中的一些場(chǎng)合經(jīng)常會(huì)有一些不該出現(xiàn)的敏感詞，我們通常會(huì)使用*去屏蔽它，例如：尼瑪 -> **，一些罵人的敏感詞和一些政治敏感詞都不應(yīng)該出現(xiàn)在一些公共場(chǎng)合中，這個(gè)時(shí)候我們就需要一定的手段去屏蔽這些敏感詞。下面我來(lái)介紹一些簡(jiǎn)單版本的敏感詞屏蔽的方法。

（我已經(jīng)盡量把臟話做成圖片的形式了，要不然文章發(fā)不出去）

方法一：replace過(guò)濾

replace就是最簡(jiǎn)單的字符串替換，當(dāng)一串字符串中有可能會(huì)出現(xiàn)的敏感詞時(shí)，我們直接使用相應(yīng)的replace方法用*替換出敏感詞即可。

缺點(diǎn)：

文本和敏感詞少的時(shí)候還可以，多的時(shí)候效率就比較差了

Python實(shí)現(xiàn)敏感詞過(guò)濾的4種方法

import datetimenow = datetime.datetime.now()print(filter_sentence, ' | ', now)

Python實(shí)現(xiàn)敏感詞過(guò)濾的4種方法

如果是多個(gè)敏感詞可以用列表進(jìn)行逐一替換

Python實(shí)現(xiàn)敏感詞過(guò)濾的4種方法

for i in dirty: speak = speak.replace(i, ’*’)print(speak, ' | ', now)

Python實(shí)現(xiàn)敏感詞過(guò)濾的4種方法

方法二：正則表達(dá)式過(guò)濾

正則表達(dá)式算是一個(gè)不錯(cuò)的匹配方法了，日常的查詢中，機(jī)會(huì)都會(huì)用到正則表達(dá)式，包括我們的爬蟲(chóng)，也都是經(jīng)常會(huì)使用到正則表達(dá)式的，在這里我們主要是使用“|”來(lái)進(jìn)行匹配，“|”的意思是從多個(gè)目標(biāo)字符串中選擇一個(gè)進(jìn)行匹配。寫(xiě)個(gè)簡(jiǎn)單的例子：

Python實(shí)現(xiàn)敏感詞過(guò)濾的4種方法

import redef sentence_filter(keywords, text): return re.sub('|'.join(keywords), '***', text)print(sentence_filter(dirty, speak))

Python實(shí)現(xiàn)敏感詞過(guò)濾的4種方法

方法三：DFA過(guò)濾算法

DFA的算法，即Deterministic Finite Automaton算法，翻譯成中文就是確定有窮自動(dòng)機(jī)算法。它的基本思想是基于狀態(tài)轉(zhuǎn)移來(lái)檢索敏感詞，只需要掃描一次待檢測(cè)文本，就能對(duì)所有敏感詞進(jìn)行檢測(cè)。（實(shí)現(xiàn)見(jiàn)代碼注釋）

Python實(shí)現(xiàn)敏感詞過(guò)濾的4種方法

#!/usr/bin/env python# -*- coding:utf-8 -*-# @Time：2020/4/15 11:40# @Software：PyCharm# article_add: https://www.cnblogs.com/JentZhang/p/12718092.html__author__ = 'JentZhang'import jsonMinMatchType = 1 # 最小匹配規(guī)則MaxMatchType = 2 # 最大匹配規(guī)則class DFAUtils(object): ''' DFA算法 ''' def __init__(self, word_warehouse): ''' 算法初始化 :param word_warehouse:詞庫(kù) ''' # 詞庫(kù) self.root = dict() # 無(wú)意義詞庫(kù),在檢測(cè)中需要跳過(guò)的（這種無(wú)意義的詞最后有個(gè)專門(mén)的地方維護(hù)，保存到數(shù)據(jù)庫(kù)或者其他存儲(chǔ)介質(zhì)中） self.skip_root = [’ ’, ’&’, ’!’, ’！’, ’@’, ’#’, ’$’, ’￥’, ’*’, ’^’, ’%’, ’?’, ’？’, ’<’, ’>’, '《', ’》’] # 初始化詞庫(kù) for word in word_warehouse: self.add_word(word) def add_word(self, word): ''' 添加詞庫(kù) :param word: :return: ''' now_node = self.root word_count = len(word) for i in range(word_count): char_str = word[i] if char_str in now_node.keys(): # 如果存在該key，直接賦值，用于下一個(gè)循環(huán)獲取 now_node = now_node.get(word[i]) now_node[’is_end’] = False else: # 不存在則構(gòu)建一個(gè)dict new_node = dict() if i == word_count - 1: # 最后一個(gè) new_node[’is_end’] = True else: # 不是最后一個(gè) new_node[’is_end’] = False now_node[char_str] = new_node now_node = new_node def check_match_word(self, txt, begin_index, match_type=MinMatchType): ''' 檢查文字中是否包含匹配的字符 :param txt:待檢測(cè)的文本 :param begin_index: 調(diào)用getSensitiveWord時(shí)輸入的參數(shù)，獲取詞語(yǔ)的上邊界index :param match_type:匹配規(guī)則 1：最小匹配規(guī)則，2：最大匹配規(guī)則 :return:如果存在，則返回匹配字符的長(zhǎng)度，不存在返回0 ''' flag = False match_flag_length = 0 # 匹配字符的長(zhǎng)度 now_map = self.root tmp_flag = 0 # 包括特殊字符的敏感詞的長(zhǎng)度 for i in range(begin_index, len(txt)): word = txt[i] # 檢測(cè)是否是特殊字符' if word in self.skip_root and len(now_map) < 100: # len(nowMap)<100 保證已經(jīng)找到這個(gè)詞的開(kāi)頭之后出現(xiàn)的特殊字符 tmp_flag += 1 continue # 獲取指定key now_map = now_map.get(word) if now_map: # 存在，則判斷是否為最后一個(gè) # 找到相應(yīng)key，匹配標(biāo)識(shí)+1 match_flag_length += 1 tmp_flag += 1 # 如果為最后一個(gè)匹配規(guī)則，結(jié)束循環(huán)，返回匹配標(biāo)識(shí)數(shù) if now_map.get('is_end'): # 結(jié)束標(biāo)志位為true flag = True # 最小規(guī)則，直接返回,最大規(guī)則還需繼續(xù)查找 if match_type == MinMatchType: break else: # 不存在，直接返回 break if tmp_flag < 2 or not flag: # 長(zhǎng)度必須大于等于1，為詞 tmp_flag = 0 return tmp_flag def get_match_word(self, txt, match_type=MinMatchType): ''' 獲取匹配到的詞語(yǔ) :param txt:待檢測(cè)的文本 :param match_type:匹配規(guī)則 1：最小匹配規(guī)則，2：最大匹配規(guī)則 :return:文字中的相匹配詞 ''' matched_word_list = list() for i in range(len(txt)): # 0---11 length = self.check_match_word(txt, i, match_type) if length > 0: word = txt[i:i + length] matched_word_list.append(word) # i = i + length - 1 return matched_word_list def is_contain(self, txt, match_type=MinMatchType): ''' 判斷文字是否包含敏感字符 :param txt:待檢測(cè)的文本 :param match_type:匹配規(guī)則 1：最小匹配規(guī)則，2：最大匹配規(guī)則 :return:若包含返回true，否則返回false ''' flag = False for i in range(len(txt)): match_flag = self.check_match_word(txt, i, match_type) if match_flag > 0: flag = True return flag def replace_match_word(self, txt, replace_char=’*’, match_type=MinMatchType): ''' 替換匹配字符 :param txt:待檢測(cè)的文本 :param replace_char:用于替換的字符，匹配的敏感詞以字符逐個(gè)替換，如'你是大王八'，敏感詞'王八'，替換字符*，替換結(jié)果'你是大**' :param match_type:匹配規(guī)則 1：最小匹配規(guī)則，2：最大匹配規(guī)則 :return:替換敏感字字符后的文本 ''' tuple_set = self.get_match_word(txt, match_type) word_set = [i for i in tuple_set] result_txt = '' if len(word_set) > 0: # 如果檢測(cè)出了敏感詞，則返回替換后的文本 for word in word_set: replace_string = len(word) * replace_char txt = txt.replace(word, replace_string) result_txt = txt else: # 沒(méi)有檢測(cè)出敏感詞，則返回原文本 result_txt = txt return result_txtif __name__ == ’__main__’: dfa = DFAUtils(word_warehouse=word_warehouse) print(’詞庫(kù)結(jié)構(gòu)：’, json.dumps(dfa.root, ensure_ascii=False)) # 待檢測(cè)的文本 msg = msg print(’是否包含：’, dfa.is_contain(msg)) print(’相匹配的詞：’, dfa.get_match_word(msg)) print(’替換包含的詞：’, dfa.replace_match_word(msg))

Python實(shí)現(xiàn)敏感詞過(guò)濾的4種方法

方法四：AC自動(dòng)機(jī)

AC自動(dòng)機(jī)需要有前置知識(shí)：Trie樹(shù)（簡(jiǎn)單介紹：又稱前綴樹(shù)，字典樹(shù)，是用于快速處理字符串的問(wèn)題，能做到快速查找到一些字符串上的信息。）

詳細(xì)參考：

https://www.luogu.com.cn/blog/juruohyfhaha/trie-xue-xi-zong-jie

ac自動(dòng)機(jī),就是在tire樹(shù)的基礎(chǔ)上,增加一個(gè)fail指針,如果當(dāng)前點(diǎn)匹配失敗,則將指針轉(zhuǎn)移到fail指針指向的地方,這樣就不用回溯,而可以路匹配下去了。

詳細(xì)匹配機(jī)制我在這里不過(guò)多贅述，關(guān)于AC自動(dòng)機(jī)可以參考一下這篇文章：

https://www.jb51.net/article/128711.htm

python可以利用ahocorasick模塊快速實(shí)現(xiàn)：

Python實(shí)現(xiàn)敏感詞過(guò)濾的4種方法

# python3 -m pip install pyahocorasickimport ahocorasickdef build_actree(wordlist): actree = ahocorasick.Automaton() for index, word in enumerate(wordlist): actree.add_word(word, (index, word)) actree.make_automaton() return actreeif __name__ == ’__main__’: actree = build_actree(wordlist=wordlist) sent_cp = sent for i in actree.iter(sent): sent_cp = sent_cp.replace(i[1][1], '**') print('屏蔽詞：',i[1][1]) print('屏蔽結(jié)果：',sent_cp)

Python實(shí)現(xiàn)敏感詞過(guò)濾的4種方法

當(dāng)然，我們也可以手寫(xiě)一份AC自動(dòng)機(jī)，具體參考：

class TrieNode(object): __slots__ = [’value’, ’next’, ’fail’, ’emit’] def __init__(self, value): self.value = value self.next = dict() self.fail = None self.emit = Noneclass AhoCorasic(object): __slots__ = [’_root’] def __init__(self, words): self._root = AhoCorasic._build_trie(words) @staticmethod def _build_trie(words): assert isinstance(words, list) and words root = TrieNode(’root’) for word in words: node = root for c in word: if c not in node.next: node.next[c] = TrieNode(c) node = node.next[c] if not node.emit: node.emit = {word} else: node.emit.add(word) queue = [] queue.insert(0, (root, None)) while len(queue) > 0: node_parent = queue.pop() curr, parent = node_parent[0], node_parent[1] for sub in curr.next.itervalues(): queue.insert(0, (sub, curr)) if parent is None: continue elif parent is root: curr.fail = root else: fail = parent.fail while fail and curr.value not in fail.next: fail = fail.fail if fail: curr.fail = fail.next[curr.value] else: curr.fail = root return root def search(self, s): seq_list = [] node = self._root for i, c in enumerate(s): matched = True while c not in node.next: if not node.fail: matched = False node = self._root break node = node.fail if not matched: continue node = node.next[c] if node.emit: for _ in node.emit: from_index = i + 1 - len(_) match_info = (from_index, _) seq_list.append(match_info) node = self._root return seq_listif __name__ == ’__main__’: aho = AhoCorasic([’foo’, ’bar’]) print aho.search(’barfoothefoobarman’)

以上便是使用Python實(shí)現(xiàn)敏感詞過(guò)濾的四種方法，前面兩種方法比較簡(jiǎn)單，后面兩種偏向算法，需要先了解算法具體實(shí)現(xiàn)的原理，之后代碼就好懂了。（DFA作為比較常用的過(guò)濾手段，建議大家掌握一下~）

最后附上敏感詞詞庫(kù)：

https://github.com/qloog/sensitive_words

以上就是Python實(shí)現(xiàn)敏感詞過(guò)濾的4種方法的詳細(xì)內(nèi)容，更多關(guān)于python 敏感詞過(guò)濾的資料請(qǐng)關(guān)注好吧啦網(wǎng)其它相關(guān)文章！

Python 編程

上一條：python識(shí)別驗(yàn)證碼的思路及解決方案下一條：Python CategoricalDtype自定義排序?qū)崿F(xiàn)原理解析

相關(guān)文章：

1. Python實(shí)現(xiàn)迪杰斯特拉算法過(guò)程解析2. 基于Android FileProvider 屬性配置詳解及FileProvider多節(jié)點(diǎn)問(wèn)題3. 詳解Python模塊化編程與裝飾器4. Spring security 自定義過(guò)濾器實(shí)現(xiàn)Json參數(shù)傳遞并兼容表單參數(shù)(實(shí)例代碼)5. Python如何進(jìn)行時(shí)間處理6. 詳解java中static關(guān)鍵詞的作用7. JavaScript實(shí)現(xiàn)頁(yè)面動(dòng)態(tài)驗(yàn)證碼的實(shí)現(xiàn)示例8. JavaScript中的AOP編程的基本實(shí)現(xiàn)9. python使用ctypes庫(kù)調(diào)用DLL動(dòng)態(tài)鏈接庫(kù)10. python裝飾器三種裝飾模式的簡(jiǎn)單分析

排行榜

					
					Spring security 自定義過(guò)濾器實(shí)現(xiàn)Json參數(shù)傳遞并兼容表單參數(shù)(實(shí)例代碼)
Python實(shí)現(xiàn)迪杰斯特拉算法過(guò)程解析
詳解java中static關(guān)鍵詞的作用
詳解Python模塊化編程與裝飾器
Django框架安裝及項(xiàng)目創(chuàng)建過(guò)程解析
python裝飾器三種裝飾模式的簡(jiǎn)單分析
詳解JS中的reduce fold unfold用法
用新PHP插件實(shí)現(xiàn)MySQL為基礎(chǔ)的事務(wù)
JXTA概念介紹-Matrix翻譯
Django實(shí)現(xiàn)任意文件上傳（最簡(jiǎn)單的方法）
基于Android FileProvider 屬性配置詳解及FileProvider多節(jié)點(diǎn)問(wèn)題
				

av一区二区在线观看_亚洲男人的天堂网站_日韩亚洲视频_在线成人免费_欧美日韩精品免费观看视频_久草视

Python實(shí)現(xiàn)敏感詞過(guò)濾的4種方法