Python容錯(cuò)的前綴樹(shù)實(shí)現(xiàn)中文糾錯(cuò)
本文使用 Python 實(shí)現(xiàn)了前綴樹(shù),并且支持編輯距離容錯(cuò)的查詢。文中的前綴樹(shù)只存儲(chǔ)了三個(gè)分詞,格式為 (分詞字符串,頻率) ,如:(’中海晉西園’, 2)、(’中海西園’, 24)、(’中南海’, 4),可以換成自己的文件進(jìn)行數(shù)據(jù)的替換。在查詢的時(shí)候要指定一個(gè)字符串和最大的容錯(cuò)編輯距離。
實(shí)現(xiàn)class Word: def __init__(self, word, freq):self.word = wordself.freq = freqclass Trie: def __init__(self):self.root = LetterNode(’’)self.START = 3 def insert(self, word, freq):self.root.insert(word, freq, 0) def findAll(self, query, maxDistance):suggestions = self.root.recommend(query, maxDistance, self.START)return sorted(set(suggestions), key=lambda x: x.freq)class LetterNode: def __init__(self, char):self.REMOVE = -1self.ADD = 1self.SAME = 0self.CHANGE = 2self.START = 3self.pointers = []self.char = charself.word = None def charIs(self, c):return self.char == c def insert(self, word, freq, depth):if ’ ’ in word: word = [i for i in word.split(’ ’)]if depth < len(word): c = word[depth].lower() for next in self.pointers:if next.charIs(c): return next.insert(word, freq, depth + 1) nextNode = LetterNode(c) self.pointers.append(nextNode) return nextNode.insert(word, freq, depth + 1)else: self.word = Word(word, freq) def recommend(self, query, movesLeft, lastAction):suggestions = []length = len(query)if length >= 0 and movesLeft - length >= 0 and self.word: suggestions.append(self.word)if movesLeft == 0 and length > 0: for next in self.pointers:if next.charIs(query[0]): suggestions += next.recommend(query[1:], movesLeft, self.SAME) breakelif movesLeft > 0: for next in self.pointers:if length > 0: if next.charIs(query[0]):suggestions += next.recommend(query[1:], movesLeft, self.SAME) else:suggestions += next.recommend(query[1:], movesLeft - 1, self.CHANGE)if lastAction != self.CHANGE and lastAction != self.REMOVE: suggestions += next.recommend(query, movesLeft - 1, self.ADD)if lastAction != self.ADD and lastAction != self.CHANGE: if length > 1 and next.charIs(query[1]):suggestions += next.recommend(query[2:], movesLeft - 1, self.REMOVE) elif length > 2 and next.charIs(query[2]) and movesLeft == 2:suggestions += next.recommend(query[3:], movesLeft - 2, self.REMOVE)else: if lastAction != self.CHANGE and lastAction != self.REMOVE:suggestions += next.recommend(query, movesLeft - 1, self.ADD)return suggestionsdef buildTrieFromFile(): trie = Trie() rows = [(’中海晉西園’, 2),(’中海西園’, 24),(’中南海’, 4)] for row in rows:trie.insert(row[0], int(row[1])) return triedef suggestor(trie, s, maxDistance): if ’ ’ in s:s = [x for x in s.split(’ ’)] suggestions = trie.findAll(s, maxDistance) return [str(x.word) for x in suggestions]if __name__ == '__main__': trie = buildTrieFromFile() r = suggestor(trie, ’中海晉西園’, 1) print(r)
分析
結(jié)果打印:[’中海晉西園’, ’中海西園’]
可以看出“中海晉西園”是和輸入完全相同的字符串,編輯距離為 0 ,所以符合最大編輯距離為 1 的要求,直接返回。
“中海西園”是“中海晉西園”去掉“晉”字之后的結(jié)果,編輯距離為 1, 所以符合最大編輯距離為 1 的要求,直接返回。
另外,“中南海”和“中海晉西園”的編輯距離為 4 ,不符合最大編輯距離為 1 的要求,所以結(jié)果中沒(méi)有出現(xiàn)。
參考https://github.com/leoRoss/AutoCorrectTrie
到此這篇關(guān)于Python容錯(cuò)的前綴樹(shù)實(shí)現(xiàn)中文糾錯(cuò)的文章就介紹到這了,更多相關(guān)Python 中文糾錯(cuò)內(nèi)容請(qǐng)搜索好吧啦網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持好吧啦網(wǎng)!
相關(guān)文章:
1. asp在iis7報(bào)錯(cuò)行號(hào)不準(zhǔn)問(wèn)題的解決方法2. 三個(gè)不常見(jiàn)的 HTML5 實(shí)用新特性簡(jiǎn)介3. ASP中解決“對(duì)象關(guān)閉時(shí),不允許操作。”的詭異問(wèn)題……4. 原生js XMLhttprequest請(qǐng)求onreadystatechange執(zhí)行兩次的解決5. 刪除docker里建立容器的操作方法6. CSS代碼檢查工具stylelint的使用方法詳解7. msxml3.dll 錯(cuò)誤 800c0019 系統(tǒng)錯(cuò)誤:-2146697191解決方法8. jsp實(shí)現(xiàn)簡(jiǎn)單用戶7天內(nèi)免登錄9. CSS3實(shí)現(xiàn)動(dòng)態(tài)翻牌效果 仿百度貼吧3D翻牌一次動(dòng)畫(huà)特效10. 匹配模式 - XSL教程 - 4
