當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Go语言的分词器（sego）

發(fā)布時間：2024/4/11 编程问答 25 豆豆

生活随笔收集整理的這篇文章主要介紹了 Go语言的分词器（sego）小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

今天，主要來介紹一個Go語言的中文分詞器，即sego。本分詞器是由陳輝寫的，他的微博在這里，github詳

見此處。由于之前他在Google，所以對Go語言特別熟悉。sego的介紹如下

?? sego是Go語言的中文分詞器，詞典用前綴樹實現(xiàn)，分詞器算法為基于詞頻的最短路徑加動態(tài)規(guī)劃。

?? 支持普通和搜索引擎兩種分詞模式，支持用戶詞典、詞性標(biāo)注，可運行JSON RPC服務(wù)。

?? 分詞速度單線程2.7MB/s，goroutines并發(fā)13MB/s, 處理器Core i7-3615QM 2.30GHz 8核。

接下來，以如下幾個方面來介紹sego

?? 1. sego的安裝

?? 2. sego的原理

?? 3. sego的使用

1. sego的安裝

???首先，在Go語言中，有很多第三方包，可以幫助我們實現(xiàn)某些特定的功能。比如這里。而sego的項目在這里。

?? 先把工程根目錄加入GOPATH，然后執(zhí)行如下命令

????

???然后在sego項目根目錄下就得到了src和pkg，這是一個Go語言項目，可以在src中創(chuàng)建文件進行使用。

?? sego的源文件如下

???

???其中詞庫在data目錄下，至此，sego就可以直接使用了。接下來會介紹sego的原理。

2. sego的原理

?? 之前，我用過中科院的分詞器ICTCLAS2014，它是一個非常優(yōu)秀的分詞器，支持C++，Java，C#，Hadoop等

?? 等，使用起來非常方便，分詞效果也不錯，但是它不是開源的。今天介紹的sego分詞器是開源Go語言分詞器，

??? 它的原理是：詞典用前綴樹實現(xiàn)，而分詞器算法為基于詞頻的最短路徑加動態(tài)規(guī)劃。具體參考這兩部分的代碼

???詞典：dictionary.go

package segoimport ("bytes" )// Dictionary結(jié)構(gòu)體實現(xiàn)了一個字串前綴樹，一個分詞可能出現(xiàn)在葉子節(jié)點也有可能出現(xiàn)在非葉節(jié)點 type Dictionary struct {root node // 根節(jié)點maxTokenLength int // 詞典中最長的分詞numTokens int // 詞典中分詞數(shù)目tokens []*Token // 詞典中所有的分詞，方便遍歷totalFrequency int64 // 詞典中所有分詞的頻率之和 }// 前綴樹節(jié)點 type node struct {word Text // 該節(jié)點對應(yīng)的字元token *Token // 當(dāng)此節(jié)點沒有對應(yīng)的分詞時值為nilchildren []*node // 該字元后繼的所有可能字元，當(dāng)為葉子節(jié)點時為空 }// 詞典中最長的分詞 func (dict *Dictionary) MaxTokenLength() int {return dict.maxTokenLength }// 詞典中分詞數(shù)目 func (dict *Dictionary) NumTokens() int {return dict.numTokens }// 詞典中所有分詞的頻率之和 func (dict *Dictionary) TotalFrequency() int64 {return dict.totalFrequency }// 向詞典中加入一個分詞 func (dict *Dictionary) addToken(token *Token) {current := &dict.rootfor _, word := range token.text {// 一邊向深處移動一邊添加節(jié)點（如果需要的話）current = upsert(¤t.children, word)}// 當(dāng)這個分詞不存在詞典中時添加此分詞，否則忽略if current.token == nil {current.token = tokenif len(token.text) > dict.maxTokenLength {dict.maxTokenLength = len(token.text)}dict.numTokens++dict.tokens = append(dict.tokens, token)dict.totalFrequency += int64(token.frequency)} }// 在詞典中查找和字元組words可以前綴匹配的所有分詞 // 返回值為找到的分詞數(shù) func (dict *Dictionary) lookupTokens(words []Text, tokens []*Token) int {// 特殊情況if len(words) == 0 {return 0}current := &dict.rootnumTokens := 0for _, word := range words {// 如果已經(jīng)抵達葉子節(jié)點則不再繼續(xù)尋找if len(current.children) == 0 {break}// 否則在該節(jié)點子節(jié)點中進行下個字元的匹配index, found := binarySearch(current.children, word)if !found {break}// 匹配成功，則跳入匹配的子節(jié)點中current = current.children[index]if current.token != nil {tokens[numTokens] = current.tokennumTokens++}}return numTokens }// 二分法查找字元在子節(jié)點中的位置 // 如果查找成功，第一個返回參數(shù)為找到的位置，第二個返回參數(shù)為true // 如果查找失敗，第一個返回參數(shù)為應(yīng)當(dāng)插入的位置，第二個返回參數(shù)false func binarySearch(nodes []*node, word Text) (int, bool) {start := 0end := len(nodes) - 1// 特例：if len(nodes) == 0 {// 當(dāng)slice為空時，插入第一位置return 0, false}compareWithFirstWord := bytes.Compare(word, nodes[0].word)if compareWithFirstWord < 0 {// 當(dāng)要查找的元素小于首元素時，插入第一位置return 0, false} else if compareWithFirstWord == 0 {// 當(dāng)首元素等于node時return 0, true}compareWithLastWord := bytes.Compare(word, nodes[end].word)if compareWithLastWord == 0 {// 當(dāng)尾元素等于node時return end, true} else if compareWithLastWord > 0 {// 當(dāng)尾元素小于node時return end + 1, false}// 二分current := end / 2for end-start > 1 {compareWithCurrentWord := bytes.Compare(word, nodes[current].word)if compareWithCurrentWord == 0 {return current, true} else if compareWithCurrentWord < 0 {end = currentcurrent = (start + current) / 2} else {start = currentcurrent = (current + end) / 2}}return end, false }// 將字元加入節(jié)點數(shù)組中，并返回插入的節(jié)點指針 // 如果字元已經(jīng)存在則返回存在的節(jié)點指針 func upsert(nodes *[]*node, word Text) *node {index, found := binarySearch(*nodes, word)if found {return (*nodes)[index]}*nodes = append(*nodes, nil)copy((*nodes)[index+1:], (*nodes)[index:])(*nodes)[index] = &node{word: word}return (*nodes)[index] }

分詞器算法：segmenter.go

//Go中文分詞 package segoimport ("bufio""fmt""log""math""os""strconv""strings""unicode""unicode/utf8" )const (minTokenFrequency = 2 // 僅從字典文件中讀取大于等于此頻率的分詞 )// 分詞器結(jié)構(gòu)體 type Segmenter struct {dict *Dictionary }// 該結(jié)構(gòu)體用于記錄Viterbi算法中某字元處的向前分詞跳轉(zhuǎn)信息 type jumper struct {minDistance float32token *Token }// 返回分詞器使用的詞典 func (seg *Segmenter) Dictionary() *Dictionary {return seg.dict }// 從文件中載入詞典 // // 可以載入多個詞典文件，文件名用","分隔，排在前面的詞典優(yōu)先載入分詞，比如 // "用戶詞典.txt,通用詞典.txt" // 當(dāng)一個分詞既出現(xiàn)在用戶詞典也出現(xiàn)在通用詞典中，則優(yōu)先使用用戶詞典。 // // 詞典的格式為（每個分詞一行）： // 分詞文本頻率詞性 func (seg *Segmenter) LoadDictionary(files string) {seg.dict = new(Dictionary)for _, file := range strings.Split(files, ",") {log.Printf("載入sego詞典 %s", file)dictFile, err := os.Open(file)defer dictFile.Close()if err != nil {log.Fatalf("無法載入字典文件 \"%s\" \n", file)}reader := bufio.NewReader(dictFile)var text stringvar freqText stringvar frequency intvar pos string// 逐行讀入分詞for {size, _ := fmt.Fscanln(reader, &text, &freqText, &pos)if size == 0 {// 文件結(jié)束break} else if size < 2 {// 無效行continue} else if size == 2 {// 沒有詞性標(biāo)注時設(shè)為空字符串pos = ""}// 解析詞頻var err errorfrequency, err = strconv.Atoi(freqText)if err != nil {continue}// 過濾頻率太小的詞if frequency < minTokenFrequency {continue}// 將分詞添加到字典中words := splitTextToWords([]byte(text))token := Token{text: words, frequency: frequency, pos: pos}seg.dict.addToken(&token)}}// 計算每個分詞的路徑值，路徑值含義見Token結(jié)構(gòu)體的注釋logTotalFrequency := float32(math.Log2(float64(seg.dict.totalFrequency)))for _, token := range seg.dict.tokens {token.distance = logTotalFrequency - float32(math.Log2(float64(token.frequency)))}// 對每個分詞進行細致劃分，用于搜索引擎模式，該模式用法見Token結(jié)構(gòu)體的注釋。for _, token := range seg.dict.tokens {segments := seg.segmentWords(token.text, true)// 計算需要添加的子分詞數(shù)目numTokensToAdd := 0for iToken := 0; iToken < len(segments); iToken++ {if len(segments[iToken].token.text) > 1 {// 略去字元長度為一的分詞// TODO: 這值得進一步推敲，特別是當(dāng)字典中有英文復(fù)合詞的時候numTokensToAdd++}}token.segments = make([]*Segment, numTokensToAdd)// 添加子分詞iSegmentsToAdd := 0for iToken := 0; iToken < len(segments); iToken++ {if len(segments[iToken].token.text) > 1 {token.segments[iSegmentsToAdd] = &segments[iSegmentsToAdd]iSegmentsToAdd++}}}log.Println("sego詞典載入完畢") }// 對文本分詞 // // 輸入?yún)?shù)： // bytes UTF8文本的字節(jié)數(shù)組 // // 輸出： // []Segment 劃分的分詞 func (seg *Segmenter) Segment(bytes []byte) []Segment {return seg.internalSegment(bytes, false) }func (seg *Segmenter) internalSegment(bytes []byte, searchMode bool) []Segment {// 處理特殊情況if len(bytes) == 0 {return []Segment{}}// 劃分字元text := splitTextToWords(bytes)return seg.segmentWords(text, searchMode) }func (seg *Segmenter) segmentWords(text []Text, searchMode bool) []Segment {// 搜索模式下該分詞已無繼續(xù)劃分可能的情況if searchMode && len(text) == 1 {return []Segment{}}// jumpers定義了每個字元處的向前跳轉(zhuǎn)信息，包括這個跳轉(zhuǎn)對應(yīng)的分詞，// 以及從文本段開始到該字元的最短路徑值jumpers := make([]jumper, len(text))tokens := make([]*Token, seg.dict.maxTokenLength)for current := 0; current < len(text); current++ {// 找到前一個字元處的最短路徑，以便計算后續(xù)路徑值var baseDistance float32if current == 0 {// 當(dāng)本字元在文本首部時，基礎(chǔ)距離應(yīng)該是零baseDistance = 0} else {baseDistance = jumpers[current-1].minDistance}// 尋找所有以當(dāng)前字元開頭的分詞numTokens := seg.dict.lookupTokens(text[current:minInt(current+seg.dict.maxTokenLength, len(text))], tokens)// 對所有可能的分詞，更新分詞結(jié)束字元處的跳轉(zhuǎn)信息for iToken := 0; iToken < numTokens; iToken++ {location := current + len(tokens[iToken].text) - 1if !searchMode || current != 0 || location != len(text)-1 {updateJumper(&jumpers[location], baseDistance, tokens[iToken])}}// 當(dāng)前字元沒有對應(yīng)分詞時補加一個偽分詞if numTokens == 0 || len(tokens[0].text) > 1 {updateJumper(&jumpers[current], baseDistance,&Token{text: []Text{text[current]}, frequency: 1, distance: 32, pos: "x"})}}// 從后向前掃描第一遍得到需要添加的分詞數(shù)目numSeg := 0for index := len(text) - 1; index >= 0; {location := index - len(jumpers[index].token.text) + 1numSeg++index = location - 1}// 從后向前掃描第二遍添加分詞到最終結(jié)果outputSegments := make([]Segment, numSeg)for index := len(text) - 1; index >= 0; {location := index - len(jumpers[index].token.text) + 1numSeg--outputSegments[numSeg].token = jumpers[index].tokenindex = location - 1}// 計算各個分詞的字節(jié)位置bytePosition := 0for iSeg := 0; iSeg < len(outputSegments); iSeg++ {outputSegments[iSeg].start = bytePositionbytePosition += textSliceByteLength(outputSegments[iSeg].token.text)outputSegments[iSeg].end = bytePosition}return outputSegments }// 更新跳轉(zhuǎn)信息: // 1. 當(dāng)該位置從未被訪問過時(jumper.minDistance為零的情況)，或者 // 2. 當(dāng)該位置的當(dāng)前最短路徑大于新的最短路徑時 // 將當(dāng)前位置的最短路徑值更新為baseDistance加上新分詞的概率 func updateJumper(jumper *jumper, baseDistance float32, token *Token) {newDistance := baseDistance + token.distanceif jumper.minDistance == 0 || jumper.minDistance > newDistance {jumper.minDistance = newDistancejumper.token = token} }// 取兩整數(shù)較小值 func minInt(a, b int) int {if a > b {return b}return a }// 取兩整數(shù)較大值 func maxInt(a, b int) int {if a > b {return a}return b }// 將文本劃分成字元 func splitTextToWords(text Text) []Text {output := make([]Text, len(text))current := 0currentWord := 0inAlphanumeric := truealphanumericStart := 0for current < len(text) {r, size := utf8.DecodeRune(text[current:])if size <= 2 && (unicode.IsLetter(r) || unicode.IsNumber(r)) {// 當(dāng)前是拉丁字母或數(shù)字（非中日韓文字）if !inAlphanumeric {alphanumericStart = currentinAlphanumeric = true}} else {if inAlphanumeric {inAlphanumeric = falseif current != 0 {output[currentWord] = toLower(text[alphanumericStart:current])currentWord++}}output[currentWord] = text[current : current+size]currentWord++}current += size}// 處理最后一個字元是英文的情況if inAlphanumeric {if current != 0 {output[currentWord] = toLower(text[alphanumericStart:current])currentWord++}}return output[:currentWord] }// 將英文詞轉(zhuǎn)化為小寫 func toLower(text []byte) []byte {output := make([]byte, len(text))for i, t := range text {if t >= 'A' && t <= 'Z' {output[i] = t - 'A' + 'a'} else {output[i] = t}}return output }

3. sego的使用

?? 接下來介紹sego的使用。首先介紹網(wǎng)頁版的分詞器，在源文件的server目錄下面，運行server.go，如下

????

???server.go是一個輕量級的web服務(wù)器，端口為8080，在瀏覽器打開后如下圖

????

接下來，對如下文章進行分詞

近年結(jié)識了一位警察朋友，好槍法。不單單在射擊場上百發(fā)百中，更在解救人質(zhì)的現(xiàn)場，次次百步穿楊。當(dāng)然了，這個“楊”不是楊樹的楊，而是匪徒的代稱。我向他請教射擊的要領(lǐng)。他說，很簡單，就是極端的平靜。我說這個要領(lǐng)所有打槍的人都知道，可是做不到。他說，記住，你要像煙灰一樣松散。只有放松，全部潛在的能量才會釋放出來，協(xié)同你達到完美。他的話我似懂非懂，但從此我開始注意以前忽略了的煙灰。煙灰，尤其是那些優(yōu)質(zhì)香煙燃燒后的煙灰，非常松散，幾乎沒有重量和形狀，真一個大象無形。它們懶洋洋地趴在那里，好像在冬眠。其實，在煙灰的內(nèi)部，棲息著高度警覺和機敏的鳥群，任何一陣微風(fēng)掠過，哪怕只是極輕微的嘆息，它們都會不失時機地騰空而起馭風(fēng)而行。它們的力量來自放松，來自一種飄揚的本能。松散的反面是緊張。幾乎每個人都有過由于緊張而慘敗的經(jīng)歷。比如，考試的時候，全身肌肉僵直，心跳得好像無數(shù)個小炸彈在身體的深淺部位依次爆破。手指發(fā)抖頭冒虛汗，原本記得滾瓜爛熟的知識，改頭換面潛藏起來，原本涇渭分明的答案變得似是而非，泥鰍一樣滑走……面試的時候，要么扭扭捏捏不夠大方，無法表現(xiàn)自己的真實實力，要么口若懸河躁動不安，拿捏不準(zhǔn)問題的實質(zhì)，只得用不停的述說掩飾自己的緊張，適得其反……相信每個人都儲存了一大堆這類不堪回首的往事。在最危急的時刻能保持極端的放松，不是一種技術(shù)，而是一種修養(yǎng)，是一種長期潛移默化修煉提升的結(jié)果。我們常說，某人勝就勝在心理上，或是說某人敗就敗在心理上。這其中的差池不是指在理性上，而是這種心靈張弛的韌性上。沒事的時候看看煙灰吧。他們曾經(jīng)是火焰，燃燒過，沸騰過，但它們此刻安靜了。它們毫不張揚地聚精會神地等待著下一次的乘風(fēng)而起，攜帶著全部的能量，抵達陽光能到的任何地方。

那么代碼如下

package mainimport ("os""fmt""github.com/huichen/sego" )func main() {// 載入詞典var segmenter sego.Segmentersegmenter.LoadDictionary("../src/github.com/huichen/sego/data/dictionary.txt")//讀取文件內(nèi)容到buf中filename := "../src/test/Text"fp, err := os.Open(filename)defer fp.Close()if err != nil {fmt.Println(filename, err)return}buf := make([]byte, 4096)n, _ := fp.Read(buf)if n == 0 {return}// 分詞segments := segmenter.Segment(buf)// 處理分詞結(jié)果// 支持普通模式和搜索模式兩種分詞，見utils.go代碼中SegmentsToString函數(shù)的注釋。// 如果需要詞性標(biāo)注，用SegmentsToString(segments, false)，更多參考utils.go文件output := sego.SegmentsToSlice(segments, false)//輸出分詞后的成語for _, v := range output {if len(v) == 12 {fmt.Println(v)}} }

運行結(jié)果如下

整個工程的結(jié)構(gòu)如下

??????

總結(jié)

以上是生活随笔為你收集整理的Go语言的分词器（sego）的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： SVD原理及其应用导论
下一篇： Logistic回归与梯度下降法