python jieba库的基本使用_Python

一、jieba库概述

jieba是优秀的中文分词第三方库

中文文本需要通过分词获得单个的词语
jieba是优秀的中文分词第三方库，需要额外安装
jieba库提供三种分词模式，最简单只需要掌握一个函数

二、jieba库安装

				?

									pip install jieba

三、jieba分词的原理

jieba分词依靠中文词库

利用一个中文词库，确定汉字之间的关联概率
汉字间概率大的组成词组，形成分词结果

四、jieba分词的3种模式

精确模式：把文本精确地切分开，不存在冗余单词（最常用）
全模式：把文本中所有可能的词语都扫描出来，有冗余
搜索引擎模式：在精确模式的基础上，对长词再次切分

五、jieba库常用函数

函数	描述
jieba.lcut(s)	精确模式，返回一个列表类型的分词结果
jieba.lcut(s,cut_all=True)	全模式，返回一个列表类型的分词结果，存在冗余
jieba.lcut_for_search(s)	搜索引擎模式，返回一个列表类型的分词结果，存在冗余
jieba.lcut(s)	精确模式，返回一个列表类型的分词结果
jieba.add_word(s)	向分词词典增加新词w

例子：

				?

									>>> jieba.lcut("中国是一个伟大的国家")

									['中国', '是', '一个', '伟大', '的', '国家']

									>>> jieba.lcut("中国是一个伟大的国家", cut_all=True)

									['中国', '国是', '一个', '伟大', '的', '国家']

									>>> jieba.lcut_for_search("中华人民共和国是伟大的")

									['中华', '华人', '人民', '共和', '共和国', '中华人民共和国', '是', '伟大', '的']

六、文本词频示例

问题分析

英文文本： Hamlet 分析词频

https://python123.io/resources/pye/hamlet.txt

中文文本：《三国演义》分析人物

https://python123.io/resources/pye/threekingdoms.txt

代码如下：

				?

									def getText():

									 # 打开 hamlet.txt 这个文件

									 txt = open("hamlet.txt", "r").read()

									 # 避免大小写对词频统计的干扰，将所有单词转换为小写

									 txt = txt.lower()

									 # 将文中出现的所有特殊字符替换为空格

									 for ch in '|"#$%^&*()_+-=\\`~{}[];:<>?/':

									 txt = txt.replace(ch, " ")

									 # 返回一个所以后单词都是小写的，单词间以空格间隔的文本

									 return txt

									hamletTxt = getText()

									# split() 默认使用空格作为分隔符

									words = hamletTxt.split()

									counts = {}

									for word in words:

									 counts[word] = counts.get(word,0) + 1

									items = list(counts.items())

									items.sort(key=lambda x:x[1], reverse=True)

									for i in range(10):

									 word, count = items[i]

									 print("{0:<10}{1:>5}".format(word,count))

上面代码中的

				?

									items.sort(key=lambda x:x[1], reverse=True)

是根据单词出现的次数进行排序，其中使用了 lambda 函数。更多解释请看：
https://www.runoob.com/python/att-list-sort.html

下面使用 jieba 库来统计《三国演义》中任务出场的次数：

				?

									import jieba

									txt = open("threekingdoms.txt","r",encoding="utf-8").read()

									words = jieba.lcut(txt)

									counts = {}

									for word in words:

									 if len(word) == 1:

									 continue

									 else:

									 counts[word] = counts.get(word, 0) + 1

									items = list(counts.items())

									items.sort(key=lambda x:x[1], reverse=True)

									for i in range(15):

									 word, count = items[i]

									 print("{0:<10}{1:>5}".format(word,count))

运行结果：

我们可以看到得出的结果与我们想象的有些差异，比如

“却说”、“二人”等与人名无关
“诸葛亮”、“孔明”都是同一个人
“孔明”和“孔明曰”分词不符合我们的需求

所以我们需要对上面代码进行优化，在词频统计的基础上，面向问题改造我们的程序。

下面是《三国演义》人物数量统计代码的升级版，升级版中对于某些确定不是人名的词，即使做了词频统计，也要将它删除掉。使用寄一个集合excludes来接收一些确定不是人名但是又排序比较靠前的单词列进去。

				?

									import jieba

									txt = open("threekingdoms.txt","r",encoding="utf-8").read()

									excludes = {"将军","却说","荆州","二人","不可","不能","如此"}

									words = jieba.lcut(txt)

									counts = {}

									for word in words:

									 if len(word) == 1:

									 continue

									 elif word == "诸葛亮" or word == "孔明曰":

									 rword == "孔明"

									 elif word == "关公" or word == "云长":

									 rword == "关羽"

									 elif word == "玄德" or word == "玄德曰":

									 rword == "刘备"

									 elif word == "孟德" or word == "丞相":

									 rword == "曹操"

									 else:

									 rword = word

									 counts[rword] = counts.get(rword, 0) + 1

									items = list(counts.items())

									items.sort(key=lambda x:x[1], reverse=True)

									for i in range(15):

									 word, count = items[i]

									 print("{0:<10}{1:>5}".format(word,count))

运行结果：