1、jieba库安装
2、jieba库功能介绍
特征:支持三种分词模式:精确模式:试图将句子最精确地切开,适合文本分析全模式:把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义搜索引擎模式:在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词
- 支持繁体分词
- 支持自定义词典
- 第一个参数为需要分词的字符串
- cut_all参数用来控制是否采用全模式
lcut 将返回的对象转化为 list 对象返回
- 需要分词的字符串
该方法适合用于搜索引擎构建倒排索引的分词,颗粒度较细jieba.lcut_for_search 方法返回列表类型
jieba.posseg.dt 为默认词性标注分词器标注句子分词后每个词的词性,采用和ictclas兼容的标记法
3、案例
3.1、精确模式
1
2
3
4
|
import jieba list1 = jieba.lcut( "中华人民共和国是一个伟大的国家" ) print (list1) print ( "精确模式:" + "/" .join(list1)) |
3.2、全模式
1
2
3
|
list2 = jieba.lcut( "中华人民共和国是一个伟大的国家" ,cut_all = True ) print (list2,end = "," ) print ( "全模式:" + "/" .join(list2)) |
3.3、搜索引擎模式
1
2
3
|
list3 = jieba.lcut_for_search( "中华人民共和国是一个伟大的国家" ) print (list3) print ( "搜索引擎模式:" + " " .join(list3)) |
3.4、修改词典
1
2
3
4
5
6
7
8
9
10
11
12
13
|
import jieba text = "中信建投投资公司了一款游戏,中信也投资了一个游戏公司" word = jieba.lcut(text) print (word) # 添加词 jieba.add_word( "中信建投" ) jieba.add_word( "投资公司" ) word1 = jieba.lcut(text) print (word1) # 删除词 jieba.del_word( "中信建投" ) word2 = jieba.lcut(text) print (word2) |
3.5、词性标注
1
2
3
4
|
import jieba.posseg as pseg words = pseg.cut( "我爱北京天安门" ) for i in words: print (i.word,i.flag) |
3.6、统计三国演义中人物出场的次数
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
import jieba txt = open ( "文件路径" , "r" , encoding = 'utf-8' ).read() # 打开并读取文件 words = jieba.lcut(txt) # 使用精确模式对文本进行分词 counts = {} # 通过键值对的形式存储词语及其出现的次数 for word in words: if len (word) = = 1 : # 单个词语不计算在内 continue else : counts[word] = counts.get(word, 0 ) + 1 # 遍历所有词语,每出现一次其对应的值加 1 items = list (counts.items()) #将键值对转换成列表 items.sort(key = lambda x: x[ 1 ], reverse = True ) # 根据词语出现的次数进行从大到小排序 for i in range ( 15 ): word, count = items[i] print ( "{0:<10}{1:>5}" . format (word, count)) |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
import jieba excludes = { "将军" , "却说" , "荆州" , "二人" , "不可" , "不能" , "如此" , "如何" } txt = open ( "三国演义.txt" , "r" , encoding = 'utf-8' ).read() words = jieba.lcut(txt) counts = {} for word in words: if len (word) = = 1 : continue elif word = = "诸葛亮" or word = = "孔明曰" : rword = "孔明" elif word = = "关公" or word = = "云长" : rword = "关羽" elif word = = "玄德" or word = = "玄德曰" : rword = "刘备" elif word = = "孟德" or word = = "丞相" : rword = "曹操" else : rword = word counts[rword] = counts.get(rword, 0 ) + 1 for i in excludes: del counts[i] items = list (counts.items()) items.sort(key = lambda x:x[ 1 ], reverse = True ) for i in range ( 10 ): word, count = items[i] print ( "{0:<10}{1:>5}" . format (word, count))<font face = "Arial, Verdana, sans-serif" ><span style = "white-space: normal;" > < / span>< / font> |
到此这篇关于python 中的jieba分词库的文章就介绍到这了,更多相关python jieba分词库内容请搜索服务器之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持服务器之家!
原文链接:https://www.cnblogs.com/L-hua/p/15584823.html