Python转换HTML到Text纯文本的方法_Python

Python转换HTML到Text纯文本的方法

2020-05-19 09:18脚本之家 Python

这篇文章主要介绍了Python转换HTML到Text纯文本的方法,分析了常用的两种方法,非常具有实用价值,需要的朋友可以参考下

本文实例讲述了Python转换HTML到Text纯文本的方法。分享给大家供大家参考。具体分析如下：

今天项目需要将HTML转换为纯文本，去网上搜了一下，发现Python果然是神通广大，无所不能，方法是五花八门。

拿今天亲自试的两个方法举例，以方便后人：

方法一：

1. 安装nltk，可以去pipy装

（注：需要依赖以下包：numpy, PyYAML）

2.测试代码：

复制代码代码如下:

	>>> import nltk 

	>>> aa = r'''''

	<html>

	    <body>

	 <b>Project:</b> DeHTML<br>

	 <b>Description</b>:<br>

	 This small script is intended to allow conversion from HTML markup to 

	 plain text.

	    </body>

	</html>

	'''

	>>> aa 

	'\n<html>\n            <body>\n                <b>Project:</b> DeHTML<br>\n                <b>Description</b>:<br>\n                This small script is intended to allow conversion from HTML markup to \n                plain text.\n            </body>\n        </html>\n        ' 

	>>> <strong>print nltk.clean_html(aa)</strong> 

	Project: DeHTML  

	     Description :  

	    This small script is intended to allow conversion from HTML markup to  

	    plain text.

方法二：

如果觉得nltk太笨重，大材小用的话，可以自己写代码，代码如下:

复制代码代码如下:

	from HTMLParser import HTMLParser 

	from re import sub 

	from sys import stderr 

	from traceback import print_exc 

	class _DeHTMLParser(HTMLParser): 

	    def __init__(self): 

	        HTMLParser.__init__(self) 

	        self.__text = [] 

	    def handle_data(self, data): 

	        text = data.strip() 

	        if len(text) > 0: 

	            text = sub('[ \t\r\n]+', ' ', text) 

	            self.__text.append(text + ' ') 

	    def handle_starttag(self, tag, attrs): 

	        if tag == 'p': 

	            self.__text.append('\n\n') 

	        elif tag == 'br': 

	            self.__text.append('\n') 

	    def handle_startendtag(self, tag, attrs): 

	        if tag == 'br': 

	            self.__text.append('\n\n') 

	    def text(self): 

	        return ''.join(self.__text).strip() 

	def dehtml(text): 

	    try: 

	        parser = _DeHTMLParser() 

	        parser.feed(text) 

	        parser.close() 

	        return parser.text() 

	    except: 

	        print_exc(file=stderr) 

	        return text 

	def main(): 

	    text = r'''''

	        <html>

	            <body>

	                <b>Project:</b> DeHTML<br>

	                <b>Description</b>:<br>

	                This small script is intended to allow conversion from HTML markup to 

	                plain text.

	            </body>

	        </html>

	    ''' 

	    print(dehtml(text)) 

	if __name__ == '__main__': 

	    main()

运行结果：

>>> ================================ RESTART ================================
>>>
Project: DeHTML
Description :
This small script is intended to allow conversion from HTML markup to plain text.

希望本文所述对大家的Python程序设计有所帮助。