html正文提取工具goose的安装及简单使用Demo

1.git clone https://github.com/grangier/python-goose.git

2.cd python-goose

3.sudo pip install -r requirements.txt
此时会报一个安装nltk的错误,执行下面命令单独安装:

sudo apt-get install python-nltk 

4.sudo python setup.py install

 

至此安装完毕!!!!!!!

---------------------------------------------------------

下面付简单的使用demo:

def goose_extraction(response):
    try:

import traceback

        import chardet
        from goose import Goose
        from goose.text import StopWordsChinese
        charset = chardet.detect(response.content)
        coding = charset.get('encoding').lower()  # 网页编码类别:gbk,gb2312,utf-8等
        if coding and coding.startswith(u'gb'):
            codeHtml = response.content.decode("GB18030").encode('utf-8')
        elif coding.startswith(u'utf'):
            codeHtml = response.content
        else:
            codeHtml = response.content.decode(coding, 'ignore')
        g = Goose({'stopwords_class': StopWordsChinese})  # 中文
        article = g.extract(raw_html=codeHtml)
        content = article.cleaned_text
        html = '<div>' + ''.join(['<p>'+con+'</p>\n' for con in content.split('\n\n')]) + '</div>'
        return content, html
    except Exception as e:
        traceback.print_exc(e)

 

posted @ 2019-07-31 18:10  python许三多  阅读(952)  评论(0编辑  收藏  举报