中文文本分句

关于文本分句这点，说简单也简单，说复杂也复杂。一般的自然语言处理任务中对这点要求并不严格，一般按照句末标点切分即可。也有一些专门从事文本相关项目的行业，可能就会有较高的要求，想100%分句正确是要考虑许多语言本身语法的，这里算是写个中等水平的。以《背影》中的一段话为例：

我心里暗笑他的迂；他们只认得钱，托他们只是白托!而且我这样大年纪的人，难道还不能料理自己么？唉，我现在想想，那时真是太聪明了!
我说道：“爸爸，你走吧。”他往车外看了看说：“我买几个橘子去。你就在此地，不要走动。”我看那边月台的栅栏外有几个卖东西的等着顾客。走到那边月台，须穿过铁道，须跳下去又爬上去。

python实现：

import re

def __merge_symmetry(sentences, symmetry=('“','”')):
    '''合并对称符号，如双引号'''
    effective_ = []
    merged = True
    for index in range(len(sentences)):       
        if symmetry[0] in sentences[index] and symmetry[1] not in sentences[index]:
            merged = False
            effective_.append(sentences[index])
        elif symmetry[1] in sentences[index] and not merged:
            merged = True
            effective_[-1] += sentences[index]
        elif symmetry[0] not in sentences[index] and symmetry[1] not in sentences[index] and not merged :
            effective_[-1] += sentences[index]
        else:
            effective_.append(sentences[index])
        
    return [i.strip() for i in effective_ if len(i.strip()) > 0]

def to_sentences(paragraph):
    """由段落切分成句子"""
    sentences = re.split(r"(？|。|！|\…\…)", paragraph)
    sentences.append("")
    sentences = ["".join(i) for i in zip(sentences[0::2], sentences[1::2])]
    sentences = [i.strip() for i in sentences if len(i.strip()) > 0]
    
    for j in range(1, len(sentences)):
        if sentences[j][0] == '”':
            sentences[j-1] = sentences[j-1] + '”'
            sentences[j] = sentences[j][1:]
            
    return __merge_symmetry(sentences)

主要考虑分句之后要带上句末标点，以及遇到人物有对话时保证话语完整性。分句结果：

我心里暗笑他的迂；他们只认得钱，托他们只是白托!而且我这样大年纪的人，难道还不能料理自己么？
唉，我现在想想，那时真是太聪明了!
我说道：“爸爸，你走吧。”
他往车外看了看说：“我买几个橘子去。你就在此地，不要走动。”
我看那边月台的栅栏外有几个卖东西的等着顾客。
走到那边月台，须穿过铁道，须跳下去又爬上去。

posted @ 2019-10-15 19:05 AloisWei 阅读(4966) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Just do it!

这个人很懒！

中文文本分句

公告