针对句子或文章的 -- 关于切词组化的正序排布
有时候会有种需求就是:将一段句子里面的词语进行分词,分词结果根据构造的词典进行分词
(当然了解NLP的同学,jieba里面的自定义词典可以满足需求) 这里自己写了轮子共享下:
需求:
a = "i love you so much and want to do something for you, can you give me one chance or want to marry you."
require_list = ["want to", "so much", "one chance"]
## 结果需要
Finish Result: ['i', 'love', 'you', 'so much', 'and', 'want to', 'do', 'something', 'for', 'you,', 'can', 'you', 'give', 'me', 'one chance', 'or', 'want to', 'marry', 'you.']
采用了逆向词组判定和递归的想法:
# 关于切词组化的正序排布
def tokenize(need_article1, rule_phrase, new_list):
need_article = need_article1
len_need_article = len(need_article)
for word_join in rule_phrase:
while word_join in need_article:
# if word_join in need_article:
length = len(word_join)
index_first = need_article.find(word_join)
if index_first == -1:
break
try:
if index_first + length < len_need_article:
if need_article[index_first + length] == " " and (index_first == 0 or need_article[index_first - 1] == " "):
need_article_temp = need_article[:index_first + length]
specfic_phrase = need_article_temp[index_first::]
if need_article_temp[:index_first]:
before_phrase = need_article_temp[:index_first]
conbine_list = tokenize(before_phrase, rule_phrase, new_list)
if conbine_list:
new_list.append(specfic_phrase)
need_article =need_article[index_first + length:]
else:
new_list.append(specfic_phrase)
need_article = need_article[index_first + length:]
else:
break
except Exception as e:
# 这里越界不做处理,直接将句子扔进递归
need_article_temp = need_article[:index_first + length]
specfic_phrase = need_article_temp[index_first::]
if need_article_temp[:index_first]:
before_phrase = need_article_temp[:index_first]
conbine_list = tokenize(before_phrase, rule_phrase, new_list)
if conbine_list:
new_list.append(specfic_phrase)
need_article = need_article[index_first + length:]
else:
new_list.append(specfic_phrase)
need_article = need_article[index_first + length:]
else:
conbine_list2 = need_article.split()
new_list.extend(conbine_list2)
if need_article1 == need_article:
if not conbine_list2:
conbine_list = need_article.split()
new_list.extend(conbine_list)
return need_article
return new_list
def tokenize_foo(a, list1, new_list):
new_list_order = tokenize(a, list1, new_list)
if a == new_list_order:
# 还是返回了原来的字符串, 没有任何的词汇组合
new_list_order = new_list_order.split()
return new_list_order
主程序:
if __name__ == '__main__':
a = "i love you so much and want to do something for you, can you give me one chance or want to marry you."
list1 = ["want to", "one chance", "so much"]
new_list = []
new_list_order = tokenize_foo(a, list1, new_list)
print(new_list_order)
结果展示
日积月累,小小的力量,大大的梦想...

浙公网安备 33010602011771号