计算两个字符串相(或句子)似度的方法1 编辑距离2 余弦相似度3 FuzzyWuzzy
时间:2022-06-01
本文章向大家介绍计算两个字符串相(或句子)似度的方法1 编辑距离2 余弦相似度3 FuzzyWuzzy,主要内容包括其使用实例、应用技巧、基本知识点总结和需要注意事项,具有一定的参考价值,需要的朋友可以参考一下。
主要方法有:编辑距离、余弦相似度、模糊相似度百分比
1 编辑距离
def levenshtein(first, second):
''' 编辑距离算法(LevD)
Args: 两个字符串
returns: 两个字符串的编辑距离 int
'''
if len(first) > len(second):
first, second = second, first
if len(first) == 0:
return len(second)
if len(second) == 0:
return len(first)
first_length = len(first) + 1
second_length = len(second) + 1
distance_matrix = [list(range(second_length)) for x in range(first_length)]
# print distance_matrix
for i in range(1, first_length):
for j in range(1, second_length):
deletion = distance_matrix[i - 1][j] + 1
insertion = distance_matrix[i][j - 1] + 1
substitution = distance_matrix[i - 1][j - 1]
if first[i - 1] != second[j - 1]:
substitution += 1
distance_matrix[i][j] = min(insertion, deletion, substitution)
# print distance_matrix
return distance_matrix[first_length - 1][second_length - 1]
str1="hello,good moring"
str2="hi,good moring"
edit_distance=levenshtein(str1,str2)
edit_distance
4
2 余弦相似度
import math
import re
import datetime
import time
text1 = "This game is one of the very best. games ive played. the ;pictures? "
"cant descripe the real graphics in the game."
text2 = "this game have/ is3 one of the very best. games ive played. the ;pictures? "
"cant descriPe now the real graphics in the game."
text3 = "So in the picture i saw a nice size detailed metal puzzle. Eager to try since I enjoy 3d wood puzzles, i ordered it. Well to my disappointment I got in the mail a small square about 4 inches around. And to add more disappointment when I built it it was smaller than the palm of my hand. For the price it should of been much much larger. Don't be fooled. It's only worth $5.00.Update 4/15/2013I have bought and completed 13 of these MODELS from A.C. Moore for $5.99 a piece, so i stand by my comment that thiss one is overpriced. It was still fun to build just like all the others from the maker of this brand.Just be warned, They are small."
text4 = "I love it when an author can bring you into their made up world and make you feel like a friend, confidant, or family. Having a special child of my own I could relate to the teacher and her madcap class. I've also spent time in similar classrooms and enjoyed the uniqueness of each and every child. Her story drew me into their world and had me laughing so hard my family thought I had lost my mind, so I shared the passage so they could laugh with me. Read this book if you enjoy a book with strong women, you won't regret it."
def compute_cosine(text_a, text_b):
# 找单词及词频
words1 = text_a.split(' ')
words2 = text_b.split(' ')
# print(words1)
words1_dict = {}
words2_dict = {}
for word in words1:
# word = word.strip(",.?!;")
word = re.sub('[^a-zA-Z]', '', word)
word = word.lower()
# print(word)
if word != '' and word in words1_dict: # 这里改动了
num = words1_dict[word]
words1_dict[word] = num + 1
elif word != '':
words1_dict[word] = 1
else:
continue
for word in words2:
# word = word.strip(",.?!;")
word = re.sub('[^a-zA-Z]', '', word)
word = word.lower()
if word != '' and word in words2_dict:
num = words2_dict[word]
words2_dict[word] = num + 1
elif word != '':
words2_dict[word] = 1
else:
continue
print(words1_dict)
print(words2_dict)
# 排序
dic1 = sorted(words1_dict.items(), key=lambda asd: asd[1], reverse=True)
dic2 = sorted(words2_dict.items(), key=lambda asd: asd[1], reverse=True)
print(dic1)
print(dic2)
# 得到词向量
words_key = []
for i in range(len(dic1)):
words_key.append(dic1[i][0]) # 向数组中添加元素
for i in range(len(dic2)):
if dic2[i][0] in words_key:
# print 'has_key', dic2[i][0]
pass
else: # 合并
words_key.append(dic2[i][0])
# print(words_key)
vect1 = []
vect2 = []
for word in words_key:
if word in words1_dict:
vect1.append(words1_dict[word])
else:
vect1.append(0)
if word in words2_dict:
vect2.append(words2_dict[word])
else:
vect2.append(0)
print(vect1)
print(vect2)
# 计算余弦相似度
sum = 0
sq1 = 0
sq2 = 0
for i in range(len(vect1)):
sum += vect1[i] * vect2[i]
sq1 += pow(vect1[i], 2)
sq2 += pow(vect2[i], 2)
try:
result = round(float(sum) / (math.sqrt(sq1) * math.sqrt(sq2)), 2)
except ZeroDivisionError:
result = 0.0
# print(result)
return result
if __name__ == '__main__':
result=compute_cosine(text1, text2)
print(result)
{'this': 1, 'game': 2, 'is': 1, 'one': 1, 'of': 1, 'the': 4, 'very': 1, 'best': 1, 'games': 1, 'ive': 1, 'played': 1, 'pictures': 1, 'cant': 1, 'descripe': 1, 'real': 1, 'graphics': 1, 'in': 1}
{'this': 1, 'game': 2, 'have': 1, 'is': 1, 'one': 1, 'of': 1, 'the': 4, 'very': 1, 'best': 1, 'games': 1, 'ive': 1, 'played': 1, 'pictures': 1, 'cant': 1, 'descripe': 1, 'now': 1, 'real': 1, 'graphics': 1, 'in': 1}
[('the', 4), ('game', 2), ('this', 1), ('is', 1), ('one', 1), ('of', 1), ('very', 1), ('best', 1), ('games', 1), ('ive', 1), ('played', 1), ('pictures', 1), ('cant', 1), ('descripe', 1), ('real', 1), ('graphics', 1), ('in', 1)]
[('the', 4), ('game', 2), ('this', 1), ('have', 1), ('is', 1), ('one', 1), ('of', 1), ('very', 1), ('best', 1), ('games', 1), ('ive', 1), ('played', 1), ('pictures', 1), ('cant', 1), ('descripe', 1), ('now', 1), ('real', 1), ('graphics', 1), ('in', 1)]
[4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]
[4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
0.97
3 FuzzyWuzzy
from fuzzywuzzy import fuzz
fuzz.ratio("this is a test", "this is a test!")
97
- JavaScript 教程
- JavaScript 编辑工具
- JavaScript 与HTML
- JavaScript 与Java
- JavaScript 数据结构
- JavaScript 基本数据类型
- JavaScript 特殊数据类型
- JavaScript 运算符
- JavaScript typeof 运算符
- JavaScript 表达式
- JavaScript 类型转换
- JavaScript 基本语法
- JavaScript 注释
- Javascript 基本处理流程
- Javascript 选择结构
- Javascript if 语句
- Javascript if 语句的嵌套
- Javascript switch 语句
- Javascript 循环结构
- Javascript 循环结构实例
- Javascript 跳转语句
- Javascript 控制语句总结
- Javascript 函数介绍
- Javascript 函数的定义
- Javascript 函数调用
- Javascript 几种特殊的函数
- JavaScript 内置函数简介
- Javascript eval() 函数
- Javascript isFinite() 函数
- Javascript isNaN() 函数
- parseInt() 与 parseFloat()
- escape() 与 unescape()
- Javascript 字符串介绍
- Javascript length属性
- javascript 字符串函数
- Javascript 日期对象简介
- Javascript 日期对象用途
- Date 对象属性和方法
- Javascript 数组是什么
- Javascript 创建数组
- Javascript 数组赋值与取值
- Javascript 数组属性和方法
- QGIS制图中面积小的区域不显示注记
- linux 编译 c或cpp 文件为动态库 so 文件(最简单直观的模板)
- 满眼只有React和Vue,却对前端数据层几乎一无所知
- 嵌入式linux之go语言开发(八)存储模块的封装(二)
- 招商银行校招题二
- 小程序工程化系列(一):文件依赖分析
- ucgui在windows上的移植,及为go语言打造简易跨平台GUI的想法
- React setState 是异步执行还是同步执行?
- sm2,sm3,sm4国密算法的纯c语言版本,使用于任何嵌入式平台
- sm2国密算法的纯c语言版本,使用于单片机平台(静态内存分配)
- 面试:mysql 事务和锁的解释
- 【STM32F407开发板用户手册】第35章 STM32F407的FSMC总线应用之驱动AD7606(8通道同步采样, 16bit, 正负10V)
- 玩转easyARM imax283A开发版(二),移植NES模拟器并增加按键驱动,让板子可以玩超级玛丽游戏
- 完了!TCP出了大事!
- redis高并发高可用