8_演员导演的知识图谱
第一步:确定演员导演知识图谱的领域和范围
演员导演知识图谱是一个专注于电影娱乐行业的专业知识图谱,旨在整合演员、导演、电影作品以及相关娱乐信息,形成完整的娱乐生态网络。
1. 核心实体类型
人员实体:
Actor
:演员,包括电影演员、电视剧演员等Director
:导演,包括电影导演、电视剧导演等Producer
:制片人Screenwriter
:编剧Crew
:其他演职人员(摄影师、美术指导等)
作品实体:
Movie
:电影作品TVSeries
:电视剧作品Episode
:剧集(电视剧单集)Award
:奖项和荣誉Festival
:电影节
机构实体:
ProductionCompany
:制作公司Studio
:电影工作室Distributor
:发行公司
2. 应用场景和目标
电影推荐系统:
- 基于演员相似度推荐电影
- 根据导演风格推荐作品
- 发现演员的转型作品
娱乐数据分析:
- 分析导演与演员的合作模式
- 追踪电影票房与演员阵容的关系
- 研究电影类型发展趋势
职业发展分析:
- 演员职业生涯轨迹分析
- 导演作品风格演变研究
- 电影产业人才流动分析
智能问答系统:
- "谁是《阿凡达》的导演?"
- "莱昂纳多·迪卡普里奥合作过哪些导演?"
- "获得奥斯卡最佳导演奖的人都有哪些作品?"
3. 数据范围界定
时间范围:
- 现代电影时代(1920年至今)
- 重点关注当代电影(2000年至今)
- 包含经典电影(视数据质量而定)
地域范围:
- 全球电影(好莱坞、欧洲电影、亚洲电影等)
- 重点关注华语电影和中国电影市场
- 包含国际电影节获奖作品
语言范围:
- 多语言支持(中文、英文、日语、韩语等)
- 重点关注中文电影内容
- 支持多语言电影信息
电影类型:
- 剧情、动作、喜剧、爱情、科幻、恐怖等
- 纪录片、动画、短片等
- 排除**电影和低质量电影
4. 数据质量目标
完整性目标:
- 覆盖主要演员和导演(知名度Top 1000)
- 包含重要电影作品(票房Top 1000或获奖作品)
- 建立演员合作网络(合作关系>1000)
准确性目标:
- 演员电影作品关联准确率 > 95%
- 导演作品归属准确率 > 98%
- 奖项信息准确率 > 90%
时效性目标:
- 新电影数据更新延迟 < 7天
- 演员最新作品更新延迟 < 30天
- 票房数据更新延迟 < 1天
5. 数据规模预期
实体数量:
- 演员:10,000+
- 导演:5,000+
- 电影作品:50,000+
- 奖项记录:10,000+
关系数量:
- 演员参演关系:100,000+
- 导演执导关系:20,000+
- 演员合作关系:50,000+
- 奖项获得关系:15,000+
6. 技术可行性评估
数据获取难度:
- 公开电影数据库(IMDb、Douban、TMDB)
- 电影网站和资讯平台
- 维基百科和百度百科
技术复杂度:
- 实体消歧(同名演员处理)
- 多语言内容整合
- 时效性数据维护
商业价值:
- 电影推荐和内容发现
- 娱乐资讯和数据服务
- 电影投资决策支持
通过明确这些范围和目标,我们可以构建一个专注于演员导演关系的专业知识图谱,为电影娱乐行业提供强大的知识支撑。
第二步:收集电影娱乐数据源
基于豆瓣电影数据的电影推荐系统需要从多个渠道合法合规地收集数据。以下是数据收集的详细策略:
1. 豆瓣官方数据源
豆瓣电影API(如果可用):
import requests
import json
class DoubanMovieAPI:
def __init__(self, api_key: str = None):
self.base_url = "https://api.douban.com/v2/movie"
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
}
if api_key:
self.headers['apikey'] = api_key
def get_movie_info(self, movie_id: str) -> Dict:
"""获取电影详细信息"""
url = f"{self.base_url}/subject/{movie_id}"
try:
response = requests.get(url, headers=self.headers, timeout=10)
if response.status_code == 200:
return response.json()
except Exception as e:
print(f"获取电影信息失败: {e}")
return None
def search_movies(self, query: str, start: int = 0, count: int = 20) -> List[Dict]:
"""搜索电影"""
url = f"{self.base_url}/search"
params = {
'q': query,
'start': start,
'count': count
}
try:
response = requests.get(url, params=params, headers=self.headers, timeout=10)
if response.status_code == 200:
data = response.json()
return data.get('subjects', [])
except Exception as e:
print(f"搜索电影失败: {e}")
return []
注意:豆瓣API需要申请API密钥,且有使用限制。
2. 网页抓取策略(合规爬取)
尊重robots.txt和网站政策:
import requests
from bs4 import BeautifulSoup
import time
import random
class DoubanMovieScraper:
def __init__(self):
self.session = requests.Session()
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive',
})
def get_movie_detail(self, movie_id: str) -> Dict:
"""获取电影详情页面信息"""
url = f"https://movie.douban.com/subject/{movie_id}/"
try:
response = self.session.get(url, timeout=15)
if response.status_code == 200:
return self._parse_movie_page(response.text)
except Exception as e:
print(f"抓取电影详情失败: {e}")
return None
def _parse_movie_page(self, html: str) -> Dict:
"""解析电影详情页面"""
soup = BeautifulSoup(html, 'html.parser')
movie_info = {}
try:
# 电影标题
title_elem = soup.find('span', property='v:itemreviewed')
movie_info['title'] = title_elem.text.strip() if title_elem else ''
# 导演信息
director_elem = soup.find('a', rel='v:directedBy')
movie_info['director'] = director_elem.text.strip() if director_elem else ''
# 演员信息
actors = []
actor_elems = soup.find_all('a', rel='v:starring')[:5] # 限制演员数量
for actor_elem in actor_elems:
actors.append(actor_elem.text.strip())
movie_info['actors'] = actors
# 电影类型
genre_elems = soup.find_all('span', property='v:genre')
movie_info['genres'] = [elem.text.strip() for elem in genre_elems]
# 上映年份
year_elem = soup.find('span', class_='year')
movie_info['year'] = year_elem.text.strip('()') if year_elem else ''
# 评分
rating_elem = soup.find('strong', property='v:average')
movie_info['rating'] = rating_elem.text.strip() if rating_elem else ''
# 简介
summary_elem = soup.find('span', property='v:summary')
movie_info['summary'] = summary_elem.text.strip() if summary_elem else ''
except Exception as e:
print(f"解析页面失败: {e}")
return movie_info
def get_top_movies(self, genre: str = '', start: int = 0, limit: int = 50) -> List[Dict]:
"""获取豆瓣热门电影"""
movies = []
# 豆瓣Top 250电影
if genre == 'top250':
url = "https://movie.douban.com/top250"
else:
# 按类型获取电影
url = f"https://movie.douban.com/tag/{genre}" if genre else "https://movie.douban.com/chart"
try:
response = self.session.get(url, timeout=15)
if response.status_code == 200:
movie_urls = self._extract_movie_urls(response.text)
# 限制抓取数量,避免过度请求
for movie_url in movie_urls[start:start+limit]:
movie_id = movie_url.split('/')[-2]
movie_info = self.get_movie_detail(movie_id)
if movie_info:
movies.append(movie_info)
# 添加延时,避免请求过快
time.sleep(random.uniform(1, 3))
except Exception as e:
print(f"获取电影列表失败: {e}")
return movies
def _extract_movie_urls(self, html: str) -> List[str]:
"""从列表页面提取电影链接"""
soup = BeautifulSoup(html, 'html.parser')
movie_links = []
# 查找电影链接
link_elems = soup.find_all('a', href=re.compile(r'/subject/\d+/'))
for link_elem in link_elems:
href = link_elem['href']
if href not in movie_links:
movie_links.append(href)
return movie_links
3. 批量数据收集策略
分批收集避免被限制:
class BatchDataCollector:
def __init__(self, scraper: DoubanMovieScraper):
self.scraper = scraper
self.collected_movies = []
def collect_comprehensive_data(self) -> Dict[str, List]:
"""收集全面的电影数据"""
data = {
'movies': [],
'actors': [],
'directors': [],
'genres': set()
}
# 定义收集的电影类型
genres = ['剧情', '喜剧', '动作', '爱情', '科幻', '动画', '悬疑', '惊悚', '恐怖', '犯罪']
for genre in genres:
print(f"正在收集 {genre} 类型电影...")
# 每个类型收集100部电影
movies = self.scraper.get_top_movies(genre, limit=100)
for movie in movies:
if movie and movie.get('title'):
# 添加电影信息
movie_data = {
'title': movie.get('title', ''),
'director': movie.get('director', ''),
'actors': movie.get('actors', []),
'genres': movie.get('genres', []),
'year': movie.get('year', ''),
'rating': movie.get('rating', ''),
'summary': movie.get('summary', ''),
'douban_id': movie.get('douban_id', '')
}
data['movies'].append(movie_data)
# 收集演员和导演
if movie.get('director'):
director_data = {
'name': movie['director'],
'type': 'director'
}
data['directors'].append(director_data)
for actor in movie.get('actors', []):
actor_data = {
'name': actor,
'type': 'actor'
}
data['actors'].append(actor_data)
# 收集电影类型
data['genres'].update(movie.get('genres', []))
# 添加延时,避免请求过快
time.sleep(random.uniform(5, 10))
# 数据去重
data['directors'] = self._deduplicate(data['directors'])
data['actors'] = self._deduplicate(data['actors'])
return data
def _deduplicate(self, items: List[Dict]) -> List[Dict]:
"""数据去重"""
seen = set()
unique_items = []
for item in items:
key = (item['name'], item.get('type', ''))
if key not in seen:
seen.add(key)
unique_items.append(item)
return unique_items
4. 数据质量控制
数据验证和清洗:
class DataQualityController:
def __init__(self):
self.validation_rules = {
'movie': self._validate_movie,
'actor': self._validate_person,
'director': self._validate_person
}
def validate_and_clean_data(self, data: Dict) -> Dict:
"""验证和清洗数据"""
cleaned_data = {
'movies': [],
'actors': [],
'directors': [],
'genres': set()
}
# 验证电影数据
for movie in data.get('movies', []):
if self._validate_movie(movie):
cleaned_data['movies'].append(movie)
cleaned_data['genres'].update(movie.get('genres', []))
# 验证演员和导演数据
for actor in data.get('actors', []):
if self._validate_person(actor):
cleaned_data['actors'].append(actor)
for director in data.get('directors', []):
if self._validate_person(director):
cleaned_data['directors'].append(director)
return cleaned_data
def _validate_movie(self, movie: Dict) -> bool:
"""验证电影数据"""
required_fields = ['title', 'director']
return all(movie.get(field) for field in required_fields)
def _validate_person(self, person: Dict) -> bool:
"""验证人员数据"""
return bool(person.get('name') and len(person['name'].strip()) > 0)
5. 法律合规性提醒
重要提醒:
- 遵守豆瓣的robots.txt协议
- 不要过度频繁请求,避免对豆瓣服务器造成压力
- 仅收集公开可用的电影信息
- 尊重知识产权和隐私权
- 考虑使用豆瓣官方API(如可用)
通过合法合规的数据收集,你可以获得高质量的豆瓣电影数据,为构建电影推荐知识图谱奠定坚实的基础。
第三步:设计电影本体和模式(Schema)
基于豆瓣电影数据的电影推荐系统需要设计专门针对电影娱乐领域的本体结构。以下是详细的本体设计方案:
1. 核心实体类型(Entity Types)
电影作品实体:
Movie
:电影作品- 属性:title, year, duration, language, country, summary, rating, votes
TVSeries
:电视剧作品- 属性:title, year, episodes, seasons, summary, rating, votes
Person
:人员(演员、导演等)- 属性:name, birth_date, birth_place, biography, gender
分类和标签实体:
Genre
:电影类型- 属性:name, description, parent_genre
Tag
:用户标签- 属性:name, frequency, category
奖项和评价实体:
Award
:奖项- 属性:name, category, year, organization
Review
:评论- 属性:content, rating, author, publish_date, helpful_votes
机构实体:
ProductionCompany
:制作公司- 属性:name, founded_year, headquarters, website
FilmFestival
:电影节- 属性:name, location, established_year, frequency
2. 关系类型(Relationship Types)
创作关系:
DIRECTED_BY
:电影导演关系- 属性:credit_order, role_description
ACTED_IN
:演员参演关系- 属性:character_name, credit_order, role_type
WRITTEN_BY
:编剧关系- 属性:credit_order, writer_type
PRODUCED_BY
:制片人关系- 属性:credit_order, producer_type
归属关系:
BELONGS_TO_GENRE
:电影类型关系- 属性:relevance_score
HAS_TAG
:标签关系- 属性:confidence_score
PRODUCED_BY_COMPANY
:制作公司关系- 属性:role_in_production
评价关系:
RATED_BY
:用户评分关系- 属性:rating_score, rating_date
REVIEWED_BY
:评论关系- 属性:review_date, helpful_count
WON_AWARD
:获奖关系- 属性:award_year, award_category
社会关系:
COLLABORATED_WITH
:合作关系- 属性:collaboration_count, first_collaboration_year
SIMILAR_TO
:相似关系- 属性:similarity_score, similarity_reason
3. 本体定义示例(OWL格式)
@prefix : <http://example.org/movie-kg#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
# 电影类定义
:Movie rdf:type owl:Class ;
rdfs:label "电影" ;
rdfs:comment "电影作品实体" .
:Person rdf:type owl:Class ;
rdfs:label "人员" ;
rdfs:comment "演员、导演等电影从业人员" .
:Genre rdf:type owl:Class ;
rdfs:label "电影类型" ;
rdfs:comment "电影分类类型" .
# 属性定义
:title rdf:type owl:DatatypeProperty ;
rdfs:domain :Movie ;
rdfs:range xsd:string ;
rdfs:label "电影标题" .
:rating rdf:type owl:DatatypeProperty ;
rdfs:domain :Movie ;
rdfs:range xsd:decimal ;
rdfs:label "豆瓣评分" .
:name rdf:type owl:DatatypeProperty ;
rdfs:domain :Person ;
rdfs:range xsd:string ;
rdfs:label "人员姓名" .
:genreName rdf:type owl:DatatypeProperty ;
rdfs:domain :Genre ;
rdfs:range xsd:string ;
rdfs:label "类型名称" .
# 关系定义
:directedBy rdf:type owl:ObjectProperty ;
rdfs:domain :Movie ;
rdfs:range :Person ;
rdfs:label "导演" .
:actedIn rdf:type owl:ObjectProperty ;
rdfs:domain :Person ;
rdfs:range :Movie ;
rdfs:label "参演" .
:belongsToGenre rdf:type owl:ObjectProperty ;
rdfs:domain :Movie ;
rdfs:range :Genre ;
rdfs:label "属于类型" .
# 反向关系
:isDirectorOf rdf:type owl:ObjectProperty ;
rdfs:domain :Person ;
rdfs:range :Movie ;
rdfs:label "导演作品" ;
owl:inverseOf :directedBy .
# 传递性关系
:collaboratedWith rdf:type owl:ObjectProperty ;
rdfs:domain :Person ;
rdfs:range :Person ;
rdfs:label "合作" ;
rdf:type owl:TransitiveProperty .
4. 豆瓣特定属性设计
针对豆瓣数据的特殊属性:
class DoubanMovieSchema:
"""豆瓣电影数据模式定义"""
# 电影属性映射
MOVIE_PROPERTIES = {
'title': '电影标题',
'original_title': '原片名',
'year': '上映年份',
'rating': '豆瓣评分',
'votes': '评分人数',
'duration': '片长(分钟)',
'language': '语言',
'country': '制片国家/地区',
'summary': '剧情简介',
'genres': '电影类型',
'tags': '用户标签',
'imdb_id': 'IMDb编号',
'douban_id': '豆瓣ID'
}
# 人员属性映射
PERSON_PROPERTIES = {
'name': '姓名',
'name_en': '英文名',
'gender': '性别',
'birth_date': '出生日期',
'birth_place': '出生地',
'biography': '简介',
'aka': '别名',
'imdb_id': 'IMDb编号',
'douban_id': '豆瓣ID'
}
# 关系属性映射
RELATIONSHIP_PROPERTIES = {
'DIRECTED_BY': {
'credit_order': '导演排序',
'role_description': '导演职责描述'
},
'ACTED_IN': {
'character_name': '角色名称',
'credit_order': '演员排序',
'role_type': '角色类型',
'is_leading_role': '是否主角'
},
'RATED_BY': {
'rating_score': '用户评分',
'rating_date': '评分日期',
'user_id': '用户ID'
}
}
5. 推荐算法相关属性
为电影推荐设计的特殊属性:
class RecommendationSchema:
"""推荐系统专用模式"""
# 相似性计算属性
SIMILARITY_PROPERTIES = {
'genre_similarity': '类型相似度',
'actor_similarity': '演员相似度',
'director_similarity': '导演相似度',
'rating_similarity': '评分相似度',
'overall_similarity': '综合相似度'
}
# 用户偏好属性
USER_PREFERENCE_PROPERTIES = {
'preferred_genres': '偏好类型',
'preferred_actors': '偏好演员',
'preferred_directors': '偏好导演',
'rating_history': '评分历史',
'viewing_history': '观看历史',
'demographics': '用户画像'
}
# 上下文属性
CONTEXT_PROPERTIES = {
'seasonal_trend': '季节趋势',
'time_based_preference': '时间偏好',
'mood_based_preference': '心情偏好',
'social_trend': '社交趋势'
}
6. 数据质量属性
确保数据质量的元数据属性:
class QualityMetadata:
"""数据质量元数据"""
# 数据来源追踪
PROVENANCE_PROPERTIES = {
'source': '数据来源',
'collection_date': '采集日期',
'last_updated': '最后更新',
'confidence_score': '置信度',
'verification_status': '验证状态'
}
# 数据质量指标
QUALITY_METRICS = {
'completeness_score': '完整性得分',
'accuracy_score': '准确性得分',
'consistency_score': '一致性得分',
'timeliness_score': '时效性得分',
'overall_quality': '整体质量'
}
7. 本体可视化
本体结构图示:
电影领域本体结构:
├── 作品层
│ ├── Movie(电影)
│ │ ├── 属性:title, year, rating, summary...
│ │ └── 关系:directed_by, acted_in, belongs_to_genre...
│ └── TVSeries(电视剧)
│ ├── 属性:title, seasons, episodes...
│ └── 关系:directed_by, acted_in...
│
├── 人员层
│ ├── Person(人员)
│ │ ├── 属性:name, birth_date, biography...
│ │ └── 关系:directed, acted_in, collaborated_with...
│ ├── Actor(演员)
│ └── Director(导演)
│
├── 分类层
│ ├── Genre(类型)
│ │ ├── 属性:name, description...
│ │ └── 关系:parent_genre, sub_genre...
│ └── Tag(标签)
│ └── 属性:name, frequency...
│
└── 评价层
├── Award(奖项)
│ ├── 属性:name, year, category...
│ └── 关系:won_by, nominated_for...
└── Review(评论)
├── 属性:content, rating, author...
└── 关系:reviewed_by, helpful_to...
这个本体设计充分考虑了豆瓣电影数据的特点,为电影推荐系统提供了丰富的关系网络和属性信息。通过这种结构化表示,可以支持复杂的电影推荐算法和查询操作。
第三步:设计电影本体和模式(Schema)
基于豆瓣电影数据的电影推荐系统需要设计专门针对电影娱乐领域的本体结构。以下是详细的本体设计方案:
1. 核心实体类型(Entity Types)
电影作品实体:
Movie
:电影作品- 属性:title, year, duration, language, country, summary, rating, votes
TVSeries
:电视剧作品- 属性:title, year, episodes, seasons, summary, rating, votes
Person
:人员(演员、导演等)- 属性:name, birth_date, birth_place, biography, gender
分类和标签实体:
Genre
:电影类型- 属性:name, description, parent_genre
Tag
:用户标签- 属性:name, frequency, category
奖项和评价实体:
Award
:奖项- 属性:name, category, year, organization
Review
:评论- 属性:content, rating, author, publish_date, helpful_votes
机构实体:
ProductionCompany
:制作公司- 属性:name, founded_year, headquarters, website
FilmFestival
:电影节- 属性:name, location, established_year, frequency
2. 关系类型(Relationship Types)
创作关系:
DIRECTED_BY
:电影导演关系- 属性:credit_order, role_description
ACTED_IN
:演员参演关系- 属性:character_name, credit_order, role_type
WRITTEN_BY
:编剧关系- 属性:credit_order, writer_type
PRODUCED_BY
:制片人关系- 属性:credit_order, producer_type
归属关系:
BELONGS_TO_GENRE
:电影类型关系- 属性:relevance_score
HAS_TAG
:标签关系- 属性:confidence_score
PRODUCED_BY_COMPANY
:制作公司关系- 属性:role_in_production
评价关系:
RATED_BY
:用户评分关系- 属性:rating_score, rating_date
REVIEWED_BY
:评论关系- 属性:review_date, helpful_count
WON_AWARD
:获奖关系- 属性:award_year, award_category
社会关系:
COLLABORATED_WITH
:合作关系- 属性:collaboration_count, first_collaboration_year
SIMILAR_TO
:相似关系- 属性:similarity_score, similarity_reason
3. 本体定义示例(OWL格式)
@prefix : <http://example.org/movie-kg#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
# 电影类定义
:Movie rdf:type owl:Class ;
rdfs:label "电影" ;
rdfs:comment "电影作品实体" .
:Person rdf:type owl:Class ;
rdfs:label "人员" ;
rdfs:comment "演员、导演等电影从业人员" .
:Genre rdf:type owl:Class ;
rdfs:label "电影类型" ;
rdfs:comment "电影分类类型" .
# 属性定义
:title rdf:type owl:DatatypeProperty ;
rdfs:domain :Movie ;
rdfs:range xsd:string ;
rdfs:label "电影标题" .
:rating rdf:type owl:DatatypeProperty ;
rdfs:domain :Movie ;
rdfs:range xsd:decimal ;
rdfs:label "豆瓣评分" .
:name rdf:type owl:DatatypeProperty ;
rdfs:domain :Person ;
rdfs:range xsd:string ;
rdfs:label "人员姓名" .
:genreName rdf:type owl:DatatypeProperty ;
rdfs:domain :Genre ;
rdfs:range xsd:string ;
rdfs:label "类型名称" .
# 关系定义
:directedBy rdf:type owl:ObjectProperty ;
rdfs:domain :Movie ;
rdfs:range :Person ;
rdfs:label "导演" .
:actedIn rdf:type owl:ObjectProperty ;
rdfs:domain :Person ;
rdfs:range :Movie ;
rdfs:label "参演" .
:belongsToGenre rdf:type owl:ObjectProperty ;
rdfs:domain :Movie ;
rdfs:range :Genre ;
rdfs:label "属于类型" .
# 反向关系
:isDirectorOf rdf:type owl:ObjectProperty ;
rdfs:domain :Person ;
rdfs:range :Movie ;
rdfs:label "导演作品" ;
owl:inverseOf :directedBy .
# 传递性关系
:collaboratedWith rdf:type owl:ObjectProperty ;
rdfs:domain :Person ;
rdfs:range :Person ;
rdfs:label "合作" ;
rdf:type owl:TransitiveProperty .
4. 豆瓣特定属性设计
针对豆瓣数据的特殊属性:
class DoubanMovieSchema:
"""豆瓣电影数据模式定义"""
# 电影属性映射
MOVIE_PROPERTIES = {
'title': '电影标题',
'original_title': '原片名',
'year': '上映年份',
'rating': '豆瓣评分',
'votes': '评分人数',
'duration': '片长(分钟)',
'language': '语言',
'country': '制片国家/地区',
'summary': '剧情简介',
'genres': '电影类型',
'tags': '用户标签',
'imdb_id': 'IMDb编号',
'douban_id': '豆瓣ID'
}
# 人员属性映射
PERSON_PROPERTIES = {
'name': '姓名',
'name_en': '英文名',
'gender': '性别',
'birth_date': '出生日期',
'birth_place': '出生地',
'biography': '简介',
'aka': '别名',
'imdb_id': 'IMDb编号',
'douban_id': '豆瓣ID'
}
# 关系属性映射
RELATIONSHIP_PROPERTIES = {
'DIRECTED_BY': {
'credit_order': '导演排序',
'role_description': '导演职责描述'
},
'ACTED_IN': {
'character_name': '角色名称',
'credit_order': '演员排序',
'role_type': '角色类型',
'is_leading_role': '是否主角'
},
'RATED_BY': {
'rating_score': '用户评分',
'rating_date': '评分日期',
'user_id': '用户ID'
}
}
5. 推荐算法相关属性
为电影推荐设计的特殊属性:
class RecommendationSchema:
"""推荐系统专用模式"""
# 相似性计算属性
SIMILARITY_PROPERTIES = {
'genre_similarity': '类型相似度',
'actor_similarity': '演员相似度',
'director_similarity': '导演相似度',
'rating_similarity': '评分相似度',
'overall_similarity': '综合相似度'
}
# 用户偏好属性
USER_PREFERENCE_PROPERTIES = {
'preferred_genres': '偏好类型',
'preferred_actors': '偏好演员',
'preferred_directors': '偏好导演',
'rating_history': '评分历史',
'viewing_history': '观看历史',
'demographics': '用户画像'
}
# 上下文属性
CONTEXT_PROPERTIES = {
'seasonal_trend': '季节趋势',
'time_based_preference': '时间偏好',
'mood_based_preference': '心情偏好',
'social_trend': '社交趋势'
}
6. 数据质量属性
确保数据质量的元数据属性:
class QualityMetadata:
"""数据质量元数据"""
# 数据来源追踪
PROVENANCE_PROPERTIES = {
'source': '数据来源',
'collection_date': '采集日期',
'last_updated': '最后更新',
'confidence_score': '置信度',
'verification_status': '验证状态'
}
# 数据质量指标
QUALITY_METRICS = {
'completeness_score': '完整性得分',
'accuracy_score': '准确性得分',
'consistency_score': '一致性得分',
'timeliness_score': '时效性得分',
'overall_quality': '整体质量'
}
7. 本体可视化
本体结构图示:
电影领域本体结构:
├── 作品层
│ ├── Movie(电影)
│ │ ├── 属性:title, year, rating, summary...
│ │ └── 关系:directed_by, acted_in, belongs_to_genre...
│ └── TVSeries(电视剧)
│ ├── 属性:title, seasons, episodes...
│ └── 关系:directed_by, acted_in...
│
├── 人员层
│ ├── Person(人员)
│ │ ├── 属性:name, birth_date, biography...
│ │ └── 关系:directed, acted_in, collaborated_with...
│ ├── Actor(演员)
│ └── Director(导演)
│
├── 分类层
│ ├── Genre(类型)
│ │ ├── 属性:name, description...
│ │ └── 关系:parent_genre, sub_genre...
│ └── Tag(标签)
│ └── 属性:name, frequency...
│
└── 评价层
├── Award(奖项)
│ ├── 属性:name, year, category...
│ └── 关系:won_by, nominated_for...
└── Review(评论)
├── 属性:content, rating, author...
└── 关系:reviewed_by, helpful_to...
这个本体设计充分考虑了豆瓣电影数据的特点,为电影推荐系统提供了丰富的关系网络和属性信息。通过这种结构化表示,可以支持复杂的电影推荐算法和查询操作。
第五步:关系抽取(合作、导演等关系)
关系抽取是从电影数据中识别实体之间关联的过程。对于豆瓣电影推荐系统,你需要抽取以下类型的电影关系:
1. 导演-电影关系抽取
建立导演与电影之间的执导关系:
class DirectorMovieRelationExtractor:
def __init__(self):
self.relationship_cache = {}
def extract_director_movie_relations(self, movie_entity: Dict, person_entities: List[Dict]) -> List[Dict]:
"""抽取导演与电影的关系"""
relations = []
# 获取电影的导演列表
movie_directors = movie_entity.get('directors', [])
# 为每个导演创建关系
for director_name in movie_directors:
director_entity = self._find_person_entity(director_name, 'director', person_entities)
if director_entity:
relation = {
'relation_type': 'DIRECTED_BY',
'source_id': director_entity['id'], # 导演ID
'target_id': movie_entity['id'], # 电影ID
'source_type': 'Person',
'target_type': 'Movie',
'properties': {
'movie_title': movie_entity.get('title', ''),
'director_name': director_name,
'movie_year': movie_entity.get('year', ''),
'source': 'douban',
'extracted_at': movie_entity.get('extracted_at')
},
'id': self._generate_relation_id('DIRECTED_BY', director_entity['id'], movie_entity['id'])
}
relations.append(relation)
return relations
def _find_person_entity(self, name: str, person_type: str, person_entities: List[Dict]) -> Optional[Dict]:
"""查找人员实体"""
for person in person_entities:
if (person.get('name') == name and
person.get('person_type') == person_type and
movie_entity['id'] in person.get('movies', [])):
return person
return None
def _generate_relation_id(self, relation_type: str, source_id: str, target_id: str) -> str:
"""生成关系唯一标识符"""
import hashlib
content = f"{relation_type}_{source_id}_{target_id}"
return hashlib.md5(content.encode()).hexdigest()
2. 演员-电影关系抽取
建立演员与电影之间的参演关系:
class ActorMovieRelationExtractor:
def __init__(self):
self.relationship_cache = {}
def extract_actor_movie_relations(self, movie_entity: Dict, person_entities: List[Dict]) -> List[Dict]:
"""抽取演员与电影的关系"""
relations = []
# 获取电影的演员列表
movie_actors = movie_entity.get('actors', [])
# 为每个演员创建关系
for i, actor_name in enumerate(movie_actors):
actor_entity = self._find_person_entity(actor_name, 'actor', person_entities)
if actor_entity:
relation = {
'relation_type': 'ACTED_IN',
'source_id': actor_entity['id'], # 演员ID
'target_id': movie_entity['id'], # 电影ID
'source_type': 'Person',
'target_type': 'Movie',
'properties': {
'movie_title': movie_entity.get('title', ''),
'actor_name': actor_name,
'movie_year': movie_entity.get('year', ''),
'credit_order': i + 1, # 演员排序
'is_leading_role': i < 3, # 前3位通常是主角
'source': 'douban',
'extracted_at': movie_entity.get('extracted_at')
},
'id': self._generate_relation_id('ACTED_IN', actor_entity['id'], movie_entity['id'])
}
relations.append(relation)
return relations
3. 演员合作关系抽取
发现演员之间的合作关系:
class ActorCollaborationExtractor:
def __init__(self):
self.collaboration_network = {}
def extract_actor_collaborations(self, movie_entities: List[Dict], person_entities: List[Dict]) -> List[Dict]:
"""抽取演员合作关系"""
relations = []
# 为每部电影建立演员合作网络
for movie in movie_entities:
movie_actors = []
# 收集这部电影的所有演员
for person in person_entities:
if (person.get('person_type') == 'actor' and
movie['id'] in person.get('movies', [])):
movie_actors.append(person)
# 为演员两两建立合作关系
for i, actor1 in enumerate(movie_actors):
for actor2 in movie_actors[i+1:]:
# 检查是否已有合作关系
collaboration_key = self._get_collaboration_key(actor1['id'], actor2['id'])
if collaboration_key in self.collaboration_network:
# 更新已有合作关系
existing_relation = self.collaboration_network[collaboration_key]
existing_relation['properties']['collaboration_count'] += 1
existing_relation['properties']['movies'].append(movie['id'])
else:
# 创建新的合作关系
relation = {
'relation_type': 'COLLABORATED_WITH',
'source_id': actor1['id'],
'target_id': actor2['id'],
'source_type': 'Person',
'target_type': 'Person',
'properties': {
'collaboration_count': 1,
'first_collaboration_movie': movie['title'],
'first_collaboration_year': movie.get('year', ''),
'movies': [movie['id']],
'source': 'douban',
'extracted_at': movie.get('extracted_at')
},
'id': self._generate_relation_id('COLLABORATED_WITH', actor1['id'], actor2['id'])
}
relations.append(relation)
self.collaboration_network[collaboration_key] = relation
return relations
def _get_collaboration_key(self, actor1_id: str, actor2_id: str) -> str:
"""生成合作关系键(确保无向性)"""
if actor1_id < actor2_id:
return f"{actor1_id}_{actor2_id}"
else:
return f"{actor2_id}_{actor1_id}"
4. 电影类型关系抽取
建立电影与类型之间的归属关系:
class GenreRelationExtractor:
def __init__(self):
self.relationship_cache = {}
def extract_genre_relations(self, movie_entity: Dict, genre_entities: List[Dict]) -> List[Dict]:
"""抽取电影类型关系"""
relations = []
movie_genres = movie_entity.get('genres', [])
for genre_name in movie_genres:
genre_entity = self._find_genre_entity(genre_name, genre_entities)
if genre_entity:
relation = {
'relation_type': 'BELONGS_TO_GENRE',
'source_id': movie_entity['id'], # 电影ID
'target_id': genre_entity['id'], # 类型ID
'source_type': 'Movie',
'target_type': 'Genre',
'properties': {
'movie_title': movie_entity.get('title', ''),
'genre_name': genre_name,
'movie_year': movie_entity.get('year', ''),
'movie_rating': movie_entity.get('rating', ''),
'source': 'douban',
'extracted_at': movie_entity.get('extracted_at')
},
'id': self._generate_relation_id('BELONGS_TO_GENRE', movie_entity['id'], genre_entity['id'])
}
relations.append(relation)
return relations
def _find_genre_entity(self, name: str, genre_entities: List[Dict]) -> Optional[Dict]:
"""查找类型实体"""
for genre in genre_entities:
if genre.get('name') == name:
return genre
return None
5. 导演合作关系抽取
发现导演之间的合作关系:
class DirectorCollaborationExtractor:
def __init__(self):
self.collaboration_network = {}
def extract_director_collaborations(self, movie_entities: List[Dict], person_entities: List[Dict]) -> List[Dict]:
"""抽取导演合作关系"""
relations = []
# 为每部电影建立导演合作网络
for movie in movie_entities:
movie_directors = []
# 收集这部电影的所有导演
for person in person_entities:
if (person.get('person_type') == 'director' and
movie['id'] in person.get('movies', [])):
movie_directors.append(person)
# 为导演两两建立合作关系
for i, director1 in enumerate(movie_directors):
for director2 in movie_directors[i+1:]:
collaboration_key = self._get_collaboration_key(director1['id'], director2['id'])
if collaboration_key in self.collaboration_network:
# 更新已有合作关系
existing_relation = self.collaboration_network[collaboration_key]
existing_relation['properties']['collaboration_count'] += 1
existing_relation['properties']['movies'].append(movie['id'])
else:
# 创建新的合作关系
relation = {
'relation_type': 'DIRECTOR_COLLABORATION',
'source_id': director1['id'],
'target_id': director2['id'],
'source_type': 'Person',
'target_type': 'Person',
'properties': {
'collaboration_count': 1,
'first_collaboration_movie': movie['title'],
'first_collaboration_year': movie.get('year', ''),
'movies': [movie['id']],
'source': 'douban',
'extracted_at': movie.get('extracted_at')
},
'id': self._generate_relation_id('DIRECTOR_COLLABORATION', director1['id'], director2['id'])
}
relations.append(relation)
self.collaboration_network[collaboration_key] = relation
return relations
6. 综合关系抽取流程
整合所有关系抽取器:
class RelationshipExtractionPipeline:
def __init__(self):
self.extractors = {
'director_movie': DirectorMovieRelationExtractor(),
'actor_movie': ActorMovieRelationExtractor(),
'actor_collaboration': ActorCollaborationExtractor(),
'genre': GenreRelationExtractor(),
'director_collaboration': DirectorCollaborationExtractor()
}
def extract_all_relationships(self,
movie_entities: List[Dict],
person_entities: List[Dict],
genre_entities: List[Dict]) -> Dict[str, List]:
"""抽取所有类型的关系"""
all_relationships = {
'director_movie': [],
'actor_movie': [],
'actor_collaborations': [],
'genre_relations': [],
'director_collaborations': []
}
# 抽取导演-电影关系
for movie in movie_entities:
relations = self.extractors['director_movie'].extract_director_movie_relations(movie, person_entities)
all_relationships['director_movie'].extend(relations)
# 抽取演员-电影关系
for movie in movie_entities:
relations = self.extractors['actor_movie'].extract_actor_movie_relations(movie, person_entities)
all_relationships['actor_movie'].extend(relations)
# 抽取演员合作关系
actor_collaborations = self.extractors['actor_collaboration'].extract_actor_collaborations(
movie_entities, person_entities)
all_relationships['actor_collaborations'].extend(actor_collaborations)
# 抽取类型关系
for movie in movie_entities:
relations = self.extractors['genre'].extract_genre_relations(movie, genre_entities)
all_relationships['genre_relations'].extend(relations)
# 抽取导演合作关系
director_collaborations = self.extractors['director_collaboration'].extract_director_collaborations(
movie_entities, person_entities)
all_relationships['director_collaborations'].extend(director_collaborations)
# 去重处理
all_relationships = self._deduplicate_relationships(all_relationships)
return all_relationships
def _deduplicate_relationships(self, relationships: Dict[str, List]) -> Dict[str, List]:
"""关系去重"""
deduplicated = {}
for rel_type, rel_list in relationships.items():
seen_ids = set()
unique_relationships = []
for rel in rel_list:
rel_id = rel['id']
if rel_id not in seen_ids:
seen_ids.add(rel_id)
unique_relationships.append(rel)
deduplicated[rel_type] = unique_relationships
return deduplicated
7. 关系质量评估
评估抽取关系的质量:
class RelationshipValidator:
def __init__(self):
self.validation_rules = {
'DIRECTED_BY': self._validate_director_movie_relation,
'ACTED_IN': self._validate_actor_movie_relation,
'COLLABORATED_WITH': self._validate_collaboration_relation,
'BELONGS_TO_GENRE': self._validate_genre_relation
}
def validate_relationships(self, relationships: Dict[str, List]) -> Dict[str, List]:
"""验证所有关系"""
validated_relationships = {}
for rel_type, rel_list in relationships.items():
validator = self.validation_rules.get(rel_type)
if validator:
validated = [rel for rel in rel_list if validator(rel)]
validated_relationships[rel_type] = validated
else:
validated_relationships[rel_type] = rel_list
return validated_relationships
def _validate_director_movie_relation(self, relation: Dict) -> bool:
"""验证导演电影关系"""
required_fields = ['source_id', 'target_id', 'properties']
return all(relation.get(field) for field in required_fields)
def _validate_actor_movie_relation(self, relation: Dict) -> bool:
"""验证演员电影关系"""
required_fields = ['source_id', 'target_id', 'properties']
return all(relation.get(field) for field in required_fields)
def _validate_collaboration_relation(self, relation: Dict) -> bool:
"""验证合作关系"""
props = relation.get('properties', {})
return bool(props.get('collaboration_count', 0) > 0)
def _validate_genre_relation(self, relation: Dict) -> bool:
"""验证类型关系"""
required_fields = ['source_id', 'target_id', 'properties']
return all(relation.get(field) for field in required_fields)
通过这些关系抽取方法,你可以从豆瓣电影数据中建立丰富的实体关联网络,为电影推荐算法提供强大的图结构支持。
第六步:知识融合和消歧(同名演员处理等)
知识融合和消歧是确保知识图谱质量的关键步骤,主要解决实体重复、命名冲突和关系一致性问题。对于豆瓣电影数据,特别需要处理同名演员的消歧问题。
1. 同名演员消歧算法
基于电影作品和合作关系的演员消歧:
import re
from typing import Dict, List, Tuple, Set
from difflib import SequenceMatcher
import networkx as nx
class ActorDisambiguator:
def __init__(self):
self.name_clusters = {} # 姓名聚类
self.movie_signatures = {} # 电影签名用于消歧
self.collaboration_graph = nx.Graph()
def disambiguate_actors(self, person_entities: List[Dict], movie_entities: List[Dict]) -> List[Dict]:
"""演员消歧主函数"""
# 第一步:基于姓名聚类
self._build_name_clusters(person_entities)
# 第二步:构建合作网络
self._build_collaboration_graph(person_entities, movie_entities)
# 第三步:计算相似度进行精确消歧
disambiguated_actors = []
processed_ids = set()
for actor in person_entities:
if actor['person_type'] != 'actor':
disambiguated_actors.append(actor)
continue
actor_id = actor['id']
if actor_id in processed_ids:
continue
# 找到同名演员
same_name_actors = self._find_same_name_actors(actor, person_entities)
if len(same_name_actors) > 1:
# 合并同名但不同演员的实体
merged_actors = self._merge_similar_actors(same_name_actors, movie_entities)
disambiguated_actors.extend(merged_actors)
processed_ids.update(a['id'] for a in same_name_actors)
else:
disambiguated_actors.append(actor)
processed_ids.add(actor_id)
return disambiguated_actors
def _build_name_clusters(self, person_entities: List[Dict]):
"""基于姓名构建聚类"""
for person in person_entities:
if person['person_type'] == 'actor':
name = person['name']
if name not in self.name_clusters:
self.name_clusters[name] = []
self.name_clusters[name].append(person)
def _build_collaboration_graph(self, person_entities: List[Dict], movie_entities: List[Dict]):
"""构建演员合作网络"""
# 为每部电影建立演员合作关系
for movie in movie_entities:
movie_actors = []
# 找到这部电影的所有演员
for person in person_entities:
if (person['person_type'] == 'actor' and
movie['id'] in person.get('movies', [])):
movie_actors.append(person['id'])
# 在合作图中添加边
for i in range(len(movie_actors)):
for j in range(i+1, len(movie_actors)):
actor1_id, actor2_id = movie_actors[i], movie_actors[j]
if self.collaboration_graph.has_edge(actor1_id, actor2_id):
self.collaboration_graph[actor1_id][actor2_id]['weight'] += 1
else:
self.collaboration_graph.add_edge(actor1_id, actor2_id, weight=1)
def _find_same_name_actors(self, actor: Dict, all_actors: List[Dict]) -> List[Dict]:
"""找到同名演员"""
same_name = []
actor_name = actor['name']
for other_actor in all_actors:
if (other_actor['person_type'] == 'actor' and
other_actor['name'] == actor_name and
other_actor['id'] != actor['id']):
same_name.append(other_actor)
same_name.append(actor) # 把自己也加进去
return same_name
def _merge_similar_actors(self, same_name_actors: List[Dict], movie_entities: List[Dict]) -> List[Dict]:
"""合并相似的同名演员"""
if len(same_name_actors) <= 1:
return same_name_actors
# 计算演员之间的相似度
similarity_matrix = self._calculate_similarity_matrix(same_name_actors, movie_entities)
# 使用聚类算法分组
clusters = self._cluster_similar_actors(same_name_actors, similarity_matrix)
merged_actors = []
for cluster in clusters:
if len(cluster) == 1:
merged_actors.append(cluster[0])
else:
merged_actor = self._merge_actor_cluster(cluster, movie_entities)
merged_actors.append(merged_actor)
return merged_actors
def _calculate_similarity_matrix(self, actors: List[Dict], movies: List[Dict]) -> List[List[float]]:
"""计算演员相似度矩阵"""
n = len(actors)
similarity_matrix = [[0.0 for _ in range(n)] for _ in range(n)]
for i in range(n):
for j in range(i+1, n):
similarity = self._calculate_actor_similarity(actors[i], actors[j], movies)
similarity_matrix[i][j] = similarity
similarity_matrix[j][i] = similarity
return similarity_matrix
def _calculate_actor_similarity(self, actor1: Dict, actor2: Dict, movies: List[Dict]) -> float:
"""计算两个演员的相似度"""
score = 0.0
# 1. 电影作品相似度(共同电影占比)
movies1 = set(actor1.get('movies', []))
movies2 = set(actor2.get('movies', []))
if movies1 or movies2:
intersection = len(movies1 & movies2)
union = len(movies1 | movies2)
jaccard_sim = intersection / union if union > 0 else 0
score += jaccard_sim * 0.4
# 2. 合作网络相似度
if self.collaboration_graph.has_node(actor1['id']) and self.collaboration_graph.has_node(actor2['id']):
common_neighbors = len(set(self.collaboration_graph.neighbors(actor1['id'])) &
set(self.collaboration_graph.neighbors(actor2['id'])))
total_neighbors = len(set(self.collaboration_graph.neighbors(actor1['id'])) |
set(self.collaboration_graph.neighbors(actor2['id'])))
neighbor_sim = common_neighbors / total_neighbors if total_neighbors > 0 else 0
score += neighbor_sim * 0.3
# 3. 合作强度相似度
if self.collaboration_graph.has_edge(actor1['id'], actor2['id']):
edge_weight = self.collaboration_graph[actor1['id']][actor2['id']]['weight']
# 如果两人合作过多,可能是同一个人
if edge_weight > 3:
score += 0.8
else:
score += edge_weight * 0.1
# 4. 时间重叠度
year1 = self._get_actor_active_years(actor1, movies)
year2 = self._get_actor_active_years(actor2, movies)
time_overlap = self._calculate_time_overlap(year1, year2)
score += time_overlap * 0.2
return min(score, 1.0)
def _get_actor_active_years(self, actor: Dict, movies: List[Dict]) -> Tuple[int, int]:
"""获取演员活跃年份范围"""
movie_ids = actor.get('movies', [])
years = []
for movie in movies:
if movie['id'] in movie_ids:
year = movie.get('year', '')
if year and year.isdigit():
years.append(int(year))
return (min(years), max(years)) if years else (0, 0)
def _calculate_time_overlap(self, years1: Tuple[int, int], years2: Tuple[int, int]) -> float:
"""计算时间重叠度"""
if years1 == (0, 0) or years2 == (0, 0):
return 0.0
start1, end1 = years1
start2, end2 = years2
overlap_start = max(start1, start2)
overlap_end = min(end1, end2)
if overlap_start >= overlap_end:
return 0.0
overlap_duration = overlap_end - overlap_start
total_duration = max(end1 - start1, end2 - start2)
return overlap_duration / total_duration if total_duration > 0 else 0.0
def _cluster_similar_actors(self, actors: List[Dict], similarity_matrix: List[List[float]],
threshold: float = 0.6) -> List[List[Dict]]:
"""聚类相似的演员"""
n = len(actors)
visited = [False] * n
clusters = []
for i in range(n):
if not visited[i]:
cluster = []
self._dfs_cluster(i, actors, similarity_matrix, visited, cluster, threshold)
if cluster:
clusters.append(cluster)
return clusters
def _dfs_cluster(self, idx: int, actors: List[Dict], similarity_matrix: List[List[float]],
visited: List[bool], cluster: List[Dict], threshold: float):
"""深度优先搜索聚类"""
visited[idx] = True
cluster.append(actors[idx])
for j in range(len(actors)):
if (not visited[j] and
similarity_matrix[idx][j] >= threshold and
actors[idx]['name'] == actors[j]['name']): # 同名约束
self._dfs_cluster(j, actors, similarity_matrix, visited, cluster, threshold)
def _merge_actor_cluster(self, cluster: List[Dict], movies: List[Dict]) -> Dict:
"""合并演员聚类"""
if len(cluster) == 1:
return cluster[0]
# 选择最完整的演员作为基础
base_actor = max(cluster, key=lambda x: self._calculate_actor_completeness(x, movies))
# 合并属性
merged = base_actor.copy()
merged['movies'] = list(set().union(*[a.get('movies', []) for a in cluster]))
merged['confidence'] = sum(a.get('confidence', 0.5) for a in cluster) / len(cluster)
merged['disambiguation_notes'] = f"合并了{len(cluster)}个同名演员"
return merged
def _calculate_actor_completeness(self, actor: Dict, movies: List[Dict]) -> int:
"""计算演员信息的完整度"""
completeness = 0
movie_count = len(actor.get('movies', []))
if movie_count > 0: completeness += movie_count * 2
if actor.get('name'): completeness += 1
if len(actor.get('movies', [])) > 5: completeness += 2 # 作品丰富的演员更可能是知名演员
return completeness
2. 实体去重和标准化
统一实体表示和去重:
class EntityDeduplicator:
def __init__(self):
self.entity_index = {}
self.name_variants = {} # 存储姓名变体
def deduplicate_entities(self, entities: Dict[str, List]) -> Dict[str, List]:
"""实体去重"""
deduplicated = {}
for entity_type, entity_list in entities.items():
if entity_type == 'persons':
# 人员实体需要特殊处理(考虑同名消歧)
deduplicated[entity_type] = self._deduplicate_persons(entity_list)
else:
deduplicated[entity_type] = self._deduplicate_generic(entity_list)
return deduplicated
def _deduplicate_persons(self, persons: List[Dict]) -> List[Dict]:
"""人员实体去重(已消歧后)"""
unique_persons = []
seen_signatures = set()
for person in persons:
signature = self._create_person_signature(person)
if signature not in seen_signatures:
seen_signatures.add(signature)
unique_persons.append(person)
else:
# 合并重复的人员信息
existing_person = self._find_person_by_signature(unique_persons, signature)
if existing_person:
self._merge_person_info(existing_person, person)
return unique_persons
def _create_person_signature(self, person: Dict) -> str:
"""创建人员签名"""
# 使用姓名、类型和主要电影创建签名
name = person.get('name', '')
person_type = person.get('person_type', '')
movies = sorted(person.get('movies', []))
main_movie = movies[0] if movies else ''
signature = f"{name}_{person_type}_{main_movie}"
return signature
def _find_person_by_signature(self, persons: List[Dict], signature: str) -> Optional[Dict]:
"""通过签名查找人员"""
for person in persons:
if self._create_person_signature(person) == signature:
return person
return None
def _merge_person_info(self, existing: Dict, new_person: Dict):
"""合并人员信息"""
# 合并电影作品
existing_movies = set(existing.get('movies', []))
new_movies = set(new_person.get('movies', []))
existing['movies'] = list(existing_movies | new_movies)
# 更新置信度
existing['confidence'] = (existing.get('confidence', 0.5) + new_person.get('confidence', 0.5)) / 2
def _deduplicate_generic(self, entities: List[Dict]) -> List[Dict]:
"""通用实体去重"""
unique_entities = []
seen_ids = set()
for entity in entities:
entity_id = entity.get('id')
if entity_id and entity_id not in seen_ids:
seen_ids.add(entity_id)
unique_entities.append(entity)
return unique_entities
3. 关系冲突解决
处理重复和冲突的关系:
class RelationshipConflictResolver:
def __init__(self):
self.relationship_index = {}
def resolve_relationship_conflicts(self, relationships: Dict[str, List]) -> Dict[str, List]:
"""解决关系冲突"""
resolved = {}
for rel_type, rel_list in relationships.items():
if rel_type in ['actor_collaborations', 'director_collaborations']:
# 合作关系需要特殊处理
resolved[rel_type] = self._resolve_collaboration_conflicts(rel_list)
else:
resolved[rel_type] = self._resolve_generic_conflicts(rel_list)
return resolved
def _resolve_collaboration_conflicts(self, collaborations: List[Dict]) -> List[Dict]:
"""解决合作关系冲突"""
resolved_collaborations = []
seen_pairs = set()
for collab in collaborations:
# 创建合作对的标准化键
pair_key = self._create_collaboration_key(collab)
if pair_key not in seen_pairs:
seen_pairs.add(pair_key)
resolved_collaborations.append(collab)
else:
# 合并合作信息
existing_collab = self._find_collaboration_by_key(resolved_collaborations, pair_key)
if existing_collab:
self._merge_collaboration_info(existing_collab, collab)
return resolved_collaborations
def _create_collaboration_key(self, collaboration: Dict) -> str:
"""创建合作关系键"""
source_id = collaboration.get('source_id', '')
target_id = collaboration.get('target_id', '')
# 确保键的无向性
if source_id < target_id:
return f"{source_id}_{target_id}"
else:
return f"{target_id}_{source_id}"
def _find_collaboration_by_key(self, collaborations: List[Dict], key: str) -> Optional[Dict]:
"""通过键查找合作关系"""
for collab in collaborations:
if self._create_collaboration_key(collab) == key:
return collab
return None
def _merge_collaboration_info(self, existing: Dict, new_collab: Dict):
"""合并合作信息"""
existing_props = existing.get('properties', {})
new_props = new_collab.get('properties', {})
# 合并合作次数
existing_count = existing_props.get('collaboration_count', 1)
new_count = new_props.get('collaboration_count', 1)
existing_props['collaboration_count'] = existing_count + new_count
# 合并电影列表
existing_movies = set(existing_props.get('movies', []))
new_movies = set(new_props.get('movies', []))
existing_props['movies'] = list(existing_movies | new_movies)
def _resolve_generic_conflicts(self, relationships: List[Dict]) -> List[Dict]:
"""解决通用关系冲突"""
resolved_relationships = []
seen_signatures = set()
for rel in relationships:
signature = self._create_relationship_signature(rel)
if signature not in seen_signatures:
seen_signatures.add(signature)
resolved_relationships.append(rel)
return resolved_relationships
def _create_relationship_signature(self, relationship: Dict) -> str:
"""创建关系签名"""
rel_type = relationship.get('relation_type', '')
source_id = relationship.get('source_id', '')
target_id = relationship.get('target_id', '')
return f"{rel_type}_{source_id}_{target_id}"
4. 跨语言名称统一
处理中英文名称对应关系:
class CrossLanguageNameUnifier:
def __init__(self):
self.name_mappings = {} # 中英文名称映射
self.english_name_patterns = {} # 英文名模式匹配
def unify_cross_language_names(self, entities: Dict[str, List]) -> Dict[str, List]:
"""统一跨语言名称"""
unified_entities = {}
for entity_type, entity_list in entities.items():
unified = []
for entity in entity_list:
unified_entity = self._unify_entity_names(entity)
unified.append(unified_entity)
unified_entities[entity_type] = unified
return unified_entities
def _unify_entity_names(self, entity: Dict) -> Dict:
"""统一实体名称"""
unified = entity.copy()
if entity.get('entity_type') == 'Person':
# 处理人员名称统一
name = entity.get('name', '')
name_en = entity.get('name_en', '')
if name and name_en:
# 建立中英文映射
self.name_mappings[name] = name_en
self.name_mappings[name_en] = name
# 添加标准化名称
unified['standardized_name'] = self._standardize_person_name(name, name_en)
elif entity.get('entity_type') == 'Movie':
# 处理电影名称统一
title = entity.get('title', '')
original_title = entity.get('original_title', '')
unified['standardized_title'] = self._standardize_movie_title(title, original_title)
return unified
def _standardize_person_name(self, chinese_name: str, english_name: str) -> str:
"""标准化人员名称"""
if chinese_name and english_name:
return f"{chinese_name} ({english_name})"
elif chinese_name:
return chinese_name
elif english_name:
return english_name
else:
return "未知"
def _standardize_movie_title(self, chinese_title: str, original_title: str) -> str:
"""标准化电影标题"""
if chinese_title and original_title:
return f"{chinese_title} / {original_title}"
elif chinese_title:
return chinese_title
elif original_title:
return original_title
else:
return "未知电影"
通过这些知识融合和消歧技术,你可以确保豆瓣电影知识图谱中实体和关系的质量,为电影推荐系统提供准确可靠的数据基础。
第七步:知识存储和建模
知识存储和建模是将豆瓣电影数据存储到图数据库中的过程。对于电影推荐系统,你需要选择合适的图数据库并设计高效的数据模型。
1. 图数据库选型
电影推荐系统图数据库推荐:
数据库 | 优势 | 适用场景 | 推荐指数 |
---|---|---|---|
Neo4j | 查询性能优异,Cypher直观,支持ACID | 电影推荐核心功能,复杂关系查询 | ⭐⭐⭐⭐⭐ |
Amazon Neptune | 云原生,完全托管,高可用性 | 云环境部署,需要企业级支持 | ⭐⭐⭐⭐ |
JanusGraph | 分布式,可扩展,海量数据 | 大规模电影数据(百万级) | ⭐⭐⭐⭐ |
ArangoDB | 多模型支持,灵活性高 | 需要文档数据库混合使用 | ⭐⭐⭐ |
推荐选择Neo4j,理由:
- 成熟的电影推荐案例丰富
- Cypher查询语言直观易用
- 性能优异,支持复杂图算法
- 社区活跃,资料丰富
2. Neo4j数据模型设计
电影知识图谱的节点和关系设计:
from neo4j import GraphDatabase
import json
class MovieKnowledgeGraph:
def __init__(self, uri: str, user: str, password: str):
self.driver = GraphDatabase.driver(uri, auth=(user, password))
def create_movie_node(self, movie_entity: Dict):
"""创建电影节点"""
query = """
MERGE (m:Movie {douban_id: $douban_id})
SET m.title = $title,
m.original_title = $original_title,
m.year = $year,
m.rating = $rating,
m.votes = $votes,
m.duration = $duration,
m.language = $language,
m.country = $country,
m.summary = $summary,
m.source = $source,
m.created_at = $created_at
RETURN m
"""
parameters = {
'douban_id': movie_entity.get('douban_id'),
'title': movie_entity.get('title'),
'original_title': movie_entity.get('original_title'),
'year': movie_entity.get('year'),
'rating': movie_entity.get('rating'),
'votes': movie_entity.get('votes'),
'duration': movie_entity.get('duration'),
'language': movie_entity.get('language'),
'country': movie_entity.get('country'),
'summary': movie_entity.get('summary'),
'source': movie_entity.get('source'),
'created_at': movie_entity.get('extracted_at')
}
with self.driver.session() as session:
result = session.run(query, parameters)
return result.single()[0]
def create_person_node(self, person_entity: Dict):
"""创建人员节点"""
query = """
MERGE (p:Person {name: $name})
SET p.person_type = $person_type,
p.birth_date = $birth_date,
p.birth_place = $birth_place,
p.gender = $gender,
p.biography = $biography,
p.source = $source,
p.created_at = $created_at,
p.movie_count = $movie_count
RETURN p
"""
parameters = {
'name': person_entity.get('name'),
'person_type': person_entity.get('person_type'),
'birth_date': person_entity.get('birth_date'),
'birth_place': person_entity.get('birth_place'),
'gender': person_entity.get('gender'),
'biography': person_entity.get('biography'),
'source': person_entity.get('source'),
'created_at': person_entity.get('extracted_at'),
'movie_count': len(person_entity.get('movies', []))
}
with self.driver.session() as session:
result = session.run(query, parameters)
return result.single()[0]
def create_genre_node(self, genre_entity: Dict):
"""创建类型节点"""
query = """
MERGE (g:Genre {name: $name})
SET g.movie_count = $movie_count,
g.source = $source,
g.created_at = $created_at
RETURN g
"""
parameters = {
'name': genre_entity.get('name'),
'movie_count': genre_entity.get('movie_count', 0),
'source': genre_entity.get('source'),
'created_at': genre_entity.get('extracted_at')
}
with self.driver.session() as session:
result = session.run(query, parameters)
return result.single()[0]
3. 关系创建和索引设计
创建电影关系和性能索引:
class MovieRelationshipManager:
def __init__(self, driver):
self.driver = driver
def create_director_relationship(self, director_name: str, movie_douban_id: str):
"""创建导演关系"""
query = """
MATCH (p:Person {name: $director_name, person_type: 'director'})
MATCH (m:Movie {douban_id: $movie_douban_id})
MERGE (p)-[r:DIRECTED_BY]->(m)
SET r.created_at = $created_at
RETURN r
"""
parameters = {
'director_name': director_name,
'movie_douban_id': movie_douban_id,
'created_at': self._get_current_timestamp()
}
with self.driver.session() as session:
session.run(query, parameters)
def create_actor_relationship(self, actor_name: str, movie_douban_id: str, character_name: str = None):
"""创建演员关系"""
query = """
MATCH (p:Person {name: $actor_name, person_type: 'actor'})
MATCH (m:Movie {douban_id: $movie_douban_id})
MERGE (p)-[r:ACTED_IN]->(m)
SET r.character_name = $character_name,
r.created_at = $created_at
RETURN r
"""
parameters = {
'actor_name': actor_name,
'movie_douban_id': movie_douban_id,
'character_name': character_name,
'created_at': self._get_current_timestamp()
}
with self.driver.session() as session:
session.run(query, parameters)
def create_genre_relationship(self, movie_douban_id: str, genre_name: str):
"""创建类型关系"""
query = """
MATCH (m:Movie {douban_id: $movie_douban_id})
MATCH (g:Genre {name: $genre_name})
MERGE (m)-[r:BELONGS_TO_GENRE]->(g)
SET r.created_at = $created_at
RETURN r
"""
parameters = {
'movie_douban_id': movie_douban_id,
'genre_name': genre_name,
'created_at': self._get_current_timestamp()
}
with self.driver.session() as session:
session.run(query, parameters)
def create_collaboration_relationship(self, actor1_name: str, actor2_name: str, collaboration_count: int):
"""创建演员合作关系"""
query = """
MATCH (p1:Person {name: $actor1_name, person_type: 'actor'})
MATCH (p2:Person {name: $actor2_name, person_type: 'actor'})
MERGE (p1)-[r:COLLABORATED_WITH]-(p2)
SET r.collaboration_count = $collaboration_count,
r.created_at = $created_at
RETURN r
"""
parameters = {
'actor1_name': actor1_name,
'actor2_name': actor2_name,
'collaboration_count': collaboration_count,
'created_at': self._get_current_timestamp()
}
with self.driver.session() as session:
session.run(query, parameters)
def create_indexes(self):
"""创建性能索引"""
index_queries = [
# 电影索引
"CREATE INDEX movie_douban_id_idx IF NOT EXISTS FOR (m:Movie) ON (m.douban_id)",
"CREATE INDEX movie_title_idx IF NOT EXISTS FOR (m:Movie) ON (m.title)",
"CREATE INDEX movie_year_idx IF NOT EXISTS FOR (m:Movie) ON (m.year)",
"CREATE INDEX movie_rating_idx IF NOT EXISTS FOR (m:Movie) ON (m.rating)",
# 人员索引
"CREATE INDEX person_name_idx IF NOT EXISTS FOR (p:Person) ON (p.name)",
"CREATE INDEX person_type_idx IF NOT EXISTS FOR (p:Person) ON (p.person_type)",
"CREATE CONSTRAINT person_name_unique IF NOT EXISTS FOR (p:Person) REQUIRE p.name IS UNIQUE",
# 类型索引
"CREATE INDEX genre_name_idx IF NOT EXISTS FOR (g:Genre) ON (g.name)",
"CREATE CONSTRAINT genre_name_unique IF NOT EXISTS FOR (g:Genre) REQUIRE g.name IS UNIQUE",
# 关系索引
"CREATE INDEX directed_by_movie_idx IF NOT EXISTS FOR ()-[r:DIRECTED_BY]->() ON (r.created_at)",
"CREATE INDEX acted_in_movie_idx IF NOT EXISTS FOR ()-[r:ACTED_IN]->() ON (r.created_at)",
"CREATE INDEX collaboration_count_idx IF NOT EXISTS FOR ()-[r:COLLABORATED_WITH]-() ON (r.collaboration_count)"
]
with self.driver.session() as session:
for query in index_queries:
try:
session.run(query)
print(f"创建索引成功: {query}")
except Exception as e:
print(f"创建索引失败: {e}")
def _get_current_timestamp(self) -> str:
"""获取当前时间戳"""
from datetime import datetime
return datetime.now().isoformat()
4. 批量数据导入优化
高效的数据导入策略:
import time
from typing import List, Dict, Any
class BatchMovieImporter:
def __init__(self, driver, batch_size: int = 100):
self.driver = driver
self.batch_size = batch_size
def import_movies_batch(self, movie_entities: List[Dict]):
"""批量导入电影"""
def create_movies_batch(tx, movie_batch):
for movie in movie_batch:
query = """
MERGE (m:Movie {douban_id: $douban_id})
SET m.title = $title,
m.year = $year,
m.rating = $rating,
m.votes = $votes,
m.duration = $duration,
m.language = $language,
m.country = $country,
m.summary = $summary,
m.created_at = $created_at
"""
tx.run(query, **movie)
# 分批处理
for i in range(0, len(movie_entities), self.batch_size):
batch = movie_entities[i:i + self.batch_size]
with self.driver.session() as session:
session.execute_write(create_movies_batch, batch)
print(f"已导入电影批次: {i//self.batch_size + 1}")
time.sleep(0.1) # 避免过度压力
def import_persons_batch(self, person_entities: List[Dict]):
"""批量导入人员"""
def create_persons_batch(tx, person_batch):
for person in person_batch:
query = """
MERGE (p:Person {name: $name})
SET p.person_type = $person_type,
p.movie_count = $movie_count,
p.created_at = $created_at
"""
tx.run(query, **person)
# 分批处理
for i in range(0, len(person_entities), self.batch_size):
batch = person_entities[i:i + self.batch_size]
with self.driver.session() as session:
session.execute_write(create_persons_batch, batch)
print(f"已导入人员批次: {i//self.batch_size + 1}")
time.sleep(0.1)
def import_relationships_batch(self, relationships: List[Dict]):
"""批量导入关系"""
def create_relationships_batch(tx, rel_batch):
for rel in rel_batch:
if rel['relation_type'] == 'DIRECTED_BY':
query = """
MATCH (p:Person {name: $director_name, person_type: 'director'})
MATCH (m:Movie {douban_id: $movie_douban_id})
MERGE (p)-[r:DIRECTED_BY]->(m)
"""
elif rel['relation_type'] == 'ACTED_IN':
query = """
MATCH (p:Person {name: $actor_name, person_type: 'actor'})
MATCH (m:Movie {douban_id: $movie_douban_id})
MERGE (p)-[r:ACTED_IN]->(m)
"""
elif rel['relation_type'] == 'BELONGS_TO_GENRE':
query = """
MATCH (m:Movie {douban_id: $movie_douban_id})
MATCH (g:Genre {name: $genre_name})
MERGE (m)-[r:BELONGS_TO_GENRE]->(g)
"""
tx.run(query, **rel['properties'])
# 分批处理
for i in range(0, len(relationships), self.batch_size):
batch = relationships[i:i + self.batch_size]
with self.driver.session() as session:
session.execute_write(create_relationships_batch, batch)
print(f"已导入关系批次: {i//self.batch_size + 1}")
time.sleep(0.1)
5. 电影推荐查询优化
为推荐算法设计的查询模板:
class MovieRecommendationQueries:
def __init__(self, driver):
self.driver = driver
def get_similar_movies_by_genre(self, movie_douban_id: str, limit: int = 10):
"""基于类型的相似电影推荐"""
query = """
MATCH (m1:Movie {douban_id: $movie_douban_id})-[:BELONGS_TO_GENRE]->(g:Genre)<-[:BELONGS_TO_GENRE]-(m2:Movie)
WHERE m1 <> m2
WITH m2, count(g) as common_genres
ORDER BY common_genres DESC, m2.rating DESC
LIMIT $limit
RETURN m2.title as title, m2.rating as rating, m2.year as year, common_genres
"""
with self.driver.session() as session:
result = session.run(query, movie_douban_id=movie_douban_id, limit=limit)
return [record.data() for record in result]
def get_movies_by_actor_collaborators(self, actor_name: str, limit: int = 10):
"""基于演员合作者的电影推荐"""
query = """
MATCH (a1:Person {name: $actor_name, person_type: 'actor'})-[:COLLABORATED_WITH]-(a2:Person)-[:ACTED_IN]->(m:Movie)
WHERE NOT (a1)-[:ACTED_IN]->(m)
WITH m, count(a2) as collaborator_count, avg(a2.movie_count) as avg_collaborator_movies
ORDER BY collaborator_count DESC, avg_collaborator_movies DESC, m.rating DESC
LIMIT $limit
RETURN m.title as title, m.rating as rating, collaborator_count, avg_collaborator_movies
"""
with self.driver.session() as session:
result = session.run(query, actor_name=actor_name, limit=limit)
return [record.data() for record in result]
def get_movies_by_director_style(self, director_name: str, limit: int = 10):
"""基于导演风格的电影推荐"""
query = """
MATCH (d:Person {name: $director_name, person_type: 'director'})-[:DIRECTED_BY]->(m1:Movie)
MATCH (m1)-[:BELONGS_TO_GENRE]->(g:Genre)<-[:BELONGS_TO_GENRE]-(m2:Movie)
MATCH (m1)-[:ACTED_IN]-(a:Person)-[:ACTED_IN]-(m2:Movie)
WHERE NOT (d)-[:DIRECTED_BY]->(m2)
WITH m2,
count(DISTINCT g) as common_genres,
count(DISTINCT a) as common_actors,
avg(m1.rating) as director_avg_rating
ORDER BY common_genres DESC, common_actors DESC, director_avg_rating DESC, m2.rating DESC
LIMIT $limit
RETURN m2.title as title,
m2.rating as rating,
common_genres,
common_actors,
director_avg_rating
"""
with self.driver.session() as session:
result = session.run(query, director_name=director_name, limit=limit)
return [record.data() for record in result]
def get_trending_movies_by_genre(self, genre_name: str, limit: int = 10):
"""获取类型热门电影"""
query = """
MATCH (m:Movie)-[:BELONGS_TO_GENRE]->(g:Genre {name: $genre_name})
WHERE m.year >= $min_year
RETURN m.title as title,
m.rating as rating,
m.votes as votes,
m.year as year
ORDER BY m.rating DESC, m.votes DESC
LIMIT $limit
"""
# 过去5年的电影
min_year = str(int(time.strftime("%Y")) - 5)
with self.driver.session() as session:
result = session.run(query, genre_name=genre_name, min_year=min_year, limit=limit)
return [record.data() for record in result]
6. 数据验证和维护
确保知识图谱数据质量:
class MovieGraphValidator:
def __init__(self, driver):
self.driver = driver
def validate_graph_integrity(self):
"""验证图数据完整性"""
validation_queries = [
# 检查电影节点完整性
"""
MATCH (m:Movie)
WHERE m.title IS NULL OR m.douban_id IS NULL OR m.year IS NULL
RETURN count(m) as incomplete_movies
""",
# 检查人员节点完整性
"""
MATCH (p:Person)
WHERE p.name IS NULL OR p.person_type IS NULL
RETURN count(p) as incomplete_persons
""",
# 检查孤立节点
"""
MATCH (m:Movie)
WHERE NOT (m)--()
RETURN count(m) as isolated_movies
""",
# 检查数据一致性
"""
MATCH (p:Person)-[r:ACTED_IN]->(m:Movie)
WHERE p.person_type <> 'actor'
RETURN count(r) as invalid_actor_relations
""",
# 检查导演关系一致性
"""
MATCH (p:Person)-[r:DIRECTED_BY]->(m:Movie)
WHERE p.person_type <> 'director'
RETURN count(r) as invalid_director_relations
"""
]
with self.driver.session() as session:
for i, query in enumerate(validation_queries):
result = session.run(query)
record = result.single()
if record and any(value > 0 for value in record.values()):
print(f"数据质量问题 {i+1}: {record.values()}")
def cleanup_orphaned_data(self):
"""清理孤立和无效数据"""
cleanup_queries = [
# 删除没有标题的电影
"MATCH (m:Movie) WHERE m.title IS NULL DETACH DELETE m",
# 删除没有姓名的演员
"MATCH (p:Person) WHERE p.name IS NULL DETACH DELETE p",
# 删除评分过低且投票数少的电影(可选项)
"""
MATCH (m:Movie)
WHERE m.rating < 3.0 AND m.votes < 100
DETACH DELETE m
""",
# 删除没有合作关系的演员合作边(如果有的话)
"""
MATCH (p1:Person)-[r:COLLABORATED_WITH]-(p2:Person)
WHERE r.collaboration_count < 1
DELETE r
"""
]
with self.driver.session() as session:
for query in cleanup_queries:
result = session.run(query)
summary = result.consume()
print(f"清理了 {summary.counters.nodes_deleted} 个节点, "
f"{summary.counters.relationships_deleted} 个关系")
def generate_statistics_report(self):
"""生成图谱统计报告"""
stat_queries = [
("总电影数", "MATCH (m:Movie) RETURN count(m) as count"),
("总人员数", "MATCH (p:Person) RETURN count(p) as count"),
("演员数量", "MATCH (p:Person {person_type: 'actor'}) RETURN count(p) as count"),
("导演数量", "MATCH (p:Person {person_type: 'director'}) RETURN count(p) as count"),
("电影类型数", "MATCH (g:Genre) RETURN count(g) as count"),
("导演关系数", "MATCH ()-[r:DIRECTED_BY]->() RETURN count(r) as count"),
("演员关系数", "MATCH ()-[r:ACTED_IN]->() RETURN count(r) as count"),
("合作关系数", "MATCH ()-[r:COLLABORATED_WITH]-() RETURN count(r) as count"),
("平均评分", "MATCH (m:Movie) WHERE m.rating IS NOT NULL RETURN avg(toFloat(m.rating)) as avg"),
("最高评分电影", "MATCH (m:Movie) WHERE m.rating IS NOT NULL RETURN m.title, m.rating ORDER BY m.rating DESC LIMIT 1")
]
report = ["=== 电影知识图谱统计报告 ==="]
with self.driver.session() as session:
for stat_name, query in stat_queries:
result = session.run(query)
record = result.single()
if record:
value = list(record.values())[0]
report.append(f"{stat_name}: {value}")
return "\n".join(report)
通过合理的数据存储和建模,你可以高效地存储和管理豆瓣电影知识图谱,为电影推荐系统提供强大的数据支撑。
第八步:质量评估和验证
质量评估和验证是确保豆瓣电影知识图谱可靠性的关键步骤。你需要从多个维度评估知识图谱的质量,为电影推荐系统提供高质量的数据支撑。
1. 数据完整性评估
评估实体和关系的覆盖率:
class CompletenessEvaluator:
def __init__(self, driver):
self.driver = driver
def evaluate_completeness(self) -> Dict[str, float]:
"""评估数据完整性"""
metrics = {}
with self.driver.session() as session:
# 基本统计
result = session.run("MATCH (n) RETURN count(n) as total_nodes")
metrics['total_nodes'] = result.single()['total_nodes']
result = session.run("MATCH ()-[r]->() RETURN count(r) as total_relationships")
metrics['total_relationships'] = result.single()['total_relationships']
# 实体类型分布
result = session.run("""
MATCH (n)
RETURN labels(n) as labels, count(n) as count
""")
entity_types = {record['labels'][0]: record['count'] for record in result}
metrics['entity_distribution'] = entity_types
# 关系类型分布
result = session.run("""
MATCH ()-[r]->()
RETURN type(r) as rel_type, count(r) as count
""")
rel_types = {record['rel_type']: record['count'] for record in result}
metrics['relationship_distribution'] = rel_types
# 计算密度指标
if metrics['total_nodes'] > 0:
metrics['relationship_per_node'] = metrics['total_relationships'] / metrics['total_nodes']
# 检查孤立节点比例
result = session.run("""
MATCH (n)
WHERE NOT (n)--()
RETURN count(n) as isolated_count
""")
isolated_count = result.single()['isolated_count']
metrics['isolated_node_ratio'] = isolated_count / metrics['total_nodes']
# 检查电影的完整性(必须属性)
result = session.run("""
MATCH (m:Movie)
WHERE m.title IS NOT NULL
AND m.year IS NOT NULL
AND m.rating IS NOT NULL
RETURN count(m) as complete_movies
""")
complete_movies = result.single()['complete_movies']
total_movies = entity_types.get('Movie', 0)
metrics['movie_completeness_ratio'] = complete_movies / total_movies if total_movies > 0 else 0
return metrics
def generate_completeness_report(self, metrics: Dict[str, float]) -> str:
"""生成完整性报告"""
report = []
report.append("=== 电影知识图谱完整性评估报告 ===")
report.append(f"总节点数: {metrics['total_nodes']}")
report.append(f"总关系数: {metrics['total_relationships']}")
report.append(f"平均每节点关系数: {metrics.get('relationship_per_node', 0)".2f"}")
report.append(f"孤立节点比例: {metrics.get('isolated_node_ratio', 0)".2%"}")
report.append(f"电影完整性比例: {metrics.get('movie_completeness_ratio', 0)".2%"}")
report.append("")
report.append("实体类型分布:")
for entity_type, count in metrics['entity_distribution'].items():
report.append(f" {entity_type}: {count}")
report.append("")
report.append("关系类型分布:")
for rel_type, count in metrics['relationship_distribution'].items():
report.append(f" {rel_type}: {count}")
return "\n".join(report)
2. 数据准确性验证
验证抽取结果的准确性:
class AccuracyValidator:
def __init__(self, driver, douban_scraper):
self.driver = driver
self.douban_scraper = douban_scraper
def validate_movie_accuracy(self, sample_size: int = 50) -> Dict[str, float]:
"""验证电影数据准确性"""
validation_results = {'correct': 0, 'incorrect': 0, 'total': 0}
with self.driver.session() as session:
# 随机采样电影进行验证
result = session.run(f"""
MATCH (m:Movie)
RETURN m.douban_id as douban_id,
m.title as title,
m.year as year,
m.rating as rating
ORDER BY rand()
LIMIT {sample_size}
""")
for record in result:
is_correct = self._validate_single_movie(record)
if is_correct:
validation_results['correct'] += 1
else:
validation_results['incorrect'] += 1
validation_results['total'] += 1
# 计算准确率
accuracy = validation_results['correct'] / validation_results['total']
validation_results['accuracy'] = accuracy
return validation_results
def _validate_single_movie(self, movie_record) -> bool:
"""验证单个电影的准确性"""
try:
douban_id = movie_record['douban_id']
# 从豆瓣获取最新数据
html_content = self._get_movie_html_from_douban(douban_id)
if not html_content:
return False
# 解析豆瓣页面
douban_data = self._parse_douban_movie_page(html_content)
# 与图谱数据比较
return self._compare_movie_data(movie_record, douban_data)
except Exception as e:
print(f"验证电影失败 {movie_record['douban_id']}: {e}")
return False
def _get_movie_html_from_douban(self, douban_id: str) -> str:
"""从豆瓣获取电影页面HTML"""
# 这里应该调用豆瓣抓取器获取数据
# 注意遵守豆瓣的robots.txt和频率限制
pass
def _parse_douban_movie_page(self, html: str) -> Dict:
"""解析豆瓣电影页面"""
# 解析HTML获取电影信息
# 返回标准化的电影数据
pass
def _compare_movie_data(self, graph_data: Dict, douban_data: Dict) -> bool:
"""比较图谱数据与豆瓣数据的差异"""
# 比较关键字段
key_fields = ['title', 'year', 'rating']
for field in key_fields:
graph_value = str(graph_data.get(field, '')).strip()
douban_value = str(douban_data.get(field, '')).strip()
# 允许一定程度的差异(比如评分的小数点差异)
if field == 'rating':
try:
graph_rating = float(graph_value)
douban_rating = float(douban_value)
if abs(graph_rating - douban_rating) > 0.1: # 差异超过0.1分认为不准确
return False
except:
if graph_value != douban_value:
return False
else:
if graph_value != douban_value:
return False
return True
3. 推荐效果评估
评估基于知识图谱的推荐质量:
class RecommendationEvaluator:
def __init__(self, driver):
self.driver = driver
def evaluate_recommendation_quality(self, test_movies: List[str], top_k: int = 10) -> Dict[str, float]:
"""评估推荐质量"""
metrics = {
'precision': 0.0,
'recall': 0.0,
'f1_score': 0.0,
'coverage': 0.0,
'novelty': 0.0
}
total_precision = 0.0
total_recall = 0.0
all_recommended = set()
all_relevant = set()
for movie_douban_id in test_movies:
# 获取推荐结果
recommended = self._get_recommendations(movie_douban_id, top_k)
# 获取相关电影(基于类型和演员重叠度)
relevant = self._get_relevant_movies(movie_douban_id, top_k * 2)
# 计算精确率和召回率
recommended_set = set(recommended)
relevant_set = set(relevant)
precision = len(recommended_set & relevant_set) / len(recommended_set) if recommended_set else 0
recall = len(recommended_set & relevant_set) / len(relevant_set) if relevant_set else 0
total_precision += precision
total_recall += recall
all_recommended.update(recommended)
all_relevant.update(relevant)
# 平均精确率和召回率
n = len(test_movies)
avg_precision = total_precision / n
avg_recall = total_recall / n
metrics['precision'] = avg_precision
metrics['recall'] = avg_recall
metrics['f1_score'] = 2 * avg_precision * avg_recall / (avg_precision + avg_recall) if (avg_precision + avg_recall) > 0 else 0
# 覆盖率(推荐电影的多样性)
total_movies = self._get_total_movie_count()
metrics['coverage'] = len(all_recommended) / total_movies if total_movies > 0 else 0
# 新颖性(推荐冷门电影的比例)
metrics['novelty'] = self._calculate_novelty_score(all_recommended)
return metrics
def _get_recommendations(self, movie_douban_id: str, top_k: int) -> List[str]:
"""获取电影推荐"""
query = """
MATCH (m1:Movie {douban_id: $movie_douban_id})-[:BELONGS_TO_GENRE]->(g:Genre)<-[:BELONGS_TO_GENRE]-(m2:Movie)
WHERE m1 <> m2
WITH m2, count(g) as common_genres
ORDER BY common_genres DESC, m2.rating DESC
LIMIT $top_k
RETURN m2.douban_id as douban_id
"""
with self.driver.session() as session:
result = session.run(query, movie_douban_id=movie_douban_id, top_k=top_k)
return [record['douban_id'] for record in result]
def _get_relevant_movies(self, movie_douban_id: str, limit: int) -> List[str]:
"""获取相关电影(作为基准)"""
query = """
MATCH (m1:Movie {douban_id: $movie_douban_id})
MATCH (m1)-[:BELONGS_TO_GENRE]->(g:Genre)<-[:BELONGS_TO_GENRE]-(m2:Movie)
WHERE m1 <> m2 AND m2.rating > 7.0
RETURN m2.douban_id as douban_id
ORDER BY m2.rating DESC
LIMIT $limit
"""
with self.driver.session() as session:
result = session.run(query, movie_douban_id=movie_douban_id, limit=limit)
return [record['douban_id'] for record in result]
def _get_total_movie_count(self) -> int:
"""获取电影总数"""
with self.driver.session() as session:
result = session.run("MATCH (m:Movie) RETURN count(m) as count")
return result.single()['count']
def _calculate_novelty_score(self, recommended_movies: Set[str]) -> float:
"""计算新颖性得分"""
# 新颖性基于电影的冷门程度(评分人数少的电影更具新颖性)
query = """
MATCH (m:Movie)
WHERE m.douban_id IN $movie_ids
RETURN m.douban_id as douban_id,
CASE WHEN toInteger(m.votes) < 1000 THEN 1.0
WHEN toInteger(m.votes) < 10000 THEN 0.7
ELSE 0.4 END as novelty_score
"""
with self.driver.session() as session:
result = session.run(query, movie_ids=list(recommended_movies))
scores = [record['novelty_score'] for record in result]
return sum(scores) / len(scores) if scores else 0.0
def generate_recommendation_report(self, metrics: Dict[str, float]) -> str:
"""生成推荐质量报告"""
report = []
report.append("=== 电影推荐质量评估报告 ===")
report.append(f"精确率 (Precision): {metrics['precision']".3f"}")
report.append(f"召回率 (Recall): {metrics['recall']".3f"}")
report.append(f"F1得分: {metrics['f1_score']".3f"}")
report.append(f"覆盖率 (Coverage): {metrics['coverage']".3f"}")
report.append(f"新颖性 (Novelty): {metrics['novelty']".3f"}")
report.append("")
# 质量评级
if metrics['f1_score'] > 0.7:
report.append("推荐质量评级: 优秀")
elif metrics['f1_score'] > 0.5:
report.append("推荐质量评级: 良好")
elif metrics['f1_score'] > 0.3:
report.append("推荐质量评级: 一般")
else:
report.append("推荐质量评级: 需要改进")
return "\n".join(report)
4. 一致性检查
检查知识图谱内部的一致性:
class ConsistencyChecker:
def __init__(self, driver):
self.driver = driver
def check_consistency(self) -> Dict[str, List[str]]:
"""检查一致性问题"""
issues = {
'data_conflicts': [],
'logic_errors': [],
'missing_links': [],
'format_issues': []
}
with self.driver.session() as session:
# 检查数据冲突
data_conflicts = self._check_data_conflicts(session)
issues['data_conflicts'] = data_conflicts
# 检查逻辑错误
logic_errors = self._check_logic_errors(session)
issues['logic_errors'] = logic_errors
# 检查缺失链接
missing_links = self._check_missing_links(session)
issues['missing_links'] = missing_links
# 检查格式问题
format_issues = self._check_format_issues(session)
issues['format_issues'] = format_issues
return issues
def _check_data_conflicts(self, session) -> List[str]:
"""检查数据冲突"""
issues = []
# 检查电影重复标题但不同年份的情况
result = session.run("""
MATCH (m1:Movie), (m2:Movie)
WHERE m1.title = m2.title
AND m1 <> m2
AND m1.year <> m2.year
RETURN m1.title as title, m1.year as year1, m2.year as year2
""")
for record in result:
issues.append(f"电影同名不同年份: {record['title']} ({record['year1']} vs {record['year2']})")
return issues
def _check_logic_errors(self, session) -> List[str]:
"""检查逻辑错误"""
issues = []
# 检查导演也作为演员出现的情况
result = session.run("""
MATCH (p:Person)-[:DIRECTED_BY]->(m:Movie)
MATCH (p)-[:ACTED_IN]->(m)
RETURN p.name as person, m.title as movie
""")
for record in result:
issues.append(f"导演兼演员: {record['person']} 在 {record['movie']} 中既是导演又是演员")
return issues
def _check_missing_links(self, session) -> List[str]:
"""检查缺失链接"""
issues = []
# 检查有演员但没有导演的电影
result = session.run("""
MATCH (m:Movie)
WHERE NOT (m)<-[:DIRECTED_BY]-()
AND (m)<-[:ACTED_IN]-()
RETURN m.title as movie
LIMIT 10
""")
for record in result:
issues.append(f"缺少导演信息: {record['movie']}")
return issues
def _check_format_issues(self, session) -> List[str]:
"""检查格式问题"""
issues = []
# 检查年份格式
result = session.run("""
MATCH (m:Movie)
WHERE m.year IS NOT NULL
AND NOT m.year =~ '\\d{4}'
RETURN m.title as movie, m.year as year
""")
for record in result:
issues.append(f"年份格式错误: {record['movie']} - {record['year']}")
return issues
5. 性能基准测试
评估查询性能:
class PerformanceBenchmark:
def __init__(self, driver):
self.driver = driver
def run_benchmarks(self) -> Dict[str, float]:
"""运行性能基准测试"""
benchmarks = {}
# 测试各种查询性能
benchmarks['movie_lookup'] = self._benchmark_movie_lookup()
benchmarks['actor_movies'] = self._benchmark_actor_movies()
benchmarks['genre_recommendation'] = self._benchmark_genre_recommendation()
benchmarks['collaboration_query'] = self._benchmark_collaboration_query()
benchmarks['complex_recommendation'] = self._benchmark_complex_recommendation()
return benchmarks
def _benchmark_movie_lookup(self) -> float:
"""基准测试电影查找性能"""
import time
start_time = time.time()
with self.driver.session() as session:
result = session.run("""
MATCH (m:Movie {title: $title})
RETURN m
LIMIT 1
""", title="肖申克的救赎")
result.consume()
return time.time() - start_time
def _benchmark_actor_movies(self) -> float:
"""基准测试演员电影查询性能"""
start_time = time.time()
with self.driver.session() as session:
result = session.run("""
MATCH (p:Person {name: $actor_name})-[:ACTED_IN]->(m:Movie)
RETURN m.title as title, m.year as year
ORDER BY m.year DESC
""", actor_name="莱昂纳多·迪卡普里奥")
result.consume()
return time.time() - start_time
def _benchmark_genre_recommendation(self) -> float:
"""基准测试类型推荐性能"""
start_time = time.time()
with self.driver.session() as session:
result = session.run("""
MATCH (m1:Movie {douban_id: $movie_id})-[:BELONGS_TO_GENRE]->(g:Genre)<-[:BELONGS_TO_GENRE]-(m2:Movie)
WHERE m1 <> m2
WITH m2, count(g) as common_genres
ORDER BY common_genres DESC, m2.rating DESC
LIMIT 10
RETURN count(m2) as recommendation_count
""", movie_id="1292052")
result.consume()
return time.time() - start_time
def _benchmark_collaboration_query(self) -> float:
"""基准测试合作查询性能"""
start_time = time.time()
with self.driver.session() as session:
result = session.run("""
MATCH (a1:Person {name: $actor1})-[:COLLABORATED_WITH]-(a2:Person)-[:ACTED_IN]->(m:Movie)
WHERE NOT (a1)-[:ACTED_IN]->(m)
RETURN m.title as title, count(a2) as collaborator_count
LIMIT 10
""", actor1="张国荣")
result.consume()
return time.time() - start_time
def _benchmark_complex_recommendation(self) -> float:
"""基准测试复杂推荐性能"""
start_time = time.time()
with self.driver.session() as session:
result = session.run("""
MATCH (d:Person {name: $director, person_type: 'director'})-[:DIRECTED_BY]->(m1:Movie)
MATCH (m1)-[:BELONGS_TO_GENRE]->(g:Genre)<-[:BELONGS_TO_GENRE]-(m2:Movie)
MATCH (m1)-[:ACTED_IN]-(a:Person)-[:ACTED_IN]-(m2:Movie)
WHERE NOT (d)-[:DIRECTED_BY]->(m2)
WITH m2,
count(DISTINCT g) as common_genres,
count(DISTINCT a) as common_actors,
avg(m1.rating) as director_avg_rating
ORDER BY common_genres DESC, common_actors DESC, director_avg_rating DESC, m2.rating DESC
LIMIT 10
RETURN count(m2) as recommendation_count
""", director="克里斯托弗·诺兰")
result.consume()
return time.time() - start_time
def generate_performance_report(self, benchmarks: Dict[str, float]) -> str:
"""生成性能报告"""
report = []
report.append("=== 电影知识图谱性能基准测试报告 ===")
for test_name, duration in benchmarks.items():
report.append(f"{test_name}: {duration".4f"}秒")
# 计算平均查询时间
avg_time = sum(benchmarks.values()) / len(benchmarks)
report.append(f"平均查询时间: {avg_time".4f"}秒")
# 性能评级
if avg_time < 0.1:
report.append("性能评级: 优秀")
elif avg_time < 0.5:
report.append("性能评级: 良好")
elif avg_time < 1.0:
report.append("性能评级: 一般")
else:
report.append("性能评级: 需要优化")
return "\n".join(report)
通过全面的质量评估和验证,你可以确保豆瓣电影知识图谱的可靠性和性能,为电影推荐系统提供高质量的数据支撑。