Facebook用户信息爬虫技术分析与完成详解

2025-10-06 18:57 tlnshuju 阅读(156) 评论(0) 收藏举报

前言

在当今数据驱动的时代，社交媒体数据的获取和分析变得越来越重要。本文将深入分析一个Facebook用户信息爬虫的实现原理，涵盖用户搜索、信息提取、并发处理等核心技术。

⚠️ 免责声明：本文仅供技术学习和研究使用，请遵守相关法律法规和平台使用条款。

技术架构概览

该爬虫系统主要由两个核心类组成：

FacebookUserInfoSpider - 负责单个用户详细信息的爬取
FacebookSearchUserSpider - 负责用户搜索和批量信息获取

核心技术分析

1. 请求头伪装技术

__HTML_HEADERS__ = {    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9...",    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",    "sec-ch-ua": "\"Google Chrome\";v=\"131\"...",    # ... 更多头部信息}

技术要点：

完整模拟真实浏览器请求头
包含Chrome浏览器特征信息
设置合适的Accept和Language参数
添加安全相关的sec-ch-*头部

2. Cookie管理机制

def cookie_str_to_dict(self, cookie_str: str) -> dict:    cookie_dict = {}    cookies = [i.strip() for i in cookie_str.split('; ') if i.strip() != ""]    for cookie in cookies:        key, value = cookie.split('=', 1)        cookie_dict[key] = value    return cookie_dict

关键Cookie字段：

c_user: 用户ID标识
xs: 会话令牌
datr: 设备跟踪标识
fr: Facebook认证令牌

3. 动态Token获取

def get_token(self):    url = "https://www.facebook.com/"    # ... 请求处理    fb_dtsg = re.findall('"DTSGInitialData",\[\],\{"token":"(.*?)"\},', html)[0]    return fb_dtsg, "0"

技术原理：

通过正则表达式提取页面中的DTSG令牌
DTSG用于GraphQL API请求的CSRF保护
动态获取确保令牌的有效性

4. GraphQL API调用

该爬虫使用Facebook的GraphQL接口进行数据获取：

def get_data(self, cursor, keyword, fb_dtsg):    variables = {        "count": 5,        "allow_streaming": False,        "args": {            "callsite": "COMET_GLOBAL_SEARCH",            "config": {"exact_match": False, "high_confidence_config": None,                       "intercept_config": None, "sts_disambiguation": None},            "context": {"bsid": "d5226c60-92d1-45fc-b2b5-274daabf2bcb"},            "experience": {                "client_defined_experiences": ["ADS_PARALLEL_FETCH"],                "type": "PEOPLE_TAB"            },            "filters": [],            "text": keyword        },        "cursor": cursor,        "__relay_internal__pv__GHLShouldChangeAdIdFieldNamerelayprovider": False,        # ... 更多内部参数    }        data = {        "av": self.cookies["c_user"],        "__user": self.cookies["c_user"],        "__req": "1b",        "__hs": "20062.HYP:comet_pkg.2.1..2.1",        "__rev": "1018642777",        "fb_dtsg": fb_dtsg,        "variables": json.dumps(variables, separators=(',', ":"), ensure_ascii=False),        "doc_id": "28106583125599288"  # GraphQL查询ID    }

关键技术点：

doc_id: Facebook GraphQL查询的唯一标识符
__rev: 客户端版本号，用于API兼容性
__hs: 哈希签名，用于请求验证
bsid: 浏览器会话ID
variables: JSON格式的查询参数

5. 复杂数据解析技术

JavaScript内嵌数据提取

Facebook将关键数据嵌入在页面的JavaScript代码中，需要通过特定的解析策略提取：

def parse_data(self, html, url_):    # 初始化数据结构    item = {"profile_url": url_, "user_id": "", "user_name": "", ...}        ele = etree.HTML(html)    script_str_list = ele.xpath('//script/text()')        # 第一阶段：解析profile_tile_section_type数据    for script_str in script_str_list:        if "profile_tile_section_type" in script_str:            data_json = json.loads(script_str)            requires = data_json["require"][0][3][0]["__bbox"]["require"]                        # 查找实际数据节点            result = None            for r in requires:                if len(r) == 4 and r[1] == "next":                    result = r[-1][1]["__bbox"]["result"]                    break                        if result:                # 提取用户ID和基本信息                try:                    user = result["data"]["user"]                    item["user_id"] = user['id']                    edges = user["profile_tile_sections"]["edges"]                except:                    # 备用数据路径                    edges = result["data"]["profile_tile_sections"]["edges"]                    item["user_id"] = result["data"]["id"]                                # 解析INTRO类型的信息块                for e in edges:                    if e['node']['profile_tile_section_type'] == "INTRO":                        node_list = e['node']["profile_tile_views"]["nodes"]                        break                                # 详细信息提取                for n in node_list:                    renderer = n["view_style_renderer"]                    if renderer:                        self._parse_renderer_data(renderer, item)

渲染器数据解析

def _parse_renderer_data(self, renderer, item):    typename = renderer['__typename']        # 个人简介解析    if typename == "ProfileTileViewIntroBioRenderer":        bio_text = renderer["view"]["profile_tile_items"]["nodes"][0]["node"]["profile_status_text"]["text"]        item["description"] = bio_text        # 上下文列表信息解析    elif typename == "ProfileTileViewContextListRenderer":        nodes = renderer["view"]["profile_tile_items"]["nodes"]                for node in nodes:            context_item = node['node']['timeline_context_item']            item_type = context_item['timeline_context_list_item_type']            renderer_data = context_item['renderer']                        # 根据不同类型提取文本            try:                text = renderer_data["context_item"]["title"]['text']            except:                text = renderer_data.get('wa_number', '') if 'wa_number' in renderer_data else ""                        # 信息分类存储            self._categorize_user_info(item_type, text, item)

信息分类处理

def _categorize_user_info(self, item_type, text, item):    """根据Facebook的内部类型标识分类存储用户信息"""        type_mapping = {        "INTRO_CARD_INFLUENCER_CATEGORY": "category",        "INTRO_CARD_CONFIRMED_OWNER_LABEL": "person_charge",        "INTRO_CARD_PROFILE_EMAIL": "email",        "INTRO_CARD_WEBSITE": "website",        "INTRO_CARD_CURRENT_CITY": "city",        "INTRO_CARD_HOMETOWN": "from",        "INTRO_CARD_ADDRESS": "address",        "INTRO_CARD_BUSINESS_HOURS": "business_hour",        "INTRO_CARD_RATING": "comment_count",        "INTRO_CARD_WORK": "work",        "INTRO_CARD_FOLLOWERS": "follower_count",        "INTRO_CARD_RELATIONSHIP": "relationship"    }        # 特殊处理电话号码（包括WhatsApp）    if item_type in ["INTRO_CARD_PROFILE_PHONE", "INTRO_CARD_PROFILE_WHATSAPP_NUMBER"]:        if item.get("tel"):            texts = item["tel"].split("; ")            texts.append(text)            item["tel"] = "; ".join(texts)        else:            item["tel"] = text        # 特殊处理可能有多个值的字段    elif item_type in ["INTRO_CARD_OTHER_ACCOUNT", "INTRO_CARD_EDUCATION"]:        field_name = "other_account" if "OTHER_ACCOUNT" in item_type else "education"        if item.get(field_name):            existing_values = item[field_name].split("; ")            existing_values.append(text)            item[field_name] = "; ".join(existing_values)        else:            item[field_name] = text        # 标准单值字段处理    elif item_type in type_mapping:        item[type_mapping[item_type]] = text

用户头部信息解析

# 第二阶段：解析XFBProfileEntityConvergenceHeaderRenderer数据if "XFBProfileEntityConvergenceHeaderRenderer" in script_str:    data_json = json.loads(script_str)    # ... 类似的数据定位逻辑        if result:        user = result["data"]["user"]["profile_header_renderer"]["user"]        item["user_name"] = user.get('name', '')                # 社交统计数据解析        try:            contents = user["profile_social_context"]["content"]            for content in contents:                uri = content.get('uri', '')                text = content['text']['text']                                # 根据URI路径判断数据类型                if "friends_likes" in uri:                    item["like_count"] = text                elif "followers" in uri:                    item["follower_count"] = text                elif "following" in uri:                    item["following_count"] = text                elif "friends" in uri and "like" not in uri:                    item["friend_count"] = text        except:            pass

6. 并发处理与超时控制

def main(self, keyword, cookie_str, cursor=None, fb_dtsg=None, timeout=30):    # ... 初始化和数据获取        resultList = self.parse_data(edges)  # 获取用户URL列表    item_list = []        # 关键：限制并发数为1，避免被检测    with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:        # 使用partial预设超时参数        run_with_timeout = partial(fbis.run)                # 提交所有任务        future_to_url = {            executor.submit(run_with_timeout, url): url            for url in resultList        }                # 处理完成的任务        for future in concurrent.futures.as_completed(future_to_url):            try:                item, status_ = future.result(timeout=timeout)                item_list.append(item)                                # 检查是否遇到严重错误                if status_ == "-2":                    return {                        "cursor": cursor,                        "token": fb_dtsg,                        "item_list": [],                        "status": status_                    }                                except (concurrent.futures.TimeoutError,                     requests.exceptions.Timeout):                continue  # 超时的任务直接跳过        return {        "cursor": cursor,        "token": fb_dtsg,        "item_list": item_list,        "status": "0"    }

并发控制要点：

max_workers=1: 严格控制并发，模拟人工操作
超时控制: 30秒超时避免长时间阻塞
优雅降级: 超时任务跳过，不影响整体流程
错误传播: 严重错误时立即停止后续处理

8. 搜索结果解析机制

def parse_data(self, edges):    """解析搜索结果中的用户链接"""    resultList = []    for e in edges:        try:            # 深度嵌套的数据结构解析            vm = e['relay_rendering_strategy']['view_model']            profile_url = vm['profile']['profile_url']            resultList.append(profile_url)        except Exception as e:            print("解析错误：", e)    return resultList

数据流向分析：

GraphQL API返回搜索结果
从edges数组中提取每个用户的视图模型
通过relay_rendering_strategy获取渲染策略
从view_model.profile中提取用户主页URL

9. 会话状态管理

class FacebookUserInfoSpider:    def __init__(self, cookie_str):        self.session = requests.Session()  # 复用TCP连接        self.cookies = self.cookie_str_to_dict(cookie_str)            def get(self, url):        response = self.session.get(url,                                   headers=self.headers,                                   cookies=self.cookies,                                   allow_redirects=False,  # 禁止自动重定向                                   timeout=(5, 10))        # 连接和读取超时

会话管理要点：

使用requests.Session()复用TCP连接
禁止自动重定向避免意外跳转
设置合理的超时时间
保持Cookie状态一致性

def get(self, url):    while True:        try:            response = self.session.get(url, headers=self.headers,                                       cookies=self.cookies, timeout=(5, 10))            # ... 处理响应        except (ProxyError, ReadTimeout, ConnectionError) as e:            logger.info("发生网络错误：", e)            time.sleep(5)        except Exception as e:            logger.info("get发生错误：", e)            return None, "-2"

错误处理策略：

网络异常自动重试
设置合理的重试间隔
区分不同类型的错误状态码

关键技术细节补充

1. Facebook内部数据结构

Facebook使用了复杂的嵌套JSON结构来存储页面数据：

# 典型的数据路径结构data_path = {    "用户基本信息": "require[0][3][0].__bbox.require[n][-1][1].__bbox.result.data.user",    "个人简介": "profile_tile_views.nodes[].view_style_renderer.view.profile_tile_items.nodes[].node.profile_status_text.text",    "联系信息": "timeline_context_item.renderer.context_item.title.text",    "社交统计": "profile_header_renderer.user.profile_social_context.content[].text.text"}

2. 关键请求参数详解

# GraphQL请求的关键参数critical_params = {    "doc_id": "28106583125599288",  # Facebook内部查询ID，可能会变化    "__rev": "1018642777",         # 客户端版本，需要匹配    "__hs": "20062.HYP:comet_pkg.2.1..2.1",  # 哈希签名    "jazoest": "25179",            # JavaScript生成的验证码    "__spin_r": "1018642777",      # 旋转标识符    "__spin_b": "trunk",           # 分支标识    "__spin_t": "1733391682"       # 时间戳}

3. 反爬虫对抗措施

def get_realistic_headers():    """构造更真实的请求头"""    return {        # 关键：模拟真实浏览器的完整特征        "sec-ch-ua-full-version-list": "\"Google Chrome\";v=\"131.0.6778.86\"",        "sec-ch-ua-platform-version": "\"8.0.0\"",        "sec-fetch-dest": "document",        "sec-fetch-mode": "navigate",        "sec-fetch-site": "same-origin",        "sec-fetch-user": "?1",        "upgrade-insecure-requests": "1",        "viewport-width": "848",  # 模拟视口宽度        "dpr": "0.9"             # 设备像素比    } def handle_rate_limiting():    """处理频率限制"""    if "Rate limit exceeded" in response.text:        # 实现指数退避策略        wait_time = min(300, base_wait * (2 ** retry_count))        time.sleep(wait_time)

用户信息数据模型

user_info_schema = {    "profile_url": "用户主页链接",    "user_id": "用户数字ID",    "user_name": "用户名称",    "description": "个人简介",    "category": "用户类别",    "work": "工作信息",    "education": "教育背景",    "city": "当前城市",    "from": "家乡",    "email": "邮箱地址",    "tel": "电话号码",    "website": "个人网站",    "follower_count": "粉丝数量",    "friend_count": "好友数量"}

状态码设计

status_codes = {    "0": "成功",    "-1": "HTTP状态码异常",    "-2": "系统错误或被限制"}

完整的工作流程

主要执行流程

def main(self, keyword, cookie_str, cursor=None, fb_dtsg=None, timeout=30):    """完整的搜索和数据提取流程"""        # 1. 初始化会话和Cookie    self.cookies = self.cookie_str_to_dict(cookie_str)    fbis = FacebookUserInfoSpider(cookie_str)        try:        # 2. 获取DTSG令牌（如果没有）        if not fb_dtsg:            fb_dtsg, _ = self.get_token()            if not fb_dtsg:                return {"cursor": cursor, "token": None,                        "item_list": [], "status": "-2"}                # 3. 执行GraphQL搜索请求        dataJson, _ = self.get(cursor, keyword, fb_dtsg)        if not dataJson or not isinstance(dataJson, dict):            return {"cursor": cursor, "token": fb_dtsg,                    "item_list": [], "status": "-2"}                # 4. 提取搜索结果和分页信息        edges = dataJson["data"]["serpResponse"]["results"]["edges"]        cursor = dataJson["data"]["serpResponse"]["results"]["page_info"]["end_cursor"]                if not edges:            return {"cursor": cursor, "token": fb_dtsg,                    "item_list": [], "status": "0"}                # 5. 解析用户URL列表        resultList = self.parse_data(edges)                # 6. 并发获取详细用户信息        item_list = []        with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:            # 提交任务并处理结果            # ... (前面已详细说明)                # 7. 返回结果        return {            "cursor": cursor,      # 下一页的游标            "token": fb_dtsg,      # 可复用的令牌            "item_list": item_list, # 用户信息列表            "status": "0"          # 执行状态        }            except Exception as e:        print(f"Error occurred: {str(e)}")        return {"cursor": cursor, "token": fb_dtsg,                "item_list": [], "status": "-2"}

数据流转示意

用户搜索请求    ↓Cookie验证 + DTSG令牌获取    ↓GraphQL API调用 (搜索接口)    ↓搜索结果解析 (提取用户URL)    ↓并发用户详情获取 (个人主页解析)    ↓数据结构化输出

# 初始化爬虫fsus = FacebookSearchUserSpider() # 设置Cookie（需要有效的登录Cookie）cookie_str = 'sb=xxx; c_user=xxx; xs=xxx; ...' # 执行搜索cursor = Nonefb_dtsg = Nonepage_count = 1 while page_count <= 5:  # 限制爬取页数    result = fsus.main("food", cookie_str, cursor, fb_dtsg)        cursor = result['cursor']    fb_dtsg = result['token']        print(f"第{page_count}页: 获取到{len(result['item_list'])}条数据")        if result['status'] != '0':        print("遇到错误，停止爬取")        break            page_count += 1

注意事项与建议

1. 合规使用

遵守robots.txt协议
控制请求频率，避免对服务器造成压力
尊重平台的使用条款和用户隐私

2. 技术优化建议

添加随机延时机制
使用代理池分散请求
实现更智能的重试策略
加入数据去重机制

3. 风险控制

监控请求成功率
设置合理的并发限制
实现优雅的错误恢复

结语

本文详细分析了Facebook爬虫的核心技术实现，涵盖了从请求构造到数据解析的完整流程。通过学习这些技术，我们可以更好地理解现代Web爬虫的设计原理和实现方法。

在实际应用中，请务必遵守相关法律法规，合理使用爬虫技术，尊重平台规则和用户隐私。技术的发展应该服务于正当的商业需求和学术研究，而不是恶意的数据窃取。

本文仅供技术学习参考，作者不承担因使用本文内容而产生的任何法律责任。

刷新页面返回顶部

tlnshuju