AI-驱动信息提取和匹配

AI 驱动信息提取和匹配

原文：towardsdatascience.com/ai-powered-information-extraction-and-matchmaking-0408c93ec1b9/

如果您不是 Medium 会员，您可以通过这个链接阅读完整文章。

随着大型语言模型（LLMs）效率的提高，它们在从法律合同、发票、财务报告和简历等商业文件中提取信息方面变得越来越受欢迎。从多个来源提取的信息可用于匹配和推荐系统。

信息提取和匹配的一些应用包括：

通过从客户请求中提取信息自动生成报价请求（RFQ）
从客户数据中提取关键使用模式以提供产品推荐
从投标中提取关键信息并与公司资料匹配以寻找潜在投标人
从公司的发票和销售文件中提取关键信息以生成销售采购报告
从采购订单中提取关键信息以促进库存和供应链管理
根据个人资料在约会或婚恋平台上匹配个人，等等。

在我在芬兰 FAIR EDIH 项目中的 AI 咨询经验期间（FAIR EDIH 项目网站），我遇到了几个可以通过使用 LLMs 进行信息提取并随后提供与提取信息相一致的建议的匹配用例。其中一些用例包括：

匹配用户购买汽车和房屋的偏好
在学习管理系统中将学生的技能与职业路径进行匹配
将欧盟法规和合规性匹配到投标提案
根据研究提案推荐专家进行审查
为学生提供同伴推荐以增强学习体验
根据员工的个人资料提供技能提升建议
将求职者迁移到与他们的个人资料相匹配的工作场所，等等。

在本文中，我将讨论一个从求职者的简历或个人简历中提取关键信息并从职位数据库中推荐与求职者个人资料相匹配的职位的用例。此方法适用于简历和个人简历；然而，我将在整篇文章中仅使用“简历”一词。这个用例对于希望将 AI 集成到现有系统中的求职平台非常有用。此类平台维护一个职位数据库，并允许用户创建个人资料和/或上传他们的简历。同样的方法也可以用来帮助招聘人员找到与职位广告相匹配的潜在候选人。

在本文中，我们将开发一个具有简单 GUI 的应用程序，用于分析上传的简历，提取包括教育背景、技能和专业经验在内的个人资料，并随后推荐与个人资料匹配度最高的职位，并对每个选择进行解释。

需要注意的是，这个示例用例可以扩展到多个其他的信息提取和匹配任务。

本文将涵盖以下主题：

利用 LlamaParse 和 Pydantic 模型，通过 LLM 从文档中提取结构化信息。
将此信息提取方法应用于简历，以提取教育背景、技能和专业经验。
根据简历中技能的强度（语义分数）对提取的技能进行评分。
从精心挑选的职位广告列表中创建职位向量数据库。
根据提取的个人资料与向量数据库中职位的语义相似度检索最匹配的职位。
使用 LLM 生成最终的职位推荐，并对每个推荐进行解释。
开发一个简单的 streamlit 应用程序，允许选择多个 LLM 和嵌入模型（包括 OpenAI 和开源）。

完整的代码和详细说明可以在我的 GitHub 仓库中找到。

仓库中有两个主要的文件夹：i) OpenAI models 文件夹中的代码使用了 OpenAI 的 gpt-4o LLM 和 text-embedding-3-large 嵌入模型，ii) Multiple models 文件夹中的代码提供了选择 OpenAI 以及开源 LLM（例如，gpt-4o，gpt-4o-mini，_llama3:70b-instruct-q40，mistral:latest，llama3.3:latest）和嵌入模型（例如，text-embedding-3-large，text-embedding-3-small，BAAI/bge-small-en-v1.5）的选项。

您需要 OpenAI 的 API 密钥来运行 OpenAI models 文件夹中的代码。然而，如果您有一台配备 CUDA 功能的 GPU 的强大 PC，您可以使用开源模型免费测试 Multiple models 文件夹中的代码。即使没有 CUDA 功能的 GPU，您也可以运行此代码，但处理速度会非常慢。这两个代码都灵活，可以添加更多 LLM 和/或嵌入模型进行实验。为了简化，我将只在这篇文章中提及 OpenAI 模型中的代码。

下图展示了整体流程。

从简历中提取关键信息并从职位数据库推荐匹配工作的整体流程（图片由作者提供）

下面是 Streamlit 应用程序的快照。

Streamlit 应用程序的快照（图片由作者提供）

使用 LlamaParse 进行解析，以及使用 Pydantic 模型进行信息提取与验证

在接下来的文章中，我展示了使用 LLM 从非结构化文档中提取信息。在这里，我使用了python-docx库从 AI 咨询文档（MS WORD）中提取文本，并将每个文档的文本直接发送给 LLM 进行信息提取。

LLM-Powered Parsing and Analysis of Semi-Structured & Structured Documents

在另一篇文章中，我展示了使用LlamaParse进行上下文、多模态检索增强生成（RAG）的更好解析方法。LlamaParse 是一个基于 genAI 的文档解析平台，它解析并清理数据，确保在传递给 LLM 之前数据质量良好且格式正确。请参阅上文提到的文章以设置 LlamaParse 并获得其免费 API 密钥。

Integrating Multimodal Data into a Large Language Model

在这篇文章中，我将使用 LlamaParse 从简历中解析数据。然而，我将不会直接使用 LLM 从解析内容中提取所需信息，而是将使用Pydantic模型来强制执行特定的信息提取模式，并验证提取的信息是否符合给定的模式。这个过程确保了由 LLM 生成的输出符合预期的类型和格式。Pydantic 验证也有助于减少 LLM 的幻觉。

Pydantic 提供了一种使用 Python 类定义数据模型的简洁方式。在讨论 Pydantic 引导的从简历中提取信息之前，我将首先通过一个示例来演示这个过程，以适用于任何文档。我将使用与上文提到的文章中相同的公司 AI 咨询示例文档来从 AI 咨询文档中提取关键、结构化的信息。以下是示例文档。

This is the AI consultancy of the company Sagittarius Tech on the date 2024-09-12\. This was a regular session facilitated by the expert Klaus Muller. Sagittarius Tech, based in Finland, is a forward-thinking, well-established company specializing in renewable energy solutions. They have a strong technical foundation in renewable energy systems, particularly in solar and wind energy, but their application of AI technology is still in its infancy, leading to a current AI maturity level that is considered low.
The company's objectives are well articulated and focus on optimizing the efficiency of their energy distribution networks. Specifically, Sagittarius Tech aims to implement AI-driven predictive maintenance for their solar farms and wind turbines. Their current approach to maintenance is largely reactive, with inspections carried out at regular intervals or when a failure is detected. This method, while functional, is neither cost-effective nor efficient, as it often leads to unexpected downtime and higher maintenance costs. By integrating AI into their maintenance operations, Sagittarius Tech hopes to predict and prevent equipment failures before they occur, thereby reducing downtime and extending the lifespan of their energy assets.
The idea of implementing predictive maintenance using AI is highly relevant and aligns with current industry trends. By predicting equipment failures before they happen, Sagittarius Tech can improve the reliability of their energy systems and offer more consistent service to their clients. The application of AI for this purpose is particularly advantageous, as it allows for the analysis of large datasets from sensors and monitoring equipment to identify patterns and anomalies that might indicate impending failures.
While the company's immediate goals are clear, their long-term strategy for AI integration is still under consideration. However, they have identified their target market as large-scale renewable energy operators and utility companies. In terms of data requirements, Sagittarius Tech has access to extensive datasets generated by the sensors installed on their solar panels and wind turbines. This data, which includes temperature readings, vibration analysis, and energy output metrics, is crucial for training and validating AI models for predictive maintenance. The data is continuously updated as part of their ongoing operations, providing a rich source of information for AI-driven insights.
The company has demonstrated strong technical expertise in renewable energy systems and in managing the associated data. They have a growing interest in AI, particularly in the area of predictive analytics, though their experience in this field is still developing. Sagittarius Tech is seeking technical assistance from FAIR Services to develop an AI proof-of-concept (POC) focused on predictive maintenance for their energy assets. During the consultation, it was noted that the company could benefit from targeted training in AI-based predictive maintenance techniques to further their capabilities.
The experts suggested that the challenge of implementing predictive maintenance could be approached through the use of machine learning models that are specifically designed to handle time-series data. Models such as LSTM (Long Short-Term Memory) networks, which are particularly effective in analyzing sequential data, can be applied to the sensor data collected by Sagittarius Tech. These models are capable of learning patterns over time and can provide early warnings of potential equipment failures. However, the experts noted that these models require a significant amount of data for training, so it may be beneficial to begin with a smaller pilot project before scaling up.
The experts further recommended exploring the integration of AI-driven predictive maintenance tools with the company's existing monitoring systems. This integration can be achieved through the use of custom APIs and middleware, allowing the AI models to continuously analyze incoming data and provide real-time alerts to the maintenance team. Additionally, the experts emphasized the importance of a hybrid approach, combining AI predictions with human expertise to ensure that maintenance decisions are both data-driven and informed by practical experience.
Starting with pre-trained models for time-series analysis was recommended, with the option to fine-tune these models based on the specific characteristics of Sagittarius Tech's equipment and operations. It was advised to avoid training models from scratch due to the computational complexity and resource requirements involved. Instead, a phased approach to AI integration was suggested, where the predictive maintenance system is gradually rolled out across different sites, allowing the models to be refined and validated in a controlled environment. This approach ensures that the AI system can be effectively integrated into the company's operations without disrupting existing processes.

我们有数百这样的非结构化文档，目标是提取以下关键信息：公司名称，国家，咨询日期，专家，咨询类型，领域，当前解决方案，AI 领域，AI 成熟度水平，技术专长和能力，公司类型，目标，确定的目标市场，数据需求评估，寻求的 FAIR 服务，以及建议。

在运行给定代码之前，必须安装以下库。

pip install openai pydantic[email] llama_parse llama-index python-dotenv pydantic[email] streamlit

以下代码定义了一个 Pydantic 模型，以强制执行特定数据提取的架构，验证 LLM 的输出，并将某些字段的格式转换为预期格式。

import os
import json
import openai
from datetime import datetime, date
from typing import List, Optional
from pydantic import BaseModel, Field, field_validator
from llama_parse import LlamaParse
from llama_index.llms.openai import OpenAI
from dotenv import load_dotenv
from llama_index.core import SimpleDirectoryReader

load_dotenv() #load the API keys from .env file

class AIconsultation(BaseModel):
    company_name: Optional[str] = Field(None, description="The name of the company seeking AI advisory")
    country: Optional[str] = Field(None, description="The company's country")
    consultation_date: Optional[str] = Field(None, description="The date of consultation")
    experts: Optional[List[str]] = Field(None, description="The experts providing AI consultancy")
    consultation_type: Optional[str] = Field(None, description="Type of consultation: Regular or pop-up")
    area_domain: Optional[str] = Field(None, description="The field of the company's operations (e.g., healthcare, logistics, etc.)")
    current_solution: Optional[str] = Field(None, description="A brief summary of the current solution (e.g., recommendation system, professional guidance system)")
    ai_field: Optional[List[str]] = Field(None, description="AI sub-fields in use or required (e.g., computer vision, generative AI)")
    ai_maturity_level: Optional[str] = Field(None, description="AI maturity level: low, moderate, high")
    technical_expertise_and_capability: Optional[str] = Field(None, description="Company's technical expertise: low, moderate, or high")
    company_type: Optional[str] = Field(None, description="Company type: startup or established company")
    aim: Optional[str] = Field(None, description="Main AI task the company is aiming for")
    identified_target_market: Optional[str] = Field(None, description="The targeted customers (e.g., healthcare professionals, construction firms)")
    data_requirement_assessment: Optional[str] = Field(None, description="Type of data required for AI integration with format/modality")
    fair_services_sought: Optional[str] = Field(None, description="Services expected from FAIR (e.g., technical advice, proof of concept)")
    recommendations: Optional[str] = Field(None, description="Key recommendations focusing on most important suggested actions")

    @field_validator("consultation_date", mode="before")
    def validate_and_convert_date(cls, raw_date):
        if raw_date is None:
            return None
        if isinstance(raw_date, str):
            # List of acceptable date formats
            date_formats = ['%d-%m-%Y', '%Y-%m-%d', '%d/%m/%Y', '%m-%d-%Y']
            for fmt in date_formats:
                try:
                    # Attempt to parse the date string with the current format
                    parsed_date = datetime.strptime(raw_date, fmt).date()
                    # Return the date in MM-DD-YYYY format as a string
                    return parsed_date.strftime('%m-%d-%Y')
                except ValueError:
                    continue  # Try the next format
            # If none of the formats match, raise an error
            raise ValueError(
                f"Invalid date format for 'consultation_date'. Expected one of: {', '.join(date_formats)}."
            )
        if isinstance(raw_date, date):
            # Convert date object to MM-DD-YYYY format
            return raw_date.strftime('%m-%d-%Y')

        raise ValueError(
            "Invalid type for 'consultation_date'. Must be a string or a date object."
        )

def extract_content(file_path):
    """Parse the document and extract its content as text."""
    #Initialize LlamaParse parser
    parser = LlamaParse(
        result_type="markdown",
        parsing_instructions="Extract each section separately based on the document structure.",
        auto_mode=True,
        api_key=os.getenv("LLAMA_API_KEY"),
        verbose=True
    )
    file_extractor = {".pdf": parser}
    # Load the document
    documents = SimpleDirectoryReader(
        input_files=[file_path], file_extractor=file_extractor
    ).load_data()
    text_content = "n".join([doc.text for doc in documents])
    return text_content

def extract_information(document_text, llm_model):
    """Extract structured information and validate with Pydantic schema."""
    openai.api_key = os.getenv("OPENAI_API_KEY")
    llm = OpenAI(model=llm_model, temperature=0.0)
    prompt = f"""
    You are an expert in analyzing consultation documents. Use the following JSON schema to extract relevant information:
    ```json

    {AIconsultation.schema_json(indent=2)}

    ```pyjson
    Extract the information from the following document and provide a structured JSON response strictly adhering to the schema above. 
    Please remove any ```json ```py characters from the output. Do not make up any information. If a field cannot be extracted, mark it as `n/a`.
    Document:
    ----------------
    {document_text}
    ----------------
    """
    response = llm.complete(prompt)
    if not response or not response.text:
        raise ValueError("Failed to get a response from LLM.")
    try:
        parsed_data = json.loads(response.text)  # Parse the response text to a Python dictionary
        return AIconsultation.model_validate(parsed_data)  # Validate the parsed data against the schema
    except Exception as e:
        raise ValueError(f"Validation failed: {e}")
if __name__ == "__main__":
    # Path to the document to analyze
    document_path = "Sagittarius.pdf"
    if not os.path.exists(document_path):
        raise FileNotFoundError(f"The file {document_path} does not exist.")
    try:
        print("Extracting content from the document...")
        document_content = extract_content(document_path)

        print("Parsing and extracting structured information...")
        consultation_info = extract_information(document_content, llm_model="gpt-4o")

        print("Extraction complete. Here is the structured information:")
        print(json.dumps(consultation_info.dict(), indent=2))
    except Exception as e:
        print(f"An error occurred: {e}")

AIconsultation类中所有字段的描述都是不言自明的。字段验证函数validate_and_convert_date检查提取的consultation_date字段的格式，并在需要时将其转换为所需的格式（dd-mm-yyyy）。函数extract_content()使用 Llamaparse 解析给定的 AI 咨询文档，函数extract_information()使用gpt-4oLLM 从文档中提取所需信息，由 Pydantic 模型指导。extract_information函数中的prompt指示模型遵循 Pydantic 架构并以 JSON 格式输出响应。

LlamaParse 根据整体上下文将文档拆分为多个子文档。根据提供给解析器的说明（见extract_content()函数中的parsing_instructions），解析器创建多个部分并为每个部分分配一个标题。解析器的输出（extract_content()函数中的document对象）包含子文档 ID，每个子文档的元数据，以及包含多个标题的文本。

LlamaParse 的输出包含在extract_content()函数中的document对象中（图片由作者提供）

我只为信息提取选择文本（extract_content()函数中的text_content），以下是extract_content()函数的最终输出。文档已被拆分为多个部分，每个部分都分配了一个标题。

最后，extract_information()函数以良好的结构化格式从解析内容中提取所需信息（在 Pydantic 模型中定义）。咨询日期已验证并转换为dd-mm-yyyy格式。请注意，在prompt中，我们不需要指定要提取的信息，因为这在 Pydantic 模型中已经指定了。

Extracting content from the document...
Started parsing the file under job_id 0761bfee-922a-49a8-9e92-da1877aeea1a
Parsing and extracting structured information...
Extraction complete. Here is the structured information:
{
  "company_name": "Sagittarius Tech",
  "country": "Finland",
  "consultation_date": "09-12-2024",
  "experts": [
    "Klaus Muller"
  ],
  "consultation_type": "Regular",
  "area_domain": "Renewable energy",
  "current_solution": "Reactive maintenance for solar farms and wind turbines",
  "ai_field": [
    "Predictive maintenance",
    "Machine learning",
    "Time-series analysis"
  ],
  "ai_maturity_level": "Low",
  "technical_expertise_and_capability": "High",
  "company_type": "Established company",
  "aim": "Implement AI-driven predictive maintenance for solar farms and wind turbines",
  "identified_target_market": "Large-scale renewable energy operators and utility companies",
  "data_requirement_assessment": "Extensive datasets from sensors on solar panels and wind turbines, including temperature readings, vibration analysis, and energy output metrics",
  "fair_services_sought": "Technical advice, proof of concept for predictive maintenance",
  "recommendations": "Use machine learning models like LSTM for time-series data, start with pre-trained models, integrate AI with existing systems using custom APIs, and adopt a phased approach to AI integration"
}

使用 Pydantic 模型解析 CV 内容和信息提取与验证

在演示了使用 LlamaParse 解析文档并通过 Pydantic 模型进行信息提取之后，现在让我们讨论使用 Pydantic 模型从简历中解析和提取关键信息，并提供与提取的配置文件相匹配的工作推荐。在这里，提取的教育资格、技能和以往的经验被认为是提供匹配工作推荐足够的信息。

GitHub 代码结构分为两个 .py 文件：

CV_analyzer.py：定义 Pydantic 模型，配置 LLM 和嵌入模型，解析简历数据，从简历中提取所需信息，为提取的技能分配分数，并从工作向量数据库中检索匹配的工作。
job_recommender.py：初始化一个 Streamlit 应用程序，按顺序调用 CV_analyzer.py 中的函数，并显示提取的信息和工作推荐。

代码的整体功能如下所示。

工作推荐应用程序的工作流程：集成 CvAnalyzer 和 RAGStringQueryEngine 进行简历解析，使用 LLM 进行 Pydantic 引导的配置文件提取，技能评分以及使用 Streamlit 输出的工作推荐（图片由作者提供）。

代码中的一些结构借鉴自这个来源，并进行了重大改进。让我们逐一讨论代码中的所有类和函数。

以下 CV_analyzer.py 中的代码显示了 Pydantic 模型的定义。

# Pydantic model for extracting education
class Education(BaseModel):
    institution: Optional[str] = Field(None, description="The name of the educational institution")
    degree: Optional[str] = Field(None, description="The degree or qualification earned")
    graduation_date: Optional[str] = Field(None, description="The graduation date (e.g., 'YYYY-MM')")
    details: Optional[List[str]] = Field(
        None, description="Additional details about the education (e.g., coursework, achievements)"
    )

    @field_validator('details', mode='before')
    def validate_details(cls, v):
        if isinstance(v, str) and v.lower() == 'n/a':
            return []
        elif not isinstance(v, list):
            return []
        return v

# Pydantic model for extracting experience
class Experience(BaseModel):
    company: Optional[str] = Field(None, description="The name of the company or organization")
    location: Optional[str] = Field(None, description="The location of the company or organization")
    role: Optional[str] = Field(None, description="The role or job title held by the candidate")
    start_date: Optional[str] = Field(None, description="The start date of the job (e.g., 'YYYY-MM')")
    end_date: Optional[str] = Field(None, description="The end date of the job or 'Present' if ongoing (e.g., 'MM-YYYY')")
    responsibilities: Optional[List[str]] = Field(
        None, description="A list of responsibilities and tasks handled during the job"
    )

    @field_validator('responsibilities', mode='before')
    def validate_responsibilities(cls, v):
        if isinstance(v, str) and v.lower() == 'n/a':
            return []
        elif not isinstance(v, list):
            return []
        return v
# Main Pydantic class ensapsulating education and epxerience classes with other information
class ApplicantProfile(BaseModel):
    name: Optional[str] = Field(None, description="The full name of the candidate")
    email: Optional[EmailStr] = Field(None, description="The email of the candidate")
    age: Optional[int] = Field(
        None,
        description="The age of the candidate."
    )
    skills: Optional[List[str]] = Field(
        None, description="A list of high-level skills possessed by the candidate."
    )
    experience: Optional[List[Experience]] = Field(
        None, description="A list of experiences detailing previous jobs, roles, and responsibilities"
    )
    education: Optional[List[Education]] = Field(
        None, description="A list of educational qualifications of the candidate including degrees, institutions studied in, and dates of start and end."
    )

    @root_validator(pre=True)
    def handle_invalid_values(cls, values):
        for key, value in values.items():
            if isinstance(value, str) and value.lower() in {'n/a', 'none', ''}:
                values[key] = None
        return values

在 pydantic 模型中的 Education 类定义了提取教育细节，包括机构的名称、获得的学位或资格、毕业日期以及课程或成就等额外细节。Experience 类定义了提取职业经验细节，包括公司名称、地点、角色、起始和结束日期以及责任或任务列表。主要类包括 ````pyApplicantProfile``encapsulates the```Education和Experience` 类，以及其他候选人的特定信息，如姓名、电子邮件、年龄和技能。每个类中的字段验证器处理将无效和不相关的值（如 n/a 或 ‘none’）或格式不正确的输入转换为一致的数据格式。

在定义了 Pydantic 模型之后，CV_analyzer.py 使用以下结构的 CvAnalyzer 类来执行各种任务。

# Class for analyzing the CV contents
class CvAnalyzer:
    def __init__(self, file_path, llm_option, embedding_option):
        """
        Initializes the CvAnalyzer with the given resume file path and model options.

        Parameters:
        - file_path: Path to the resume file.
        - llm_option: Name of the LLM to use.
        - embedding_option: Name of the embedding model to use.
        """
        pass

    def _model_settings(self):
        """
        Configures the large language model and embedding model based on the user-provided options.
        This ensures that the selected models are properly initialized and ready for use.
        """
        pass

    def extract_profile_info(self) -> ApplicantProfile:
        """
        Extracts structured information from the resume and converts it into an ApplicantProfile object.
        This includes parsing education, skills, and experience using a selected LLM.
        """
        pass

    def _get_embedding(self, texts: List[str], model: str) -> torch.Tensor:
        """
        Generates embeddings for a list of text inputs using the specified embedding model.
        This function is called by compute_skill_scores() function
        Parameters:
        - texts: List of strings to embed.
        - model: Name of the embedding model.

        Returns:
        - Tensor of embeddings.
        """
        pass

    def compute_skill_scores(self, skills: list[str]) -> dict:
        """
        Computes semantic similarity scores between skills and the resume content.

        Parameters:
        - skills: List of skills to evaluate.

        Returns:
        - A dictionary mapping each skill to its similarity score.
        """
        pass

    def _extract_resume_content(self) -> str:
        """
        Called by compute_skill_scores(), this function extracts and returns the raw textual content of the resume.
        """
        pass

    def _cosine_similarity(self, vec1: torch.Tensor, vec2: torch.Tensor) -> float:
        """
        Called by compute_skill_scores() function, calculates the cosine similarity between two vectors.

        Parameters:
        - vec1: First vector.
        - vec2: Second vector.

        Returns:
        - Cosine similarity score as a float.
        """
        pass

    def create_or_load_job_index(self, json_file: str, index_folder: str = "job_index_storage"):
        """
        Creates a new job vector index from a JSON dataset or loads an existing index from storage.

        Parameters:
        - json_file: Path to the job dataset JSON file.
        - index_folder: Folder to save or load the vector index.

        Returns:
        - VectorStoreIndex object for querying jobs.
        """
        pass

    def query_jobs(self, education, skills, experience, index, top_k=3):
        """
        Queries the job vector index to find the top-k matching jobs based on the provided profile.

        Parameters:
        - education: List of educational qualifications.
        - skills: List of skills.
        - experience: List of work experiences.
        - index: Job vector database index.
        - top_k: Number of top matching jobs to retrieve (default: 3).

        Returns:
        - List of job matches.
        """
        pass

extract_profile_info 函数首先使用 LlamaParse 解析给定的简历，并将其分割成部分（如文章开头示例所示）。然后，它将简历的内容 self._resume_content 与 Pydantic 模式和信息提取指示一起发送给 LLM（见 prompt）。LLM 的响应（response）与 Pydantic 模式进行验证。

值得注意的是，我提取了文本数据（self._resume_content）而不是带有元数据和其它信息的原始解析内容（document），并将其发送给 LLM 进行信息提取。这防止了 LLM 在处理散布在不同节点上的信息时产生混淆，这可能导致遗漏一些必要信息的一部分。

def extract_profile_info(self) -> ApplicantProfile:
        """
        Extracts candidate data from the resume.
        """
        print(f"Extracting CV data. LLM: {self.llm_option}")
        output_schema = ApplicantProfile.model_json_schema()
        parser = LlamaParse(
            result_type="markdown",
            parsing_instructions="Extract each section separately based on the document structure.",
            auto_mode=True,
            api_key=os.getenv("LLAMA_API_KEY"),
            verbose=True
        )
        file_extractor = {".pdf": parser}

        # Load resume and parse
        documents = SimpleDirectoryReader(
            input_files=[self.file_path], file_extractor=file_extractor
        ).load_data()

        # Split into sections
        self._resume_content = "n".join([doc.text for doc in documents])
        prompt = f"""
            You are an expert in analyzing resumes. Use the following JSON schema to extract relevant information:
            ```json

            {输出模式}

            ```pyjson
            Extract the information from the following document and provide a structured JSON response strictly adhering to the schema above. 
            Please remove any ```json ```py characters from the output. Do not make up any information. If a field cannot be extracted, mark it as `n/a`.
            Document:
            ----------------
            {self._resume_content}
            ----------------
            """
        try:
            response = self.llm.complete(prompt)
            if not response or not response.text:
                raise ValueError("Failed to get a response from LLM.")

            parsed_data = json.loads(response.text)
            return ApplicantProfile.model_validate(parsed_data)
        except Exception as e:
            print(f"Error parsing response: {str(e)}")
            raise ValueError("Failed to extract insights. Please ensure the resume and query engine are properly configured.")

compute_skill_scores 函数计算每个提取的技能及其简历内容的嵌入。然后，它计算技能和简历嵌入之间的余弦相似度分数。在简历中，技能越突出，其余弦相似度分数就越高。每个技能的余弦相似度分数被归一化到 0 到 5 之间，以 5 星级格式显示。

def compute_skill_scores(self, skills: list[str]) -> dict:
        """
        Compute semantic weightage scores for each skill based on the resume content

        Parameters:
        - skills (list of str): A list of skills to evaluate.

        Returns:
        - dict: A dictionary mapping each skill to a score 
        """
        # Extract resume content and compute its embedding
        resume_content = self._extract_resume_content()

        # Compute embeddings for all skills at once
        skill_embeddings = self._get_embedding(skills, model=self.embedding_model.model_name)

        # Compute raw similarity scores and semantic frequency for each skill
        raw_scores = {}
        for skill, skill_embedding in zip(skills, skill_embeddings):
            # Compute semantic similarity with the entire resume
            similarity = self._cosine_similarity(
                self._get_embedding([resume_content], model=self.embedding_model.model_name)[0],
                skill_embedding
            )
            raw_scores[skill] = similarity
        return raw_scores

def _extract_resume_content(self) -> str:
        """
        Returns the CV contents previously extracted
        """
        if self._resume_content:
            return self._resume_content  # Use the pre-stored content
        else:
            raise ValueError("Resume content not available. Ensure `extract_profile_info` is called first.")

def _get_embedding(self, texts: List[str], model: str) -> torch.Tensor:
       """Computes embeddings based on the selected embedding model. 
          These could be CV embeddings, skill embeddings, or job embeddings """     
        from openai import OpenAI
        client = OpenAI(api_key=openai.api_key)
        response = client.embeddings.create(input=texts, model=model)
        embeddings = [torch.tensor(item.embedding) for item in response.data]
        return torch.stack(embeddings)

 def _cosine_similarity(self, vec1: torch.Tensor, vec2: torch.Tensor) -> float:
        """
        Compute cosine similarity between a skill and the CV content.
        """
        vec1, vec2 = vec1.to(self.device), vec2.to(self.device)
        return (torch.dot(vec1, vec2) / (torch.norm(vec1) * torch.norm(vec2))).item()

函数 create_or_load_job_index 创建一个新的职位向量数据库或从现有的职位向量数据库中加载索引（请参阅代码仓库中的 job_index_storage 文件夹）。

def create_or_load_job_index(self, json_file: str, index_folder: str = "job_index_storage"):
        """
        Create or load a vector database for jobs using LlamaIndex.
        """
        if not os.path.exists(index_folder):
            print(f"Creating new job vector index with {self.embedding_model.model_name} model...")
            with open(json_file, "r") as f:
                job_data = json.load(f)
            # Convert job descriptions to Document objects by serializing all fields dynamically
            documents = []
            for job in job_data["jobs"]:
                job_text = "n".join([f"{key.capitalize()}: {value}" for key, value in job.items()])
                documents.append(Document(text=job_text))
            # Create the vector index directly from documents
            index = VectorStoreIndex.from_documents(documents, embed_model=self.embedding_model)
            # Save index to disk
            index.storage_context.persist(persist_dir=index_folder)
            return index
        else:
            print(f"Loading existing job index from {index_folder}...")
            storage_context = StorageContext.from_defaults(persist_dir=index_folder)
            return load_index_from_storage(storage_context)

从中创建向量数据库的职位数据集位于代码仓库中的 sample_jobs.json 文件中。我通过从不同来源抓取了 50 个职位广告的 JSON 格式，精心整理了这个示例数据集。以下是该文件中存储职位广告的方式。

{
  "jobs": [
   {
      "id": "2253637",
      "title": "Director of Customer Success",
      "company": "HEI Schools",
      "description": "HEI Schools is seeking an experienced Director of Customer Success to lead our account management, customer success, and project delivery functions. Responsibilities include overseeing seamless product and service delivery, ensuring high quality and customer satisfaction, and supervising a team of three customer success professionals. The role requires regular international travel and reports directly to the CEO.",
      "image": "n/a",
      "location": "Helsinki, Finland",
      "employmentType": "Full-time, Permanent",
      "datePosted": "December 10, 2024",
      "salaryRange": "n/a",
      "jobProvider": "Jobly",
      "url": "https://www.jobly.fi/en/job/director-customer-success-2253637"
    },
    {
      "id": "2258919",
      "title": "Service Specialist",
      "company": "Stora Enso",
      "description": "We are seeking an active and service-oriented Service Specialist for our forest owner services in the Helsinki metropolitan area. Responsibilities include supporting timber sales and service sales, marketing and communication within your area of responsibility, forest consulting, promoting digital solutions in customer management and service offerings, and stakeholder collaboration in the metropolitan area.",
      "image": "n/a",
      "location": "Helsinki, Finland",
      "employmentType": "Permanent, Full-time",
      "datePosted": "December 10, 2024",
      "salaryRange": "n/a",
      "jobProvider": "Jobly",
      "url": "https://www.jobly.fi/en/job/palveluasiantuntija-2258919"
    }
  ...

  ],
  "index": 0,
  "jobCount": 50,
  "hasError": false,
  "errors": []
}

函数 query_jobs 从职位向量数据库中检索 top_k 匹配的职位广告，然后将这些广告发送给 LLM 进行最终推荐。

 def query_jobs(self, education, skills, experience, index, top_k=3):
        """
        Query the vector database for jobs matching the extracted profile.
        """
        print(f"Fetching job suggestions.(LLM: {self.llm.model}, embed_model: {self.embedding_option})")
        query = f"Education: {', '.join(education)}; Skills: {', '.join(skills)}; Experience: {', '.join(experience)}"
        # Use retriever with appropriate model
        retriever = index.as_retriever(similarity_top_k=top_k)
        matches = retriever.retrieve(query)
        return matches

CvAnalyzer 类及其上述方法由 job_recommender.py 初始化和调用，该文件作为主应用程序代码。job_recommender.py 使用以下自定义查询引擎通过函数提供最终职位推荐。

class RAGStringQueryEngine(BaseModel):
    """
    Custom Query Engine for Retrieval-Augmented Generation (fetching matching job recommendations).
    """
    retriever: BaseRetriever
    llm: OpenAI
    qa_prompt: PromptTemplate

    # Allow arbitrary types
    model_config = ConfigDict(arbitrary_types_allowed=True)
    def custom_query(self, candidate_details: str, retrieved_jobs: str):
        query_str = self.qa_prompt.format(
            query_str=candidate_details, context_str=retrieved_jobs
        )

        response = self.llm.complete(query_str)        
        return str(response)

job_recommender.py 中的主要功能如下：

def main():
    #Streamlit messages
    st.set_page_config(page_title="CV Analyzer &amp; Job Recommender", page_icon="🔍 ")
    st.title("CV Analyzer &amp; Job Recommender")
    st.write("Upload a CV to extract key information.")
    uploaded_file = st.file_uploader("Select Your CV (PDF)", type="pdf", help="Choose a PDF file up to 5MB")
    #Define LLM and embedding model    
    llm_option = "gpt-4o"
    embedding_option = "text-embedding-3-large"
    #Following code is trigerred after pressing 'Analyze' button
    if uploaded_file is not None:
        if st.button("Analyze"):
            with st.spinner("Parsing CV... This may take a moment."):
                try:
                    with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as temp_file:
                        temp_file.write(uploaded_file.getvalue())
                        temp_file_path = temp_file.name
                    # Initialize CvAnalyzer with selected models
                    analyzer = CvAnalyzer(temp_file_path, llm_option, embedding_option)
                    print("Resume extractor initialized.")
                    # Extract insights from the resume
                    insights = analyzer.extract_profile_info()
                    print("Candidate data extracted.")
                    # Load or create job vector index
                    job_index = analyzer.create_or_load_job_index(json_file="sample_jobs.json", index_folder="job_index_storage")
                    # Extract education, skills, and experience fields from insights object
                    education = [edu.degree for edu in insights.education] if insights.education else []
                    skills = insights.skills or []
                    experience = [exp.role for exp in insights.experience] if insights.experience else []
                    #Retrieve the top_k matching jobs
                    matching_jobs = analyzer.query_jobs(education, skills, experience, job_index)
                    #combine the retrieved matching jobs
                    retrieved_context = "nn".join([match.node.get_content() for match in matching_jobs]) 
                    #combine the profile information
                    candidate_details = f"Education: {', '.join(education)}; Skills: {', '.join(skills)}; Experience: {', '.join(experience)}" 
                    #Initialize LLM and the query engine
                    llm = OpenAI(model=llm_option, temperature=0.0)
                    rag_engine = RAGStringQueryEngine(
                        retriever=job_index.as_retriever(),
                        llm=analyzer.llm,  
                        qa_prompt=PromptTemplate(template="""
                            You are expert in analyzing resumes, based on the following candidate details and job descriptions:
                            Candidate Details:
                            ---------------------
                            {query_str}
                            ---------------------
                            Job Descriptions:
                            ---------------------
                            {context_str}
                            ---------------------
                            Provide a concise list of the matching jobs. For each matching job, mention job-related details such as 
                            company, brief job description, location, employment type, salary range, URL for each suggestion, and a brief explanation of why the job matches the candidate's profile.
                            Be critical in matching profile with the jobs. Thoroughly analyze education, skills, and experience to match jobs.  
                            Do not explain why the candidate's profile does not match with the other jobs. Do not include any summary. Order the jobs based on their relevance. 
                            Answer: 
                            """
                        ),
                    )

                    #send the profile details and the retrieved jobs to the LLM for final recommendation
                    llm_response = rag_engine.custom_query(
                        candidate_details=candidate_details,
                        retrieved_jobs=retrieved_context
                    )
                    # Display extracted information
                    st.subheader("Extracted Information")
                    st.write(f"**Name:** {insights.name}")
                    st.write(f"**Email:** {insights.email}")
                    st.write(f"**Age:** {insights.age}")
                    list_education(insights.education or [])
                    with st.spinner("Extracting skills..."):
                        list_skills(insights.skills or [], analyzer)
                    list_experience(insights.experience or [])
                    st.subheader("Top Matching Jobs with Explanation")
                    st.markdown(llm_response)
                    print("Done.")
                except Exception as e:
                    st.error(f"Failed to analyze the resume: {str(e)}")

主要函数使用所选模型初始化 CVAnalyzer 类，并调用 extract_profile_info 函数以提取个人资料信息。然后它加载职位向量索引，并调用 query_jobs 函数检索与提取的个人资料匹配的职位。随后，它初始化查询引擎（rag_engine），并将检索到的职位（retreived_context）和个人资料信息（candidate_details）连同关于生成最终职位推荐的考虑方面的指示发送给 LLM（见 qa_prompt）。

import torch
from transformers import AutoTokenizer, AutoModel
from llama_index.core import Settings, VectorStoreIndex 
from llama_index.llms.ollama import Ollama
from typing import Union
import streamlit as st
import tempfile
import random
import os
from CV_analyzer import CvAnalyzer
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.retrievers import BaseRetriever
from llama_index.llms.openai import OpenAI
from llama_index.core.prompts import PromptTemplate
from pydantic import BaseModel, Field, ConfigDict

class RAGStringQueryEngine(BaseModel):
    """
    Custom Query Engine for Retrieval-Augmented Generation (fetching matching job recommendations).
    """
    retriever: BaseRetriever
    llm: OpenAI
    qa_prompt: PromptTemplate

    # Allow arbitrary types
    model_config = ConfigDict(arbitrary_types_allowed=True)
    def custom_query(self, candidate_details: str, retrieved_jobs: str):
        query_str = self.qa_prompt.format(
            query_str=candidate_details, context_str=retrieved_jobs
        )

        response = self.llm.complete(query_str)        
        return str(response)

def main():
    st.set_page_config(page_title="CV Analyzer &amp; Job Recommender", page_icon="🔍 ")
    st.title("CV Analyzer &amp; Job Recommender")
    llm_option = "gpt-4o"
    embedding_option = "text-embedding-3-large"
    st.write("Upload a CV to extract key information.")
    uploaded_file = st.file_uploader("Select Your CV (PDF)", type="pdf", help="Choose a PDF file up to 5MB")

    if uploaded_file is not None:
        if st.button("Analyze"):
            with st.spinner("Parsing CV... This may take a moment."):
                try:
                    with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as temp_file:
                        temp_file.write(uploaded_file.getvalue())
                        temp_file_path = temp_file.name
                    # Initialize CvAnalyzer with selected models
                    analyzer = CvAnalyzer(temp_file_path, llm_option, embedding_option)
                    print("Resume extractor initialized.")
                    # Extract insights from the resume
                    insights = analyzer.extract_profile_info()
                    print("Candidate data extracted.")
                    # Load or create job vector index
                    job_index = analyzer.create_or_load_job_index(json_file="sample_jobs.json", index_folder="job_index_storage")
                    # Extract education, skills, and experience fields from insights object
                    education = [edu.degree for edu in insights.education] if insights.education else []
                    skills = insights.skills or []
                    experience = [exp.role for exp in insights.experience] if insights.experience else []
                    #Retrieve the top_k matching jobs
                    matching_jobs = analyzer.query_jobs(education, skills, experience, job_index)
                    #combine the retrieved matching jobs
                    retrieved_context = "nn".join([match.node.get_content() for match in matching_jobs]) 
                    #combine the profile information
                    candidate_details = f"Education: {', '.join(education)}; Skills: {', '.join(skills)}; Experience: {', '.join(experience)}" 
                    #Initialize LLM and the query engine
                    llm = OpenAI(model=llm_option, temperature=0.0)
                    rag_engine = RAGStringQueryEngine(
                        retriever=job_index.as_retriever(),
                        llm=analyzer.llm,  
                        qa_prompt=PromptTemplate(template="""
                            You are expert in analyzing resumes, based on the following candidate details and job descriptions:
                            Candidate Details:
                            ---------------------
                            {query_str}
                            ---------------------
                            Job Descriptions:
                            ---------------------
                            {context_str}
                            ---------------------
                            Provide a concise list of the matching jobs. For each matching job, mention job-related details such as 
                            company, brief job description, location, employment type, salary range, URL for each suggestion, and a brief explanation of why the job matches the candidate's profile.
                            Be critical in matching profile with the jobs. Thoroughly analyze education, skills, and experience to match jobs.  
                            Do not explain why the candidate's profile does not match with the other jobs. Do not include any summary. Order the jobs based on their relevance. 
                            Answer: 
                            """
                        ),
                    )

                    #send the profile details and the retrieved jobs to the LLM for final recommendation
                    llm_response = rag_engine.custom_query(
                        candidate_details=candidate_details,
                        retrieved_jobs=retrieved_context
                    )
                    # Display extracted information
                    st.subheader("Extracted Information")
                    st.write(f"**Name:** {insights.name}")
                    st.write(f"**Email:** {insights.email}")
                    st.write(f"**Age:** {insights.age}")
                    list_education(insights.education or [])
                    with st.spinner("Extracting skills..."):
                        list_skills(insights.skills or [], analyzer)
                    list_experience(insights.experience or [])
                    st.subheader("Top Matching Jobs with Explanation")
                    st.markdown(llm_response)
                    print("Done.")
                except Exception as e:
                    st.error(f"Failed to analyze the resume: {str(e)}")

以下三个函数显示教育资格、技能和经验。函数 list_skills 调用 compute_skill_scores 函数计算每个技能的余弦相似度分数，然后将每个分数转换为 5 星级评分。

 def list_skills(skills: list[str], analyzer):
    """
    Display skills with their computed scores as large golden stars with full or partial coverage.
    """
    if not skills:
        st.warning("No skills found to display.")
        return
    st.subheader("Skills")
    # Custom CSS for large golden stars
    st.markdown(
        """
        <style>
        .star-container {
            display: inline-block;
            position: relative;
            font-size: 1.5rem;
            color: lightgray;
        }
        .star-container .filled {
            position: absolute;
            top: 0;
            left: 0;
            color: gold;
            overflow: hidden;
        }
        </style>
        """,
        unsafe_allow_html=True,
    )

    # Compute scores for all skills
    skill_scores = analyzer.compute_skill_scores(skills)
    # Display each skill with a star rating
    for skill in skills:
        score = skill_scores.get(skill, 0)  # Get the raw score
        max_score = max(skill_scores.values()) if skill_scores else 1  # Avoid division by zero
        # Normalize the score to a 5-star scale
        normalized_score = (score / max_score) * 5 if max_score > 0 else 0
        # Split into full stars and partial star percentage
        full_stars = int(normalized_score)
        if (normalized_score - full_stars) >= 0.40:
            partial_star_percentage = 50
        else:
            partial_star_percentage = 0

        # Generate the star display
        stars_html = ""
        for i in range(5):
            if i < full_stars:
                # Fully filled star
                stars_html += '<span class="star-container"><span class="filled">★</span>★</span>'
            elif i == full_stars:
                # Partially filled star
                stars_html += f'<span class="star-container"><span class="filled" style="width: {partial_star_percentage}%">★</span>★</span>'
            else:
                # Empty star
                stars_html += '<span class="star-container">★</span>'

        # Display skill name and star rating
        st.markdown(f"**{skill}**: {stars_html}", unsafe_allow_html=True)

def list_education(education_list):
    """
    Display a list of educational qualifications.
    """
    if education_list:
        st.subheader("Education")
        for education in education_list:
            #extract metrics for each education (degree) and display it
            institution = education.institution if education.institution else "Not found"
            degree = education.degree if education.degree else "Not found"
            year = education.graduation_date if education.graduation_date else "Not found"
            details = education.details if education.details else []
            formatted_details = ". ".join(details) if details else "No additional details provided."
            st.markdown(f"**{degree}**, {institution} ({year})")
            st.markdown(f"_Details_: {formatted_details}")

def list_experience(experience_list):
    """
    Display a single-level bulleted list of experiences.
    """
    if experience_list:
        st.subheader("Experience")
        for experience in experience_list:
            #extract metrics for each experience and display it
            job_title = experience.role if experience.role else "Not found"
            company_name = experience.company if experience.company else "Not found"
            location = experience.location if experience.location else "Not found"
            start_date = experience.start_date if experience.start_date else "Not found"
            end_date = experience.end_date if experience.end_date else "Not found"
            responsibilities = experience.responsibilities if experience.responsibilities else ["Not found"]
            brief_responsibilities = ", ".join(responsibilities)
            st.markdown(
                f"- Worked as **{job_title}** from {start_date} to {end_date} in *{company_name}*, {location}, "
                f"where responsibilities include {brief_responsibilities}."
            )

以下是从 Streamlit 应用程序中提取的个人信息和职业推荐快照。样本简历（Sample CV.pdf）可以在代码存储库中找到。

样本简历（图片由作者提供）

屏幕截图 1：gpt-4o 和 text-embedding-3-large 模型（图片由作者提供）

屏幕截图 2：gpt-4o 和 text-embedding-3-large 模型（图片由作者提供）

如果您有一台配备 CUDA 功能 GPU 的强大计算机，该应用程序允许您选择开源模型。您可以在按照存储库说明设置所需的库之后运行开源模型。然而，如果您只有 CPU 机器，处理速度可能会非常慢。话虽如此，如果找不到任何 CUDA 功能 GPU，代码可以自动切换到基于 CPU 的处理。

以下屏幕截图显示了使用开源模型（Llama3.1:latest LLM 和 BAAI/bge-small-en-v1.5 嵌入模型）处理同一简历的过程。

屏幕截图 1：Llama3.1:latest 和 BAAI/bge-small-en-v1.5 模型

屏幕截图 2：Llama3.1:latest 和 BAAI/bge-small-en-v1.5 模型

开源模型也做得相当不错。值得注意的是 Llama3.1 如何提取和评分技能。它将简历中提到的技能组合拆分为单个技能。然而，它们仍然对于提供匹配的工作推荐保持相关性。

以下是我简历第一页的快照（共 4 页），展示了其结构。

我的简历的第一页（图片由作者提供）

这里是我简历分析的结果。对于我的个人资料来说相当不错。

屏幕截图 1：gpt-4o 和 text-embedding-3-large 模型（图片由作者提供）

屏幕截图 2：gpt-4o 和 text-embedding-3-large 模型（图片由作者提供）

屏幕截图 3：gpt-4o 和 text-embedding-3-large 模型（图片由作者提供）

改进和扩展的方向

有很多空间可以改进这个应用程序。一些潜在改进可能包括以下方向。

当前代码使用的是有限的样本数据集。可以实现一个高效的招聘广告摄取管道，以更新工作数据库中的最新工作。
我只为职业推荐考虑了教育证书、技能和以往经验。通过扩展 Pydantic 模型，可以提取更多信息以获得更好的推荐。这些信息可能包括位置、更近期的经验、出版物以及其他兴趣/活动。
技能评分方法可以进一步改进，并与最终建议集成，以考虑个人技能评分。
该应用可以扩展以从招聘广告中提取关键信息，并将它们与求职者的个人资料匹配，以找到合适的候选人。
该应用可以用于测试具有不同格式的简历，以检查解析质量并相应地改进解析。
基于对简历内容的分析，可以为用户提供改进简历和技能的建议，并建议提升技能的套餐。

请在以下 GitHub 仓库中找到完整的代码：

GitHub – umairalipathan1980/CV-Analyzer-Job-Recommender: CV 分析器 & 职业推荐器

这就结束了！ 如果您喜欢这篇文章，请为文章鼓掌（多次 👏 ），留下评论，并关注我的 Medium 和 LinkedIn。

posted @ 2026-03-27 09:48 绝不原创的飞龙阅读(8) 评论(0) 收藏举报

刷新页面返回顶部

龙哥盟

掠夺·扩张·投机·博弈