Stay Hungry,Stay Foolish!

How to pass multimodal data directly to models

How to pass multimodal data directly to models

https://python.langchain.com/v0.2/docs/how_to/multimodal_inputs/

Here we demonstrate how to pass multimodal input directly to models. We currently expect all input to be passed in the same format as OpenAI expects. For other model providers that support multimodal input, we have added logic inside the class to convert to the expected format.

In this example we will ask a model to describe an image.

image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
 
from langchain_core.messages import HumanMessage
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4o")
 
API Reference:HumanMessage | ChatOpenAI

The most commonly supported way to pass in images is to pass it in as a byte string. This should work for most model integrations.

import base64

import httpx

image_data = base64.b64encode(httpx.get(image_url).content).decode("utf-8")
 
message = HumanMessage(
content=[
{"type": "text", "text": "describe the weather in this image"},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
},
],
)
response = model.invoke([message])
print(response.content)
 
The weather in the image appears to be clear and pleasant. The sky is mostly blue with scattered, light clouds, suggesting a sunny day with minimal cloud cover. There is no indication of rain or strong winds, and the overall scene looks bright and calm. The lush green grass and clear visibility further indicate good weather conditions.
 

We can feed the image URL directly in a content block of type "image_url". Note that only some model providers support this.

 

Multi-Vector Retriever for RAG on tables, text, and images

https://blog.langchain.dev/semi-structured-multi-modal-rag/

Summary

Seamless question-answering across diverse data types (images, text, tables) is one of the holy grails of RAG. We’re releasing three new cookbooks that showcase the multi-vector retriever for RAG on documents that contain a mixture of content types. These cookbooks as also present a few ideas for pairing multimodal LLMs with the multi-vector retriever to unlock RAG on images.

 

https://developer.volcengine.com/articles/7387287884799148073

图像做矢量

下面稍微介绍一下几个关键步骤:

步骤1:从PDF中提取图像

   使用unstructured库抽取PDF信息,并创建一个文本和图像列表。提取的图像需要存储在特定的文件夹中。

          
# Extract images, tables, and chunk text
          
from unstructured.partition.pdf import partition_pdf
          

          
raw_pdf_elements = partition_pdf(
          
    filename="LCM_2020_1112.pdf",
          
    extract_images_in_pdf=True,
          
    infer_table_structure=True,
          
    chunking_strategy="by_title",
          
    max_characters=4000,
          
    new_after_n_chars=3800,
          
    combine_text_under_n_chars=2000,
          
    image_output_dir_path=path,
          
)
      

步骤2:创建矢量数据库

    准备矢量数据库,并将图像URI和文本添加到矢量数据库中。

          
# Create chroma
          
vectorstore = Chroma(
          
    collection_name="mm_rag_clip_photos", embedding_function=OpenCLIPEmbeddings()
          
)
          

          
# Get image URIs with .jpg extension only
          
image_uris = sorted(
          
    [
          
        os.path.join(path, image_name)
          
        for image_name in os.listdir(path)
          
        if image_name.endswith(".jpg")
          
    ]
          
)
          

          
print(image_uris)
          
# Add images
          
vectorstore.add_images(uris=image_uris)
          

          
# Add documents
          
vectorstore.add_texts(texts=texts)
      

 

posted @ 2024-08-03 23:18  lightsong  阅读(41)  评论(0)    收藏  举报
千山鸟飞绝,万径人踪灭