lance 集成minio 简单试用

属于一个简单试用,学习下内部处理以及集成minio

参考代码

  • app.py
import lance
import pyarrow as pa
import duckdb

new_table = pa.Table.from_pylist([{"name": "Bob", "age": 30},
                                  {"name": "Carla", "age": 37}])
storage_options = {
        "storage_type":'minio',
        "endpoint":'http://localhost:9000',
        "access_key":'minio',
        "secret_key":'minio123',
        "allow_http":"true",
        "bucket":"lance",
        "region":'us-east-1'
}

ds = lance.write_dataset(new_table, "s3://lance/mydemo",mode="overwrite",storage_options= storage_options)

new_table_v2 = pa.Table.from_pylist([{"name": "Bob_v2", "age": 39},
                                  {"name": "Carla", "age": 35}])
ds.merge_insert("name") \
       .when_not_matched_insert_all() \
       .when_matched_update_all() \
       .execute(new_table_v2)

samples = duckdb.query("SELECT * FROM ds").to_df()

print(samples)
  • 效果 提示会有并发问题,推荐使用锁机制

锁参考玩法

from contextlib import contextmanager

@contextmanager
def commit_lock(version: int):
    # Acquire the lock
    my_lock.acquire()
    try:
      yield
    except:
      failed = True
    finally:
      my_lock.release()

lance.write_dataset(data, "s3://bucket/path/", commit_lock=commit_lock)

完整示例

import lance
import pyarrow as pa
import duckdb
from contextlib import contextmanager
from filelock import FileLock

lock_path = "high_ground.txt.lock"

lock = FileLock(lock_path, timeout=10)

@contextmanager
def commit_lock(version: int):
    print("Acquiring lock")
    lock.acquire()
    try:
      yield
    except:
      failed = True
    finally:
      print("Releasing lock")
      lock.release()

new_table = pa.Table.from_pylist([{"name": "Bob", "age": 30},
                                  {"name": "Carla", "age": 37}])
storage_options = {
        "storage_type":'minio',
        "endpoint":'http://localhost:9000',
        "access_key":'minio',
        "secret_key":'minio123',
        "allow_http":"true",
        "bucket":"lance",
        "region":'us-east-1'
}

print("Dataset created, will write to S3, and need to acquire lock")
ds = lance.write_dataset(new_table, "s3://lance/mydemo",mode="overwrite",storage_options= storage_options,commit_lock=commit_lock)

new_table_v2 = pa.Table.from_pylist([{"name": "Bob_v2", "age": 39},
                                  {"name": "Carla", "age": 35}])
print("Dataset merge_insert,  and need to acquire lock")
ds.merge_insert("name") \
       .when_not_matched_insert_all() \
       .when_matched_update_all() \
       .execute(new_table_v2)

samples = duckdb.query("SELECT * FROM ds").to_df()

print(samples)

说明

注意集成s3 是不能确保并发安全的,所以建议使用一个锁,这个锁可以是本地的,也可能是分布式的(比如一些需要多实例控制的), filelock 是一个对于多进程安全的一种选择,可以集成起来,对于一些需要向量存储场景,lance 还是值得试用的,当然直接基于lance的lancedb 是一个all-in-one 解决方案

参考资料

https://lancedb.github.io/lance/read_and_write.html

https://py-filelock.readthedocs.io/en/latest/

posted on 2025-04-04 08:00  荣锋亮  阅读(66)  评论(0)    收藏  举报

导航