lance 集成minio 简单试用
属于一个简单试用,学习下内部处理以及集成minio
参考代码
- app.py
import lance
import pyarrow as pa
import duckdb
new_table = pa.Table.from_pylist([{"name": "Bob", "age": 30},
{"name": "Carla", "age": 37}])
storage_options = {
"storage_type":'minio',
"endpoint":'http://localhost:9000',
"access_key":'minio',
"secret_key":'minio123',
"allow_http":"true",
"bucket":"lance",
"region":'us-east-1'
}
ds = lance.write_dataset(new_table, "s3://lance/mydemo",mode="overwrite",storage_options= storage_options)
new_table_v2 = pa.Table.from_pylist([{"name": "Bob_v2", "age": 39},
{"name": "Carla", "age": 35}])
ds.merge_insert("name") \
.when_not_matched_insert_all() \
.when_matched_update_all() \
.execute(new_table_v2)
samples = duckdb.query("SELECT * FROM ds").to_df()
print(samples)
- 效果 提示会有并发问题,推荐使用锁机制
锁参考玩法
from contextlib import contextmanager
@contextmanager
def commit_lock(version: int):
# Acquire the lock
my_lock.acquire()
try:
yield
except:
failed = True
finally:
my_lock.release()
lance.write_dataset(data, "s3://bucket/path/", commit_lock=commit_lock)
完整示例
import lance
import pyarrow as pa
import duckdb
from contextlib import contextmanager
from filelock import FileLock
lock_path = "high_ground.txt.lock"
lock = FileLock(lock_path, timeout=10)
@contextmanager
def commit_lock(version: int):
print("Acquiring lock")
lock.acquire()
try:
yield
except:
failed = True
finally:
print("Releasing lock")
lock.release()
new_table = pa.Table.from_pylist([{"name": "Bob", "age": 30},
{"name": "Carla", "age": 37}])
storage_options = {
"storage_type":'minio',
"endpoint":'http://localhost:9000',
"access_key":'minio',
"secret_key":'minio123',
"allow_http":"true",
"bucket":"lance",
"region":'us-east-1'
}
print("Dataset created, will write to S3, and need to acquire lock")
ds = lance.write_dataset(new_table, "s3://lance/mydemo",mode="overwrite",storage_options= storage_options,commit_lock=commit_lock)
new_table_v2 = pa.Table.from_pylist([{"name": "Bob_v2", "age": 39},
{"name": "Carla", "age": 35}])
print("Dataset merge_insert, and need to acquire lock")
ds.merge_insert("name") \
.when_not_matched_insert_all() \
.when_matched_update_all() \
.execute(new_table_v2)
samples = duckdb.query("SELECT * FROM ds").to_df()
print(samples)
说明
注意集成s3 是不能确保并发安全的,所以建议使用一个锁,这个锁可以是本地的,也可能是分布式的(比如一些需要多实例控制的), filelock 是一个对于多进程安全的一种选择,可以集成起来,对于一些需要向量存储场景,lance 还是值得试用的,当然直接基于lance的lancedb 是一个all-in-one 解决方案