Python学习笔记：循环读取多个文件保存为数据框并合并

源于生产上有多个零碎的 HDFS 小文件需要通过 Python 进行读取，遂产生需求。

屡经测试，除去真正 pd.read_csv 读取数据时间无法避免之外，一边读取数据存储为临时变量，一边进行 pd.concat 合并也造成大量开销。

# 读取数据
data = pd.DataFrame()
for file in final_path:
    tmp_data = pd.read_csv(file, header=None, encoding='utf8', sep='|', low_memory=False, nrows=10000) # nrows 设置读取行数
    data = pd.concat([data, tmp_data], axis=0, ignore_index=True)

因此诞生出是否可以一次性将文件读取不同数据框，再进行一次性合并。

苦苦搜寻、尝试，发现 Python 并不能动态变量存储多个数据框，例如：

# 读取数据（不对！！！）
for i, file in enumerate(final_path):
    tmp_data = f'tmp_data_{i}'
    print(tmp_data)
    tmp_data = pd.read_csv(file, header=None, encoding='utf8', sep='|', low_memory=False)

经过好一番搜索，最终发现可以利用字典的方式进行存储，每一个文件名作为字典键，每一个文件内容作为字典值，一一对应进行读取，最终再合并即可。

import os
from os import walk
import pandas as pd
pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_colwidth', 200)

# 设置路径
path = r'C:\Users\Desktop\读取多个文件保存为多个dataframe'
os.chdir(path)

# 遍历文件
res = []
for (dir_path, dir_name, file_name) in walk(path):
    res.extend(file_name)
print(res) # ['a.txt', 'b.txt', 'c.txt']

# 拼接全路径
final_path = []
for file_name in res:
    tmp_final_path = os.path.join(path, file_name)
    final_path.append(tmp_final_path)
print(final_path)

# 利用字典保存
dict_res = {}

# 读取数据
for i, file in enumerate(final_path):
    name = os.path.basename(file).split('.')[0]
    print(name)
    dict_res[name] = pd.read_csv(file, header=None, encoding='utf8', sep='|', low_memory=False)

# 合并数据
data = pd.concat(list(dict_res.values()), axis=0, ignore_index=True)
print(data)

参考链接：Python: Looping through directory and saving each file using filename as data frame name

posted @ 2022-08-01 17:22 Hider1214 阅读(1233) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Hider1214

Python学习笔记：循环读取多个文件保存为数据框并合并

公告