Python 文件导入、保存

菜鸟教程：https://www.runoob.com/python/os-chdir.html
Python官方文件教程：https://docs.python.org/3.9/library/os.html?highlight=os chdir#os.chdir
Datascience：https://betterprogramming.pub/the-top-10-file-handling-techniques-in-python-cf2330a16e7

路径设置

在使用Python时，我们可以爬取网页上的数据，也可以使用电脑本地的数据，Python安装时，需要配置环境，使用Anaconda安装，Anaconda会帮你配置好环境，这时配置的环境中的路径就是Python的默认工作路径。

在参考大佬的Python code时，每个电脑的默认路径是不同的，这时我们需要告诉Python,本地文件的路径具体在哪里！这时需要用到os.chdir（"path"） ，path为文件所在的位置，\\表示下一层级，也可以使用/。

import os
os.chdir("D:\\研究生课程\\研一小学期\\SDA\\Homework\\hw3")

可以用下列命令测试我们是否已经更改默认路径。

sub_dir = os.getcwd()  # 获得默认路径
print(sub_dir)
# os.path.exists(sub_dir) 如果是 True ，将sub_dir赋值给cwd_dir，否则将os.getcwd()赋值给cwd_dir
cwd_dir = sub_dir if os.path.exists(sub_dir) else os.getcwd()

如果我们想在新的默认路径下生成文件，保存分析结果。可以使用如下命令，其中+起到连接两个字符串的作用。

plt.savefig(cwd_dir + "\\wordcloud_abstract_PeiQi.png", dpi=500, transparent=True)

如果觉得以上方法太过于麻烦，可以通过一下方法快速锁定当前路径。模块就是程序，程序就是模块。让Python知道该模块是作为程序运行，还是导入其他从程序中，这是可以运用 __name__ ,如果在主程序中运行 __name__ ，这是得到的是 __main__ ，如果在其他模块中运行，返回模块名字。

if __name__ = '__main__':
  pass # 占位符

获取文件夹下的所有文件

import os
import numpy as np
import pandas as pd

def get_file_values(data_path, file):
    """get dataframe close values"""
    data = pd.read_csv(os.path.join(data_path, file))
    close_values = data.Close.values.reshape(-1, 1)
    return close_values

def get_fname_num(path_files, data_path, csv_names=[], len_=[]):
    """get file name and the number of observations"""
    for file in path_files:
        if 'csv' in file:
            close_values = get_file_values(data_path, file)
            len_file = close_values.shape[0]
            f_name = file.split('.')[0]
            csv_names.append(f_name)
            len_.append(len_file)

    dictionary = dict(zip(csv_names, len_))

    return dictionary


if __name__ == '__main__':
    print('Current path{}'.format(os.getcwd()))
    data_path = 'F:\example\data'
    path_files = os.listdir(data_path)

    i = 1
    dict_file = get_fname_num(path_files, data_path)
    max_key = max(dict_file, key=lambda x: dict_file[x])


    file = max_key + '.csv'
    data = pd.read_csv(os.path.join(data_path, file))
    data.rename(columns = {'Close': max_key}, inplace=True)
    merge_data = data.copy()

    for key in dict_file.keys():
        if key != max_key:
            file = key + '.csv'
            data = pd.read_csv(os.path.join(data_path, file))
            data.rename(columns={'Close': key}, inplace=True)
            merge_data = merge_data.merge(data, how='left', on='Date')
        i+=1

    print('number of na \n', merge_data.isna().sum())

添加一个固定属性：

test_= pd.DataFrame({'Date': data.Date,
                     'Close': data['EMI01003837'],
                     'ID': 'EMI01003837'})

文件导入速度比较：

我们常常需要将本地文件导入成DataFrame格式，因此会有csv, xlsx文件格式，但是那种文件导入速度更快那？

https://royseto.medium.com/python-faster-ways-to-import-data-files-887dec096546

os

os、os.path 模块中关于文件、目录常用的函数使用方法

pandas csv 导入乱码

报错：UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 158: invalid start byte

问题解决: https://www.cnblogs.com/huangchenggener/p/10983812.html

或许你可以这样：

start_time = time.time()
df_train_por = pd.read_excel('./Data_xlsx/train.xlsx')
df_train_por = df_train_por.to_csv('./Data/train.csv', index=False)
print('数据读取时间：', time.time() - start_time)

运行程序暂停： https://blog.csdn.net/qq_33567641/article/details/82346941
- https://blog.csdn.net/qq_36998053/article/details/82949517

导入txt文件

https://blog.csdn.net/weixin_40422121/article/details/102407166
Python：使用pandas导入txt文件：https://blog.csdn.net/weixin_45845722/article/details/115293530
Python 怎么读取txt文件：https://www.php.cn/python-tutorials-424790.html
读取文件去除\n : https://blog.csdn.net/kevinjin2011/article/details/105550869

# 读取txt文件中的多个字典
texts_list = []

with open(data_path + '1641650263.6994636.txt') as f:
    for line in f.readlines():
        line_dict = json.loads(line)
        texts_list.append(line_dict)

%、format、f

%, format 保存：https://blog.csdn.net/wjqhit/article/details/103095729

保存json文件

每行保存成一个json 文件，用到了list的join

Python常见方法（1）- list的join方法：https://blog.csdn.net/dylan_young/article/details/112298983
dumps，dump的区别：https://blog.csdn.net/KassadinSw/article/details/73912645

def save_event_json(event_text):
    """
    将数据保存成json文件，一个字典是一个json格式（换行）
    :param event_text: 字典数据结构，value是列表数据格式。
    :return: 
    """
    for k in event_text.keys():
        with open(f'./add_event_data/{k}' + '_data_add.json', 'w') as f:
            # json.dump(str(event_text['BondPrice']), f, ensure_ascii=False, indent=4)
            f.write("\n".join([json.dumps(r, ensure_ascii=False) for r in event_text[k]]))

posted on 2021-07-08 20:27 RankFan 阅读(426) 评论(0) 收藏举报

刷新页面返回顶部