【Python数据分析】新冠肺炎数据--获取
本文作为【Python数据分析】板块的第一篇文章,主要内容为从Wind提供的Python接口获取数据集,简单处理后保存至本地。
Δ在此为本系列博文作如下约定:#表示操作讲解或代码注释,有行号的灰色框内为代码,紧随其后的没有行号的灰色框(或截图)为代码运行结果。
1、
#导入pandas、万得的Python接口WindPy
#如果使用的是jupyter且提示找不到WindPy模块,可以将WindPy.pth复制到jupyter安装路径下
1 import pandas as pd 2 from WindPy import *
2、
#启动Windpy,导入20201018往前一年的新冠肺炎确诊案例累计数,并赋值给Wdata。以S开头的代码表示全国各省。
1 w.start() 2 Wdata=w.edb("S6274770,S6289292,S6274299,S6289303,S6274354,S6274437,S6289302,S6289313,S6274256,S6289297,S6289296, \ 3 S6274810,S6289290,S6289294,S6289295,S6289291,S6275679,S6275447,S6275558,S6275394,S6289298,S6289301,S6289289, \ 4 S6289293,S6275360,S6289300,S6275203,S6289299,S6274391,S6289316,S6274477,S6289310", 5 "2019-10-19", "2020-10-18","Fill=Previous")
Welcome to use Wind Quant API for Python (WindPy)! COPYRIGHT (C) 2020 WIND INFORMATION CO., LTD. ALL RIGHTS RESERVED. IN NO CIRCUMSTANCE SHALL WIND BE RESPONSIBLE FOR ANY DAMAGES OR LOSSES CAUSED BY USING WIND QUANT API FOR Python.
3、
#可以看到Wind量化接口的欢迎提示,接下来打印Wdata查看获取到的数据。
#查看Wdata数据,其中.Codes表示各省代码,.Times表示日期,.Data表示数据
1 print(Wdata)
.ErrorCode=0 .Codes=[S6274770,S6289292,S6274299,S6289303,S6274354,S6274437,S6289302,S6289313,S6274256,S6289297,...] .Fields=[CLOSE] .Times=[20200116,20200117,20200118,20200119,20200120,20200121,20200122,20200123,20200124,20200125,...] .Data=[[nan,nan,nan,nan,291.0,440.0,571.0,830.0,1287.0,1975.0,...],[nan,nan,nan,nan,nan,nan,nan,26.0,36.0,51.0,...],[nan,nan,nan,nan,nan,2.0,4.0,5.0,8.0,10.0,...],[nan,nan,nan,nan,nan,nan,1.0,2.0,8.0,13.0,...],[nan,nan,nan,nan,nan,nan,1.0,1.0,6.0,9.0,...],[nan,nan,nan,nan,nan,nan,nan,1.0,2.0,7.0,...],[nan,nan,nan,nan,nan,nan,2.0,4.0,12.0,19.0,...],[nan,nan,nan,nan,nan,nan,nan,3.0,4.0,4.0,...],[nan,nan,nan,nan,nan,nan,2.0,4.0,9.0,15.0,...],[nan,nan,nan,0.0,2.0,9.0,16.0,20.0,33.0,40.0,...],...]
4、
#可以看到,Wdata.Data是一个276列的数据,所以将日期设置为DataFrame的列,将代码设置为DataFrame的行索引。但用日期表示行,用各省代码作为列更符合习惯,因此使用.T方法转置。
1 covid_cases_cumsum=pd.DataFrame(Wdata.Data,index=Wdata.Codes,columns=Wdata.Times).T
5、
#打印数据,看到顺利生成了新冠确诊病例累计值数据,但是各省使用代码表示,不够直观。
1 print(covid_cases_cumsum.head())
S6274770 S6289292 S6274299 S6289303 S6274354 S6274437 \
2020-01-16 NaN NaN NaN NaN NaN NaN
2020-01-17 NaN NaN NaN NaN NaN NaN
2020-01-18 NaN NaN NaN NaN NaN NaN
2020-01-19 NaN NaN NaN NaN NaN NaN
2020-01-20 291.0 NaN NaN NaN NaN NaN
S6289302 S6289313 S6274256 S6289297 ... S6289289 S6289293 \
2020-01-16 NaN NaN NaN NaN ... NaN NaN
2020-01-17 NaN NaN NaN NaN ... NaN NaN
2020-01-18 NaN NaN NaN NaN ... NaN NaN
2020-01-19 NaN NaN NaN 0.0 ... NaN NaN
2020-01-20 NaN NaN NaN 2.0 ... NaN NaN
S6275360 S6289300 S6275203 S6289299 S6274391 S6289316 \
2020-01-16 NaN NaN NaN NaN NaN NaN
2020-01-17 NaN NaN NaN NaN NaN NaN
2020-01-18 NaN NaN NaN NaN NaN NaN
2020-01-19 NaN NaN NaN NaN NaN NaN
2020-01-20 NaN NaN NaN NaN NaN NaN
S6274477 S6289310
2020-01-16 NaN NaN
2020-01-17 NaN NaN
2020-01-18 NaN NaN
2020-01-19 NaN 0.0
2020-01-20 NaN 0.0
[5 rows x 32 columns]
6、
#建立一个字典,键为地区代码,值为地区名称。并打印字典查看是否建立成功
1 code_list=['S6274770','S6289292','S6274299','S6289303','S6274354','S6274437','S6289302','S6289313','S6274256','S6289297',\ 2 'S6289296','S6274810','S6289290','S6289294','S6289295','S6289291','S6275679','S6275447','S6275558','S6275394',\ 3 'S6289298','S6289301','S6289289','S6289293','S6275360','S6289300','S6275203','S6289299','S6274391','S6289316',\ 4 'S6274477','S6289310',] 5 area_list=['全国','北京','天津','河北','山西','内蒙古','辽宁','吉林','黑龙江','上海','江苏','浙江','安徽','福建','江西','山东','河南',\ 6 '湖北','湖南','广东','广西','海南','重庆','四川','贵州','云南','西藏','陕西','甘肃','青海','宁夏','新疆'] 7 code_to_area_dict={code_list[i]:area_list[i] for i in range(32)} 8 print(code_to_area_dict)
{'S6274770': '全国', 'S6289292': '北京', 'S6274299': '天津', 'S6289303': '河北', 'S6274354': '山西', 'S6274437': '内蒙古', 'S6289302': '辽宁', 'S6289313': '吉林', 'S6274256': '黑龙江', 'S6289297': '上海', 'S6289296': '江苏', 'S6274810': '浙江', 'S6289290': '安徽', 'S6289294': '福建', 'S6289295': '江西', 'S6289291': '山东', 'S6275679': '河南', 'S6275447': '湖北', 'S6275558': '湖南', 'S6275394': '广东', 'S6289298': '广西', 'S6289301': '海南', 'S6289289': '重庆', 'S6289293': '四川', 'S6275360': '贵州', 'S6289300': '云南', 'S6275203': '西藏', 'S6289299': '陕西', 'S6274391': '甘肃', 'S6289316': '青海', 'S6274477': '宁夏', 'S6289310': '新疆'}
7、
#使用Pandas.rename方法修改列名,并传入code_to_area_dict字典作为参数,inplace设置为True,直接修改原数据。
1 covid_cases_cumsum.rename(columns=code_to_area_dict,inplace=True) 2 print(covid_cases_cumsum.info())
<class 'pandas.core.frame.DataFrame'> Index: 276 entries, 2020-01-16 to 2020-10-17 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 全国 272 non-null float64 1 北京 269 non-null float64 2 天津 271 non-null float64 3 河北 270 non-null float64 4 山西 270 non-null float64 5 内蒙古 269 non-null float64 6 辽宁 270 non-null float64 7 吉林 269 non-null float64 8 黑龙江 270 non-null float64 9 上海 273 non-null float64 10 江苏 270 non-null float64 11 浙江 269 non-null float64 12 安徽 270 non-null float64 13 福建 270 non-null float64 14 江西 271 non-null float64 15 山东 268 non-null float64 16 河南 269 non-null float64 17 湖北 276 non-null float64 18 湖南 271 non-null float64 19 广东 273 non-null float64 20 广西 270 non-null float64 21 海南 270 non-null float64 22 重庆 270 non-null float64 23 四川 270 non-null float64 24 贵州 270 non-null float64 25 云南 271 non-null float64 26 西藏 262 non-null float64 27 陕西 269 non-null float64 28 甘肃 269 non-null float64 29 青海 267 non-null float64 30 宁夏 270 non-null float64 31 新疆 273 non-null float64 dtypes: float64(32) memory usage: 71.2+ KB None
8、
#将covid_cases_cumsum数据保存至本地,该数据已分享至百度云盘。 链接:https://pan.baidu.com/s/100z9ZUpsI-xIcirhy3hkpA 提取码:4eo8
1 covid_cases_cumsum.to_csv(r'C:\Users\92342\Desktop\python\Python数据分析学习之路\1、疫情数据\covid_cases_cumsum.csv')
浙公网安备 33010602011771号