数据分析与机器学习二:数据处理pandas
pandas 是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的。
pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。
pandas提供了大量能使我们快速便捷地处理数据的函数和方法。
一、生成dataFrame:也是矩阵结构的数据
单行或单列的数据类型为series,相当于numpy.ndarray中的一维矩阵;
dataframe数据类型,相当于numpy.ndarray的二维矩阵;
panel :三维的数组;
三者关系:
dataframe由一系列series组成,也可以说dataframe是series的容器;当然,panel为datafame的容器。
1.从csv文件,得到dataFrame:读取csv文件,使用read_csv函数
import pandas food_info = pandas.read_csv("food_info.csv") print(type(food_info)) # <class 'pandas.core.frame.DataFrame' print (food_info.dtypes) NDB_No int64 Shrt_Desc object # 字符串类型数据str,在pandas中叫object Water_(g) float64 Energ_Kcal int64 Protein_(g) float64 Lipid_Tot_(g) float64 Ash_(g) float64
read_csv的参数:
filepath_or_buffer:可以是stringIO,可以是可以是URL(http, ftp, s3),和文件。如,本地文件读取实例:“://localhost/path/to/table.csv”
sep:指定分隔符,默认为逗号。
delimiter :定界符,备选分隔符(如果指定该参数,则sep参数失效)
header :整数,指定行数用来作为列名,数据开始行数。如果文件中没有列名,则默认为0,否则设置为None
skipinitialspace :是否忽略分隔符号的空白,默认为False
skiprows :需要忽略的行数,如跳过第一行(0开始)
encoding:通常指定为"utf-8"
dtype:如下
# 在pandas中,dtypes说明: # object -- 对应string values # int --- 对应int values # float -- 对应float values # datetime -- 对应time values # bool -- 对应boolean values # print (help(pandas.read_csv))
其它参数,略....
food_info.head() # 默认取前5条数据,如需自定义所取数,传数字即可 food_info.tail(2) # 默认取后5条数据,如需自定义所取数,传数字即可 print(food_info.columns) # 获取数据dataFrame的列的标题 # Index(['num', 'title', 'price', 'jing'.........], dtype='object') print(food_info.shape) # 查看数组结构:(8618, 36)
2.使用pd.DataFrame(data),构造dataFrame数据
data = {'state':['Ohino','Ohino','Ohino','Nevada','Nevada'], 'year':[2000,2001,2002,2001,2002], 'pop':[1.5,1.7,3.6,2.4,2.9]} DF = pandas.DataFrame(data) print(DF) print(type(DF)) # <class 'pandas.core.frame.DataFrame'> print(DF.dtypes) # dtype: object字符类型的数据
二、读取dataFrame的数据:
1.使用索引读取数据:
同numpy一样,使用索引读取矩阵的数据;
但是pandas不能直接使用索引读取数据,
使用dataFrame.loc[索引],读取矩阵的行的数据,会将标题也读进来;
使用dataFrame[列名],读取矩阵的列的数据
# 读取行,数据类型为series或dataFrame: x = food_info.loc[0] # 取第一行数据 print(x) print(type(x)) # 取出单行的数据类型为:class 'pandas.core.series.Series'> y = food_info.loc[[0, 2, 3]] # 取到第0、2、3行的数据 print(y) print(type(y)) # 取出多行的数据类型为: <class 'pandas.core.frame.DataFrame'> print(food_info.loc[0:1]) # 取到第1行到第二行的数据
NDB_No 1001 Shrt_Desc BUTTER WITH SALT Water_(g) 15.87 Energ_Kcal 717 Protein_(g) 0.85 Lipid_Tot_(g) 81.11 Ash_(g) 2.11 Carbohydrt_(g) 0.06 Fiber_TD_(g) 0 .......... Name: 0, dtype: object NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g)....... 0 1001 BUTTER WITH SALT 15.87 717 0.85 ...... 2 1003 BUTTER OIL ANHYDROUS 0.24 876 0.28 .... 3 1004 CHEESE BLUE 42.41 353 21.40....
# 读取列,读取出来的数据类型为series或dataFrame: x = food_info['NDB_No'] # 通过列名,取出此列的数据print(x) print(type(x)) #取出的单列的数据类型为:<class 'pandas.core.series.Series'> y = food_info[['NDB_No', 'Shrt_Desc', 'Water_(g)']] # 通过列名,取出此列的数据
print(y)
print(type(y)) #取出多列的数据类型为:<class 'pandas.core.frame.DataFrame'>
0 1001
1 1002
2 1003
3 1004
4 1005
5 1006
6 1007
........
NDB_No Shrt_Desc Water_(g)
0 1001 BUTTER WITH SALT 15.87
1 1002 BUTTER WHIPPED WITH SALT 15.87
2 1003 BUTTER OIL ANHYDROUS 0.24
3 1004 CHEESE BLUE 42.41
2.根据索引查找
# 示例:找到标题以“g”结尾的列数据 # 首先拿到每列标题,并转为list col_list = food_info.columns.tolist() # print(col_list) # ['NDB_No', 'Shrt_Desc', 'Water_(g)',....... # 然后通过字符串匹配 titles_g = [] for col_title in col_list: if col_title.endswith("g"): titles_g.append(col_title) print(titles_g) # ['Water_(g)', 'Protein_(g)', ..... data_g = food_info[titles_g] print(type(data_g.head(3))) # <class 'pandas.core.frame.DataFrame'> print(data_g.head(3)) Water_(g) Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) ......... 0 15.87 0.85 81.11 2.11 0.06 1 15.87 0.85 81.11 2.11 0.06 2 0.24 0.28 99.48 0.00 0.00
# 计算:csv表中有一列数据单位是mg,我们需要将该列数据转为g为单位, # 那么这列数据除以1000即可,然后可以替换原表中数据 data_1000 = food_info["Calcium_(mg)"] / 1000 # print(data_1000) # 还可以再加一列 food_info["Calcium_(g)"] = data_1000 print(food_info.shape) # 原数据将增加一列,(8618, 37) # 取一列当中最大值,最小值同理 data_max = food_info["Calcium_(mg)"].max() print(data_max) # 7364.0 # 列与列相乘 # total_price = food_info['price'] * food_info['jing']
print(food_info[["Water_(g)", "Energ_Kcal"]][:3]) # 读取多个列时,要多加一个[] x = food_info["Water_(g)"] * food_info["Energ_Kcal"] # 乘法,第1个值与第1个值相乘,第2个与第2个.....第n个与第n个 print(type(x)) # 原数据dataFrame增加一列 print(food_info.shape) food_info['new1'] = x print(food_info.shape) print(x) Water_(g) Energ_Kcal 0 15.87 717 1 15.87 717 2 0.24 876 <class 'pandas.core.series.Series'> (8618, 38) (8618, 39) 0 11378.79 1 11378.79 2 210.24 3 14970.73 4 15251.81 5 16172.28 ...........
三、dataFrame或Series的操作
1.求最大值
# 求某列的最大值 food_info['Energ_Kcal'].max()
# 求dataFrame各列的最大值 food_info.max() NDB_No 93600 Shrt_Desc ZWIEBACK Water_(g) 100 Energ_Kcal 902 Protein_(g) 88.32 Lipid_Tot_(g) 100 Ash_(g) 99.8 ....
2.排序
print (food_info[["Sodium_(mg)"]]) food_info.sort_values("Sodium_(mg)", inplace=True) # inplace为True时,原地修改;为False时,返回新的dateFrame print (food_info["Sodium_(mg)"]) food_info.sort_values("Sodium_(mg)", inplace=True, ascending=False) # 默认升序排序,ascending为False时为降序,None值始终排在最后; # print (food_info["Sodium_(mg)"]) Sodium_(mg) 276 38758.0 5814 27360.0 6192 26050.0 1242 26000.0 ...... 746 0.0 395 0.0 ... 8153 NaN 8155 NaN .............
四、示例:泰坦尼克号获救预测
1.生成dataFrame数据
#从csv文件读取数据,生成dataFrame数据 titanic_train = pandas.read_csv("train.csv") print(titanic_train.head())
PassengerId Survived Pclass .............. 0 1 0 3 1 2 1 1
船员数据
Survived:是否获救 Survive=1(获救) Survive=0(死亡)
Pclass:船舱等级 3等舱:便宜 --> 1等舱:贵
SibSp:兄弟姐妹个数
Parch:parent and child 携带 老人和孩子个数
Fare:船票价格
Embarked:上船地点 S C Q
2. 计算乘客的平均年龄:
age = titanic_survival["Age"] print(age.loc[0:10]) 0 22.0 1 38.0 2 26.0 3 35.0 4 35.0 5 NaN 6 54.0 7 2.0 ............
# 乘客的总年龄,除以乘客的人数,得到平均年龄 # 但是有些乘客的年龄为空,如果直接计算,因为有空值存在,最后的结果将是空值 mean_age = sum(titanic_survival["Age"]) / len(titanic_survival["Age"]) print (mean_age) # nan
因此,需要对缺失值进行处理,标记缺失值为True,得到缺失值的布尔列表:age_is_null
使用pd.isnull(dataFrame或series),求nan值
# 判断缺失值:使用pd.isnull(dataFrame或series) age_is_null = pd.isnull(age) print (age_is_null) # 将bool列表作为索引,读取age列的中bool值为True的值 age_null_true = age[age_is_null] print (age_null_true) age_null_count = len(age_null_true) print(age_null_count) # 177 0 False 1 False 2 False 3 False 4 False 5 True 6 False ........... 5 NaN 17 NaN 19 NaN 26 NaN 28 NaN 29 NaN 31 NaN ..........
处理缺失值,取出age列中值不是null的数据,再注均值
good_ages = titanic_survival["Age"][age_is_null == False] print(good_ages[:3]) correct_mean_age = sum(good_ages) / len(good_ages) print ('-----avg age------') print(correct_mean_age) 0 22.0 1 38.0 2 26.0 Name: Age, dtype: float64 -----avg age------ 29.69911764705882
使用pd.mean求均值:pandas中求均值的函数mean,此函数会只处理有值的数据
age_average_mean = age.mean() print(age_average_mean) # 29.69911764705882
定点取数据
# 求第60行,"Age"的值 row_index60_age = titanic_surival.loc[60, "Age"] print(row_index60_age) # 输出结果是:22.0
# 定位83行的age值 row_index_83_age = titanic_survival.loc[83,"Age"] row_index_1000_pclass = titanic_survival.loc[766,"Pclass"] print (row_index_83_age) print (row_index_1000_pclass) 28.0 1
3. 计算3种船仓等级的平均票价
passenger_classes = [1, 2, 3] fares_by_class = {} for this_class in passenger_classes: # 取出各等级船仓的乘客数据 pclass_rows = titanic_survival[titanic_survival["Pclass"] == this_class] print(type(pclass_rows)) # <class 'pandas.core.frame.DataFrame'> print(pclass_rows) # 取出以上数据中,乘客的票价 pclass_fares = pclass_rows["Fare"] # 求出以上票价数据的均值 fares_for_class = pclass_fares.mean() fares_by_class[this_class] = fares_for_class print(fares_by_class) # 结果:{1: 84.1546875, 2: 20.662183152173913, 3: 13.675550101832993} PassengerId Survived Pclass ............. 1 2 1 1 3 4 1 1 6 7 0 1
4.数据透视表:pd.pivot_table(index, values, aggfunc)
# 在pandas中,pivot_table可以轻松实现以上功能
# 需要统计的数据以index为基准,需要统计的值是value,aggfunc需要统计value的关系函数,默认是mean(平均值)
# 计算各等级舱的平均船价 passenger_survival = titanic_survival.pivot_table(index="Pclass", values="Fare", aggfunc=np.mean) print (passenger_survival) Fare Pclass 1 84.154687 2 20.662183 3 13.675550
# 计算各等船仓等乘客的平均年龄,aggfunc默认是求均值numpy.mean函数 passenger_age = titanic_survival.pivot_table(index="Pclass", values="Age") print(passenger_age) Age Pclass 1 38.233441 2 29.877630 3 25.140620
# 计算各船舱等级的获救率 passenger_survival = titanic_survival.pivot_table(index="Pclass", values="Survived", aggfunc=np.mean) print (passenger_survival) Survived Pclass 1 0.629630 2 0.472826 3 0.242363
# 计算各码头的票价总和、获求总人数 # 分别计算Fare与Survived的和,根据Embarked为基准 import numpy passenger_survival_2 = titanic_survival.pivot_table(index="Embarked", values=["Fare", "Survived"], aggfunc=numpy.sum) print(passenger_survival_2) Fare Survived Embarked C 10072.2962 93 Q 1022.2543 30 S 17439.3988 217
5.滤除缺失值:dataFrame.dropna()或series.dropna()
# 将age和sec缺失值 的数据丢弃 new_titanic_survival = titanic_survival.dropna(axis=0, subset=["Age", "Sex"]) print (new_titanic_survival)
在排序时,过渡掉缺失值:dataFrame.reset_index(drop=True)或series.reset_index(drop=True)
new_titanic_survival = titanic_survival.sort_values("Age", ascending=False) print(new_titanic_survival[0:10]) # 将以上排序的数据,重新排序index,并过渡掉缺失值:reset_index(drop=True) new = new_titanic_survival.reset_index(drop=True) print(new[0:10])
填充缺失值:fill
6.自定义函数:dataFrame.apply(func_name)
# 自定义函数, 获取第100行数据 def hunderth_row(column): return column.loc[99] row_100 = titanic_train.apply(hunderth_row) print(row_100)
PassengerId 100 Survived 0 Pclass 2 Name Kantor, Mr. Sinai Sex male Age 34 SibSp 1 Parch 0 Ticket 244367 Fare 26 Cabin NaN Embarked S dtype: object
# 计算每列中,空值的数量 def not_null_count(column): column_null = pd.isnull(column) null = column[column_null] return len(null) # 将数据中Pclass列的值替换为相应的文字 column_null_count = titanic_survival.apply(not_null_count) print (column_null_count) PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
# 将数据中Pclass列的值替换为相应的文字 def which_class(row): pclass = row['Pclass'] if pd.isnull(pclass): return "Unknown" elif pclass == 1: return "First Class" elif pclass == 2: return "Second Class" elif pclass == 3: return "Third Class" classes = titanic_survival.apply(which_class, axis=1) print (classes) 0 Third Class 1 First Class 2 Third Class 3 First Class 4 Third Class 5 Third Class 6 First Class 7 Third Class 8 Third Class 9 Second Class 10 Third Class 11 First Class
# 判断是否成年人,并用文件替换年龄 def is_minor(row): if row["Age"] < 18: return True else: return False minors = titanic_survival.apply(is_minor, axis=1) # print (minors) def generate_age_label(row): age = row["Age"] if pd.isnull(age): return "unknown" elif age < 18: return "minor" else: return "adult" age_labels = titanic_survival.apply(generate_age_label, axis=1) print (age_labels)
0 adult 1 adult 2 adult 3 adult 4 adult 5 unknown 6 adult 7 minor
# 成年人与未成年人的获救率 titanic_survival['age_labels'] = age_labels age_group_survival = titanic_survival.pivot_table(index="age_labels", values="Survived") print (age_group_survival) Survived age_labels adult 0.381032 minor 0.539823 unknown 0.293785
五、
# dataframe由一系列series组成,datafram相当于读进来的矩阵,series相当于行或列,series.values是numpy.ndarray,series由numpy.ndarray组成
数据准备:
import pandas as pd fandango = pd.read_csv('fandango_score_comparison.csv') series_film = fandango['FILM'] #获取电影列表 print(series_film[0:5]) series_rt = fandango['RottenTomatoes'] # 获取评分列表 print (series_rt[0:5]) 0 Avengers: Age of Ultron (2015) 1 Cinderella (2015) 2 Ant-Man (2015) 3 Do You Believe? (2015) 4 Hot Tub Time Machine 2 (2015) Name: FILM, dtype: object 0 74 1 85 2 80 3 18 4 14 Name: RottenTomatoes, dtype: int64
1.自定义series
from pandas import Series print(type(fandango.values)) # dataFrame.values的数据类型为np.ndarray print(type(series_film)) # <class 'pandas.core.series.Series'> film_names = series_film.values print(type(film_names)) # series.values的数据类型为np.ndarray rt_scores = series_rt.values # <class 'numpy.ndarray'> # 自定义series,包括评分和电影两个字段:第一个参数为ndarray,index参数用来指定索引(即标题) series_custom = Series(rt_scores, index=film_names) print(series_custom) print (series_custom['Minions (2015)']) print(series_custom[['Minions (2015)', 'Leviathan (2015)']]) # 根据电影名,查找数据 Avengers: Age of Ultron (2015) 74 Cinderella (2015) 85 Ant-Man (2015) 80 Do You Believe? (2015) 18 Hot Tub Time Machine 2 (2015) 14 ........... 54 Minions (2015) 54.0 Leviathan (2015) NaN dtype: float64
fiveten = series_custom[5:10] print(fiveten) The Water Diviner (2015) 63 Irrational Man (2015) 42 Top Five (2014) 86 Shaun the Sheep Movie (2015) 99 Love & Mercy (2015) 89 dtype: int64
2.对series排序
# 将series转换为列表 original_index = series_custom.index.tolist() print (original_index) sorted_index = sorted(original_index) # 排序 sorted_by_index = series_custom.reindex(sorted_index) print (sorted_by_index) ['Avengers: Age of Ultron (2015)', 'Cinderella (2015)',...................] '71 (2015) 97 5 Flights Up (2015) 52 .................
3. series相加
import numpy print (numpy.add(series_custom, series_custom)) # series相加 print(numpy.max(series_custom)) # 对series求最大值 Avengers: Age of Ultron (2015) 148 Cinderella (2015) 170 .......... 100
# data alignment same index
print (series_custom > 50) series_greater_than_50 = series_custom[series_custom > 50] criteria_one = series_custom > 50 criteria_two = series_custom < 75 both_criteria = series_custom[criteria_one & criteria_two] print (both_criteria) Avengers: Age of Ultron (2015) True Cinderella (2015) True Ant-Man (2015) True ............ Avengers: Age of Ultron (2015) 74 The Water Diviner (2015) 63 ..............
rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM']) rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM']) rt_mean = (rt_critics + rt_users)/2 print(rt_mean) FILM Avengers: Age of Ultron (2015) 80.0 Cinderella (2015) 82.5 Ant-Man (2015) 85.0 .............
指定dataframe索引
fandango.set_index('FILM', drop=False)
# 匿名函数
posted on 2018-09-09 19:46 myworldworld 阅读(727) 评论(0) 收藏 举报