数据分析与机器学习二:数据处理pandas

pandas 是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的。

pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。

pandas提供了大量能使我们快速便捷地处理数据的函数和方法。


一、生成dataFrame:也是矩阵结构的数据

单行或单列的数据类型为series,相当于numpy.ndarray中的一维矩阵; 
dataframe数据类型,相当于numpy.ndarray的二维矩阵;
panel :三维的数组;
三者关系:
dataframe由一系列series组成,
也可以说dataframe是series的容器;当然,panel为datafame的容器。
1.从csv文件,得到dataFrame:读取csv文件,使用read_csv函数
import pandas
food_info = pandas.read_csv("food_info.csv")
print(type(food_info))  # <class 'pandas.core.frame.DataFrame'
print (food_info.dtypes)
NDB_No               int64
Shrt_Desc           object  # 字符串类型数据str,在pandas中叫object
Water_(g)          float64
Energ_Kcal           int64
Protein_(g)        float64
Lipid_Tot_(g)      float64
Ash_(g)            float64

 read_csv的参数:

filepath_or_buffer:可以是stringIO,可以是可以是URL(http, ftp, s3),和文件。如,本地文件读取实例:“://localhost/path/to/table.csv”

sep:指定分隔符,默认为逗号。

delimiter :定界符,备选分隔符(如果指定该参数,则sep参数失效)
header :整数,指定行数用来作为列名,数据开始行数。如果文件中没有列名,则默认为0,否则设置为None
skipinitialspace :是否忽略分隔符号的空白,默认为False
skiprows :需要忽略的行数,如跳过第一行(0开始)
encoding:通常指定为"utf-8"
dtype:如下
# 在pandas中,dtypes说明:
# object -- 对应string values
# int --- 对应int values
# float -- 对应float values
# datetime -- 对应time values
# bool -- 对应boolean values
# print (help(pandas.read_csv))

 其它参数,略....

food_info.head() # 默认取前5条数据,如需自定义所取数,传数字即可
food_info.tail(2)   # 默认取后5条数据,如需自定义所取数,传数字即可

print(food_info.columns) # 获取数据dataFrame的列的标题
# Index(['num', 'title', 'price', 'jing'.........], dtype='object')
print(food_info.shape) # 查看数组结构:(8618, 36)

 

2.使用pd.DataFrame(data),构造dataFrame数据
data = {'state':['Ohino','Ohino','Ohino','Nevada','Nevada'],
        'year':[2000,2001,2002,2001,2002],
        'pop':[1.5,1.7,3.6,2.4,2.9]}
DF = pandas.DataFrame(data)
print(DF)
print(type(DF))   # <class 'pandas.core.frame.DataFrame'>
print(DF.dtypes)  # dtype: object字符类型的数据

 

二、读取dataFrame的数据:
1.使用索引读取数据:
同numpy一样,使用索引读取矩阵的数据;
但是pandas不能直接使用索引读取数据,
使用dataFrame.loc[索引],读取矩阵的行的数据,会将标题也读进来;
使用dataFrame[列名],读取矩阵的列的数据
# 读取行,数据类型为series或dataFrame:
x = food_info.loc[0]  # 取第一行数据
print(x)
print(type(x))  # 取出单行的数据类型为:class 'pandas.core.series.Series'>
y = food_info.loc[[0, 2, 3]] # 取到第0、2、3行的数据
print(y)
print(type(y))  # 取出多行的数据类型为: <class 'pandas.core.frame.DataFrame'>
print(food_info.loc[0:1]) # 取到第1行到第二行的数据

 

NDB_No                         1001
Shrt_Desc          BUTTER WITH SALT
Water_(g)                     15.87
Energ_Kcal                      717
Protein_(g)                    0.85
Lipid_Tot_(g)                 81.11
Ash_(g)                        2.11
Carbohydrt_(g)                 0.06
Fiber_TD_(g)                      0
..........
Name: 0, dtype: object


NDB_No             Shrt_Desc  Water_(g)  Energ_Kcal  Protein_(g).......
0    1001      BUTTER WITH SALT      15.87         717         0.85   ......
2    1003  BUTTER OIL ANHYDROUS       0.24         876         0.28   ....
3    1004           CHEESE BLUE      42.41         353        21.40....

 

# 读取列,读取出来的数据类型为series或dataFrame:
x = food_info['NDB_No']  # 通过列名,取出此列的数据print(x)
print(type(x))   #取出的单列的数据类型为:<class 'pandas.core.series.Series'>
y = food_info[['NDB_No', 'Shrt_Desc', 'Water_(g)']]  # 通过列名,取出此列的数据
print(y)
print(type(y)) #取出多列的数据类型为:<class 'pandas.core.frame.DataFrame'>

 

0        1001
1        1002
2        1003
3        1004
4        1005
5        1006
6        1007
........


      NDB_No                                          Shrt_Desc  Water_(g)
0       1001                                   BUTTER WITH SALT      15.87
1       1002                           BUTTER WHIPPED WITH SALT      15.87
2       1003                               BUTTER OIL ANHYDROUS       0.24
3       1004                                        CHEESE BLUE      42.41

 

2.根据索引查找
# 示例:找到标题以“g”结尾的列数据
# 首先拿到每列标题,并转为list
col_list = food_info.columns.tolist()
# print(col_list)   # ['NDB_No', 'Shrt_Desc', 'Water_(g)',.......
# 然后通过字符串匹配
titles_g = []
for col_title in col_list:
        if col_title.endswith("g"):
                titles_g.append(col_title)
print(titles_g)     # ['Water_(g)', 'Protein_(g)', .....    
data_g = food_info[titles_g]
print(type(data_g.head(3)))   # <class 'pandas.core.frame.DataFrame'>
print(data_g.head(3))

   Water_(g)  Protein_(g)  Lipid_Tot_(g)  Ash_(g)  Carbohydrt_(g)  .........
0      15.87         0.85          81.11     2.11            0.06   
1      15.87         0.85          81.11     2.11            0.06   
2       0.24         0.28          99.48     0.00            0.00 

 

 

# 计算:csv表中有一列数据单位是mg,我们需要将该列数据转为g为单位,
# 那么这列数据除以1000即可,然后可以替换原表中数据
data_1000 = food_info["Calcium_(mg)"] / 1000
# print(data_1000)
# 还可以再加一列
food_info["Calcium_(g)"] = data_1000
print(food_info.shape)  # 原数据将增加一列,(8618, 37)
# 取一列当中最大值,最小值同理
data_max = food_info["Calcium_(mg)"].max()
print(data_max)  # 7364.0
# 列与列相乘
# total_price = food_info['price'] * food_info['jing']
print(food_info[["Water_(g)", "Energ_Kcal"]][:3])  # 读取多个列时,要多加一个[]
x = food_info["Water_(g)"] * food_info["Energ_Kcal"]  # 乘法,第1个值与第1个值相乘,第2个与第2个.....第n个与第n个
print(type(x))
# 原数据dataFrame增加一列
print(food_info.shape)
food_info['new1'] = x
print(food_info.shape)
print(x)


   Water_(g)  Energ_Kcal
0      15.87         717
1      15.87         717
2       0.24         876
<class 'pandas.core.series.Series'>
(8618, 38)
(8618, 39)
0       11378.79
1       11378.79
2         210.24
3       14970.73
4       15251.81
5       16172.28
...........

 

三、dataFrame或Series的操作

1.求最大值
# 求某列的最大值
food_info['Energ_Kcal'].max()

 

# 求dataFrame各列的最大值
food_info.max()

NDB_No                93600
Shrt_Desc          ZWIEBACK
Water_(g)               100
Energ_Kcal              902
Protein_(g)           88.32
Lipid_Tot_(g)           100
Ash_(g)                99.8
....

 


2.排序
print (food_info[["Sodium_(mg)"]])
food_info.sort_values("Sodium_(mg)", inplace=True)  # inplace为True时,原地修改;为False时,返回新的dateFrame
print (food_info["Sodium_(mg)"])
food_info.sort_values("Sodium_(mg)", inplace=True, ascending=False) # 默认升序排序,ascending为False时为降序,None值始终排在最后;
# print (food_info["Sodium_(mg)"])

      Sodium_(mg)
276       38758.0
5814      27360.0
6192      26050.0
1242      26000.0
......

746     0.0
395     0.0
       ... 
8153    NaN
8155    NaN
.............

 

四、示例:泰坦尼克号获救预测
1.生成dataFrame数据
#从csv文件读取数据,生成dataFrame数据
titanic_train = pandas.read_csv("train.csv")
print(titanic_train.head())
   PassengerId  Survived  Pclass  ..............
0            1         0       3   
1            2         1       1

船员数据

Survived:是否获救 Survive=1(获救) Survive=0(死亡)

Pclass:船舱等级 3等舱:便宜  --> 1等舱:贵

SibSp:兄弟姐妹个数

Parch:parent and child 携带 老人和孩子个数

Fare:船票价格

Embarked:上船地点 S C Q

2. 计算乘客的平均年龄:
age = titanic_survival["Age"]
print(age.loc[0:10])

0     22.0
1     38.0
2     26.0
3     35.0
4     35.0
5      NaN
6     54.0
7      2.0
............
# 乘客的总年龄,除以乘客的人数,得到平均年龄
# 但是有些乘客的年龄为空,如果直接计算,因为有空值存在,最后的结果将是空值
mean_age = sum(titanic_survival["Age"]) / len(titanic_survival["Age"])
print (mean_age)   # nan
因此,需要对缺失值进行处理,标记缺失值为True,得到缺失值的布尔列表:age_is_null
使用pd.isnull(dataFrame或series),求nan值
# 判断缺失值:使用pd.isnull(dataFrame或series)
age_is_null = pd.isnull(age)
print (age_is_null)
# 将bool列表作为索引,读取age列的中bool值为True的值
age_null_true = age[age_is_null]
print (age_null_true)
age_null_count = len(age_null_true)
print(age_null_count)  # 177


0      False
1      False
2      False
3      False
4      False
5       True
6      False
...........

5     NaN
17    NaN
19    NaN
26    NaN
28    NaN
29    NaN
31    NaN
..........
 处理缺失值,取出age列中值不是null的数据,再注均值
good_ages = titanic_survival["Age"][age_is_null == False]
print(good_ages[:3])
correct_mean_age = sum(good_ages) / len(good_ages)
print ('-----avg age------')
print(correct_mean_age)

0    22.0
1    38.0
2    26.0
Name: Age, dtype: float64
-----avg age------
29.69911764705882
 使用pd.mean求均值:pandas中求均值的函数mean,此函数会只处理有值的数据
age_average_mean = age.mean()
print(age_average_mean)   # 29.69911764705882

 

定点取数据
# 求第60行,"Age"的值
row_index60_age = titanic_surival.loc[60, "Age"]
print(row_index60_age)  # 输出结果是:22.0

 

# 定位83行的age值
row_index_83_age = titanic_survival.loc[83,"Age"]
row_index_1000_pclass = titanic_survival.loc[766,"Pclass"]
print (row_index_83_age)
print (row_index_1000_pclass)


28.0
1

 


3. 计算3种船仓等级的平均票价
passenger_classes = [1, 2, 3]
fares_by_class = {}
for this_class in passenger_classes:
        # 取出各等级船仓的乘客数据
        pclass_rows =  titanic_survival[titanic_survival["Pclass"] == this_class]
        print(type(pclass_rows))  # <class 
'pandas.core.frame.DataFrame'>
        print(pclass_rows)
        # 取出以上数据中,乘客的票价
        pclass_fares = pclass_rows["Fare"]
        # 求出以上票价数据的均值
        fares_for_class = pclass_fares.mean()
        fares_by_class[this_class] = fares_for_class
print(fares_by_class)  # 结果:{1: 84.1546875, 2: 20.662183152173913, 3: 13.675550101832993}


     PassengerId  Survived  Pclass  .............
1              2         1       1   
3              4         1       1   
6              7         0       1 

 

 

4.数据透视表:pd.pivot_table(index, values, aggfunc)

# 在pandas中,pivot_table可以轻松实现以上功能
# 需要统计的数据以index为基准,需要统计的值是value,aggfunc需要统计value的关系函数,默认是mean(平均值)
# 计算各等级舱的平均船价
passenger_survival = titanic_survival.pivot_table(index="Pclass", values="Fare", aggfunc=np.mean)
print (passenger_survival)

             Fare
Pclass           
1       84.154687
2       20.662183
3       13.675550

 

# 计算各等船仓等乘客的平均年龄,aggfunc默认是求均值numpy.mean函数
passenger_age = titanic_survival.pivot_table(index="Pclass", values="Age")
print(passenger_age)
              Age
Pclass           
1       38.233441
2       29.877630
3       25.140620

 

# 计算各船舱等级的获救率
passenger_survival = titanic_survival.pivot_table(index="Pclass", values="Survived", aggfunc=np.mean)
print (passenger_survival)

        Survived
Pclass          
1       0.629630
2       0.472826
3       0.242363

 

# 计算各码头的票价总和、获求总人数
# 分别计算Fare与Survived的和,根据Embarked为基准
import numpy
passenger_survival_2 = titanic_survival.pivot_table(index="Embarked", values=["Fare", "Survived"], aggfunc=numpy.sum)
print(passenger_survival_2)

                Fare  Survived
Embarked                      
C         10072.2962        93
Q          1022.2543        30
S         17439.3988       217

 

5.滤除缺失值:dataFrame.dropna()或series.dropna()

# 将age和sec缺失值 的数据丢弃
new_titanic_survival = titanic_survival.dropna(axis=0, subset=["Age", "Sex"])
print (new_titanic_survival)
在排序时,过渡掉缺失值:dataFrame.reset_index(drop=True)或series.reset_index(drop=True)
new_titanic_survival = titanic_survival.sort_values("Age", ascending=False)
print(new_titanic_survival[0:10])
# 将以上排序的数据,重新排序index,并过渡掉缺失值:reset_index(drop=True)
new = new_titanic_survival.reset_index(drop=True)
print(new[0:10])

 填充缺失值:fill


6.自定义函数:dataFrame.apply(func_name)
# 自定义函数, 获取第100行数据
def hunderth_row(column):
        return column.loc[99]
row_100 = titanic_train.apply(hunderth_row)
print(row_100)
PassengerId                  100
Survived                       0
Pclass                         2
Name           Kantor, Mr. Sinai
Sex                         male
Age                           34
SibSp                          1
Parch                          0
Ticket                    244367
Fare                          26
Cabin                        NaN
Embarked                       S
dtype: object
# 计算每列中,空值的数量
def not_null_count(column):
    column_null = pd.isnull(column)
    null = column[column_null]
    return len(null)
# 将数据中Pclass列的值替换为相应的文字
column_null_count = titanic_survival.apply(not_null_count)
print (column_null_count)


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
# 将数据中Pclass列的值替换为相应的文字
def which_class(row):
    pclass = row['Pclass']
    if pd.isnull(pclass):
        return "Unknown"
    elif pclass == 1:
        return "First Class"
    elif pclass == 2:
        return "Second Class"
    elif pclass == 3:
        return "Third Class"

classes = titanic_survival.apply(which_class, axis=1)
print (classes)


0       Third Class
1       First Class
2       Third Class
3       First Class
4       Third Class
5       Third Class
6       First Class
7       Third Class
8       Third Class
9      Second Class
10      Third Class
11      First Class
# 判断是否成年人,并用文件替换年龄
def is_minor(row):
    if row["Age"] < 18:
        return True
    else:
        return False

minors = titanic_survival.apply(is_minor, axis=1)
# print (minors)
def generate_age_label(row):
    age = row["Age"]
    if pd.isnull(age):
        return "unknown"
    elif age < 18:
        return "minor"
    else:
        return "adult"

age_labels = titanic_survival.apply(generate_age_label, axis=1)
print (age_labels)
0        adult
1        adult
2        adult
3        adult
4        adult
5      unknown
6        adult
7        minor
# 成年人与未成年人的获救率
titanic_survival['age_labels'] = age_labels
age_group_survival = titanic_survival.pivot_table(index="age_labels", values="Survived")
print (age_group_survival)

            Survived
age_labels          
adult       0.381032
minor       0.539823
unknown     0.293785

 

五、

# dataframe由一系列series组成,datafram相当于读进来的矩阵,series相当于行或列,series.values是numpy.ndarray,series由numpy.ndarray组成
数据准备:
import pandas as pd
fandango = pd.read_csv('fandango_score_comparison.csv')
series_film = fandango['FILM']  #获取电影列表
print(series_film[0:5])
series_rt = fandango['RottenTomatoes'] # 获取评分列表
print (series_rt[0:5])


0    Avengers: Age of Ultron (2015)
1                 Cinderella (2015)
2                    Ant-Man (2015)
3            Do You Believe? (2015)
4     Hot Tub Time Machine 2 (2015)
Name: FILM, dtype: object
0    74
1    85
2    80
3    18
4    14
Name: RottenTomatoes, dtype: int64

1.自定义series
from pandas import Series
print(type(fandango.values))  # dataFrame.values的数据类型为np.ndarray
print(type(series_film))  # <class 'pandas.core.series.Series'>
film_names = series_film.values
print(type(film_names))   # series.values的数据类型为np.ndarray
rt_scores = series_rt.values  # <class 'numpy.ndarray'>
# 自定义series,包括评分和电影两个字段:第一个参数为ndarray,index参数用来指定索引(即标题)
series_custom = Series(rt_scores, index=film_names)
print(series_custom)
print (series_custom['Minions (2015)'])
print(series_custom[['Minions (2015)', 'Leviathan (2015)']])  # 根据电影名,查找数据

Avengers: Age of Ultron (2015)                     74
Cinderella (2015)                                  85
Ant-Man (2015)                                     80
Do You Believe? (2015)                             18
Hot Tub Time Machine 2 (2015)                      14
...........

54

Minions (2015)      54.0
Leviathan (2015)     NaN
dtype: float64
fiveten = series_custom[5:10]
print(fiveten)

The Water Diviner (2015)        63
Irrational Man (2015)           42
Top Five (2014)                 86
Shaun the Sheep Movie (2015)    99
Love & Mercy (2015)             89
dtype: int64

 

 


2.对series排序
# 将series转换为列表
original_index = series_custom.index.tolist()
print (original_index)
sorted_index = sorted(original_index)
# 排序
sorted_by_index = series_custom.reindex(sorted_index)
print (sorted_by_index)

['Avengers: Age of Ultron (2015)', 'Cinderella (2015)',...................]

'71 (2015)                                         97
5 Flights Up (2015)                                52
.................

 

3. series相加
import numpy 
print (numpy.add(series_custom, series_custom)) # series相加
print(numpy.max(series_custom))  # 对series求最大值

Avengers: Age of Ultron (2015)                    148
Cinderella (2015)                                 170
..........

100

 



# data alignment same index
print (series_custom > 50)
series_greater_than_50 = series_custom[series_custom > 50]

criteria_one = series_custom > 50
criteria_two = series_custom < 75
both_criteria = series_custom[criteria_one & criteria_two]
print (both_criteria)


Avengers: Age of Ultron (2015)                     True
Cinderella (2015)                                  True
Ant-Man (2015)                                     True
............


Avengers: Age of Ultron (2015)                                            74
The Water Diviner (2015)                                                  63
..............
rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])
rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])
rt_mean = (rt_critics + rt_users)/2
print(rt_mean)

FILM
Avengers: Age of Ultron (2015)                    80.0
Cinderella (2015)                                 82.5
Ant-Man (2015)                                    85.0
.............

 



指定dataframe索引
fandango.set_index('FILM', drop=False)

 


# 匿名函数


posted on 2018-09-09 19:46  myworldworld  阅读(727)  评论(0)    收藏  举报

导航