数据分析与机器学习二：数据处理pandas

pandas 是基于NumPy 的一种工具，该工具是为了解决数据分析任务而创建的。

pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的工具。

pandas提供了大量能使我们快速便捷地处理数据的函数和方法。

一、生成dataFrame：也是矩阵结构的数据

单行或单列的数据类型为series，相当于numpy.ndarray中的一维矩阵； 
dataframe数据类型，相当于numpy.ndarray的二维矩阵；
panel ：三维的数组；
三者关系：
dataframe由一系列series组成，也可以说dataframe是series的容器；当然，panel为datafame的容器。

1.从csv文件，得到dataFrame：读取csv文件，使用read_csv函数

import pandas
food_info = pandas.read_csv("food_info.csv")
print(type(food_info))  # <class 'pandas.core.frame.DataFrame'
print (food_info.dtypes)
NDB_No               int64
Shrt_Desc           object  # 字符串类型数据str，在pandas中叫object
Water_(g)          float64
Energ_Kcal           int64
Protein_(g)        float64
Lipid_Tot_(g)      float64
Ash_(g)            float64

read_csv的参数：

filepath_or_buffer：可以是stringIO，可以是可以是URL(http, ftp, s3)，和文件。如，本地文件读取实例：“://localhost/path/to/table.csv”

sep：指定分隔符，默认为逗号。

delimiter :定界符，备选分隔符（如果指定该参数，则sep参数失效）

header :整数，指定行数用来作为列名，数据开始行数。如果文件中没有列名，则默认为0，否则设置为None

skipinitialspace :是否忽略分隔符号的空白，默认为False

skiprows :需要忽略的行数，如跳过第一行(0开始)

encoding：通常指定为"utf-8"

dtype：如下

# 在pandas中，dtypes说明：
# object -- 对应string values
# int --- 对应int values
# float -- 对应float values
# datetime -- 对应time values
# bool -- 对应boolean values
# print (help(pandas.read_csv))

其它参数，略....

food_info.head() # 默认取前5条数据，如需自定义所取数，传数字即可
food_info.tail(2)   # 默认取后5条数据，如需自定义所取数，传数字即可

print(food_info.columns) # 获取数据dataFrame的列的标题
# Index(['num', 'title', 'price', 'jing'.........], dtype='object')
print(food_info.shape) # 查看数组结构：(8618, 36)

2.使用pd.DataFrame(data)，构造dataFrame数据

data = {'state':['Ohino','Ohino','Ohino','Nevada','Nevada'],
        'year':[2000,2001,2002,2001,2002],
        'pop':[1.5,1.7,3.6,2.4,2.9]}
DF = pandas.DataFrame(data)
print(DF)
print(type(DF))   # <class 'pandas.core.frame.DataFrame'>
print(DF.dtypes)  # dtype: object字符类型的数据

二、读取dataFrame的数据：
1.使用索引读取数据：
同numpy一样，使用索引读取矩阵的数据；
但是pandas不能直接使用索引读取数据，
使用dataFrame.loc[索引]，读取矩阵的行的数据，会将标题也读进来；
使用dataFrame[列名]，读取矩阵的列的数据

# 读取行，数据类型为series或dataFrame：
x = food_info.loc[0]  # 取第一行数据
print(x)
print(type(x))  # 取出单行的数据类型为：class 'pandas.core.series.Series'>
y = food_info.loc[[0, 2, 3]] # 取到第0、2、3行的数据
print(y)
print(type(y))  # 取出多行的数据类型为: <class 'pandas.core.frame.DataFrame'>
print(food_info.loc[0:1]) # 取到第1行到第二行的数据

NDB_No                         1001
Shrt_Desc          BUTTER WITH SALT
Water_(g)                     15.87
Energ_Kcal                      717
Protein_(g)                    0.85
Lipid_Tot_(g)                 81.11
Ash_(g)                        2.11
Carbohydrt_(g)                 0.06
Fiber_TD_(g)                      0
..........
Name: 0, dtype: object


NDB_No             Shrt_Desc  Water_(g)  Energ_Kcal  Protein_(g).......
0    1001      BUTTER WITH SALT      15.87         717         0.85   ......
2    1003  BUTTER OIL ANHYDROUS       0.24         876         0.28   ....
3    1004           CHEESE BLUE      42.41         353        21.40....

# 读取列，读取出来的数据类型为series或dataFrame：
x = food_info['NDB_No']  # 通过列名，取出此列的数据print(x)
print(type(x))   #取出的单列的数据类型为：<class 'pandas.core.series.Series'>
y = food_info[['NDB_No', 'Shrt_Desc', 'Water_(g)']]  # 通过列名，取出此列的数据
print(y)

print(type(y))  #取出多列的数据类型为：<class 'pandas.core.frame.DataFrame'>

0        1001
1        1002
2        1003
3        1004
4        1005
5        1006
6        1007
........


      NDB_No                                          Shrt_Desc  Water_(g)
0       1001                                   BUTTER WITH SALT      15.87
1       1002                           BUTTER WHIPPED WITH SALT      15.87
2       1003                               BUTTER OIL ANHYDROUS       0.24
3       1004                                        CHEESE BLUE      42.41

2.根据索引查找

# 示例：找到标题以“g”结尾的列数据
# 首先拿到每列标题，并转为list
col_list = food_info.columns.tolist()
# print(col_list)   # ['NDB_No', 'Shrt_Desc', 'Water_(g)',.......
# 然后通过字符串匹配
titles_g = []
for col_title in col_list:
        if col_title.endswith("g"):
                titles_g.append(col_title)
print(titles_g)     # ['Water_(g)', 'Protein_(g)', .....    
data_g = food_info[titles_g]
print(type(data_g.head(3)))   # <class 'pandas.core.frame.DataFrame'>
print(data_g.head(3))

   Water_(g)  Protein_(g)  Lipid_Tot_(g)  Ash_(g)  Carbohydrt_(g)  .........
0      15.87         0.85          81.11     2.11            0.06   
1      15.87         0.85          81.11     2.11            0.06   
2       0.24         0.28          99.48     0.00            0.00

# 计算：csv表中有一列数据单位是mg，我们需要将该列数据转为g为单位，
# 那么这列数据除以1000即可，然后可以替换原表中数据
data_1000 = food_info["Calcium_(mg)"] / 1000
# print(data_1000)
# 还可以再加一列
food_info["Calcium_(g)"] = data_1000
print(food_info.shape)  # 原数据将增加一列，(8618, 37)
# 取一列当中最大值，最小值同理
data_max = food_info["Calcium_(mg)"].max()
print(data_max)  # 7364.0
# 列与列相乘
# total_price = food_info['price'] * food_info['jing']

print(food_info[["Water_(g)", "Energ_Kcal"]][:3])  # 读取多个列时，要多加一个[]
x = food_info["Water_(g)"] * food_info["Energ_Kcal"]  # 乘法，第1个值与第1个值相乘，第2个与第2个.....第n个与第n个
print(type(x))
# 原数据dataFrame增加一列
print(food_info.shape)
food_info['new1'] = x
print(food_info.shape)
print(x)


   Water_(g)  Energ_Kcal
0      15.87         717
1      15.87         717
2       0.24         876
<class 'pandas.core.series.Series'>
(8618, 38)
(8618, 39)
0       11378.79
1       11378.79
2         210.24
3       14970.73
4       15251.81
5       16172.28
...........

三、dataFrame或Series的操作

1.求最大值

# 求某列的最大值
food_info['Energ_Kcal'].max()

# 求dataFrame各列的最大值
food_info.max()

NDB_No                93600
Shrt_Desc          ZWIEBACK
Water_(g)               100
Energ_Kcal              902
Protein_(g)           88.32
Lipid_Tot_(g)           100
Ash_(g)                99.8
....


2.排序

print (food_info[["Sodium_(mg)"]])
food_info.sort_values("Sodium_(mg)", inplace=True)  # inplace为True时，原地修改；为False时，返回新的dateFrame
print (food_info["Sodium_(mg)"])
food_info.sort_values("Sodium_(mg)", inplace=True, ascending=False) # 默认升序排序，ascending为False时为降序，None值始终排在最后；
# print (food_info["Sodium_(mg)"])

      Sodium_(mg)
276       38758.0
5814      27360.0
6192      26050.0
1242      26000.0
......

746     0.0
395     0.0
       ... 
8153    NaN
8155    NaN
.............

四、示例：泰坦尼克号获救预测

1.生成dataFrame数据

#从csv文件读取数据，生成dataFrame数据
titanic_train = pandas.read_csv("train.csv")
print(titanic_train.head())

   PassengerId  Survived  Pclass  ..............
0            1         0       3   
1            2         1       1

船员数据

Survived：是否获救 Survive=1（获救） Survive=0（死亡）

Pclass：船舱等级 3等舱：便宜 --> 1等舱：贵

SibSp：兄弟姐妹个数

Parch：parent and child 携带老人和孩子个数

Fare：船票价格

Embarked：上船地点 S C Q

2. 计算乘客的平均年龄：

age = titanic_survival["Age"]
print(age.loc[0:10])

0     22.0
1     38.0
2     26.0
3     35.0
4     35.0
5      NaN
6     54.0
7      2.0
............

# 乘客的总年龄，除以乘客的人数，得到平均年龄
# 但是有些乘客的年龄为空，如果直接计算，因为有空值存在，最后的结果将是空值
mean_age = sum(titanic_survival["Age"]) / len(titanic_survival["Age"])
print (mean_age)   # nan

因此，需要对缺失值进行处理，标记缺失值为True，得到缺失值的布尔列表：age_is_null
使用pd.isnull(dataFrame或series)，求nan值

# 判断缺失值：使用pd.isnull(dataFrame或series)
age_is_null = pd.isnull(age)
print (age_is_null)
# 将bool列表作为索引，读取age列的中bool值为True的值
age_null_true = age[age_is_null]
print (age_null_true)
age_null_count = len(age_null_true)
print(age_null_count)  # 177


0      False
1      False
2      False
3      False
4      False
5       True
6      False
...........

5     NaN
17    NaN
19    NaN
26    NaN
28    NaN
29    NaN
31    NaN
..........

 处理缺失值，取出age列中值不是null的数据，再注均值

good_ages = titanic_survival["Age"][age_is_null == False]
print(good_ages[:3])
correct_mean_age = sum(good_ages) / len(good_ages)
print ('-----avg age------')
print(correct_mean_age)

0    22.0
1    38.0
2    26.0
Name: Age, dtype: float64
-----avg age------
29.69911764705882

 使用pd.mean求均值：pandas中求均值的函数mean，此函数会只处理有值的数据

age_average_mean = age.mean()
print(age_average_mean)   # 29.69911764705882

定点取数据

# 求第60行，"Age"的值
row_index60_age = titanic_surival.loc[60, "Age"]
print(row_index60_age)  # 输出结果是：22.0

# 定位83行的age值
row_index_83_age = titanic_survival.loc[83,"Age"]
row_index_1000_pclass = titanic_survival.loc[766,"Pclass"]
print (row_index_83_age)
print (row_index_1000_pclass)


28.0
1


3. 计算3种船仓等级的平均票价

passenger_classes = [1, 2, 3]
fares_by_class = {}
for this_class in passenger_classes:
        # 取出各等级船仓的乘客数据
        pclass_rows =  titanic_survival[titanic_survival["Pclass"] == this_class]
        print(type(pclass_rows))  # <class 
'pandas.core.frame.DataFrame'>
        print(pclass_rows)
        # 取出以上数据中，乘客的票价
        pclass_fares = pclass_rows["Fare"]
        # 求出以上票价数据的均值
        fares_for_class = pclass_fares.mean()
        fares_by_class[this_class] = fares_for_class
print(fares_by_class)  # 结果：{1: 84.1546875, 2: 20.662183152173913, 3: 13.675550101832993}


     PassengerId  Survived  Pclass  .............
1              2         1       1   
3              4         1       1   
6              7         0       1

4.数据透视表：pd.pivot_table(index, values, aggfunc)

# 在pandas中，pivot_table可以轻松实现以上功能

# 需要统计的数据以index为基准，需要统计的值是value，aggfunc需要统计value的关系函数，默认是mean（平均值）

# 计算各等级舱的平均船价
passenger_survival = titanic_survival.pivot_table(index="Pclass", values="Fare", aggfunc=np.mean)
print (passenger_survival)

             Fare
Pclass           
1       84.154687
2       20.662183
3       13.675550

# 计算各等船仓等乘客的平均年龄，aggfunc默认是求均值numpy.mean函数
passenger_age = titanic_survival.pivot_table(index="Pclass", values="Age")
print(passenger_age)
              Age
Pclass           
1       38.233441
2       29.877630
3       25.140620

# 计算各船舱等级的获救率
passenger_survival = titanic_survival.pivot_table(index="Pclass", values="Survived", aggfunc=np.mean)
print (passenger_survival)

        Survived
Pclass          
1       0.629630
2       0.472826
3       0.242363

# 计算各码头的票价总和、获求总人数
# 分别计算Fare与Survived的和，根据Embarked为基准
import numpy
passenger_survival_2 = titanic_survival.pivot_table(index="Embarked", values=["Fare", "Survived"], aggfunc=numpy.sum)
print(passenger_survival_2)

                Fare  Survived
Embarked                      
C         10072.2962        93
Q          1022.2543        30
S         17439.3988       217

5.滤除缺失值：dataFrame.dropna()或series.dropna()

# 将age和sec缺失值 的数据丢弃
new_titanic_survival = titanic_survival.dropna(axis=0, subset=["Age", "Sex"])
print (new_titanic_survival)

在排序时，过渡掉缺失值：dataFrame.reset_index(drop=True)或series.reset_index(drop=True)

new_titanic_survival = titanic_survival.sort_values("Age", ascending=False)
print(new_titanic_survival[0:10])
# 将以上排序的数据，重新排序index，并过渡掉缺失值:reset_index(drop=True)
new = new_titanic_survival.reset_index(drop=True)
print(new[0:10])

填充缺失值：fill


6.自定义函数：dataFrame.apply(func_name)

# 自定义函数, 获取第100行数据
def hunderth_row(column):
        return column.loc[99]
row_100 = titanic_train.apply(hunderth_row)
print(row_100)

PassengerId                  100
Survived                       0
Pclass                         2
Name           Kantor, Mr. Sinai
Sex                         male
Age                           34
SibSp                          1
Parch                          0
Ticket                    244367
Fare                          26
Cabin                        NaN
Embarked                       S
dtype: object

# 计算每列中，空值的数量
def not_null_count(column):
    column_null = pd.isnull(column)
    null = column[column_null]
    return len(null)
# 将数据中Pclass列的值替换为相应的文字
column_null_count = titanic_survival.apply(not_null_count)
print (column_null_count)


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

# 将数据中Pclass列的值替换为相应的文字
def which_class(row):
    pclass = row['Pclass']
    if pd.isnull(pclass):
        return "Unknown"
    elif pclass == 1:
        return "First Class"
    elif pclass == 2:
        return "Second Class"
    elif pclass == 3:
        return "Third Class"

classes = titanic_survival.apply(which_class, axis=1)
print (classes)


0       Third Class
1       First Class
2       Third Class
3       First Class
4       Third Class
5       Third Class
6       First Class
7       Third Class
8       Third Class
9      Second Class
10      Third Class
11      First Class

# 判断是否成年人，并用文件替换年龄
def is_minor(row):
    if row["Age"] < 18:
        return True
    else:
        return False

minors = titanic_survival.apply(is_minor, axis=1)
# print (minors)
def generate_age_label(row):
    age = row["Age"]
    if pd.isnull(age):
        return "unknown"
    elif age < 18:
        return "minor"
    else:
        return "adult"

age_labels = titanic_survival.apply(generate_age_label, axis=1)
print (age_labels)

0        adult
1        adult
2        adult
3        adult
4        adult
5      unknown
6        adult
7        minor

# 成年人与未成年人的获救率
titanic_survival['age_labels'] = age_labels
age_group_survival = titanic_survival.pivot_table(index="age_labels", values="Survived")
print (age_group_survival)

            Survived
age_labels          
adult       0.381032
minor       0.539823
unknown     0.293785

五、

# dataframe由一系列series组成，datafram相当于读进来的矩阵，series相当于行或列，series.values是numpy.ndarray，series由numpy.ndarray组成
数据准备：

import pandas as pd
fandango = pd.read_csv('fandango_score_comparison.csv')
series_film = fandango['FILM']  #获取电影列表
print(series_film[0:5])
series_rt = fandango['RottenTomatoes'] # 获取评分列表
print (series_rt[0:5])


0    Avengers: Age of Ultron (2015)
1                 Cinderella (2015)
2                    Ant-Man (2015)
3            Do You Believe? (2015)
4     Hot Tub Time Machine 2 (2015)
Name: FILM, dtype: object
0    74
1    85
2    80
3    18
4    14
Name: RottenTomatoes, dtype: int64


1.自定义series

from pandas import Series
print(type(fandango.values))  # dataFrame.values的数据类型为np.ndarray
print(type(series_film))  # <class 'pandas.core.series.Series'>
film_names = series_film.values
print(type(film_names))   # series.values的数据类型为np.ndarray
rt_scores = series_rt.values  # <class 'numpy.ndarray'>
# 自定义series，包括评分和电影两个字段：第一个参数为ndarray，index参数用来指定索引(即标题)
series_custom = Series(rt_scores, index=film_names)
print(series_custom)
print (series_custom['Minions (2015)'])
print(series_custom[['Minions (2015)', 'Leviathan (2015)']])  # 根据电影名，查找数据

Avengers: Age of Ultron (2015)                     74
Cinderella (2015)                                  85
Ant-Man (2015)                                     80
Do You Believe? (2015)                             18
Hot Tub Time Machine 2 (2015)                      14
...........

54

Minions (2015)      54.0
Leviathan (2015)     NaN
dtype: float64

fiveten = series_custom[5:10]
print(fiveten)

The Water Diviner (2015)        63
Irrational Man (2015)           42
Top Five (2014)                 86
Shaun the Sheep Movie (2015)    99
Love & Mercy (2015)             89
dtype: int64


2.对series排序

# 将series转换为列表
original_index = series_custom.index.tolist()
print (original_index)
sorted_index = sorted(original_index)
# 排序
sorted_by_index = series_custom.reindex(sorted_index)
print (sorted_by_index)

['Avengers: Age of Ultron (2015)', 'Cinderella (2015)',...................]

'71 (2015)                                         97
5 Flights Up (2015)                                52
.................

3. series相加

import numpy 
print (numpy.add(series_custom, series_custom)) # series相加
print(numpy.max(series_custom))  # 对series求最大值

Avengers: Age of Ultron (2015)                    148
Cinderella (2015)                                 170
..........

100



# data alignment same index

print (series_custom > 50)
series_greater_than_50 = series_custom[series_custom > 50]

criteria_one = series_custom > 50
criteria_two = series_custom < 75
both_criteria = series_custom[criteria_one & criteria_two]
print (both_criteria)


Avengers: Age of Ultron (2015)                     True
Cinderella (2015)                                  True
Ant-Man (2015)                                     True
............


Avengers: Age of Ultron (2015)                                            74
The Water Diviner (2015)                                                  63
..............

rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])
rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])
rt_mean = (rt_critics + rt_users)/2
print(rt_mean)

FILM
Avengers: Age of Ultron (2015)                    80.0
Cinderella (2015)                                 82.5
Ant-Man (2015)                                    85.0
.............



 指定dataframe索引

fandango.set_index('FILM', drop=False)


# 匿名函数

posted on 2018-09-09 19:46 myworldworld 阅读(727) 评论(0) 收藏举报

刷新页面返回顶部

myworldworld

数据分析与机器学习二：数据处理pandas

导航

公告