Pandas数据分析实战（Pandas in action）第3章 Series 方法

Pandas 数据分析实战

第 3 章 Series 方法

read_csv() 导入数据集

pd.read_csv(filepath_or_buffer="./file/chapter_03/pokemon.csv")
# 或者
pd.read_csv("./file/chapter_03/pokemon.csv")

         Pokemon            Type
0      Bulbasaur  Grass / Poison
1        Ivysaur  Grass / Poison
2       Venusaur  Grass / Poison
3     Charmander            Fire
4     Charmeleon            Fire
..           ...             ...
804    Stakataka    Rock / Steel
805  Blacephalon    Fire / Ghost
806      Zeraora        Electric
807       Meltan           Steel
808     Melmetal           Steel
[809 rows x 2 columns]

read_csv() 设置索引列

通过参数 index_col 设置索引列，将“Pokemon”作为参数传递给 index_col

pd.read_csv("./file/chapter_03/pokemon.csv", index_col="Pokemon")

                       Type
Pokemon
Bulbasaur    Grass / Poison
Ivysaur      Grass / Poison
Venusaur     Grass / Poison
Charmander             Fire
Charmeleon             Fire
                     ...
Stakataka      Rock / Steel
Blacephalon    Fire / Ghost
Zeraora            Electric
Meltan                Steel
Melmetal              Steel
[809 rows x 1 columns]

read_csv() 将 DataFrame 转换为 Series

一列数据，Pandas 默认是将数据导入 DataFrame ，为了得到 Series ，需要使用函数 squeeze()

在 pandas 1.5 版本以前，使用时在 read_csv() 函数中传参 squeeze=True 就行，在 1.5 版本后废弃掉了。链接
```
pd.read_csv("./file/chapter_03/pokemon.csv", index_col="Pokemon").squeeze("columns")
```
```
Pokemon
Bulbasaur      Grass / Poison
Ivysaur        Grass / Poison
Venusaur       Grass / Poison
Charmander               Fire
Charmeleon               Fire
                    ...
Stakataka        Rock / Steel
Blacephalon      Fire / Ghost
Zeraora              Electric
Meltan                  Steel
Melmetal                Steel
Name: Type, Length: 809, dtype: object
```
成功获得了一个 Series ，索引标签是 Pokemon 的名称，值是 Pokemon 的类型。
- Pandas 已为 Series 分配了名称为 Type 的列，即 CSV 文件的列名为 Type
- 该 Series 有 809 个值
- dtype ： object 表示字符串类型的 Series 。
read_csv() 导入的参数转换为日期

导入数据时，Pandas 会为每一列推断最合适的数据类型，但是出于稳定程序的目的，Pandas 会避免对数据做出假设。因此在导入 google_stocks.csv 中，包含有一个 Date 列，格式为 YYYY-MM-DD 格式的，除非明确告诉 Pandas 把该值设置为日期，否则都是按照字符串导入的。可以通过 read_csv() 函数中的 parse_dates 参数指定需要转换为日期的列，parse_dates 参数接收一个字符串列表
```
pd.read_csv("./file/chapter_03/google_stocks.csv", parse_dates=["Date"]).head()
```
```
        Date  Close
0 2004-08-19  49.98
1 2004-08-20  53.95
2 2004-08-23  54.50
3 2004-08-24  52.24
4 2004-08-25  52.80
```

read_csv() 导入的列转换为日期格式，并设置该列为索引并转换为 Series

pd.read_csv("./file/chapter_03/google_stocks.csv", parse_dates=["Date"],index_col="Date").squeeze(True)

Date
2004-08-19      49.98
2004-08-20      53.95
2004-08-23      54.50
2004-08-24      52.24
2004-08-25      52.80
               ...
2019-10-21    1246.15
2019-10-22    1242.80
2019-10-23    1259.13
2019-10-24    1260.99
2019-10-25    1265.13
Name: Close, Length: 3824, dtype: float64

read_csv() 存在多列时，squeeze() 函数无效

pd.read_csv("./file/chapter_03/revolutionary_war.csv", parse_dates=["Start Date"], index_col="Start Date").squeeze(True)

                                       Battle          State
Start Date
1774-09-01                       Powder Alarm  Massachusetts
1774-12-14  Storming of Fort William and Mary  New Hampshire
1775-04-19   Battles of Lexington and Concord  Massachusetts
1775-04-19                    Siege of Boston  Massachusetts
1775-04-20                 Gunpowder Incident       Virginia
                                       ...            ...
1782-09-11                Siege of Fort Henry       Virginia
1782-09-13         Grand Assault on Gibraltar            NaN
1782-10-18          Action of 18 October 1782            NaN
1782-12-06          Action of 6 December 1782            NaN
1783-01-22          Action of 22 January 1783       Virginia
[232 rows x 2 columns]

read_csv() 多列中选择只导入索引列和值，转换为 Series

read_csv 函数的 usecols 参数接受 Pandas 应该导入的字段列表，选择 Start Date 和 State ，Start Date 作为索引，State 作为值，在 DataFrame 只存在 2 列时，可以转换为 Series

pd.read_csv("./file/chapter_03/revolutionary_war.csv", parse_dates=["Start Date"], index_col="Start Date",
                       usecols=["Start Date", "State"]).squeeze(True)

Start Date
1774-09-01    Massachusetts
1774-12-14    New Hampshire
1775-04-19    Massachusetts
1775-04-19    Massachusetts
1775-04-20         Virginia
                  ...
1782-09-11         Virginia
1782-09-13              NaN
1782-10-18              NaN
1782-12-06              NaN
1783-01-22         Virginia
Name: State, Length: 232, dtype: object

sort_values() 按值排序

sort_values() 返回一个新的 Series，其中的值按照升序排序

google = pd.read_csv("./file/chapter_03/google_stocks.csv", parse_dates=["Date"], index_col="Date").squeeze(True)
google.sort_values()

Date
2004-09-03      49.82
2004-09-01      49.94
2004-08-19      49.98
2004-09-02      50.57
2004-09-07      50.60
               ...
2019-04-23    1264.55
2019-10-25    1265.13
2018-07-26    1268.33
2019-04-26    1272.18
2019-04-29    1287.58
Name: Close, Length: 3824, dtype: float64

按字母顺序对 Series 中的字符串进行排序

pokemon = pd.read_csv("./file/chapter_03/pokemon.csv", index_col="Pokemon").squeeze("columns")
pokemon.sort_values()

Pokemon
Illumise                Bug
Silcoon                 Bug
Pinsir                  Bug
Burmy                   Bug
Wurmple                 Bug
                  ...
Tirtouga       Water / Rock
Relicanth      Water / Rock
Corsola        Water / Rock
Carracosta     Water / Rock
Empoleon      Water / Steel
Name: Type, Length: 809, dtype: object

Pandas 将大写字母排在小写字母前

pd.Series(data= ['Adam','adam','Ben']).sort_values()

0    Adam
2     Ben
1    adam
dtype: object

sort_values() 通过 ascending=False 进行降序处理，默认值为 True

google.sort_values(ascending=False)

Date
2019-04-29    1287.58
2019-04-26    1272.18
2018-07-26    1268.33
2019-10-25    1265.13
2019-04-23    1264.55
               ...
2004-09-07      50.60
2004-09-02      50.57
2004-08-19      49.98
2004-09-01      49.94
2004-09-03      49.82
Name: Close, Length: 3824, dtype: float64

字符串降序排序是指按字母表的倒序对 Series 中的字符串进行排序

pokemon.sort_values(ascending=False)

Pokemon
Empoleon      Water / Steel
Corsola        Water / Rock
Relicanth      Water / Rock
Carracosta     Water / Rock
Tirtouga       Water / Rock
                  ...
Kricketune              Bug
Cascoon                 Bug
Scatterbug              Bug
Kricketot               Bug
Grubbin                 Bug
Name: Type, Length: 809, dtype: object

sort_values() 参数 na_position 用来设置 NaN 值时，将该记录放置在排序结果中的位置，该参数默认为 last，即默认将缺失值放在已排序 Series 的末尾

battles.sort_values(na_position="last")

Start Date
1781-09-06    Connecticut
1779-07-05    Connecticut
1777-04-27    Connecticut
1777-09-03       Delaware
1777-05-17        Florida
                 ...
1782-08-08            NaN
1782-08-25            NaN
1782-09-13            NaN
1782-10-18            NaN
1782-12-06            NaN
Name: State, Length: 232, dtype: object

需要先显示缺失值，na_position 参数设置为 first

battles.sort_values(na_position="first")

Start Date
1775-09-17         NaN
1775-12-31         NaN
1776-03-03         NaN
1776-03-25         NaN
1776-05-18         NaN
                ...
1781-07-06    Virginia
1781-07-01    Virginia
1781-06-26    Virginia
1781-04-25    Virginia
1783-01-22    Virginia
Name: State, Length: 232, dtype: object

dropna() 删除了所有缺失值的 Series，该方法仅针对 Series 值中的 NaN，而不是索引

battles.dropna().sort_values()

Start Date
1781-09-06    Connecticut
1779-07-05    Connecticut
1777-04-27    Connecticut
1777-09-03       Delaware
1777-05-17        Florida
                 ...
1781-07-06       Virginia
1781-07-01       Virginia
1781-06-26       Virginia
1781-04-25       Virginia
1783-01-22       Virginia
Name: State, Length: 162, dtype: object

新的 Series 比之前的 Series 要短，因为 Pandas 从 battles 中删除了 70 个 NaN 值

sort_index() 按索引排序，ascending 参数默认为为 True

sort_index() 按索引对 Series 排序，这些值将与他们的索引一起移动

pokemon.sort_index()
# 或者
pokemon.sort_index(ascending=True)

Pokemon
Abomasnow        Grass / Ice
Abra                 Psychic
Absol                   Dark
Accelgor                 Bug
Aegislash      Steel / Ghost
                  ...
Zoroark                 Dark
Zorua                   Dark
Zubat        Poison / Flying
Zweilous       Dark / Dragon
Zygarde      Dragon / Ground
Name: Type, Length: 809, dtype: object

索引日期排序，按照从最早日期到最晚日期顺序进行排序

battles.sort_index()

Start Date
1774-09-01    Massachusetts
1774-12-14    New Hampshire
1775-04-19    Massachusetts
1775-04-19    Massachusetts
1775-04-20         Virginia
                  ...
1783-01-22         Virginia
NaT              New Jersey
NaT                Virginia
NaT                     NaN
NaT                     NaN
Name: State, Length: 232, dtype: object

NaT(not a time) 表示没有日期值

sort_index() 先显示 NaT，使用参数 na_position

battles.sort_index(na_position="first")

Start Date
NaT              New Jersey
NaT                Virginia
NaT                     NaN
NaT                     NaN
1774-09-01    Massachusetts
                  ...
1782-09-11         Virginia
1782-09-13              NaN
1782-10-18              NaN
1782-12-06              NaN
1783-01-22         Virginia
Name: State, Length: 232, dtype: object

sort_index() ,按照日期由近到远排序

battles.sort_index(ascending=False)

Start Date
1783-01-22         Virginia
1782-12-06              NaN
1782-10-18              NaN
1782-09-13              NaN
1782-09-11         Virginia
                  ...
1774-09-01    Massachusetts
NaT              New Jersey
NaT                Virginia
NaT                     NaN
NaT                     NaN
Name: State, Length: 232, dtype: object

nsmallest() 返回的 Series 中按升序进行排序，默认值为 5。不适合 Series 字符串

google.nsmallest()

Date
2004-09-03    49.82
2004-09-01    49.94
2004-08-19    49.98
2004-09-02    50.57
2004-09-07    50.60
Name: Close, dtype: float64

nlargest() 返回的 Series 中按降序对值进行排序，默认值为 5。不适合 Series 字符串

google.nlargest()

Date
2019-04-29    1287.58
2019-04-26    1272.18
2018-07-26    1268.33
2019-10-25    1265.13
2019-04-23    1264.55
Name: Close, dtype: float64

参数 inplace 替换原有的 Series
```
battles.sort_values(inplace=True)
```
inplace 参数，将修改或改变现有对象，而不是创建一个副本。

value_counts() 计算值的个数

默认按照降序对值进行排序

pokemon.value_counts()

Type
Normal                65
Water                 61
Grass                 38
Psychic               35
Fire                  30
                      ..
Fire / Psychic         1
Normal / Ground        1
Psychic / Fighting     1
Dark / Ghost           1
Fire / Ghost           1
Name: count, Length: 159, dtype: int64

value_counts() 返回一个新的 Series 对象，新对象的索引标签是 pokemon Series 的值，新对象的值是它们各自的计数。

nunique() 唯一值的数量
```
pokemon.nunique()
```
```
159
```

value_counts() 参数 ascending 进行排序。
默认为 False ，即按照降序进行排序。要按升序值进行排序，ascending 设置为 True

pokemon.value_counts(ascending=True)

Type
Fire / Ghost         1
Fighting / Dark      1
Fighting / Steel     1
Normal / Ground      1
Fire / Psychic       1
                    ..
Fire                30
Psychic             35
Grass               38
Water               61
Normal              65
Name: count, Length: 159, dtype: int64

value_counts() 参数 normalize ，返回每个唯一值的频率

pokemon.value_counts(normalize=True)

Type
Normal                0.080346
Water                 0.075402
Grass                 0.046972
Psychic               0.043263
Fire                  0.037083
                        ...
Fire / Psychic        0.001236
Normal / Ground       0.001236
Psychic / Fighting    0.001236
Dark / Ghost          0.001236
Fire / Ghost          0.001236
Name: proportion, Length: 159, dtype: float64

可以将 Series 中的值乘以 100 ，算出来百分比

pokemon.value_counts(normalize=True) * 100

Type
Normal                8.034611
Water                 7.540173
Grass                 4.697157
Psychic               4.326329
Fire                  3.708282
                        ...
Fire / Psychic        0.123609
Normal / Ground       0.123609
Psychic / Fighting    0.123609
Dark / Ghost          0.123609
Fire / Ghost          0.123609
Name: proportion, Length: 159, dtype: float64

round() 设置百分比的精度

(pokemon.value_counts(normalize=True) * 100).round(2)

Type
Normal                8.03
Water                 7.54
Grass                 4.70
Psychic               4.33
Fire                  3.71
                      ...
Fire / Psychic        0.12
Normal / Ground       0.12
Psychic / Fighting    0.12
Dark / Ghost          0.12
Fire / Ghost          0.12
Name: proportion, Length: 159, dtype: float64

max() python 函数最大值
```
google.max()
```
```
1287.58
```
min() python 函数最小值
```
google.min()
```
```
49.82
```

value_values() 参数 bins 分组区间

buckets = [0, 200, 400, 600, 800, 1000, 1200, 1400]
google.value_counts(bins=buckets)

(200.0, 400.0]      1568
(-0.001, 200.0]      595
(400.0, 600.0]       575
(1000.0, 1200.0]     406
(600.0, 800.0]       380
(800.0, 1000.0]      207
(1200.0, 1400.0]      93
Name: count, dtype: int64

圆括号表示该值不包含在区间当中
方括号表示该值包含在区间当中
闭区间包括两个端点，[5,10]
开区间不包括两个端点，（5,10）
带有 bins 参数的 value_counts() 方法返回半开区间，将包含一个端点并排除另一个端点
bins 也接受一个整数参数，Pandas 会自动计算 Series 中最大值和最小值之间的差值，并将范围划分为指定数量的 bins。

返回的 Series 按照值进行降序排序

可以继续对索引进行升序排序

google.value_counts(bins=buckets).sort_index()
# 或者
google.value_counts(bins=buckets, sort=False)

(-0.001, 200.0]      595
(200.0, 400.0]      1568
(400.0, 600.0]       575
(600.0, 800.0]       380
(800.0, 1000.0]      207
(1000.0, 1200.0]     406
(1200.0, 1400.0]      93
Name: count, dtype: int64

value_counts() 默认排除 NaN 值，要对 NaN 值进行计算，参数 drnpna = False

battles.value_counts(dropna=False)

State
NaN               70
South Carolina    31
New York          28
New Jersey        24
Virginia          21
Massachusetts     11
Pennsylvania      10
North Carolina     9
Florida            8
Georgia            6
Rhode Island       3
Connecticut        3
Vermont            3
New Hampshire      1
Delaware           1
Indiana            1
Louisiana          1
Ohio               1
Name: count, dtype: int64

Series 索引使用 value_counts 方法

battles.index.value_counts()

Start Date
1781-04-25    2
1781-05-22    2
1780-08-18    2
1781-09-13    2
1782-03-16    2
             ..
1778-06-30    1
1778-07-03    1
1778-07-27    1
1778-08-21    1
1783-01-22    1
Name: count, Length: 217, dtype: int64

apply() 对每个 Series 值调用一个函数

函数是 Python 中的第一类对象（first-class object）。

任何可以用数字完成的事情，都可以用函数来完成

将函数存储在列表中
将函数作为字典键的值
将一个函数作为参数传递给另一个函数
从一个函数返回另一个函数

函数是产生出书的指令序列，函数调用是指令的实际执行

round() 函数，将高于 0.5 的值向上取整，低于 0.5 的值向下取整

google.apply(round)

Date
2004-08-19      50
2004-08-20      54
2004-08-23      54
2004-08-24      52
2004-08-25      53
              ...
2019-10-21    1246
2019-10-22    1243
2019-10-23    1259
2019-10-24    1261
2019-10-25    1265
Name: Close, Length: 3824, dtype: int64

定义函数 single_or_multi ，含有 / 返回 multi ，否则是 Single

def single_or_multi(pokemon_type):
    if '/' in pokemon_type:
        return "Multi"
    return "Single"

pokemon.apply(single_or_multi)

Pokemon
Bulbasaur       Multi
Ivysaur         Multi
Venusaur        Multi
Charmander     Single
Charmeleon     Single
                ...
Stakataka       Multi
Blacephalon     Multi
Zeraora        Single
Meltan         Single
Melmetal       Single
Name: Type, Length: 809, dtype: object

Pandas 为每个 Series 值调用 single_or_multi 函数

代码挑战

需要确定美国独立战争期间星期几发生的战斗最多。

最终输出应该是一个以星期几作为索引标签，每天战斗计数作为值的 Series

原始数据

pd.read_csv("./file/chapter_03/revolutionary_war.csv")

                                Battle  Start Date          State
0                         Powder Alarm    9/1/1774  Massachusetts
1    Storming of Fort William and Mary  12/14/1774  New Hampshire
2     Battles of Lexington and Concord   4/19/1775  Massachusetts
3                      Siege of Boston   4/19/1775  Massachusetts
4                   Gunpowder Incident   4/20/1775       Virginia
..                                 ...         ...            ...
227                Siege of Fort Henry   9/11/1782       Virginia
228         Grand Assault on Gibraltar   9/13/1782            NaN
229          Action of 18 October 1782  10/18/1782            NaN
230          Action of 6 December 1782   12/6/1782            NaN
231          Action of 22 January 1783   1/22/1783       Virginia
[232 rows x 3 columns]

把 Start Date 作为导入的列，由于只有一列，调用 squeeze() 转换为 Series ，Start Date 指定为日期类型。

war = pd.read_csv("./file/chapter_03/revolutionary_war.csv", usecols=["Start Date"],
                      parse_dates=["Start Date"]).squeeze(True)

0     1774-09-01
1     1774-12-14
2     1775-04-19
3     1775-04-19
4     1775-04-20
         ...
227   1782-09-11
228   1782-09-13
229   1782-10-18
230   1782-12-06
231   1783-01-22
Name: Start Date, Length: 232, dtype: datetime64[ns]

定义日期转换为星期的函数

def day_for_week(date):
    return date.strftime("%A")

删除 NaT 值，使用 apply() 函数对 Series 每个值调用 day_for_week() 函数，然后进行唯一值出现次数的统计

war.dropna().apply(day_for_week).value_counts()

Start Date
Saturday     39
Friday       39
Wednesday    32
Thursday     31
Sunday       31
Tuesday      29
Monday       27
Name: count, dtype: int64

posted @ 2023-12-24 21:03 熠然阅读(80) 评论(0) 收藏举报

刷新页面返回顶部

熠然

Pandas数据分析实战（Pandas in action）第3章 Series 方法

Pandas 数据分析实战

第 3 章 Series 方法

公告