Pandas数据分析实战(Pandas in action)第3章 Series 方法

Pandas 数据分析实战

第 3 章 Series 方法

  • read_csv() 导入数据集

    pd.read_csv(filepath_or_buffer="./file/chapter_03/pokemon.csv")
    # 或者
    pd.read_csv("./file/chapter_03/pokemon.csv")
    
             Pokemon            Type
    0      Bulbasaur  Grass / Poison
    1        Ivysaur  Grass / Poison
    2       Venusaur  Grass / Poison
    3     Charmander            Fire
    4     Charmeleon            Fire
    ..           ...             ...
    804    Stakataka    Rock / Steel
    805  Blacephalon    Fire / Ghost
    806      Zeraora        Electric
    807       Meltan           Steel
    808     Melmetal           Steel
    [809 rows x 2 columns]
    
  • read_csv() 设置索引列

    通过参数 index_col 设置索引列,将“Pokemon”作为参数传递给 index_col

    pd.read_csv("./file/chapter_03/pokemon.csv", index_col="Pokemon")
    
                           Type
    Pokemon
    Bulbasaur    Grass / Poison
    Ivysaur      Grass / Poison
    Venusaur     Grass / Poison
    Charmander             Fire
    Charmeleon             Fire
                         ...
    Stakataka      Rock / Steel
    Blacephalon    Fire / Ghost
    Zeraora            Electric
    Meltan                Steel
    Melmetal              Steel
    [809 rows x 1 columns]
    
  • read_csv() 将 DataFrame 转换为 Series

    一列数据,Pandas 默认是将数据导入 DataFrame ,为了得到 Series ,需要使用函数 squeeze()

    在 pandas 1.5 版本以前,使用时在 read_csv() 函数中传参 squeeze=True 就行,在 1.5 版本后废弃掉了。链接

    pd.read_csv("./file/chapter_03/pokemon.csv", index_col="Pokemon").squeeze("columns")
    
    Pokemon
    Bulbasaur      Grass / Poison
    Ivysaur        Grass / Poison
    Venusaur       Grass / Poison
    Charmander               Fire
    Charmeleon               Fire
                        ...
    Stakataka        Rock / Steel
    Blacephalon      Fire / Ghost
    Zeraora              Electric
    Meltan                  Steel
    Melmetal                Steel
    Name: Type, Length: 809, dtype: object
    

    成功获得了一个 Series ,索引标签是 Pokemon 的名称,值是 Pokemon 的类型。

    • Pandas 已为 Series 分配了名称为 Type 的列,即 CSV 文件的列名为 Type
    • 该 Series 有 809 个值
    • dtype : object 表示字符串类型的 Series 。
  • read_csv() 导入的参数转换为日期

    导入数据时,Pandas 会为每一列推断最合适的数据类型,但是出于稳定程序的目的,Pandas 会避免对数据做出假设。因此在导入 google_stocks.csv 中,包含有一个 Date 列,格式为 YYYY-MM-DD 格式的,除非明确告诉 Pandas 把该值设置为日期,否则都是按照字符串导入的。可以通过 read_csv() 函数中的 parse_dates 参数指定需要转换为日期的列,parse_dates 参数接收一个字符串列表

    pd.read_csv("./file/chapter_03/google_stocks.csv", parse_dates=["Date"]).head()
    
            Date  Close
    0 2004-08-19  49.98
    1 2004-08-20  53.95
    2 2004-08-23  54.50
    3 2004-08-24  52.24
    4 2004-08-25  52.80
    
  • read_csv() 导入的列转换为日期格式,并设置该列为索引并转换为 Series

    pd.read_csv("./file/chapter_03/google_stocks.csv", parse_dates=["Date"],index_col="Date").squeeze(True)
    
    Date
    2004-08-19      49.98
    2004-08-20      53.95
    2004-08-23      54.50
    2004-08-24      52.24
    2004-08-25      52.80
                   ...
    2019-10-21    1246.15
    2019-10-22    1242.80
    2019-10-23    1259.13
    2019-10-24    1260.99
    2019-10-25    1265.13
    Name: Close, Length: 3824, dtype: float64
    
  • read_csv() 存在多列时,squeeze() 函数无效

    pd.read_csv("./file/chapter_03/revolutionary_war.csv", parse_dates=["Start Date"], index_col="Start Date").squeeze(True)
    
                                           Battle          State
    Start Date
    1774-09-01                       Powder Alarm  Massachusetts
    1774-12-14  Storming of Fort William and Mary  New Hampshire
    1775-04-19   Battles of Lexington and Concord  Massachusetts
    1775-04-19                    Siege of Boston  Massachusetts
    1775-04-20                 Gunpowder Incident       Virginia
                                           ...            ...
    1782-09-11                Siege of Fort Henry       Virginia
    1782-09-13         Grand Assault on Gibraltar            NaN
    1782-10-18          Action of 18 October 1782            NaN
    1782-12-06          Action of 6 December 1782            NaN
    1783-01-22          Action of 22 January 1783       Virginia
    [232 rows x 2 columns]
    
  • read_csv() 多列中选择只导入索引列和值,转换为 Series

    read_csv 函数的 usecols 参数接受 Pandas 应该导入的字段列表,选择 Start Date 和 State ,Start Date 作为索引,State 作为值,在 DataFrame 只存在 2 列时,可以转换为 Series

    pd.read_csv("./file/chapter_03/revolutionary_war.csv", parse_dates=["Start Date"], index_col="Start Date",
                           usecols=["Start Date", "State"]).squeeze(True)
    
    Start Date
    1774-09-01    Massachusetts
    1774-12-14    New Hampshire
    1775-04-19    Massachusetts
    1775-04-19    Massachusetts
    1775-04-20         Virginia
                      ...
    1782-09-11         Virginia
    1782-09-13              NaN
    1782-10-18              NaN
    1782-12-06              NaN
    1783-01-22         Virginia
    Name: State, Length: 232, dtype: object
    
  • sort_values() 按值排序

    sort_values() 返回一个新的 Series,其中的值按照升序排序

    google = pd.read_csv("./file/chapter_03/google_stocks.csv", parse_dates=["Date"], index_col="Date").squeeze(True)
    google.sort_values()
    
    Date
    2004-09-03      49.82
    2004-09-01      49.94
    2004-08-19      49.98
    2004-09-02      50.57
    2004-09-07      50.60
                   ...
    2019-04-23    1264.55
    2019-10-25    1265.13
    2018-07-26    1268.33
    2019-04-26    1272.18
    2019-04-29    1287.58
    Name: Close, Length: 3824, dtype: float64
    

    按字母顺序对 Series 中的字符串进行排序

    pokemon = pd.read_csv("./file/chapter_03/pokemon.csv", index_col="Pokemon").squeeze("columns")
    pokemon.sort_values()
    
    Pokemon
    Illumise                Bug
    Silcoon                 Bug
    Pinsir                  Bug
    Burmy                   Bug
    Wurmple                 Bug
                      ...
    Tirtouga       Water / Rock
    Relicanth      Water / Rock
    Corsola        Water / Rock
    Carracosta     Water / Rock
    Empoleon      Water / Steel
    Name: Type, Length: 809, dtype: object
    

    Pandas 将大写字母排在小写字母前

    pd.Series(data= ['Adam','adam','Ben']).sort_values()
    
    0    Adam
    2     Ben
    1    adam
    dtype: object
    
  • sort_values() 通过 ascending=False 进行降序处理, 默认值为 True

    google.sort_values(ascending=False)
    
    Date
    2019-04-29    1287.58
    2019-04-26    1272.18
    2018-07-26    1268.33
    2019-10-25    1265.13
    2019-04-23    1264.55
                   ...
    2004-09-07      50.60
    2004-09-02      50.57
    2004-08-19      49.98
    2004-09-01      49.94
    2004-09-03      49.82
    Name: Close, Length: 3824, dtype: float64
    

    字符串降序排序是指按字母表的倒序对 Series 中的字符串进行排序

    pokemon.sort_values(ascending=False)
    
    Pokemon
    Empoleon      Water / Steel
    Corsola        Water / Rock
    Relicanth      Water / Rock
    Carracosta     Water / Rock
    Tirtouga       Water / Rock
                      ...
    Kricketune              Bug
    Cascoon                 Bug
    Scatterbug              Bug
    Kricketot               Bug
    Grubbin                 Bug
    Name: Type, Length: 809, dtype: object
    
  • sort_values() 参数 na_position 用来设置 NaN 值时,将该记录放置在排序结果中的位置,该参数默认为 last,即默认将缺失值放在已排序 Series 的末尾

    battles.sort_values(na_position="last")
    
    Start Date
    1781-09-06    Connecticut
    1779-07-05    Connecticut
    1777-04-27    Connecticut
    1777-09-03       Delaware
    1777-05-17        Florida
                     ...
    1782-08-08            NaN
    1782-08-25            NaN
    1782-09-13            NaN
    1782-10-18            NaN
    1782-12-06            NaN
    Name: State, Length: 232, dtype: object
    

    需要先显示缺失值,na_position 参数设置为 first

    battles.sort_values(na_position="first")
    
    Start Date
    1775-09-17         NaN
    1775-12-31         NaN
    1776-03-03         NaN
    1776-03-25         NaN
    1776-05-18         NaN
                    ...
    1781-07-06    Virginia
    1781-07-01    Virginia
    1781-06-26    Virginia
    1781-04-25    Virginia
    1783-01-22    Virginia
    Name: State, Length: 232, dtype: object
    
  • dropna() 删除了所有缺失值的 Series,该方法仅针对 Series 值中的 NaN,而不是索引

    battles.dropna().sort_values()
    
    Start Date
    1781-09-06    Connecticut
    1779-07-05    Connecticut
    1777-04-27    Connecticut
    1777-09-03       Delaware
    1777-05-17        Florida
                     ...
    1781-07-06       Virginia
    1781-07-01       Virginia
    1781-06-26       Virginia
    1781-04-25       Virginia
    1783-01-22       Virginia
    Name: State, Length: 162, dtype: object
    

    新的 Series 比之前的 Series 要短,因为 Pandas 从 battles 中删除了 70 个 NaN 值

  • sort_index() 按索引排序,ascending 参数默认为为 True

    sort_index() 按索引对 Series 排序,这些值将与他们的索引一起移动

    pokemon.sort_index()
    # 或者
    pokemon.sort_index(ascending=True)
    
    Pokemon
    Abomasnow        Grass / Ice
    Abra                 Psychic
    Absol                   Dark
    Accelgor                 Bug
    Aegislash      Steel / Ghost
                      ...
    Zoroark                 Dark
    Zorua                   Dark
    Zubat        Poison / Flying
    Zweilous       Dark / Dragon
    Zygarde      Dragon / Ground
    Name: Type, Length: 809, dtype: object
    

    索引日期排序,按照从最早日期到最晚日期顺序进行排序

    battles.sort_index()
    
    Start Date
    1774-09-01    Massachusetts
    1774-12-14    New Hampshire
    1775-04-19    Massachusetts
    1775-04-19    Massachusetts
    1775-04-20         Virginia
                      ...
    1783-01-22         Virginia
    NaT              New Jersey
    NaT                Virginia
    NaT                     NaN
    NaT                     NaN
    Name: State, Length: 232, dtype: object
    

    NaT(not a time) 表示没有日期值

  • sort_index() 先显示 NaT,使用参数 na_position

    battles.sort_index(na_position="first")
    
    Start Date
    NaT              New Jersey
    NaT                Virginia
    NaT                     NaN
    NaT                     NaN
    1774-09-01    Massachusetts
                      ...
    1782-09-11         Virginia
    1782-09-13              NaN
    1782-10-18              NaN
    1782-12-06              NaN
    1783-01-22         Virginia
    Name: State, Length: 232, dtype: object
    
  • sort_index() ,按照日期由近到远排序

    battles.sort_index(ascending=False)
    
    Start Date
    1783-01-22         Virginia
    1782-12-06              NaN
    1782-10-18              NaN
    1782-09-13              NaN
    1782-09-11         Virginia
                      ...
    1774-09-01    Massachusetts
    NaT              New Jersey
    NaT                Virginia
    NaT                     NaN
    NaT                     NaN
    Name: State, Length: 232, dtype: object
    
  • nsmallest() 返回的 Series 中按升序进行排序,默认值为 5。不适合 Series 字符串

    google.nsmallest()
    
    Date
    2004-09-03    49.82
    2004-09-01    49.94
    2004-08-19    49.98
    2004-09-02    50.57
    2004-09-07    50.60
    Name: Close, dtype: float64
    
  • nlargest() 返回的 Series 中按降序对值进行排序,默认值为 5。不适合 Series 字符串

    google.nlargest()
    
    Date
    2019-04-29    1287.58
    2019-04-26    1272.18
    2018-07-26    1268.33
    2019-10-25    1265.13
    2019-04-23    1264.55
    Name: Close, dtype: float64
    
  • 参数 inplace 替换原有的 Series

    battles.sort_values(inplace=True)
    

    inplace 参数,将修改或改变现有对象,而不是创建一个副本。

  • value_counts() 计算值的个数

    默认按照降序对值进行排序

    pokemon.value_counts()
    
    Type
    Normal                65
    Water                 61
    Grass                 38
    Psychic               35
    Fire                  30
                          ..
    Fire / Psychic         1
    Normal / Ground        1
    Psychic / Fighting     1
    Dark / Ghost           1
    Fire / Ghost           1
    Name: count, Length: 159, dtype: int64
    

    value_counts() 返回一个新的 Series 对象,新对象的索引标签是 pokemon Series 的值,新对象的值是它们各自的计数。

  • nunique() 唯一值的数量

    pokemon.nunique()
    
    159
    
  • value_counts() 参数 ascending 进行排序。
    默认为 False ,即按照降序进行排序。要按升序值进行排序,ascending 设置为 True

    pokemon.value_counts(ascending=True)
    
    Type
    Fire / Ghost         1
    Fighting / Dark      1
    Fighting / Steel     1
    Normal / Ground      1
    Fire / Psychic       1
                        ..
    Fire                30
    Psychic             35
    Grass               38
    Water               61
    Normal              65
    Name: count, Length: 159, dtype: int64
    
  • value_counts() 参数 normalize ,返回每个唯一值的频率

    pokemon.value_counts(normalize=True)
    
    Type
    Normal                0.080346
    Water                 0.075402
    Grass                 0.046972
    Psychic               0.043263
    Fire                  0.037083
                            ...
    Fire / Psychic        0.001236
    Normal / Ground       0.001236
    Psychic / Fighting    0.001236
    Dark / Ghost          0.001236
    Fire / Ghost          0.001236
    Name: proportion, Length: 159, dtype: float64
    

    可以将 Series 中的值乘以 100 ,算出来百分比

    pokemon.value_counts(normalize=True) * 100
    
    Type
    Normal                8.034611
    Water                 7.540173
    Grass                 4.697157
    Psychic               4.326329
    Fire                  3.708282
                            ...
    Fire / Psychic        0.123609
    Normal / Ground       0.123609
    Psychic / Fighting    0.123609
    Dark / Ghost          0.123609
    Fire / Ghost          0.123609
    Name: proportion, Length: 159, dtype: float64
    
  • round() 设置百分比的精度

    (pokemon.value_counts(normalize=True) * 100).round(2)
    
    Type
    Normal                8.03
    Water                 7.54
    Grass                 4.70
    Psychic               4.33
    Fire                  3.71
                          ...
    Fire / Psychic        0.12
    Normal / Ground       0.12
    Psychic / Fighting    0.12
    Dark / Ghost          0.12
    Fire / Ghost          0.12
    Name: proportion, Length: 159, dtype: float64
    
  • max() python 函数最大值

    google.max()
    
    1287.58
    
  • min() python 函数最小值

    google.min()
    
    49.82
    
  • value_values() 参数 bins 分组区间

    buckets = [0, 200, 400, 600, 800, 1000, 1200, 1400]
    google.value_counts(bins=buckets)
    
    (200.0, 400.0]      1568
    (-0.001, 200.0]      595
    (400.0, 600.0]       575
    (1000.0, 1200.0]     406
    (600.0, 800.0]       380
    (800.0, 1000.0]      207
    (1200.0, 1400.0]      93
    Name: count, dtype: int64
    
    • 圆括号表示该值不包含在区间当中
    • 方括号表示该值包含在区间当中
    • 闭区间包括两个端点,[5,10]
    • 开区间不包括两个端点,(5,10)
    • 带有 bins 参数的 value_counts() 方法返回半开区间,将包含一个端点并排除另一个端点
    • bins 也接受一个整数参数,Pandas 会自动计算 Series 中最大值和最小值之间的差值,并将范围划分为指定数量的 bins。

    返回的 Series 按照值进行降序排序

    可以继续对索引进行升序排序

    google.value_counts(bins=buckets).sort_index()
    # 或者
    google.value_counts(bins=buckets, sort=False)
    
    (-0.001, 200.0]      595
    (200.0, 400.0]      1568
    (400.0, 600.0]       575
    (600.0, 800.0]       380
    (800.0, 1000.0]      207
    (1000.0, 1200.0]     406
    (1200.0, 1400.0]      93
    Name: count, dtype: int64
    
  • value_counts() 默认排除 NaN 值,要对 NaN 值进行计算,参数 drnpna = False

    battles.value_counts(dropna=False)
    
    State
    NaN               70
    South Carolina    31
    New York          28
    New Jersey        24
    Virginia          21
    Massachusetts     11
    Pennsylvania      10
    North Carolina     9
    Florida            8
    Georgia            6
    Rhode Island       3
    Connecticut        3
    Vermont            3
    New Hampshire      1
    Delaware           1
    Indiana            1
    Louisiana          1
    Ohio               1
    Name: count, dtype: int64
    
  • Series 索引使用 value_counts 方法

    battles.index.value_counts()
    
    Start Date
    1781-04-25    2
    1781-05-22    2
    1780-08-18    2
    1781-09-13    2
    1782-03-16    2
                 ..
    1778-06-30    1
    1778-07-03    1
    1778-07-27    1
    1778-08-21    1
    1783-01-22    1
    Name: count, Length: 217, dtype: int64
    
  • apply() 对每个 Series 值调用一个函数

    函数是 Python 中的第一类对象(first-class object)。

    任何可以用数字完成的事情,都可以用函数来完成

    • 将函数存储在列表中
    • 将函数作为字典键的值
    • 将一个函数作为参数传递给另一个函数
    • 从一个函数返回另一个函数

    函数是产生出书的指令序列,函数调用是指令的实际执行

    round() 函数,将高于 0.5 的值向上取整,低于 0.5 的值向下取整

    google.apply(round)
    
    Date
    2004-08-19      50
    2004-08-20      54
    2004-08-23      54
    2004-08-24      52
    2004-08-25      53
                  ...
    2019-10-21    1246
    2019-10-22    1243
    2019-10-23    1259
    2019-10-24    1261
    2019-10-25    1265
    Name: Close, Length: 3824, dtype: int64
    

    定义函数 single_or_multi ,含有 / 返回 multi ,否则是 Single

    def single_or_multi(pokemon_type):
        if '/' in pokemon_type:
            return "Multi"
        return "Single"
    
    pokemon.apply(single_or_multi)
    
    Pokemon
    Bulbasaur       Multi
    Ivysaur         Multi
    Venusaur        Multi
    Charmander     Single
    Charmeleon     Single
                    ...
    Stakataka       Multi
    Blacephalon     Multi
    Zeraora        Single
    Meltan         Single
    Melmetal       Single
    Name: Type, Length: 809, dtype: object
    

    Pandas 为每个 Series 值调用 single_or_multi 函数

  • 代码挑战

    需要确定美国独立战争期间星期几发生的战斗最多。

    最终输出应该是一个以星期几作为索引标签,每天战斗计数作为值的 Series

    原始数据

    pd.read_csv("./file/chapter_03/revolutionary_war.csv")
    
                                    Battle  Start Date          State
    0                         Powder Alarm    9/1/1774  Massachusetts
    1    Storming of Fort William and Mary  12/14/1774  New Hampshire
    2     Battles of Lexington and Concord   4/19/1775  Massachusetts
    3                      Siege of Boston   4/19/1775  Massachusetts
    4                   Gunpowder Incident   4/20/1775       Virginia
    ..                                 ...         ...            ...
    227                Siege of Fort Henry   9/11/1782       Virginia
    228         Grand Assault on Gibraltar   9/13/1782            NaN
    229          Action of 18 October 1782  10/18/1782            NaN
    230          Action of 6 December 1782   12/6/1782            NaN
    231          Action of 22 January 1783   1/22/1783       Virginia
    [232 rows x 3 columns]
    

    把 Start Date 作为导入的列,由于只有一列,调用 squeeze() 转换为 Series ,Start Date 指定为日期类型。

    war = pd.read_csv("./file/chapter_03/revolutionary_war.csv", usecols=["Start Date"],
                          parse_dates=["Start Date"]).squeeze(True)
    
    0     1774-09-01
    1     1774-12-14
    2     1775-04-19
    3     1775-04-19
    4     1775-04-20
             ...
    227   1782-09-11
    228   1782-09-13
    229   1782-10-18
    230   1782-12-06
    231   1783-01-22
    Name: Start Date, Length: 232, dtype: datetime64[ns]
    

    定义日期转换为星期的函数

    def day_for_week(date):
        return date.strftime("%A")
    

    删除 NaT 值,使用 apply() 函数对 Series 每个值调用 day_for_week() 函数,然后进行唯一值出现次数的统计

    war.dropna().apply(day_for_week).value_counts()
    
    Start Date
    Saturday     39
    Friday       39
    Wednesday    32
    Thursday     31
    Sunday       31
    Tuesday      29
    Monday       27
    Name: count, dtype: int64
    
posted @ 2023-12-24 21:03  熠然  阅读(13)  评论(0编辑  收藏  举报