Pandas数据分析实战（Pandas in action）第2章 Series 对象

Pandas 数据分析实战

第 2 章 Series

Series 是 Pandas 的核心数据结构之一，是一个用于同构数据的一维标记数组。Series 可以设置索引，没有设置的话，Pandas 会设置默认的索引，从 0 开始的线性索引。

创建一个 Series 对象

import pandas as pd
pd.Series()

Series([], dtype: object)

用值填充 Series 对象，传入的数据可以是列表、字典、元组和 Numpy 的 ndarray

ice_cream_flavors =[
        'Chocolate',
        'Vanilla',
        'Strawberry',
        'Rum Raisin'
]
print(pd.Series(ice_cream_flavors))

0     Chocolate
1       Vanilla
2    Strawberry
3    Rum Raisin
dtype: object

Series 构造方法参数

data: 数据，可不传。index 索引，可不传。

pandas.core.series.Series def __init__(self,
             data: Any = None,
             index: Any = None,
             dtype: ExtensionDtype | str | dtype | Type[str] | Type[complex] | Type[bool] | Type[object] | None = None,
             name: Any = None,
             copy: bool | None = None,
             fastpath: bool = False) -> None

自定义 Series 索引

Series 索引，除了索引位置，还可以为每个 Series 值分配一个索引标签。索引标签可以是任何不可变的数据类型：字符串、元组、日期时间等。

Series 构造函数的第二个形参 index 用来设置 Series 的索引标签。如果不设定这个形参，则 Pandas 默认使用从 0 开始的数字作为索引

ice_cream_flavors = [
        'Chocolate',
        'Vanilla',
        'Strawberry',
        'Rum Raisin'
]
days_of_week = ('Monday', 'Wednesday', 'Friday', 'Saturday')
print(pd.Series(ice_cream_flavors, days_of_week))
print(pd.Series(data=ice_cream_flavors, index=days_of_week))

Monday        Chocolate
Wednesday       Vanilla
Friday       Strawberry
Saturday     Rum Raisin
dtype: object

Pandas 允许索引重复，但最好避免重复。

使用关键字实参，可以允许以任何顺序传递形参。位置实参，要求按照构造函数期望的顺序给出实参。

对于布尔值、浮点数、整型， Pandas 可以显示类型，对于字符串和更复杂的对象，Pandas 将显示 dtype:object

bunch_of_bools = [True,False,False]
print(pd.Series(bunch_of_bools))

0     True
1    False
2    False
dtype: bool

stock_prices = [985.32,950.44]
print(pd.Series(stock_prices))

0    985.32
1    950.44
dtype: float64

lucky_number = [4,8,15,16,23,42]
print(pd.Series(lucky_number))

0     4
1     8
2    15
3    16
4    23
5    42
dtype: int64

Pandas 会尽力从 data 形参的值判断出适合 Series 的数据类型，通过构造函数的 dtype 形参强制将数据转换为不同的类型

lucky_number = [4,8,15,16,23,42]
print(pd.Series(lucky_number, dtype="float"))

0     4.0
1     8.0
2    15.0
3    16.0
4    23.0
5    42.0
dtype: float64

创建有缺失值的 Series
```
temperatures = [94,88,np.nan, 91]
print(pd.Series(temperatures))
```
```
0    94.0
1    88.0
2     NaN
3    91.0
dtype: float64
```
Series dtype 为 float64，当 Pandas 发现一个 nan 值时，会自动将数值从整数转换为浮点数， Pandas 允许将数值和缺失值储存在同一个同构 Series 中

字典创建 Series

calorie_info = {
        'Cereal': 125,
        'Choolate Bar': 406,
        'Ice Cream Sundae': 342
}
print(pd.Series(calorie_info))

Cereal              125
Choolate Bar        406
Ice Cream Sundae    342
dtype: int64

元组创建 Series

tep = ('Red','Green','Blue')
print(pd.Series(tep))

0      Red
1    Green
2     Blue
dtype: object

存储元组 Series

rgb_color = [(120,41,26),(196,165,45)]
print(pd.Series(rgb_color))

0     (120, 41, 26)
1    (196, 165, 45)
dtype: object

集合创建 Series

集合是无序数据结构，可以用一对{}来声明，类似字典，但是创建 Series 的时候，会报错：TypeError 异常。集合既没有顺序的概念（如列表），也没有关联的概念（如字典）
```
my_set = {'Ricky', 'Bobby'}
print(pd.Series(my_set))
```
```
TypeError: 'set' type is unordered
```
如果涉及到集合，将其转换为有序数据结构
```
my_set = {'Ricky', 'Bobby'}
print(pd.Series(list(my_set)))
```
```
0    Ricky
1    Bobby
dtype: object
```

Numpy 数组创建 Series

import numpy as np
random_randint = np.random.randint(0, 101, 10)

[65 72 65 64 38  4  6 18 43 50]

random_randint = np.random.randint(0, 101, 10)
print(pd.Series(random_randint))

0    65
1    72
2    65
3    64
4    38
5     4
6     6
7    18
8    43
9    50
dtype: int32

Series 属性
- values
  
  Series 的值使用 NumPy 库的 ndarray 对象来存储
```
calorie_info = {
        'Cereal': 125,
        'Choolate Bar': 406,
        'Ice Cream Sundae': 342
}
series = pd.Series(calorie_info)
print(series.values)
print(type(series.values))
```
```
[125 406 342]
<class 'numpy.ndarray'>
```
- index
  
  Series 的索引使用的 Pandas 中内置的 Index 索引对象
```
calorie_info = {
        'Cereal': 125,
        'Choolate Bar': 406,
        'Ice Cream Sundae': 342
}
series = pd.Series(calorie_info)
print(series.index)
print(type(series.index))
```
```
Index(['Cereal', 'Choolate Bar', 'Ice Cream Sundae'], dtype='object')
<class 'pandas.core.indexes.base.Index'>
```
- dtype
  
  返回 Series 值的数据类型
```
series.dtype
```
```
int64
```
- size
  
  返回 Series 中值的数量
```
series.size
```
```
3
```
- shape
  
  返回一个具有 Pandas 数据结构维度的元组。对于一维 Series，元组的唯一值将是 Series 的大小。数字 3 之后的逗号是 Python 中单个元素的元组的标准可视化输出
```
series.shape
```
```
(3,)
```
- is_unique
  
  如果所有的 Series 值都是唯一的，则 is_unique 属性返回 True。含有重复项，is_unique 属性返回 False
```
series.is_unique
```
```
True
```
- is_monotonic_increasing
  
  每个 Series 值都大于前一个值，则 is_monotonic_increasing 属性返回 True
```
pd.Series(data=[1, 3, 6]).is_monotonic_increasing
```
```
True
```
- is_monotonic_decreasing
  任何元素都小于前一个元素，is_monotonic_decreasing 属性返回 True
```
pd.Series(data=[6, 3, 1]).is_monotonic_decreasing
```
```
True
```
Series 方法

统计操作

以 5 为增量生产 0-500 范围内的 100 个值的 Series 对象
```
values = range(0, 500, 5)
nums = pd.Series(data=values)
```
```
0       0
1       5
2      10
3      15
4      20
     ...
95    475
96    480
97    485
98    490
99    495
Length: 100, dtype: int64
```
- head()
  
  从数据集的开头返回行，默认 5 行。可以传参，比如 head(3) ，表示获取前 3 行
```
nums.head()
```
```
0     0
1     5
2    10
3    15
4    20
dtype: int64
```
- tail()
  
  从数据集的末尾返回行，默认 5 行。可以传参，比如 tail(6)，表示获取末尾 6 行
```
nums.tail()
```
```
95    475
96    480
97    485
98    490
99    495
dtype: int64
```
- 通过一个升序的数字列表创建一个 Series，在中间插入一个 np.nan 值。由于值中有 nan 值， Pandas 就会将整数强制转换为浮点值
```
numbers = pd.Series([1, 2, 3, np.nan, 4, 5])
print(numbers)
```
```
0    1.0
1    2.0
2    3.0
3    NaN
4    4.0
5    5.0
dtype: float64
```
- count()
  
  统计非空值的个数
```
numbers.count()
```
```
5
```
- sum()
  
  将 Series 的值相加
```
numbers.sum()
```
```
15.0
```
- sum(skipna = False)
  
  skipna 参数传递 False 强制包含缺失的值
```
numbers.sum(skipna=False)
```
```
nan
```
- sum(min_count = 3)
  
  min_count 参数设置有效值的最小数量，只有当 Series 至少包含这么多有效值的时候，Pandas 才会计算它的和
```
numbers.sum(min_count=3)
```
```
15.0
```
- product()
  
  将所有 Series 值相乘
```
numbers.product()
```
```
120.0
```
- product(skipna = False)
  
  skipna 设置为 False ，不忽略 nan 的值
```
numbers.product(skipna=False)
```
```
nan
```
- product(min_count = 3)
  
  min_count 参数设置有效值的最小数量，只有当 Series 至少包含这么多有效值的时候，Pandas 才会相乘
```
numbers.product(min_count=3)
```
```
120.0
```
- cumsum()
  
  累计和，返回一个带有滚动总和的新 Series 。每个索引位置都保存截止当前索引位置的值的总和（含当前位置值）。累计和有助于确定哪些值对总和的贡献最大
```
numbers.cumsum()
```
```
0     1.0
1     3.0
2     6.0
3     NaN
4    10.0
5    15.0
dtype: float64
```
- cumsum(skipna = False)
  
  skipna 不忽略 nan 的值
```
numbers.cumsum(skipna=False)
```
```
0    1.0
1    3.0
2    6.0
3    NaN
4    NaN
5    NaN
dtype: float64
```
- pct_change()
  
  百分比变化，返回从一个 Series 值到下一个 Series 值的百分比差异。在每个索引处，Pandas 将当前索引对应值与上一个索引对应值的差值，除以上一个索引对应值。只有当两个索引都具有有效值时，Pandas 才能计算百分比差异。pct_change 方法默认为缺失值，使用前向填充的策略。使用这个策略， Pandas 用它遇到的最后一个有效值替换一个 nan
```
numbers.pct_change()
```
```
0         NaN
1    1.000000
2    0.500000
3    0.000000
4    0.333333
5    0.250000
dtype: float64
```
  采用默认方式，pad 或者传参 file_method='pad' 之类，提示该参数要过期了，使用 ffile() 的方式选择替换 nan 值的方法
```
# file_method 已过期
numbers.pct_change(fill_method='pad')
# 或者
numbers.pct_change(fill_method='ffill')
# 新的代码
numbers.ffill().pct_change()
```
```
# file_method 已过期
numbers.pct_change(fill_method='bfill')
或
numbers.pct_change(fill_method='backfill')
# 新的代码
numbers.bfill().pct_change()
```
- mean()
  
  返回 Series 中值的平均值
```
numbers.mean()
```
```
3.0
```
- median()
  
  返回排序后的 Series 值中的中间数。一半的 Series 值将低于中间数，一半的值将高于中间数
```
numbers.median()
```
```
3.0
```
- std()
  
  返回标准差，即数据变化的度量
```
numbers.std()
```
```
1.5811388300841898
```
- max()
  
  从 Series 中检索最大值
```
numbers.max()
```
```
5.0
```
- min()
  
  从 Series 中检索最小值
```
numbers.min()
```
```
1.0
```
- describe()
  
  对 Series 对象进行有效的总结。包括计数、平均值和标准差
```
numbers.describe()
```
```
count    5.000000
mean     3.000000
std      1.581139
min      1.000000
25%      2.000000
50%      3.000000
75%      4.000000
max      5.000000
dtype: float64
```
- sample()
  
  从 Series 中随机选择各种值。新 Series 和原始 Series 之间的值的顺序可能不同。如果随机选择的值中缺少 nan 值，Pandas 将返回一个 int Series 。如果 nan 包含在返回值中， Pandas 将返回一个 float 类型的 Series
```
numbers.sample(3)
```
```
3    NaN
2    3.0
4    4.0
dtype: float64
```
- unique()
  
  返回一个 NumPy ndarray，其中包含 Series 中的唯一值。
```
authors = pd.Series(['Hemingway', 'Orwell', 'Dostoevsky', 'Fitzgerald', 'Orwell'])
authors.unique()
```
```
['Hemingway' 'Orwell' 'Dostoevsky' 'Fitzgerald']
```
- nunique()
  
  返回 Series 中唯一值的数量
```
authors.nunique()
```
```
4
```

Series 算术操作

先创建一个有缺失值的整数型 Series

s1 = pd.Series(data = [5,np.nan,15],index=['A','B','C'])

A     5.0
B     NaN
C    15.0
dtype: float64

add()

加法

s1+3
# 或者
s1.add(3)

A     8.0
B     NaN
C    18.0
dtype: float64

sub()

减法

s1-5
# 或者
s1.sub(5)
# 或者
s1.subtract(5)

A     0.0
B     NaN
C    10.0
dtype: float64

mul()

乘法

s1 * 2
# 或者
s1.mul(2)
# 或者
s1.multiply(2)

A    10.0
B     NaN
C    30.0
dtype: float64

div()

除法

s1 / 2
# 或者
s1.div(2)
# 或者
s1.divide(2)

A    2.5
B    NaN
C    7.5
dtype: float64

floordiv()

除法，并删除结果中小数点后的所有数字

s1 // 4
# 或者
s1.floordiv(4)

A    1.0
B    NaN
C    3.0
dtype: float64

mod()

模运算

s1 % 3

A    2.0
B    NaN
C    0.0
dtype: float64

Series 广播

Pandas 把 Series 的值存储在 NumPy 的 ndarray 中，使用 s1 +3 或 s1-5 这样的语法时，Pandas 会将数学计算委托给 NumPy。

NumPy 文档使用术语广播来描述一个数组向另一个数组的派生。s1 +3 的语法意味着对 Series 中的每个值应用相同的操作（加 3）。

索引标签相同时，Pandas 通过相同的索引标签对 Series 进行对齐。

索引标签不相同时，不相同的索引标签返回 NaN 值。

Series 传递给 Python 的内置函数

创建一个美国城市的小型 Series

cities = pd.Series(
        data=['San Francisco', 'Los Angeles', 'Las Veges', np.nan]
)

len()
```
len(cities)
```
```
4
```

type()

type(cities)

<class 'pandas.core.series.Series'>

dir()
```
dir(cities)
```

list()

list(cities)

['San Francisco', 'Los Angeles', 'Las Veges', nan]

dict()

dict(cities)

{0: 'San Francisco', 1: 'Los Angeles', 2: 'Las Veges', 3: nan}

'Las Vegas' in cities

False

2 in cities

True

'Las Vegas' in cities.values

True

not in

100 not in cities

True

'Paris' not in cities.values

True

代码挑战

假设有两个数据结构

superheroes = [
        'Batman',
        'Superman',
        'Spider-Man',
        'Iron Man',
        'Captain America',
        'Wonder Woman']
strength_levels = (100, 120, 90, 95, 110, 120)

要解决的问题如下：

使用 superheroes 列表填充一个新的 Series 对象

pd.Series(superheroes)

0             Batman
1           Superman
2         Spider-Man
3           Iron Man
4    Captain America
5       Wonder Woman
dtype: object

使用 strength_levels 元组填充一个新的 Series 对象

pd.Series(strength_levels)

0    100
1    120
2     90
3     95
4    110
5    120
dtype: int64

创建一个 Series ，将 superheroes 作为索引标签， strength_levels 作为值，并将 Series赋值给 heroes 变量

heroes = pd.Series(data=strength_levels, index=superheroes)

Batman             100
Superman           120
Spider-Man          90
Iron Man            95
Captain America    110
Wonder Woman       120
dtype: int64

提取 heroes Series 的前两行

heroes.head(2)

Batman      100
Superman    120
dtype: int64

提取 heroes Series 的后四行

heroes.tail(4)

Spider-Man          90
Iron Man            95
Captain America    110
Wonder Woman       120
dtype: int64

确定 heroes Series 中唯一值的数量
```
heroes.nunique()
```
```
5
```
计算 superheroes 中的平均 strength
```
heroes.mean()
```
```
105.83333333333333
```
计算 superheroes 的最大和最小 strength
```
heroes.max()
heroes.min()
```
```
120
90
```

如何让每个 superheroes 的 strength 翻倍

heroes * 2

Batman             200
Superman           240
Spider-Man         180
Iron Man           190
Captain America    220
Wonder Woman       240
dtype: int64

如何将 heroes Series 转换为 Python 字典

dict(heroes)

{'Batman': 100, 'Superman': 120, 'Spider-Man': 90, 'Iron Man': 95, 'Captain America': 110, 'Wonder Woman': 120}

posted @ 2023-12-17 21:31 熠然阅读(24) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

熠然

Pandas数据分析实战（Pandas in action）第2章 Series 对象

Pandas 数据分析实战

第 2 章 Series

公告