Python数据分析-数据计算以及数据格式化

1. 简介

　　Pandas提供了大量的数据计算函数，可以实现求和、求均值、求最大值、求最小值、求中位数、求众数、求方差、标准差等。

　　通过数据格式化，可以在图表中显示自己规定格式的数据，如设置小数位数、设置百分比、设置千位分隔符等。

2. 数据计算

2.1. 求和（sum()函数）

　　求和函数df.sum()语法如下：

DataFrame.sum(axis=0, skipna=True, numeric_only=False, min_count=0, **kwargs)

参数说明：

axis：{index (0), columns (1)}

　　Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

skipna：bool, default True

　　Exclude NA/null values when computing the result.

numeric_only：bool, default False

　　Include only float, int, boolean columns. Not implemented for Series.

min_count：int, default 0

　　The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

**kwargs

　　Additional keyword arguments to be passed to the function.

代码示例：

 1 idx = pd.MultiIndex.from_arrays([
 2     ['warm', 'warm', 'cold', 'cold'],
 3     ['dog', 'falcon', 'fish', 'spider']],
 4     names=['blooded', 'animal'])
 5 s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
 6 print(s)
 7 
 8 ### 结果
 9 # blooded  animal
10 # warm     dog       4
11 #          falcon    2
12 # cold     fish      0
13 #          spider    8
14 # Name: legs, dtype: int64
15 
16 re = s.sum()
17 print(re)
18 
19 ### 结果
20 # 14

　　缺省情况下，空或 all-NA Series的和为0

1 pd.Series([], dtype="float64").sum()
2 
3 ### 结果
4 # 0.0

　　空级数的和为NaN，则传递min_count=1

1 pd.Series([], dtype="float64").sum(min_count=1)
2 
3 ### 结果
4 # nan

　　skipna参数

1 pd.Series([np.nan]).sum()
2 ### 结果
3 # 0.0
4 
5 pd.Series([np.nan]).sum(min_count=1)
6 
7 ### 结果
8 # nan

2.2. 求均值（mean()函数）

　　求均值函数df.mean()语法如下：

DataFrame.mean(axis=0, skipna=True, numeric_only=False, **kwargs)

参数说明：

axis：{index (0), columns (1)}

　　Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

　　For DataFrames, specifying

skipna：bool, default True

　　Exclude NA/null values when computing the result.

numeric_only：bool, default False

　　Include only float, int, boolean columns. Not implemented for Series.

**kwargs

　　Additional keyword arguments to be passed to the function.

代码示例：

1 s = pd.Series([1, 2, 3])
2 s.mean()
3 
4 ### 结果
5 # 2.0

　　DataFrame数据求均值

 1 df = pd.DataFrame({'a': [1, 2], 'b': [2, 3]}, index=['A', 'B'])
 2 print(df)
 3 
 4 ### 结果
 5 #    a  b
 6 # A  1  2
 7 # B  2  3
 8 
 9 re = df.mean()
10 print(re)
11 
12 ### 结果
13 # a    1.5
14 # b    2.5
15 # dtype: float64

　　使用axis参数

1 re = df.mean(axis=1)
2 print(re)
3 
4 ### 结果
5 # A    1.5
6 # B    2.5
7 # dtype: float64

　　特殊情况下，使用numeric_only参数

 1 df = pd.DataFrame({'a': [1, 2], 'b': ['T', 'Z']},
 2                   index=['A', 'B'])
 3 
 4 print(df)
 5 re = df.mean(numeric_only=True)
 6 print(re)
 7 
 8 ### 结果
 9 #    a  b
10 # A  1  T
11 # B  2  Z
12 
13 # a    1.5
14 # dtype: float64

2.3. 求最大值（max()函数）

　　求最大值函数df.max()语法如下：

DataFrame.max(axis=0, skipna=True, numeric_only=False, **kwargs)

参数说明：

axis：{index (0), columns (1)}

　　Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

　　For DataFrames, specifying

skipna：bool, default True

　　Exclude NA/null values when computing the result.

numeric_only：bool, default False

　　Include only float, int, boolean columns. Not implemented for Series.

**kwargs

　　Additional keyword arguments to be passed to the function.

代码示例：

 1 idx = pd.MultiIndex.from_arrays([
 2     ['warm', 'warm', 'cold', 'cold'],
 3     ['dog', 'falcon', 'fish', 'spider']],
 4     names=['blooded', 'animal'])
 5 s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
 6 print(s)
 7 
 8 ### 结果
 9 # blooded  animal
10 # warm     dog       4
11 #          falcon    2
12 # cold     fish      0
13 #          spider    8
14 # Name: legs, dtype: int64
15 
16 # 8

2.4. 求最小值（min()函数）

　　求最小值函数df.min()语法如下：

DataFrame.min(axis=0, skipna=True, numeric_only=False, **kwargs)

参数说明：

axis：{index (0), columns (1)}

　　Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

　　For DataFrames, specifying

skipna：bool, default True

　　Exclude NA/null values when computing the result.

numeric_only：bool, default False

　　Include only float, int, boolean columns. Not implemented for Series.

**kwargs

　　Additional keyword arguments to be passed to the function.

代码示例：

 1 idx = pd.MultiIndex.from_arrays([
 2     ['warm', 'warm', 'cold', 'cold'],
 3     ['dog', 'falcon', 'fish', 'spider']],
 4     names=['blooded', 'animal'])
 5 s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
 6 print(s)
 7 
 8 ### 结果
 9 # blooded  animal
10 # warm     dog       4
11 #          falcon    2
12 # cold     fish      0
13 #          spider    8
14 # Name: legs, dtype: int64
15 
16 # 0

2.5. 求中位数（median()函数）

　　求中位数函数df.median()语法如下：

DataFrame.median(axis=0, skipna=True, numeric_only=False, **kwargs)

参数说明：

axis：{index (0), columns (1)}

　　Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

　　For DataFrames, specifying axis=None will apply the aggregation across both axes.

skipna：bool, default True

　　Exclude NA/null values when computing the result.

numeric_only：bool, default False

　　Include only float, int, boolean columns. Not implemented for Series.

**kwargs

　　Additional keyword arguments to be passed to the function.

代码示例：

1 s = pd.Series([1, 2, 3])
2 re = s.median()
3 print(re)
4 
5 ### 结果
6 # 2.0

 1 df = pd.DataFrame({'a': [1, 2], 'b': [2, 3]}, index=['A', 'B'])
 2 print(df)
 3 re = df.median()
 4 print(re)
 5 
 6 ### 结果
 7 #    a  b
 8 # A  1  2
 9 # B  2  3
10 
11 # a    1.5
12 # b    2.5
13 # dtype: float64

　　使用axis参数

1 re = df.median(axis=1)
2 print(re)
3 
4 ### 结果
5 # A    1.5
6 # B    2.5
7 # dtype: float64

　　特殊情况下，使用numeric_only参数

 1 df = pd.DataFrame({'a': [1, 2], 'b': ['T', 'Z']},
 2                   index=['A', 'B'])
 3 print(df)
 4 
 5 re = df.median(numeric_only=True)
 6 print(re)
 7 
 8 ### 结果
 9 #    a  b
10 # A  1  T
11 # B  2  Z
12 
13 # a    1.5
14 # dtype: float64

2.6. 求众数（mode()函数）

　　求众数函数df.mode()语法如下：

DataFrame.mode(axis=0, numeric_only=False, dropna=True)

参数说明：

axis：{0 or ‘index’, 1 or ‘columns’}, default 0

　　The axis to iterate over while searching for the mode:

0 or ‘index’ : get mode of each column
1 or ‘columns’ : get mode of each row.

numeric_only：bool, default False

　　If True, only apply to numeric columns.

dropna：bool, default True

　　Don’t consider counts of NaN/NaT.

代码示例：

 1 df = pd.DataFrame([('bird', 2, 2),
 2                    ('mammal', 4, np.nan),
 3                    ('arthropod', 8, 0),
 4                    ('bird', 2, np.nan)],
 5                   index=('falcon', 'horse', 'spider', 'ostrich'),
 6                   columns=('species', 'legs', 'wings'))
 7 print(df)
 8 
 9 ### 结果
10 #            species  legs  wings
11 # falcon        bird     2    2.0
12 # horse       mammal     4    NaN
13 # spider   arthropod     8    0.0
14 # ostrich       bird     2    NaN

　　默认参数

1 re = df.mode()
2 print(re)
3 
4 ### 结果
5 #   species  legs  wings
6 # 0    bird   2.0    0.0
7 # 1     NaN   NaN    2.0

　　设置dropna=False

1 re = df.mode(dropna=False)
2 print(re)
3 
4 ### 结果
5 #   species  legs  wings
6 # 0    bird     2    NaN

　　设置numeric_only=True

1 re = df.mode(numeric_only=True)
2 print(re)
3 
4 ### 结果
5 #    legs  wings
6 # 0   2.0    0.0
7 # 1   NaN    2.0

　　使用axis 参数

1 re = df.mode(axis='columns', numeric_only=True)
2 print(re)
3 
4 ### 结果
5 #            0    1
6 # falcon   2.0  NaN
7 # horse    4.0  NaN
8 # spider   0.0  8.0
9 # ostrich  2.0  NaN

2.7. 求方差（var()函数）

　　求方差函数df.var()语法如下：

DataFrame.var(axis=0, skipna=True, ddof=1, numeric_only=False, **kwargs)

参数说明：

axis：{index (0), columns (1)}

　　For Series this parameter is unused and defaults to 0.

skipna：bool, default True

　　Exclude NA/null values. If an entire row/column is NA, the result will be NA.

ddof：int, default 1

　　Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

numeric_only：bool, default False

　　Include only float, int, boolean columns. Not implemented for Series.

代码示例：

 1 df = pd.DataFrame({'person_id': [0, 1, 2, 3],
 2                    'age': [21, 25, 62, 43],
 3                    'height': [1.61, 1.87, 1.49, 2.01]}
 4                   ).set_index('person_id')
 5 print(df)
 6 
 7 ### 结果
 8 #            age  height
 9 # person_id             
10 # 0           21    1.61
11 # 1           25    1.87
12 # 2           62    1.49
13 # 3           43    2.01

1 # 默认参数
2 re = df.var()
3 print(re)
4 
5 ### 结果
6 # age       352.916667
7 # height      0.056367
8 # dtype: float64

1 # ddof=0可以设置为N而不是N-1进行规范化
2 re = df.var(ddof=0)
3 print(re)
4 
5 ### 结果
6 # age       264.687500
7 # height      0.042275
8 # dtype: float64

2.8. 标准差（数据标准化std()函数）

　　求标准差函数df.std()语法如下：

DataFrame.std(axis=0, skipna=True, ddof=1, numeric_only=False, **kwargs)

参数说明：

axis：{index (0), columns (1)}

　　For Series this parameter is unused and defaults to 0.

skipna：bool, default True

　　Exclude NA/null values. If an entire row/column is NA, the result will be NA.

ddof：int, default 1

　　Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

numeric_only：bool, default False

　　Include only float, int, boolean columns. Not implemented for Series.

代码示例：

 1 df = pd.DataFrame({'person_id': [0, 1, 2, 3],
 2                    'age': [21, 25, 62, 43],
 3                    'height': [1.61, 1.87, 1.49, 2.01]}
 4                   ).set_index('person_id')
 5 print(df)
 6 
 7 ### 结果
 8 #            age  height
 9 # person_id             
10 # 0           21    1.61
11 # 1           25    1.87
12 # 2           62    1.49
13 # 3           43    2.01

1 # 默认参数
2 re = df.std()
3 print(re)
4 
5 ### 结果
6 age       18.786076
7 height     0.237417
8 dtype: float64

1 # 默认参数
2 re = df.std()
3 print(re)
4 
5 ### 结果
6 age       18.786076
7 height     0.237417
8 dtype: float64

2.9. 求分位数（quantile()函数）

　　求分位数函数df.quantile()语法如下：

DataFrame.quantile(q=0.5, axis=0, numeric_only=False, interpolation='linear', method='single')

参数说明：

q：float or array-like, default 0.5 (50% quantile)

　　Value between 0 <= q <= 1, the quantile(s) to compute.

axis：{0 or ‘index’, 1 or ‘columns’}, default 0

　　Equals 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.

numeric_only：bool, default False

　　Include only float, int or boolean data.

interpolation：{‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}

　　This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j:

linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
lower: i.
higher: j.
nearest: i or j whichever is nearest.
midpoint: (i + j) / 2.

method：{‘single’, ‘table’}, default ‘single’

　　Whether to compute quantiles per-column (‘single’) or over all columns (‘table’). When ‘table’, the only allowed interpolation methods are ‘nearest’, ‘lower’, and ‘higher’.

代码示例：

 1 df = pd.DataFrame(np.array([[1, 1], [2, 10], [3, 100], [4, 100]]),
 2                   columns=['a', 'b'])
 3 print(df)
 4 
 5 ### 结果
 6    a    b
 7 0  1    1
 8 1  2   10
 9 2  3  100
10 3  4  100

 1 re1 = df.quantile(.1)
 2 print(re1)
 3 re2 = df.quantile([.1, .5])
 4 print(re2)
 5 
 6 ### 结果
 7 # a    1.3
 8 # b    3.7
 9 # Name: 0.1, dtype: float64
10 
11 #        a     b
12 # 0.1  1.3   3.7
13 # 0.5  2.5  55.0

　　指定method= ' table '将计算所有列的分位数

 1 re1 = df.quantile(.1, method="table", interpolation="nearest")
 2 print(re1)
 3 re2 = df.quantile([.1, .5], method="table", interpolation="nearest")
 4 print(re2)
 5 
 6 # ### 结果
 7 # a    1
 8 # b    1
 9 # Name: 0.1, dtype: int32
10 
11 #      a    b
12 # 0.1  1    1
13 # 0.5  3  100

　　指定numeric_only=False也将计算datetime和timedelta数据的分位数

 1 df = pd.DataFrame({'A': [1, 2],
 2                    'B': [pd.Timestamp('2010'),
 3                          pd.Timestamp('2011')],
 4                    'C': [pd.Timedelta('1 days'),
 5                          pd.Timedelta('2 days')]})
 6 print(df)
 7 df.quantile(0.5, numeric_only=False)
 8 
 9 re = df.quantile(0.5, numeric_only=False)
10 print(re)
11 
12 ### 结果
13 #    A          B      C
14 # 0  1 2010-01-01 1 days
15 # 1  2 2011-01-01 2 days
16 
17 # A                    1.5
18 # B    2010-07-02 12:00:00
19 # C        1 days 12:00:00
20 # Name: 0.5, dtype: object

3. 数据格式化

3.1. 设置小数位数

　　常用方法df.round()，语法如下：

DataFrame.round(decimals=0, *args, **kwargs)

参数说明：

decimals：int, dict, Series

　　Number of decimal places to round each column to. If an int is given, round each column to the same number of places. Otherwise dict and Series round to variable numbers of places. Column names should be in the keys if decimals is a dict-like, or in the index if decimals is a Series. Any columns not included in decimals will be left as is. Elements of decimals which are not columns of the input will be ignored.

*args

　　Additional keywords have no effect but might be accepted for compatibility with numpy.

**kwargs

　　Additional keywords have no effect but might be accepted for compatibility with numpy.

代码示例：

 1 df = pd.DataFrame([(.21, .32), (.01, .67), (.66, .03), (.21, .18)],
 2                   columns=['A', 'B'])
 3 print(df)
 4 # 保留小数点后一位
 5 re = df.round(1)
 6 print(re)
 7 
 8 ### 结果
 9 #       A     B
10 # 0  0.21  0.32
11 # 1  0.01  0.67
12 # 2  0.66  0.03
13 # 3  0.21  0.18
14 
15 #      A    B
16 # 0  0.2  0.3
17 # 1  0.0  0.7
18 # 2  0.7  0.0
19 # 3  0.2  0.2

　　使用字典，可以使用列名作为键，小数位数作为值来指定特定列的位置数

1 re = df.round({'A': 1, 'B': 0})
2 print(re)
3 
4 ### 结果
5 #      A    B
6 # 0  0.2  0.0
7 # 1  0.0  1.0
8 # 2  0.7  0.0
9 # 3  0.2  0.0

　　使用Series，可以使用列名作为索引，小数位数作为值来指定特定列的位置数

 1 decimals = pd.Series([0, 1], index=['A', 'B'])
 2 re = df.round(decimals)
 3 print(re)
 4 
 5 ### 结果
 6 #      A    B
 7 # 0  0.0  0.3
 8 # 1  0.0  0.7
 9 # 2  1.0  0.0
10 # 3  0.0  0.2

3.2. 设置百分比、千分位

　　注意：这里会使用到匿名函数 lambda，相关用法会在之后提供讲解

　　设置百分位和千分位经常使用到函数df.apply()，并结合匿名函数。语法如下：

DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), by_row='compat', engine='python', engine_kwargs=None, **kwargs)

　　Objects passed to the function are Series objects whose index is either the DataFrame’s index (axis=0) or the DataFrame’s columns (axis=1). By default (result_type=None), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result_type argument.

参数说明：

func：function

　　Function to apply to each column or row.

axis：{0 or ‘index’, 1 or ‘columns’}, default 0

　　Axis along which the function is applied:

0 or ‘index’: apply function to each column.
1 or ‘columns’: apply function to each row.

raw：bool, default False

　　Determines if row or column is passed as a Series or ndarray object:

False : passes each row or column as a Series to the function.
True : the passed function will receive ndarray objects instead. If you are just applying a NumPy reduction function this will achieve much better performance.

result_type：{‘expand’, ‘reduce’, ‘broadcast’, None}, default None

　　These only act when axis=1 (columns):

‘expand’ : list-like results will be turned into columns.
‘reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.
‘broadcast’ : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.

　　The default behaviour (None) depends on the return value of the applied function: list-like results will be returned as a Series of those. However if the apply function returns a Series these are expanded to columns.

args：tuple

　　Positional arguments to pass to func in addition to the array/series.

by_row：False or “compat”, default “compat”

　　Only has an effect when funcis a listlike or dictlike of funcs and the func isn’t a string. If “compat”, will if possible first translate the func into pandas methods (e.g. Series().apply(np.sum)will be translated to Series().sum()). If that doesn’t work, will try call to apply again with by_row=True and if that fails, will call apply again with by_row=False (backward compatible). If False, the funcs will be passed the whole Series at once.

engine：{‘python’, ‘numba’}, default ‘python’

　　Choose between the python (default) engine or the numba engine in apply.

The numba engine will attempt to JIT compile the passed function, which may result in speedups for large DataFrames. It also supports the following engine_kwargs :

nopython (compile the function in nopython mode)
nogil (release the GIL inside the JIT compiled function)
parallel (try to apply the function in parallel over the DataFrame)Note: Due to limitations within numba/how pandas interfaces with numba, you should only use this if raw=True

engine_kwargs：dict

　　Pass keyword arguments to the engine. This is currently only used by the numba engine, see the documentation for the engine argument for more information.

**kwargs

　　Additional keyword arguments to pass as keywords arguments to func.

代码示例：

　　设置百分比——整列保留0位

 1 df = pd.DataFrame(np.random.random([5, 5]), columns=['A1', 'A2', 'A3', 'A4', 'A5'])
 2 df['百分比'] = df['A1'].apply(lambda x: format(x, '.0%'))  
 3 print(df) 
 4 
 5 ### 结果
 6 #          A1        A2        A3        A4        A5  百分比
 7 # 0  0.135617  0.211257  0.772651  0.921478  0.010945  14%
 8 # 1  0.676379  0.993900  0.134181  0.688656  0.554876  68%
 9 # 2  0.113347  0.728911  0.372836  0.378372  0.900831  11%
10 # 3  0.416929  0.519926  0.637486  0.089960  0.770882  42%
11 # 4  0.257964  0.576093  0.692108  0.310876  0.892253  26%
12 
13 df['百分比'] = df['A1'].apply(lambda x: '{:.0%}'.format(x))  
14 print(df) 
15 
16 ### 结果
17 #          A1        A2        A3        A4        A5 百分比
18 # 0  0.346965  0.865431  0.230436  0.930984  0.115680    35%
19 # 1  0.144931  0.560487  0.829222  0.181691  0.853662    14%
20 # 2  0.090430  0.799495  0.005978  0.058735  0.295530     9%
21 # 3  0.940128  0.002937  0.139490  0.966288  0.453443    94%
22 # 4  0.976460  0.155085  0.087421  0.011189  0.071188    98%

　　设置千分位

 1 data = [['零基础学Python', '1月', 49768889], ['零基础学Python', '2月', 11777775],
 2         ['零基础学Python', '3月', 13799990]]
 3 columns = ['图书', '月份', '码洋']
 4 df = pd.DataFrame(data=data, columns=columns)
 5 df['码洋'] = df['码洋'].apply(lambda x: format(int(x), ','))
 6 print(df)
 7 
 8 ### 结果
 9 #            图书  月份          码洋
10 # 0  零基础学Python  1月  49,768,889
11 # 1  零基础学Python  2月  11,777,775
12 # 2  零基础学Python  3月  13,799,990

时间：2024年2月5日

posted @ 2024-02-05 10:57 一路狂奔的乌龟阅读(113) 评论(0) 收藏举报

刷新页面返回顶部

一路狂奔的乌龟

别听世俗的耳语，去看自己喜欢的风景。

Python数据分析-数据计算以及数据格式化

1. 简介

2. 数据计算

2.1. 求和（sum()函数）

2.2. 求均值（mean()函数）

2.3. 求最大值（max()函数）

2.4. 求最小值（min()函数）

2.5. 求中位数（median()函数）

2.6. 求众数（mode()函数）

2.7. 求方差（var()函数）

2.8. 标准差（数据标准化std()函数）

2.9. 求分位数（quantile()函数）

3. 数据格式化

3.1. 设置小数位数

3.2. 设置百分比、千分位

公告