Python数据分析-数据计算以及数据格式化
1. 简介
Pandas提供了大量的数据计算函数,可以实现求和、求均值、求最大值、求最小值、求中位数、求众数、求方差、标准差等。
通过数据格式化,可以在图表中显示自己规定格式的数据,如设置小数位数、设置百分比、设置千位分隔符等。
2. 数据计算
2.1. 求和(sum()函数)
求和函数df.sum()语法如下:
DataFrame.sum(axis=0, skipna=True, numeric_only=False, min_count=0, **kwargs)
参数说明:
- axis:{index (0), columns (1)}
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
- skipna:bool, default True
Exclude NA/null values when computing the result.
- numeric_only:bool, default False
Include only float, int, boolean columns. Not implemented for Series.
- min_count:int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
- **kwargs
Additional keyword arguments to be passed to the function.
代码示例:
1 idx = pd.MultiIndex.from_arrays([ 2 ['warm', 'warm', 'cold', 'cold'], 3 ['dog', 'falcon', 'fish', 'spider']], 4 names=['blooded', 'animal']) 5 s = pd.Series([4, 2, 0, 8], name='legs', index=idx) 6 print(s) 7 8 ### 结果 9 # blooded animal 10 # warm dog 4 11 # falcon 2 12 # cold fish 0 13 # spider 8 14 # Name: legs, dtype: int64 15 16 re = s.sum() 17 print(re) 18 19 ### 结果 20 # 14
缺省情况下,空或 all-NA Series的和为0
1 pd.Series([], dtype="float64").sum() 2 3 ### 结果 4 # 0.0
空级数的和为NaN,则传递min_count=1
1 pd.Series([], dtype="float64").sum(min_count=1) 2 3 ### 结果 4 # nan
skipna参数
1 pd.Series([np.nan]).sum() 2 ### 结果 3 # 0.0 4 5 pd.Series([np.nan]).sum(min_count=1) 6 7 ### 结果 8 # nan
2.2. 求均值(mean()函数)
求均值函数df.mean()语法如下:
DataFrame.mean(axis=0, skipna=True, numeric_only=False, **kwargs)
参数说明:
- axis:{index (0), columns (1)}
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
- skipna:bool, default True
Exclude NA/null values when computing the result.
- numeric_only:bool, default False
Include only float, int, boolean columns. Not implemented for Series.
- **kwargs
Additional keyword arguments to be passed to the function.
代码示例:
1 s = pd.Series([1, 2, 3]) 2 s.mean() 3 4 ### 结果 5 # 2.0
DataFrame数据求均值
1 df = pd.DataFrame({'a': [1, 2], 'b': [2, 3]}, index=['A', 'B']) 2 print(df) 3 4 ### 结果 5 # a b 6 # A 1 2 7 # B 2 3 8 9 re = df.mean() 10 print(re) 11 12 ### 结果 13 # a 1.5 14 # b 2.5 15 # dtype: float64
使用axis参数
1 re = df.mean(axis=1) 2 print(re) 3 4 ### 结果 5 # A 1.5 6 # B 2.5 7 # dtype: float64
1 df = pd.DataFrame({'a': [1, 2], 'b': ['T', 'Z']}, 2 index=['A', 'B']) 3 4 print(df) 5 re = df.mean(numeric_only=True) 6 print(re) 7 8 ### 结果 9 # a b 10 # A 1 T 11 # B 2 Z 12 13 # a 1.5 14 # dtype: float64
2.3. 求最大值(max()函数)
求最大值函数df.max()语法如下:
DataFrame.max(axis=0, skipna=True, numeric_only=False, **kwargs)
参数说明:
- axis:{index (0), columns (1)}
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
- skipna:bool, default True
Exclude NA/null values when computing the result.
- numeric_only:bool, default False
Include only float, int, boolean columns. Not implemented for Series.
- **kwargs
Additional keyword arguments to be passed to the function.
代码示例:
1 idx = pd.MultiIndex.from_arrays([ 2 ['warm', 'warm', 'cold', 'cold'], 3 ['dog', 'falcon', 'fish', 'spider']], 4 names=['blooded', 'animal']) 5 s = pd.Series([4, 2, 0, 8], name='legs', index=idx) 6 print(s) 7 8 ### 结果 9 # blooded animal 10 # warm dog 4 11 # falcon 2 12 # cold fish 0 13 # spider 8 14 # Name: legs, dtype: int64 15 16 # 8
2.4. 求最小值(min()函数)
求最小值函数df.min()语法如下:
DataFrame.min(axis=0, skipna=True, numeric_only=False, **kwargs)
参数说明:
- axis:{index (0), columns (1)}
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
- skipna:bool, default True
Exclude NA/null values when computing the result.
- numeric_only:bool, default False
Include only float, int, boolean columns. Not implemented for Series.
- **kwargs
Additional keyword arguments to be passed to the function.
代码示例:
1 idx = pd.MultiIndex.from_arrays([ 2 ['warm', 'warm', 'cold', 'cold'], 3 ['dog', 'falcon', 'fish', 'spider']], 4 names=['blooded', 'animal']) 5 s = pd.Series([4, 2, 0, 8], name='legs', index=idx) 6 print(s) 7 8 ### 结果 9 # blooded animal 10 # warm dog 4 11 # falcon 2 12 # cold fish 0 13 # spider 8 14 # Name: legs, dtype: int64 15 16 # 0
2.5. 求中位数(median()函数)
求中位数函数df.median()语法如下:
DataFrame.median(axis=0, skipna=True, numeric_only=False, **kwargs)
参数说明:
- axis:{index (0), columns (1)}
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying axis=None will apply the aggregation across both axes.
- skipna:bool, default True
Exclude NA/null values when computing the result.
- numeric_only:bool, default False
Include only float, int, boolean columns. Not implemented for Series.
- **kwargs
Additional keyword arguments to be passed to the function.
代码示例:
1 s = pd.Series([1, 2, 3]) 2 re = s.median() 3 print(re) 4 5 ### 结果 6 # 2.0
1 df = pd.DataFrame({'a': [1, 2], 'b': [2, 3]}, index=['A', 'B']) 2 print(df) 3 re = df.median() 4 print(re) 5 6 ### 结果 7 # a b 8 # A 1 2 9 # B 2 3 10 11 # a 1.5 12 # b 2.5 13 # dtype: float64
使用axis参数
1 re = df.median(axis=1) 2 print(re) 3 4 ### 结果 5 # A 1.5 6 # B 2.5 7 # dtype: float64
特殊情况下,使用numeric_only参数
1 df = pd.DataFrame({'a': [1, 2], 'b': ['T', 'Z']}, 2 index=['A', 'B']) 3 print(df) 4 5 re = df.median(numeric_only=True) 6 print(re) 7 8 ### 结果 9 # a b 10 # A 1 T 11 # B 2 Z 12 13 # a 1.5 14 # dtype: float64
2.6. 求众数(mode()函数)
求众数函数df.mode()语法如下:
DataFrame.mode(axis=0, numeric_only=False, dropna=True)
参数说明:
- axis:{0 or ‘index’, 1 or ‘columns’}, default 0
The axis to iterate over while searching for the mode:
- 0 or ‘index’ : get mode of each column
- 1 or ‘columns’ : get mode of each row.
- numeric_only:bool, default False
If True, only apply to numeric columns.
- dropna:bool, default True
Don’t consider counts of NaN/NaT.
代码示例:
1 df = pd.DataFrame([('bird', 2, 2), 2 ('mammal', 4, np.nan), 3 ('arthropod', 8, 0), 4 ('bird', 2, np.nan)], 5 index=('falcon', 'horse', 'spider', 'ostrich'), 6 columns=('species', 'legs', 'wings')) 7 print(df) 8 9 ### 结果 10 # species legs wings 11 # falcon bird 2 2.0 12 # horse mammal 4 NaN 13 # spider arthropod 8 0.0 14 # ostrich bird 2 NaN
默认参数
1 re = df.mode() 2 print(re) 3 4 ### 结果 5 # species legs wings 6 # 0 bird 2.0 0.0 7 # 1 NaN NaN 2.0
设置dropna=False
1 re = df.mode(dropna=False) 2 print(re) 3 4 ### 结果 5 # species legs wings 6 # 0 bird 2 NaN
设置numeric_only=True
1 re = df.mode(numeric_only=True) 2 print(re) 3 4 ### 结果 5 # legs wings 6 # 0 2.0 0.0 7 # 1 NaN 2.0
使用axis 参数
1 re = df.mode(axis='columns', numeric_only=True) 2 print(re) 3 4 ### 结果 5 # 0 1 6 # falcon 2.0 NaN 7 # horse 4.0 NaN 8 # spider 0.0 8.0 9 # ostrich 2.0 NaN
2.7. 求方差(var()函数)
求方差函数df.var()语法如下:
DataFrame.var(axis=0, skipna=True, ddof=1, numeric_only=False, **kwargs)
参数说明:
- axis:{index (0), columns (1)}
For Series this parameter is unused and defaults to 0.
- skipna:bool, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
- ddof:int, default 1
Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
- numeric_only:bool, default False
Include only float, int, boolean columns. Not implemented for Series.
代码示例:
1 df = pd.DataFrame({'person_id': [0, 1, 2, 3], 2 'age': [21, 25, 62, 43], 3 'height': [1.61, 1.87, 1.49, 2.01]} 4 ).set_index('person_id') 5 print(df) 6 7 ### 结果 8 # age height 9 # person_id 10 # 0 21 1.61 11 # 1 25 1.87 12 # 2 62 1.49 13 # 3 43 2.01
1 # 默认参数 2 re = df.var() 3 print(re) 4 5 ### 结果 6 # age 352.916667 7 # height 0.056367 8 # dtype: float64
1 # ddof=0可以设置为N而不是N-1进行规范化 2 re = df.var(ddof=0) 3 print(re) 4 5 ### 结果 6 # age 264.687500 7 # height 0.042275 8 # dtype: float64
2.8. 标准差(数据标准化std()函数)
求标准差函数df.std()语法如下:
DataFrame.std(axis=0, skipna=True, ddof=1, numeric_only=False, **kwargs)
参数说明:
- axis:{index (0), columns (1)}
For Series this parameter is unused and defaults to 0.
- skipna:bool, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
- ddof:int, default 1
Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
- numeric_only:bool, default False
Include only float, int, boolean columns. Not implemented for Series.
代码示例:
1 df = pd.DataFrame({'person_id': [0, 1, 2, 3], 2 'age': [21, 25, 62, 43], 3 'height': [1.61, 1.87, 1.49, 2.01]} 4 ).set_index('person_id') 5 print(df) 6 7 ### 结果 8 # age height 9 # person_id 10 # 0 21 1.61 11 # 1 25 1.87 12 # 2 62 1.49 13 # 3 43 2.01
1 # 默认参数 2 re = df.std() 3 print(re) 4 5 ### 结果 6 age 18.786076 7 height 0.237417 8 dtype: float64
1 # 默认参数 2 re = df.std() 3 print(re) 4 5 ### 结果 6 age 18.786076 7 height 0.237417 8 dtype: float64
2.9. 求分位数(quantile()函数)
求分位数函数df.quantile()语法如下:
DataFrame.quantile(q=0.5, axis=0, numeric_only=False, interpolation='linear', method='single')
参数说明:
- q:float or array-like, default 0.5 (50% quantile)
Value between 0 <= q <= 1, the quantile(s) to compute.
- axis:{0 or ‘index’, 1 or ‘columns’}, default 0
Equals 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
- numeric_only:bool, default False
Include only float, int or boolean data.
- interpolation:{‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}
This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j:
- linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
- lower: i.
- higher: j.
- nearest: i or j whichever is nearest.
- midpoint: (i + j) / 2.
- method:{‘single’, ‘table’}, default ‘single’
Whether to compute quantiles per-column (‘single’) or over all columns (‘table’). When ‘table’, the only allowed interpolation methods are ‘nearest’, ‘lower’, and ‘higher’.
代码示例:
1 df = pd.DataFrame(np.array([[1, 1], [2, 10], [3, 100], [4, 100]]), 2 columns=['a', 'b']) 3 print(df) 4 5 ### 结果 6 a b 7 0 1 1 8 1 2 10 9 2 3 100 10 3 4 100
1 re1 = df.quantile(.1) 2 print(re1) 3 re2 = df.quantile([.1, .5]) 4 print(re2) 5 6 ### 结果 7 # a 1.3 8 # b 3.7 9 # Name: 0.1, dtype: float64 10 11 # a b 12 # 0.1 1.3 3.7 13 # 0.5 2.5 55.0
指定method= ' table '将计算所有列的分位数
1 re1 = df.quantile(.1, method="table", interpolation="nearest") 2 print(re1) 3 re2 = df.quantile([.1, .5], method="table", interpolation="nearest") 4 print(re2) 5 6 # ### 结果 7 # a 1 8 # b 1 9 # Name: 0.1, dtype: int32 10 11 # a b 12 # 0.1 1 1 13 # 0.5 3 100
指定numeric_only=False也将计算datetime和timedelta数据的分位数
1 df = pd.DataFrame({'A': [1, 2], 2 'B': [pd.Timestamp('2010'), 3 pd.Timestamp('2011')], 4 'C': [pd.Timedelta('1 days'), 5 pd.Timedelta('2 days')]}) 6 print(df) 7 df.quantile(0.5, numeric_only=False) 8 9 re = df.quantile(0.5, numeric_only=False) 10 print(re) 11 12 ### 结果 13 # A B C 14 # 0 1 2010-01-01 1 days 15 # 1 2 2011-01-01 2 days 16 17 # A 1.5 18 # B 2010-07-02 12:00:00 19 # C 1 days 12:00:00 20 # Name: 0.5, dtype: object
3. 数据格式化
3.1. 设置小数位数
常用方法df.round(),语法如下:
DataFrame.round(decimals=0, *args, **kwargs)
参数说明:
- decimals:int, dict, Series
Number of decimal places to round each column to. If an int is given, round each column to the same number of places. Otherwise dict and Series round to variable numbers of places. Column names should be in the keys if decimals is a dict-like, or in the index if decimals is a Series. Any columns not included in decimals will be left as is. Elements of decimals which are not columns of the input will be ignored.
- *args
Additional keywords have no effect but might be accepted for compatibility with numpy.
- **kwargs
Additional keywords have no effect but might be accepted for compatibility with numpy.
代码示例:
1 df = pd.DataFrame([(.21, .32), (.01, .67), (.66, .03), (.21, .18)], 2 columns=['A', 'B']) 3 print(df) 4 # 保留小数点后一位 5 re = df.round(1) 6 print(re) 7 8 ### 结果 9 # A B 10 # 0 0.21 0.32 11 # 1 0.01 0.67 12 # 2 0.66 0.03 13 # 3 0.21 0.18 14 15 # A B 16 # 0 0.2 0.3 17 # 1 0.0 0.7 18 # 2 0.7 0.0 19 # 3 0.2 0.2
使用字典,可以使用列名作为键,小数位数作为值来指定特定列的位置数
1 re = df.round({'A': 1, 'B': 0}) 2 print(re) 3 4 ### 结果 5 # A B 6 # 0 0.2 0.0 7 # 1 0.0 1.0 8 # 2 0.7 0.0 9 # 3 0.2 0.0
使用Series,可以使用列名作为索引,小数位数作为值来指定特定列的位置数
1 decimals = pd.Series([0, 1], index=['A', 'B']) 2 re = df.round(decimals) 3 print(re) 4 5 ### 结果 6 # A B 7 # 0 0.0 0.3 8 # 1 0.0 0.7 9 # 2 1.0 0.0 10 # 3 0.0 0.2
3.2. 设置百分比、千分位
注意:这里会使用到匿名函数 lambda,相关用法会在之后提供讲解
设置百分位和千分位经常使用到函数df.apply(),并结合匿名函数。语法如下:
DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), by_row='compat', engine='python', engine_kwargs=None, **kwargs)
Objects passed to the function are Series objects whose index is either the DataFrame’s index (axis=0) or the DataFrame’s columns (axis=1). By default (result_type=None), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result_type argument.
参数说明:
- func:function
Function to apply to each column or row.
- axis:{0 or ‘index’, 1 or ‘columns’}, default 0
Axis along which the function is applied:
- 0 or ‘index’: apply function to each column.
- 1 or ‘columns’: apply function to each row.
- raw:bool, default False
Determines if row or column is passed as a Series or ndarray object:
- False : passes each row or column as a Series to the function.
- True : the passed function will receive ndarray objects instead. If you are just applying a NumPy reduction function this will achieve much better performance.
- result_type:{‘expand’, ‘reduce’, ‘broadcast’, None}, default None
These only act when axis=1 (columns):
- ‘expand’ : list-like results will be turned into columns.
- ‘reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.
- ‘broadcast’ : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.
The default behaviour (None) depends on the return value of the applied function: list-like results will be returned as a Series of those. However if the apply function returns a Series these are expanded to columns.
- args:tuple
Positional arguments to pass to func in addition to the array/series.
- by_row:False or “compat”, default “compat”
Only has an effect when funcis a listlike or dictlike of funcs and the func isn’t a string. If “compat”, will if possible first translate the func into pandas methods (e.g. Series().apply(np.sum)will be translated to Series().sum()). If that doesn’t work, will try call to apply again with by_row=True and if that fails, will call apply again with by_row=False (backward compatible). If False, the funcs will be passed the whole Series at once.
- engine:{‘python’, ‘numba’}, default ‘python’
Choose between the python (default) engine or the numba engine in apply.
The numba engine will attempt to JIT compile the passed function, which may result in speedups for large DataFrames. It also supports the following engine_kwargs :
- nopython (compile the function in nopython mode)
- nogil (release the GIL inside the JIT compiled function)
- parallel (try to apply the function in parallel over the DataFrame)Note: Due to limitations within numba/how pandas interfaces with numba, you should only use this if raw=True
- engine_kwargs:dict
Pass keyword arguments to the engine. This is currently only used by the numba engine, see the documentation for the engine argument for more information.
- **kwargs
Additional keyword arguments to pass as keywords arguments to func.
代码示例:
设置百分比——整列保留0位
1 df = pd.DataFrame(np.random.random([5, 5]), columns=['A1', 'A2', 'A3', 'A4', 'A5']) 2 df['百分比'] = df['A1'].apply(lambda x: format(x, '.0%')) 3 print(df) 4 5 ### 结果 6 # A1 A2 A3 A4 A5 百分比 7 # 0 0.135617 0.211257 0.772651 0.921478 0.010945 14% 8 # 1 0.676379 0.993900 0.134181 0.688656 0.554876 68% 9 # 2 0.113347 0.728911 0.372836 0.378372 0.900831 11% 10 # 3 0.416929 0.519926 0.637486 0.089960 0.770882 42% 11 # 4 0.257964 0.576093 0.692108 0.310876 0.892253 26% 12 13 df['百分比'] = df['A1'].apply(lambda x: '{:.0%}'.format(x)) 14 print(df) 15 16 ### 结果 17 # A1 A2 A3 A4 A5 百分比 18 # 0 0.346965 0.865431 0.230436 0.930984 0.115680 35% 19 # 1 0.144931 0.560487 0.829222 0.181691 0.853662 14% 20 # 2 0.090430 0.799495 0.005978 0.058735 0.295530 9% 21 # 3 0.940128 0.002937 0.139490 0.966288 0.453443 94% 22 # 4 0.976460 0.155085 0.087421 0.011189 0.071188 98%
设置千分位
1 data = [['零基础学Python', '1月', 49768889], ['零基础学Python', '2月', 11777775], 2 ['零基础学Python', '3月', 13799990]] 3 columns = ['图书', '月份', '码洋'] 4 df = pd.DataFrame(data=data, columns=columns) 5 df['码洋'] = df['码洋'].apply(lambda x: format(int(x), ',')) 6 print(df) 7 8 ### 结果 9 # 图书 月份 码洋 10 # 0 零基础学Python 1月 49,768,889 11 # 1 零基础学Python 2月 11,777,775 12 # 2 零基础学Python 3月 13,799,990
时间:2024年2月5日

Python数据分析-数据计算以及数据格式化
浙公网安备 33010602011771号