Python数据分析-数据清洗
1. 主要内容
- 缺失值的查看与处理
- 重复值处理
- 异常值的检测与处理
2. 缺失值的查看与处理
缺失值是指由于某种原因导致数据为空,这种情况一般有不处理、删除、填充/替换、插值(以均值/中位数/众数等填补)这4种处理方式。
2.1. 缺失值的查看
可以先输出数据表的数据,然后使用方法df.info()即可查看数据的缺失值,方法参数如下:
DataFrame.info(verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=None)
- 该方法用于打印一个DataFrame的简明摘要。
- 该方法打印有关DataFrame的信息,包括索引类型和列、非空值和内存使用情况。
参数说明:
- verbose:bool, optional
Whether to print the full summary. By default, the setting in pandas.options.display.max_info_columnsis followed.
- buf:writable buffer, defaults to sys.stdout
Where to send the output. By default, the output is printed to sys.stdout. Pass a writable buffer if you need to further process the output.
- max_cols:int, optional
When to switch from the verbose to the truncated output. If the DataFrame has more than max_cols columns, the truncated output is used. By default, the setting in pandas.options.display.max_info_columnsis used.
- memory_usage:bool, str, optional
Specifies whether total memory usage of the DataFrame elements (including the index) should be displayed. By default, this follows the pandas.options.display.memory_usage setting.
True always show memory usage. False never shows memory usage. A value of ‘deep’ is equivalent to “True with deep introspection”. Memory usage is shown in human-readable units (base-2 representation). Without deep introspection a memory estimation is made based in column dtype and number of rows assuming values consume the same memory amount for corresponding dtypes. With deep memory introspection, a real memory usage calculation is performed at the cost of computational resources.
- show_counts:bool, optional
Whether to show the non-null counts. By default, this is shown only if the DataFrame is smaller than pandas.options.display.max_info_rows and pandas.options.display.max_info_columns. A value of True always shows the counts, and False never shows the counts.
代码示例:
1 int_values = [1, 2, 3, 4, 5] 2 text_values = ['alpha', 'beta', 'gamma', 'delta', 'epsilon'] 3 float_values = [0.0, 0.25, 0.5, 0.75, 1.0] 4 df = pd.DataFrame({"int_col": int_values, "text_col": text_values, "float_col": float_values})
打印所有列的信息
1 # 打印所有列的信息 2 df.info(verbose=True) 3 4 ### 结果 5 # <class 'pandas.core.frame.DataFrame'> 6 # RangeIndex: 5 entries, 0 to 4 7 # Data columns (total 3 columns): 8 # # Column Non-Null Count Dtype 9 # --- ------ -------------- ----- 10 # 0 int_col 5 non-null int64 11 # 1 text_col 5 non-null object 12 # 2 float_col 5 non-null float64 13 # dtypes: float64(1), int64(1), object(1) 14 # memory usage: 248.0+ bytes
打印列计数及其dtypes的摘要,但不打印每个列的信息
1 # 打印列计数及其dtypes的摘要,但不打印每个列的信息 2 df.info(verbose=False) 3 4 ### 结果 5 # <class 'pandas.core.frame.DataFrame'> 6 # RangeIndex: 5 entries, 0 to 4 7 # Columns: 3 entries, int_col to float_col 8 # dtypes: float64(1), int64(1), object(1) 9 # memory usage: 248.0+ bytes
将DataFrame.info的输出管道输出到缓冲区而不是sys.stdout,获取缓冲区内容并写入文本文件
1 # 将DataFrame.info的输出管道输出到缓冲区而不是sys.stdout,获取缓冲区内容并写入文本文件 2 buffer = io.StringIO() 3 df.info(buf=buffer) 4 s = buffer.getvalue() 5 with open("df_info.txt", "w", 6 encoding="utf-8") as f: 7 f.write(s) 8 9 ### df_info.txt文件内容 10 # <class 'pandas.core.frame.DataFrame'> 11 # RangeIndex: 5 entries, 0 to 4 12 # Data columns (total 3 columns): 13 # # Column Non-Null Count Dtype 14 # --- ------ -------------- ----- 15 # 0 int_col 5 non-null int64 16 # 1 text_col 5 non-null object 17 # 2 float_col 5 non-null float64 18 # dtypes: float64(1), int64(1), object(1) 19 # memory usage: 248.0+ bytes
memory_usage参数缺省值状态
1 # memory_usage参数缺省值状态 2 random_strings_array = np.random.choice(['a', 'b', 'c'], 10 ** 6) 3 df = pd.DataFrame({ 4 'column_1': np.random.choice(['a', 'b', 'c'], 10 ** 6), 5 'column_2': np.random.choice(['a', 'b', 'c'], 10 ** 6), 6 'column_3': np.random.choice(['a', 'b', 'c'], 10 ** 6) 7 }) 8 print(df.info()) 9 10 ### 结果 11 # <class 'pandas.core.frame.DataFrame'> 12 # RangeIndex: 1000000 entries, 0 to 999999 13 # Data columns (total 3 columns): 14 # # Column Non-Null Count Dtype 15 # --- ------ -------------- ----- 16 # 0 column_1 1000000 non-null object 17 # 1 column_2 1000000 non-null object 18 # 2 column_3 1000000 non-null object 19 # dtypes: object(3) 20 # memory usage: 22.9+ MB
memory_usage参数为'deep'
1 df.info(memory_usage='deep') 2 3 ### 结果 4 # <class 'pandas.core.frame.DataFrame'> 5 # RangeIndex: 1000000 entries, 0 to 999999 6 # Data columns (total 3 columns): 7 # # Column Non-Null Count Dtype 8 # --- ------ -------------- ----- 9 # 0 column_1 1000000 non-null object 10 # 1 column_2 1000000 non-null object 11 # 2 column_3 1000000 non-null object 12 # dtypes: object(3) 13 # memory usage: 165.9 MB
2.2. 缺失值处理——删除
通过前面的判断得知数据缺失情况,下面将缺失值删除,主要使用dropna()方法,该方法用于删除含有缺失值的行,主要代码如下:
DataFrame.dropna(*, axis=0, how=_NoDefault.no_default, thresh=_NoDefault.no_default, subset=None, inplace=False, ignore_index=False)
参数说明:
- axis:{0 or ‘index’, 1 or ‘columns’}, default 0
Determine if rows or columns which contain missing values are removed.
- 0, or ‘index’ : Drop rows which contain missing values.
- 1, or ‘columns’ : Drop columns which contain missing value.
Only a single axis is allowed.
- how:{‘any’, ‘all’}, default ‘any’
Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
- ‘any’ : If any NA values are present, drop that row or column.
- ‘all’ : If all values are NA, drop that row or column.
- thresh:int, optional
Require that many non-NA values. Cannot be combined with how.
- subset:column label or sequence of labels, optional
Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.
- inplace:bool, default False
Whether to modify the DataFrame rather than creating a new one.
- ignore_index:bool, default False
If True, the resulting axis will be labeled 0, 1, …, n - 1.
代码示例:
1 df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'], 2 "toy": [np.nan, 'Batmobile', 'Bullwhip'], 3 "born": [pd.NaT, pd.Timestamp("1940-04-25"), 4 pd.NaT]})
删除至少缺少一个元素的行
1 df.dropna() 2 3 ### 结果 4 # name toy born 5 # 1 Batman Batmobile 1940-04-25
删除至少缺少一个元素的列
1 df.dropna(axis='columns') 2 3 ### 结果 4 # name 5 # 0 Alfred 6 # 1 Batman 7 # 2 Catwoman
删除缺少所有元素的行
1 df.dropna(how='all') 2 3 ### 结果 4 # name toy born 5 # 0 Alfred NaN NaT 6 # 1 Batman Batmobile 1940-04-25 7 # 2 Catwoman Bullwhip NaT
只保留至少有2个非NA值的行
1 df.dropna(thresh=2) 2 3 ### 结果 4 # name toy born 5 # 1 Batman Batmobile 1940-04-25 6 # 2 Catwoman Bullwhip NaT
定义在哪些列中查找缺失的值
1 df.dropna(subset=['name', 'toy']) 2 3 ### 结果 4 # name toy born 5 # 1 Batman Batmobile 1940-04-25 6 # 2 Catwoman Bullwhip NaT
2.3. 缺失值处理——填充
主要使用fillna()方法,主要代码如下:
DataFrame.fillna(value=None, *, method=None, axis=None, inplace=False, limit=None, downcast=_NoDefault.no_default)
参数说明:
- value:scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.
- method:{‘backfill’, ‘bfill’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series:
- ffill: propagate last valid observation forward to next valid.
- backfill / bfill: use next valid observation to fill gap.
- axis:{0 or ‘index’} for Series, {0 or ‘index’, 1 or ‘columns’} for DataFrame
Axis along which to fill missing values. For Series this parameter is unused and defaults to 0.
- inplace:bool, default False
If True, fill in-place. Note: this will modify any other views on this object (e.g., a no-copy slice for a column in a DataFrame).
- limit:int, default None
If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
- downcast:dict, default is None
A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).
代码示例:
1 df = pd.DataFrame([[np.nan, 2, np.nan, 0], 2 [3, 4, np.nan, 1], 3 [np.nan, np.nan, np.nan, np.nan], 4 [np.nan, 3, np.nan, 4]], 5 columns=list("ABCD")) 6 print(df) 7 8 ### 结果 9 # A B C D 10 # 0 NaN 2.0 NaN 0.0 11 # 1 3.0 4.0 NaN 1.0 12 # 2 NaN NaN NaN NaN 13 # 3 NaN 3.0 NaN 4.0
将所有NaN元素替换为0
1 df.fillna(0) 2 3 ### 结果 4 # A B C D 5 # 0 0.0 2.0 0.0 0.0 6 # 1 3.0 4.0 0.0 1.0 7 # 2 0.0 0.0 0.0 0.0 8 # 3 0.0 3.0 0.0 4.0
用0,1,2,3替换列中的所有NaN元素
1 values = {"A": 0, "B": 1, "C": 2, "D": 3} 2 df1 = df.fillna(value=values) 3 4 ### 结果 5 # A B C D 6 # 0 0.0 2.0 2.0 0.0 7 # 1 3.0 4.0 2.0 1.0 8 # 2 0.0 1.0 2.0 3.0 9 # 3 0.0 3.0 2.0 4.0
只替换第一个NaN元素
1 values = {"A": 0, "B": 1, "C": 2, "D": 3} 2 df1 = df.fillna(value=values, limit=1) 3 print(df1) 4 5 ### 结果 6 # A B C D 7 # 0 0.0 2.0 2.0 0.0 8 # 1 3.0 4.0 NaN 1.0 9 # 2 NaN 1.0 NaN 3.0 10 # 3 NaN 3.0 NaN 4.0
当使用DataFrame填充时,会沿着相同的列名和相同的索引进行替换
1 df2 = pd.DataFrame(np.zeros((4, 4)), columns=list("ABCE")) 2 df1 = df.fillna(df2) 3 print(df1) 4 5 ### 结果 6 # A B C D 7 # 0 0.0 2.0 0.0 0.0 8 # 1 3.0 4.0 0.0 1.0 9 # 2 0.0 0.0 0.0 NaN 10 # 3 0.0 3.0 0.0 4.0
3. 重复值的查看与处理
3.1. 重复值的查看
主要使用df.duplicated()方法,语法如下:
DataFrame.duplicated(subset=None, keep='first')
- 返回表示重复行的布尔系列。
- 考虑某些列是可选的。
参数说明:
- subset:column label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by default use all of the columns.
- keep:{‘first’, ‘last’, False }, default ‘first’
Determines which duplicates (if any) to mark.
- first : Mark duplicates as True except for the first occurrence.
- last : Mark duplicates as True except for the last occurrence.
- False : Mark all duplicates as True.
代码示例:
1 df = pd.DataFrame({ 2 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'], 3 'style': ['cup', 'cup', 'cup', 'pack', 'pack'], 4 'rating': [4, 4, 3.5, 15, 5] 5 }) 6 print(df) 7 8 ### 结果 9 # brand style rating 10 # 0 Yum Yum cup 4.0 11 # 1 Yum Yum cup 4.0 12 # 2 Indomie cup 3.5 13 # 3 Indomie pack 15.0 14 # 4 Indomie pack 5.0
默认情况下,对于每组重复的值,第一个出现设置为False,其他所有出现设置为True
1 df.duplicated() 2 print(df1) 3 4 ### 结果 5 # 0 False 6 # 1 True 7 # 2 False 8 # 3 False 9 # 4 False 10 # dtype: bool
通过使用' last ',每组重复值的最后一次出现设置为False,所有其他的设置为True
1 df.duplicated(keep='last') 2 print(df1) 3 4 ### 结果 5 # 0 True 6 # 1 False 7 # 2 False 8 # 3 False 9 # 4 False 10 # dtype: bool
通过将keep设置为False,所有副重复值都为True
1 df.duplicated(keep=False) 2 print(df1) 3 4 ### 结果 5 # 0 True 6 # 1 True 7 # 2 False 8 # 3 False 9 # 4 False 10 # dtype: bool
使用子集的方式查找特定列上的重复项
1 df1 = df.duplicated(subset=['brand']) 2 print(df1) 3 4 ### 结果 5 # 0 False 6 # 1 True 7 # 2 False 8 # 3 True 9 # 4 True 10 # dtype: bool
3.2. 重复值处理——删除
DataFrame.drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=False)
参数说明:
- subset:column label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by default use all of the columns.
- keep:{‘first’, ‘last’, False}, default ‘first’
Determines which duplicates (if any) to keep.
- ‘first’ : Drop duplicates except for the first occurrence.
- ‘last’ : Drop duplicates except for the last occurrence.
- False : Drop all duplicates.
- inplace:bool, default False
Whether to modify the DataFrame rather than creating a new one.
- ignore_index:bool, default False
If True, the resulting axis will be labeled 0, 1, …, n - 1.
代码示例:
默认情况下,它会基于所有列删除重复行
1 df.drop_duplicates() 2 print(df1) 3 4 ### 结果 5 # brand style rating 6 # 0 Yum Yum cup 4.0 7 # 2 Indomie cup 3.5 8 # 3 Indomie pack 15.0 9 # 4 Indomie pack 5.0
使用子集删除特定列上的重复项
1 df1 = df.drop_duplicates(subset=['brand']) 2 print(df1) 3 4 ### 结果 5 # brand style rating 6 # 0 Yum Yum cup 4.0 7 # 2 Indomie cup 3.5
使用keep删除重复项并保留最近出现的内容
1 df1 = df.drop_duplicates(subset=['brand', 'style'], keep='last') 2 print(df1) 3 4 ### 结果 5 # brand style rating 6 # 1 Yum Yum cup 4.0 7 # 2 Indomie cup 3.5 8 # 4 Indomie pack 5.0
4. 异常值的检测与处理
主要包括以下3种处理方式:
- 最常用的方式是删除。
- 将异常值当缺失值处理,以某个值填充。
- 将异常值当特殊情况进行分析,研究异常值出现的原因。
2024年2月4日

Python数据分析-数据清洗
浙公网安备 33010602011771号