Python数据分析-数据清洗

1. 主要内容

缺失值的查看与处理
重复值处理
异常值的检测与处理

2. 缺失值的查看与处理

　　缺失值是指由于某种原因导致数据为空，这种情况一般有不处理、删除、填充/替换、插值（以均值/中位数/众数等填补）这4种处理方式。

2.1. 缺失值的查看

　　可以先输出数据表的数据，然后使用方法df.info()即可查看数据的缺失值，方法参数如下：

DataFrame.info(verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=None)

该方法用于打印一个DataFrame的简明摘要。
该方法打印有关DataFrame的信息，包括索引类型和列、非空值和内存使用情况。

参数说明：

verbose：bool, optional

　　Whether to print the full summary. By default, the setting in pandas.options.display.max_info_columnsis followed.

buf：writable buffer, defaults to sys.stdout

　　Where to send the output. By default, the output is printed to sys.stdout. Pass a writable buffer if you need to further process the output.

max_cols：int, optional

　　When to switch from the verbose to the truncated output. If the DataFrame has more than max_cols columns, the truncated output is used. By default, the setting in pandas.options.display.max_info_columnsis used.

memory_usage：bool, str, optional

　　Specifies whether total memory usage of the DataFrame elements (including the index) should be displayed. By default, this follows the pandas.options.display.memory_usage setting.

　　True always show memory usage. False never shows memory usage. A value of ‘deep’ is equivalent to “True with deep introspection”. Memory usage is shown in human-readable units (base-2 representation). Without deep introspection a memory estimation is made based in column dtype and number of rows assuming values consume the same memory amount for corresponding dtypes. With deep memory introspection, a real memory usage calculation is performed at the cost of computational resources.

show_counts：bool, optional

　　Whether to show the non-null counts. By default, this is shown only if the DataFrame is smaller than pandas.options.display.max_info_rows and pandas.options.display.max_info_columns. A value of True always shows the counts, and False never shows the counts.

代码示例：

1 int_values = [1, 2, 3, 4, 5]
2 text_values = ['alpha', 'beta', 'gamma', 'delta', 'epsilon']
3 float_values = [0.0, 0.25, 0.5, 0.75, 1.0]
4 df = pd.DataFrame({"int_col": int_values, "text_col": text_values, "float_col": float_values})

　　打印所有列的信息

 1 # 打印所有列的信息
 2 df.info(verbose=True)
 3 
 4 ### 结果
 5 # <class 'pandas.core.frame.DataFrame'>
 6 # RangeIndex: 5 entries, 0 to 4
 7 # Data columns (total 3 columns):
 8 #  #   Column     Non-Null Count  Dtype  
 9 # ---  ------     --------------  -----  
10 #  0   int_col    5 non-null      int64  
11 #  1   text_col   5 non-null      object 
12 #  2   float_col  5 non-null      float64
13 # dtypes: float64(1), int64(1), object(1)
14 # memory usage: 248.0+ bytes

　　打印列计数及其dtypes的摘要，但不打印每个列的信息

1 # 打印列计数及其dtypes的摘要，但不打印每个列的信息
2 df.info(verbose=False)
3 
4 ### 结果
5 # <class 'pandas.core.frame.DataFrame'>
6 # RangeIndex: 5 entries, 0 to 4
7 # Columns: 3 entries, int_col to float_col
8 # dtypes: float64(1), int64(1), object(1)
9 # memory usage: 248.0+ bytes

　　将DataFrame.info的输出管道输出到缓冲区而不是sys.stdout，获取缓冲区内容并写入文本文件

 1 # 将DataFrame.info的输出管道输出到缓冲区而不是sys.stdout，获取缓冲区内容并写入文本文件
 2 buffer = io.StringIO()
 3 df.info(buf=buffer)
 4 s = buffer.getvalue()
 5 with open("df_info.txt", "w",
 6           encoding="utf-8") as f:  
 7     f.write(s)
 8 
 9 ### df_info.txt文件内容
10 # <class 'pandas.core.frame.DataFrame'>
11 # RangeIndex: 5 entries, 0 to 4
12 # Data columns (total 3 columns):
13 #  #   Column     Non-Null Count  Dtype  
14 # ---  ------     --------------  -----  
15 #  0   int_col    5 non-null      int64  
16 #  1   text_col   5 non-null      object 
17 #  2   float_col  5 non-null      float64
18 # dtypes: float64(1), int64(1), object(1)
19 # memory usage: 248.0+ bytes

　　memory_usage参数缺省值状态

 1 # memory_usage参数缺省值状态
 2 random_strings_array = np.random.choice(['a', 'b', 'c'], 10 ** 6)
 3 df = pd.DataFrame({
 4     'column_1': np.random.choice(['a', 'b', 'c'], 10 ** 6),
 5     'column_2': np.random.choice(['a', 'b', 'c'], 10 ** 6),
 6     'column_3': np.random.choice(['a', 'b', 'c'], 10 ** 6)
 7 })
 8 print(df.info())
 9 
10 ### 结果
11 # <class 'pandas.core.frame.DataFrame'>
12 # RangeIndex: 1000000 entries, 0 to 999999
13 # Data columns (total 3 columns):
14 #  #   Column    Non-Null Count    Dtype 
15 # ---  ------    --------------    ----- 
16 #  0   column_1  1000000 non-null  object
17 #  1   column_2  1000000 non-null  object
18 #  2   column_3  1000000 non-null  object
19 # dtypes: object(3)
20 # memory usage: 22.9+ MB

　　memory_usage参数为'deep'

 1 df.info(memory_usage='deep')
 2 
 3 ### 结果
 4 # <class 'pandas.core.frame.DataFrame'>
 5 # RangeIndex: 1000000 entries, 0 to 999999
 6 # Data columns (total 3 columns):
 7 #  #   Column    Non-Null Count    Dtype 
 8 # ---  ------    --------------    ----- 
 9 #  0   column_1  1000000 non-null  object
10 #  1   column_2  1000000 non-null  object
11 #  2   column_3  1000000 non-null  object
12 # dtypes: object(3)
13 # memory usage: 165.9 MB

2.2. 缺失值处理——删除

　　通过前面的判断得知数据缺失情况，下面将缺失值删除，主要使用dropna()方法，该方法用于删除含有缺失值的行，主要代码如下：

DataFrame.dropna(*, axis=0, how=_NoDefault.no_default, thresh=_NoDefault.no_default, subset=None, inplace=False, ignore_index=False)

参数说明：

axis：{0 or ‘index’, 1 or ‘columns’}, default 0

　　Determine if rows or columns which contain missing values are removed.

0, or ‘index’ : Drop rows which contain missing values.
1, or ‘columns’ : Drop columns which contain missing value.

　　Only a single axis is allowed.

how：{‘any’, ‘all’}, default ‘any’

　　Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

‘any’ : If any NA values are present, drop that row or column.
‘all’ : If all values are NA, drop that row or column.

thresh：int, optional

　　Require that many non-NA values. Cannot be combined with how.

subset：column label or sequence of labels, optional

　　Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.

inplace：bool, default False

　　Whether to modify the DataFrame rather than creating a new one.

ignore_index：bool, default False

　　If True, the resulting axis will be labeled 0, 1, …, n - 1.

代码示例：

1 df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
2                    "toy": [np.nan, 'Batmobile', 'Bullwhip'],
3                    "born": [pd.NaT, pd.Timestamp("1940-04-25"),
4                             pd.NaT]})

　　删除至少缺少一个元素的行

1 df.dropna()
2 
3 ### 结果
4 #      name        toy       born
5 # 1  Batman  Batmobile 1940-04-25

　　删除至少缺少一个元素的列

1 df.dropna(axis='columns')
2 
3 ### 结果
4 #        name
5 # 0    Alfred
6 # 1    Batman
7 # 2  Catwoman

　　删除缺少所有元素的行

1 df.dropna(how='all')
2 
3 ### 结果
4 #        name        toy       born
5 # 0    Alfred        NaN        NaT
6 # 1    Batman  Batmobile 1940-04-25
7 # 2  Catwoman   Bullwhip        NaT

　　只保留至少有2个非NA值的行

1 df.dropna(thresh=2)
2 
3 ### 结果
4 #        name        toy       born
5 # 1    Batman  Batmobile 1940-04-25
6 # 2  Catwoman   Bullwhip        NaT

　　定义在哪些列中查找缺失的值

1 df.dropna(subset=['name', 'toy'])
2 
3 ### 结果
4 #        name        toy       born
5 # 1    Batman  Batmobile 1940-04-25
6 # 2  Catwoman   Bullwhip        NaT

2.3. 缺失值处理——填充

　　主要使用fillna()方法，主要代码如下：

DataFrame.fillna(value=None, *, method=None, axis=None, inplace=False, limit=None, downcast=_NoDefault.no_default)

参数说明：

value：scalar, dict, Series, or DataFrame

　　Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.

method：{‘backfill’, ‘bfill’, ‘ffill’, None}, default None

　　Method to use for filling holes in reindexed Series:

ffill: propagate last valid observation forward to next valid.
backfill / bfill: use next valid observation to fill gap.

axis：{0 or ‘index’} for Series, {0 or ‘index’, 1 or ‘columns’} for DataFrame

　　Axis along which to fill missing values. For Series this parameter is unused and defaults to 0.

inplace：bool, default False

　　If True, fill in-place. Note: this will modify any other views on this object (e.g., a no-copy slice for a column in a DataFrame).

limit：int, default None

　　If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

downcast：dict, default is None

　　A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).

代码示例：

 1 df = pd.DataFrame([[np.nan, 2, np.nan, 0],
 2                    [3, 4, np.nan, 1],
 3                    [np.nan, np.nan, np.nan, np.nan],
 4                    [np.nan, 3, np.nan, 4]],
 5                   columns=list("ABCD"))
 6 print(df)
 7 
 8 ### 结果
 9 #      A    B   C    D
10 # 0  NaN  2.0 NaN  0.0
11 # 1  3.0  4.0 NaN  1.0
12 # 2  NaN  NaN NaN  NaN
13 # 3  NaN  3.0 NaN  4.0

　　将所有NaN元素替换为0

1 df.fillna(0)
2 
3 ### 结果
4 #      A    B    C    D
5 # 0  0.0  2.0  0.0  0.0
6 # 1  3.0  4.0  0.0  1.0
7 # 2  0.0  0.0  0.0  0.0
8 # 3  0.0  3.0  0.0  4.0

　　用0，1，2，3替换列中的所有NaN元素

1 values = {"A": 0, "B": 1, "C": 2, "D": 3}
2 df1 = df.fillna(value=values)
3 
4 ### 结果
5 # A    B    C    D
6 # 0  0.0  2.0  2.0  0.0
7 # 1  3.0  4.0  2.0  1.0
8 # 2  0.0  1.0  2.0  3.0
9 # 3  0.0  3.0  2.0  4.0

　　只替换第一个NaN元素

 1 values = {"A": 0, "B": 1, "C": 2, "D": 3}
 2 df1 = df.fillna(value=values, limit=1)
 3 print(df1)
 4 
 5 ### 结果
 6 #      A    B    C    D
 7 # 0  0.0  2.0  2.0  0.0
 8 # 1  3.0  4.0  NaN  1.0
 9 # 2  NaN  1.0  NaN  3.0
10 # 3  NaN  3.0  NaN  4.0

　　当使用DataFrame填充时，会沿着相同的列名和相同的索引进行替换

 1 df2 = pd.DataFrame(np.zeros((4, 4)), columns=list("ABCE"))
 2 df1 = df.fillna(df2)
 3 print(df1)
 4 
 5 ### 结果
 6 #      A    B    C    D
 7 # 0  0.0  2.0  0.0  0.0
 8 # 1  3.0  4.0  0.0  1.0
 9 # 2  0.0  0.0  0.0  NaN
10 # 3  0.0  3.0  0.0  4.0

3. 重复值的查看与处理

3.1. 重复值的查看

　　主要使用df.duplicated()方法，语法如下：

DataFrame.duplicated(subset=None, keep='first')

返回表示重复行的布尔系列。
考虑某些列是可选的。

参数说明：

subset：column label or sequence of labels, optional

　　Only consider certain columns for identifying duplicates, by default use all of the columns.

keep：{‘first’, ‘last’, False }, default ‘first’

　　Determines which duplicates (if any) to mark.

first : Mark duplicates as True except for the first occurrence.
last : Mark duplicates as True except for the last occurrence.
False : Mark all duplicates as True.

代码示例：

 1 df = pd.DataFrame({
 2     'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
 3     'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
 4     'rating': [4, 4, 3.5, 15, 5]
 5 })
 6 print(df)
 7 
 8 ### 结果
 9 #      brand style  rating
10 # 0  Yum Yum   cup     4.0
11 # 1  Yum Yum   cup     4.0
12 # 2  Indomie   cup     3.5
13 # 3  Indomie  pack    15.0
14 # 4  Indomie  pack     5.0

　　默认情况下，对于每组重复的值，第一个出现设置为False，其他所有出现设置为True

 1 df.duplicated()
 2 print(df1)
 3 
 4 ### 结果
 5 # 0    False
 6 # 1     True
 7 # 2    False
 8 # 3    False
 9 # 4    False
10 # dtype: bool

　　通过使用' last '，每组重复值的最后一次出现设置为False，所有其他的设置为True

 1 df.duplicated(keep='last')
 2 print(df1)
 3 
 4 ### 结果
 5 # 0     True
 6 # 1    False
 7 # 2    False
 8 # 3    False
 9 # 4    False
10 # dtype: bool

　　通过将keep设置为False，所有副重复值都为True

 1 df.duplicated(keep=False)
 2 print(df1)
 3 
 4 ### 结果
 5 # 0     True
 6 # 1     True
 7 # 2    False
 8 # 3    False
 9 # 4    False
10 # dtype: bool

　　使用子集的方式查找特定列上的重复项

 1 df1 = df.duplicated(subset=['brand'])
 2 print(df1)
 3 
 4 ### 结果
 5 # 0    False
 6 # 1     True
 7 # 2    False
 8 # 3     True
 9 # 4     True
10 # dtype: bool

3.2. 重复值处理——删除

DataFrame.drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=False)

参数说明：

subset：column label or sequence of labels, optional

　　Only consider certain columns for identifying duplicates, by default use all of the columns.

keep：{‘first’, ‘last’, False}, default ‘first’

　　Determines which duplicates (if any) to keep.

‘first’ : Drop duplicates except for the first occurrence.
‘last’ : Drop duplicates except for the last occurrence.
False : Drop all duplicates.

inplace：bool, default False

　　Whether to modify the DataFrame rather than creating a new one.

ignore_index：bool, default False

　　If True, the resulting axis will be labeled 0, 1, …, n - 1.

代码示例：

　　默认情况下，它会基于所有列删除重复行

1 df.drop_duplicates()
2 print(df1)
3 
4 ### 结果
5 #      brand style  rating
6 # 0  Yum Yum   cup     4.0
7 # 2  Indomie   cup     3.5
8 # 3  Indomie  pack    15.0
9 # 4  Indomie  pack     5.0

　　使用子集删除特定列上的重复项

1 df1 = df.drop_duplicates(subset=['brand'])
2 print(df1)
3 
4 ### 结果
5 #      brand style  rating
6 # 0  Yum Yum   cup     4.0
7 # 2  Indomie   cup     3.5

　　　　使用keep删除重复项并保留最近出现的内容

1 df1 = df.drop_duplicates(subset=['brand', 'style'], keep='last')
2 print(df1)
3 
4 ### 结果
5 #      brand style  rating
6 # 1  Yum Yum   cup     4.0
7 # 2  Indomie   cup     3.5
8 # 4  Indomie  pack     5.0

4. 异常值的检测与处理

主要包括以下3种处理方式：

最常用的方式是删除。
将异常值当缺失值处理，以某个值填充。
将异常值当特殊情况进行分析，研究异常值出现的原因。

2024年2月4日

posted @ 2024-02-04 11:13 一路狂奔的乌龟阅读(45) 评论(0) 收藏举报

刷新页面返回顶部

一路狂奔的乌龟

别听世俗的耳语，去看自己喜欢的风景。

Python数据分析-数据清洗

1. 主要内容

2. 缺失值的查看与处理

2.1. 缺失值的查看

2.2. 缺失值处理——删除

2.3. 缺失值处理——填充

3. 重复值的查看与处理

3.1. 重复值的查看

3.2. 重复值处理——删除

4. 异常值的检测与处理

公告