pandas数据分析

In [1]:

import numpy as np
import pandas as pd

一、数据清洗

读取"豆瓣电影数据.xlsx"文件进行分析清洗

In [2]:

df1 = pd.read_excel('豆瓣电影数据.xlsx', index_col=0)
df1.head()

Out[2]:

	名字	投票人数	类型	产地	上映时间	时长	年代	评分	首映地点
0	肖申克的救赎	692795.0	剧情/犯罪	美国	1994-09-10 00:00:00	142	1994	9.6	多伦多电影节
1	控方证人	42995.0	剧情/悬疑/犯罪	美国	1957-12-17 00:00:00	116	1957	9.5	美国
2	美丽人生	327855.0	剧情/喜剧/爱情	意大利	1997-12-20 00:00:00	116	1997	9.5	意大利
3	阿甘正传	580897.0	剧情/爱情	美国	1994-06-23 00:00:00	142	1994	9.4	洛杉矶首映
4	霸王别姬	478523.0	剧情/爱情/同性	中国大陆	1993-01-01 00:00:00	171	1993	9.4	香港

1.1 缺失值及异常值的处理

方法	说明
dropna	根据标签中缺失值进行过滤，删除缺失值
fillna	对缺失值进行填充
isnull	返回一个布尔值对象，判断哪些值是缺失值
notnull	isnull的否定值

1.1.1 缺失值的处理

丢弃
- drop
填充缺失值
- fillna

In [3]:

df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38738 entries, 0 to 38737
Data columns (total 9 columns):
名字      38178 non-null object
投票人数    38738 non-null float64
类型      38738 non-null object
产地      38738 non-null object
上映时间    38736 non-null object
时长      38738 non-null object
年代      38738 non-null object
评分      38738 non-null float64
首映地点    38737 non-null object
dtypes: float64(2), object(7)
memory usage: 3.0+ MB

In [4]:

df1.describe()

Out[4]:

	投票人数	评分
count	38738.000000	38738.000000
mean	6185.833702	6.935704
std	26143.518786	1.270101
min	-118.000000	2.000000
25%	98.000000	6.300000
50%	341.000000	7.100000
75%	1739.750000	7.800000
max	692795.000000	9.900000

1. 查看名字中有缺失值的行数据

In [5]:

df1[df1.名字.isnull()].head()
 

Out[5]:

	名字	投票人数	类型	产地	上映时间	时长	年代	评分	首映地点
231	NaN	144.0	纪录片/音乐	韩国	2011-02-02 00:00:00	90	2011	9.7	美国
361	NaN	80.0	短片	其他	1905-05-17 00:00:00	4	1964	5.7	美国
369	NaN	5315.0	剧情	日本	2004-07-10 00:00:00	111	2004	7.5	日本
372	NaN	263.0	短片/音乐	英国	1998-06-30 00:00:00	34	1998	9.2	美国
374	NaN	47.0	短片	其他	1905-05-17 00:00:00	3	1964	6.7	美国

In [7]:

tmp1 = df1[df1.名字.isnull()].index
tmp1

Out[7]:

Int64Index([  231,   361,   369,   372,   374,   375,   411,   432,   441,
              448,
            ...
            38250, 38316, 38342, 38361, 38508, 38523, 38555, 38560, 38643,
            38656],
           dtype='int64', length=560)

In [8]:

df1.drop(tmp1,inplace=True) 

In [9]:

df1[df1.名字.isnull()]

Out[9]:

	名字	投票人数	类型	产地	上映时间	时长	年代	评分	首映地点

1.1.2 对于异常值的处理

比如投票人数小于0，投票人数为小数了
在不影响整体数据分布时，可以直接删除就行
其他属性异常值，以格式转换为主

1. 投票人数小于0和投票人数为小数的数据

In [12]:

idx1 = df1[(df1.投票人数 < 0) | (df1.投票人数 % 1 != 0)].index
df1.drop(idx1, inplace=True)

2.验证清理情况

In [13]:

df1[(df1.投票人数 < 0) | (df1.投票人数 % 1 != 0)]
 

Out[13]:

	名字	投票人数	类型	产地	上映时间	时长	年代	评分	首映地点

In [14]:

df1.head()

Out[14]:

	名字	投票人数	类型	产地	上映时间	时长	年代	评分	首映地点
0	肖申克的救赎	692795.0	剧情/犯罪	美国	1994-09-10 00:00:00	142	1994	9.6	多伦多电影节
1	控方证人	42995.0	剧情/悬疑/犯罪	美国	1957-12-17 00:00:00	116	1957	9.5	美国
2	美丽人生	327855.0	剧情/喜剧/爱情	意大利	1997-12-20 00:00:00	116	1997	9.5	意大利
3	阿甘正传	580897.0	剧情/爱情	美国	1994-06-23 00:00:00	142	1994	9.4	洛杉矶首映
4	霸王别姬	478523.0	剧情/爱情/同性	中国大陆	1993-01-01 00:00:00	171	1993	9.4	香港

In [18]:

df1.iloc[3, 7] = 9.4
 

In [19]:

df1.head()

Out[19]:

	名字	投票人数	类型	产地	上映时间	时长	年代	评分	首映地点
0	肖申克的救赎	692795.0	剧情/犯罪	美国	1994-09-10 00:00:00	142	1994	9.6	多伦多电影节
1	控方证人	42995.0	剧情/悬疑/犯罪	美国	1957-12-17 00:00:00	116	1957	9.5	美国
2	美丽人生	327855.0	剧情/喜剧/爱情	意大利	1997-12-20 00:00:00	116	1997	9.5	意大利
3	阿甘正传	580897.0	剧情/爱情	美国	1994-06-23 00:00:00	142	1994	9.4	洛杉矶首映
4	霸王别姬	478523.0	剧情/爱情/同性	中国大陆	1993-01-01 00:00:00	171	1993	9.4	香港

1.2 数据格式转换

为了方便后续的统计分析和数据处理，需要对某些列的格式进行优化

1. 查看投票人数列的格式

In [20]:

df1.投票人数.dtype

Out[20]:

dtype('float64')

利用astype进行格式的转换

In [21]:

df1.投票人数 = df1.投票人数.astype(np.int32)
df1.head()

Out[21]:

	名字	投票人数	类型	产地	上映时间	时长	年代	评分	首映地点
0	肖申克的救赎	692795	剧情/犯罪	美国	1994-09-10 00:00:00	142	1994	9.6	多伦多电影节
1	控方证人	42995	剧情/悬疑/犯罪	美国	1957-12-17 00:00:00	116	1957	9.5	美国
2	美丽人生	327855	剧情/喜剧/爱情	意大利	1997-12-20 00:00:00	116	1997	9.5	意大利
3	阿甘正传	580897	剧情/爱情	美国	1994-06-23 00:00:00	142	1994	9.4	洛杉矶首映
4	霸王别姬	478523	剧情/爱情/同性	中国大陆	1993-01-01 00:00:00	171	1993	9.4	香港

2.查看年代列的格式

年代的格式是字符类型，需要转换为数字类型

In [23]:

df1.年代.dtype

Out[23]:

dtype('O')

In [24]:

df1.年代.astype(np.int32)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-24-7461c5d7b8cb> in <module>
----> 1 df1.年代.astype(np.int32)

f:\mywork\conda\envs\mydata\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs)
   5880             # else, only a single dtype is given
   5881             new_data = self._data.astype(
-> 5882                 dtype=dtype, copy=copy, errors=errors, **kwargs
   5883             )
   5884             return self._constructor(new_data).__finalize__(self)

f:\mywork\conda\envs\mydata\lib\site-packages\pandas\core\internals\managers.py in astype(self, dtype, **kwargs)
    579 
    580     def astype(self, dtype, **kwargs):
--> 581         return self.apply("astype", dtype=dtype, **kwargs)
    582 
    583     def convert(self, **kwargs):

f:\mywork\conda\envs\mydata\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
    436                     kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy)
    437 
--> 438             applied = getattr(b, f)(**kwargs)
    439             result_blocks = _extend_blocks(applied, result_blocks)
    440 

f:\mywork\conda\envs\mydata\lib\site-packages\pandas\core\internals\blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
    557 
    558     def astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):
--> 559         return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs)
    560 
    561     def _astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):

f:\mywork\conda\envs\mydata\lib\site-packages\pandas\core\internals\blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
    641                     # _astype_nansafe works fine with 1-d only
    642                     vals1d = values.ravel()
--> 643                     values = astype_nansafe(vals1d, dtype, copy=True, **kwargs)
    644 
    645                 # TODO(extension)

f:\mywork\conda\envs\mydata\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy, skipna)
    705         # work around NumPy brokenness, #1987
    706         if np.issubdtype(dtype.type, np.integer):
--> 707             return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
    708 
    709         # if we have a datetime/timedelta array of objects

pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()

ValueError: invalid literal for int() with base 10: '2008\u200e'

In [25]:

df1[df1.年代 == '2008\u200e']
 

Out[25]:

	名字	投票人数	类型	产地	上映时间	时长	年代	评分	首映地点
15205	狂蟒惊魂	544	恐怖	中国大陆	2008-04-08 00:00:00	93	2008‎	2.7	美国

In [27]:

df1.loc[15205, "年代"] = 2008

In [28]:

df1["年代"] = df1.年代.astype(np.int32)

In [30]:

df1.年代.dtype

Out[30]:

dtype('int32')

In [31]:

df1.describe()

Out[31]:

	投票人数	年代	评分
count	38171.000000	38171.000000	38171.000000
mean	6264.254801	1998.802651	6.922273
std	26290.206068	255.052234	1.263766
min	21.000000	1888.000000	2.000000
25%	101.000000	1990.000000	6.300000
50%	354.000000	2005.000000	7.100000
75%	1798.000000	2010.000000	7.800000
max	692795.000000	39180.000000	9.900000

3. 查看时长列格式

In [32]:

df1.时长.dtype

Out[32]:

dtype('O')

In [33]:

df1.时长.astype(np.int32)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-33-f91a79d0a481> in <module>
----> 1 df1.时长.astype(np.int32)

f:\mywork\conda\envs\mydata\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs)
   5880             # else, only a single dtype is given
   5881             new_data = self._data.astype(
-> 5882                 dtype=dtype, copy=copy, errors=errors, **kwargs
   5883             )
   5884             return self._constructor(new_data).__finalize__(self)

f:\mywork\conda\envs\mydata\lib\site-packages\pandas\core\internals\managers.py in astype(self, dtype, **kwargs)
    579 
    580     def astype(self, dtype, **kwargs):
--> 581         return self.apply("astype", dtype=dtype, **kwargs)
    582 
    583     def convert(self, **kwargs):

f:\mywork\conda\envs\mydata\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
    436                     kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy)
    437 
--> 438             applied = getattr(b, f)(**kwargs)
    439             result_blocks = _extend_blocks(applied, result_blocks)
    440 

f:\mywork\conda\envs\mydata\lib\site-packages\pandas\core\internals\blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
    557 
    558     def astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):
--> 559         return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs)
    560 
    561     def _astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):

f:\mywork\conda\envs\mydata\lib\site-packages\pandas\core\internals\blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
    641                     # _astype_nansafe works fine with 1-d only
    642                     vals1d = values.ravel()
--> 643                     values = astype_nansafe(vals1d, dtype, copy=True, **kwargs)
    644 
    645                 # TODO(extension)

f:\mywork\conda\envs\mydata\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy, skipna)
    705         # work around NumPy brokenness, #1987
    706         if np.issubdtype(dtype.type, np.integer):
--> 707             return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
    708 
    709         # if we have a datetime/timedelta array of objects

pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()

ValueError: invalid literal for int() with base 10: '8U'

In [34]:

df1[df1.时长 == '8U']

Out[34]:

	名字	投票人数	类型	产地	上映时间	时长	年代	评分	首映地点
31644	一个被隔绝的世界	46	纪录片/短片	瑞典	2001-10-25 00:00:00	8U	1948	7.8	美国

In [35]:

df1.drop([31644], inplace=True)

In [36]:

df1.时长.astype(np.int32)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-f91a79d0a481> in <module>
----> 1 df1.时长.astype(np.int32)

f:\mywork\conda\envs\mydata\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs)
   5880             # else, only a single dtype is given
   5881             new_data = self._data.astype(
-> 5882                 dtype=dtype, copy=copy, errors=errors, **kwargs
   5883             )
   5884             return self._constructor(new_data).__finalize__(self)

f:\mywork\conda\envs\mydata\lib\site-packages\pandas\core\internals\managers.py in astype(self, dtype, **kwargs)
    579 
    580     def astype(self, dtype, **kwargs):
--> 581         return self.apply("astype", dtype=dtype, **kwargs)
    582 
    583     def convert(self, **kwargs):

f:\mywork\conda\envs\mydata\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
    436                     kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy)
    437 
--> 438             applied = getattr(b, f)(**kwargs)
    439             result_blocks = _extend_blocks(applied, result_blocks)
    440 

f:\mywork\conda\envs\mydata\lib\site-packages\pandas\core\internals\blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
    557 
    558     def astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):
--> 559         return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs)
    560 
    561     def _astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):

f:\mywork\conda\envs\mydata\lib\site-packages\pandas\core\internals\blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
    641                     # _astype_nansafe works fine with 1-d only
    642                     vals1d = values.ravel()
--> 643                     values = astype_nansafe(vals1d, dtype, copy=True, **kwargs)
    644 
    645                 # TODO(extension)

f:\mywork\conda\envs\mydata\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy, skipna)
    705         # work around NumPy brokenness, #1987
    706         if np.issubdtype(dtype.type, np.integer):
--> 707             return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
    708 
    709         # if we have a datetime/timedelta array of objects

pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()

ValueError: invalid literal for int() with base 10: '12J'

In [37]:

df1[df1.时长=='12J']
 

Out[37]:

	名字	投票人数	类型	产地	上映时间	时长	年代	评分	首映地点
32949	渔业危机	41	纪录片	英国	2009-06-19 00:00:00	12J	2008	8.2	USA

In [38]:

df1.drop([32949], inplace=True)

In [39]:

df1["时长"] = df1.时长.astype(np.int32)

In [40]:

df1.head()

Out[40]:

	名字	投票人数	类型	产地	上映时间	时长	年代	评分	首映地点
0	肖申克的救赎	692795	剧情/犯罪	美国	1994-09-10 00:00:00	142	1994	9.6	多伦多电影节
1	控方证人	42995	剧情/悬疑/犯罪	美国	1957-12-17 00:00:00	116	1957	9.5	美国
2	美丽人生	327855	剧情/喜剧/爱情	意大利	1997-12-20 00:00:00	116	1997	9.5	意大利
3	阿甘正传	580897	剧情/爱情	美国	1994-06-23 00:00:00	142	1994	9.4	洛杉矶首映
4	霸王别姬	478523	剧情/爱情/同性	中国大陆	1993-01-01 00:00:00	171	1993	9.4	香港

1.3 利用基本统计寻找异常值

通过描述性统计，可以发现一些异常值，很多异常值是需要我们逐步去发现的

In [41]:

df1.describe()

Out[41]:

	投票人数	时长	年代	评分
count	38169.000000	38169.000000	38169.000000	38169.000000
mean	6264.580759	89.471037	1998.803741	6.922217
std	26290.856296	83.762406	255.058780	1.263775
min	21.000000	1.000000	1888.000000	2.000000
25%	101.000000	60.000000	1990.000000	6.300000
50%	354.000000	93.000000	2005.000000	7.100000
75%	1798.000000	106.000000	2010.000000	7.800000
max	692795.000000	11500.000000	39180.000000	9.900000

In [42]:

df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38169 entries, 0 to 38737
Data columns (total 9 columns):
名字      38169 non-null object
投票人数    38169 non-null int32
类型      38169 non-null object
产地      38169 non-null object
上映时间    38167 non-null object
时长      38169 non-null int32
年代      38169 non-null int32
评分      38169 non-null float64
首映地点    38168 non-null object
dtypes: float64(1), int32(3), object(5)
memory usage: 2.5+ MB

In [43]:

df1[df1.年代 > 2019]

Out[43]:

	名字	投票人数	类型	产地	上映时间	时长	年代	评分	首映地点
13882	武之舞	128	纪录片	中国大陆	1997-02-01 00:00:00	60	34943	9.9	美国
17115	妈妈回来吧-中国打工村的孩子	49	纪录片	日本	2007-04-08 00:00:00	109	39180	8.9	美国

In [45]:

df1[df1.时长 > 1000]

Out[45]:

	名字	投票人数	类型	产地	上映时间	时长	年代	评分	首映地点
19690	怒海余生	54	剧情/家庭/冒险	美国	1937-09-01 00:00:00	11500	1937	7.9	美国
38730	喧闹村的孩子们	36	家庭	瑞典	1986-12-06 00:00:00	9200	1986	8.7	瑞典

In [46]:

df1.drop(df1[df1.年代>2019].index, inplace=True)
df1.drop(df1[df1.时长>1000].index, inplace=True)

In [47]:

df1.describe()

Out[47]:

	投票人数	时长	年代	评分
count	38165.000000	38165.000000	38165.000000	38165.000000
mean	6265.230342	88.933604	1996.968269	6.922015
std	26292.157467	37.816503	19.907381	1.263666
min	21.000000	1.000000	1888.000000	2.000000
25%	101.000000	60.000000	1990.000000	6.300000
50%	354.000000	93.000000	2005.000000	7.100000
75%	1798.000000	106.000000	2010.000000	7.800000
max	692795.000000	958.000000	2016.000000	9.900000

1.4 在清洗数据后，对index进行重新赋值

In [48]:

df1[230:235]

Out[48]:

	名字	投票人数	类型	产地	上映时间	时长	年代	评分	首映地点
230	世界奇妙物语 2009春之特别篇世にも奇妙な物語豪華キャストで	5802	悬疑/恐怖	日本	2009-03-30 00:00:00	60	2009	7.8	日本
232	睡美人	5785	喜剧/动画/短片	加拿大	2007-06-11 00:00:00	60	2007	8.0	美国
233	2010年青少年选择奖	313	家庭	美国	2010-08-09 00:00:00	120	2010	6.8	美国
234	准时	5773	剧情/短片	德国	2008-02-14 00:00:00	60	2008	7.8	柏林电影节
235	超级杯奶爸	5742	喜剧/家庭/运动	美国	2007-09-28 00:00:00	60	2007	7.2	美国

In [49]:

df1.index = range(len(df1))
 

In [50]:

df1[230:235]

Out[50]:

	名字	投票人数	类型	产地	上映时间	时长	年代	评分	首映地点
230	世界奇妙物语 2009春之特别篇世にも奇妙な物語豪華キャストで	5802	悬疑/恐怖	日本	2009-03-30 00:00:00	60	2009	7.8	日本
231	睡美人	5785	喜剧/动画/短片	加拿大	2007-06-11 00:00:00	60	2007	8.0	美国
232	2010年青少年选择奖	313	家庭	美国	2010-08-09 00:00:00	120	2010	6.8	美国
233	准时	5773	剧情/短片	德国	2008-02-14 00:00:00	60	2008	7.8	柏林电影节
234	超级杯奶爸	5742	喜剧/家庭/运动	美国	2007-09-28 00:00:00	60	2007	7.2	美国

1.5 数据内容的修改

获取产地的唯一值

In [51]:

df1.产地.unique()

Out[51]:

array(['美国', '意大利', '中国大陆', '日本', '法国', '英国', '韩国', '中国香港', '阿根廷', '德国',
       '印度', '其他', '加拿大', '波兰', '泰国', '澳大利亚', '西班牙', '俄罗斯', '中国台湾', '荷兰',
       '丹麦', '比利时', 'USA', '苏联', '巴西', '瑞典', '西德', '墨西哥'], dtype=object)

在产地列中，很多重复值，比如USA和美国，西德和德国
- 我们可以通过数据替换的方法，将这些相同国家的电影数据合并起来。

In [52]:

df1.产地.replace("USA", "美国", inplace=True)

In [53]:

df1.产地.replace(["USA" ,"西德", "苏联"], ["美国", "德国", "俄罗斯"], inplace=True)

In [54]:

df1.产地.unique()

Out[54]:

array(['美国', '意大利', '中国大陆', '日本', '法国', '英国', '韩国', '中国香港', '阿根廷', '德国',
       '印度', '其他', '加拿大', '波兰', '泰国', '澳大利亚', '西班牙', '俄罗斯', '中国台湾', '荷兰',
       '丹麦', '比利时', '巴西', '瑞典', '墨西哥'], dtype=object)

1.6 保存最终结果¶

In [58]:

df1.to_excel('movie.xlsx')

水晶bingbing

pandas数据清理