Python数据分析示例(2)Day3
电影数据分析
所需文件在Day2中下载,接下来要用到的一些文件的文件格式如下:
users.dat文件格式
1::F::1::10::48067
2::M::56::16::70072
3::M::25::15::55117
ratings.dat文件格式
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
movies.dat文件格式
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
[/code]
通过pandas.read_table将各个表分别读到pandas DataFrame对象中:
```code
import pandas as pd
import os
path='E:\\Enthought\\book\\ch02\\movielens'
os.chdir(path) #改变当前工作目录到path
unames = ['user_id','gender','age','occupation','zip']
users = pd.read_table('users.dat',sep='::',header=None,names=unames) #根据'::'分解记录
-c:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
rnames = ['user_id','movie_id','rating','timestamp']
ratings = pd.read_table('ratings.dat',sep='::',header=None,names=rnames,engine='python') #加了engine='python'就不会出现上述报错
mnames = ['movie_id','title','genres']
movies = pd.read_table('movies.dat',sep='::',header=None,names=mnames,engine='python')
[/code]
查看各个DataFrame对象:
```code
users[:5]
Out[11]:
user_id gender age occupation zip
0 1 F 1 10 48067
1 2 M 56 16 70072
2 3 M 25 15 55117
3 4 M 45 7 02460
4 5 M 25 20 55455
ratings[:5]
Out[12]:
user_id movie_id rating timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291
movies[:5]
Out[13]:
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
[/code]
其中年龄age,职业occupation是以编码形式给出,具体含义参见README。
接下来尝试分析散布在三个表中的数据。假设我们想根据性别和年龄计算某部电影的平均得分,如果将所有数据合并到一个表的话问题就简单多了。我们先用pandas的merge函数将ratings跟users
**合并** 到一起,然后再将movies野合并进去。pandas会根据列明的重叠情况推断出哪些是合并(或连接)键:
```code
data = pd.merge(pd.merge(ratings,users),movies)
data[:5] #可能输merge策略改变,接下来两个输出结果均与书本不同
Out[16]:
user_id movie_id rating timestamp gender age occupation zip \
0 1 1193 5 978300760 F 1 10 48067
1 2 1193 5 978298413 M 56 16 70072
2 12 1193 4 978220179 M 25 12 32793
3 15 1193 4 978199279 M 25 7 22903
4 17 1193 5 978158471 M 50 1 95350
title genres
0 One Flew Over the Cuckoo's Nest (1975) Drama
1 One Flew Over the Cuckoo's Nest (1975) Drama
2 One Flew Over the Cuckoo's Nest (1975) Drama
3 One Flew Over the Cuckoo's Nest (1975) Drama
4 One Flew Over the Cuckoo's Nest (1975) Drama
data.ix[0] #输出第一条记录
Out[17]:
user_id 1
movie_id 1193
rating 5
timestamp 978300760
gender F
age 1
occupation 10
zip 48067
title One Flew Over the Cuckoo's Nest (1975)
genres Drama
Name: 0, dtype: object
[/code]
接下来就可以根据任意个用户或者电影属性对评分数据进行 **聚合** 操作。按性别计算每部电影的平均分,可以使用pivot_table方法:
```code
mean_ratings = data.pivot_table('rating',index='title',columns='gender',aggfunc='mean') #参数改变rows-index,cols-columns,与书本不一样
mean_ratings[:5]
Out[26]:
gender F M
title
$1,000,000 Duck (1971) 3.375000 2.761905
'Night Mother (1986) 3.388889 3.352941
'Til There Was You (1997) 2.675676 2.733333
'burbs, The (1989) 2.793478 2.962085
...And Justice for All (1979) 3.828571 3.689024
[/code]
该操作产生一个DataFrame,其内容为电影平均分,行标为电影名称,列标为性别。现在,过滤掉评分数据不够250条的电影。先对title进行分组,然后利用size()得到一个含有各电影分组大小的Series对象:
```code
ratings_by_title = data.groupby('title').size()
ratings_by_title[:10]
Out[28]:
title
$1,000,000 Duck (1971) 37
'Night Mother (1986) 70
'Til There Was You (1997) 52
'burbs, The (1989) 303
...And Justice for All (1979) 199
1-900 (1994) 2
10 Things I Hate About You (1999) 700
101 Dalmatians (1961) 565
101 Dalmatians (1996) 364
12 Angry Men (1957) 616
dtype: int64
active_titles = ratings_by_title.index[ratings_by_title>=250]
active_titles
Out[31]:
Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)',
u'101 Dalmatians (1961)', u'101 Dalmatians (1996)',
u'12 Angry Men (1957)', u'13th Warrior, The (1999)',
u'2 Days in the Valley (1996)', u'20,000 Leagues Under the Sea (1954)',
u'2001: A Space Odyssey (1968)', u'2010 (1984)',
...
u'X-Men (2000)', u'Year of Living Dangerously (1982)',
u'Yellow Submarine (1968)', u'You've Got Mail (1998)',
u'Young Frankenstein (1974)', u'Young Guns (1988)',
u'Young Guns II (1990)', u'Young Sherlock Holmes (1985)',
u'Zero Effect (1998)', u'eXistenZ (1999)'],
dtype='object', name=u'title', length=1216)
[/code]
该索引中含有评分数据大于250条的电影名称,然后就可以据此从前面的mean_ratings中 **选取** 所需的行了:
```code
mean_ratings = mean_ratings.ix[active_titles]
mean_ratings[:5] #此处与书本不同
Out[34]:
gender F M
title
'burbs, The (1989) 2.793478 2.962085
10 Things I Hate About You (1999) 3.646552 3.311966
101 Dalmatians (1961) 3.791444 3.500000
101 Dalmatians (1996) 3.240000 2.911215
12 Angry Men (1957) 4.184397 4.328421
[/code]
为了了解女性观众最喜欢的电影,可以对F列降序排列:
```code
top_female_ratings = mean_ratings.sort_index(by='F',ascending=False)
-c:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
#此处出现警告,pandas0.18.1版本sort_index没有by参数,具体见下
top_female_ratings = mean_ratings.sort_values(by='F',ascending=False)
top_female_ratings[:10]
Out[38]:
gender F M
title
Close Shave, A (1995) 4.644444 4.473795
Wrong Trousers, The (1993) 4.588235 4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572650 4.464589
Wallace & Gromit: The Best of Aardman Animation... 4.563107 4.385075
Schindler's List (1993) 4.562602 4.491415
Shawshank Redemption, The (1994) 4.539075 4.560625
Grand Day Out, A (1992) 4.537879 4.293255
To Kill a Mockingbird (1962) 4.536667 4.372611
Creature Comforts (1990) 4.513889 4.272277
Usual Suspects, The (1995) 4.513317 4.518248
[/code]
警告函数比较,pandas版本0.18.1
> **pandas.DataFrame.sort_index()**
> **Parameters:**
> **_axis_ ** : index, columns to direct sorting
> **_level_ ** : int or level name or list of ints or list of level names
> if not None, sort on values in specified index level(s)
> **_ascending_ ** : boolean, default True
> Sort ascending vs. descending
> **_inplace_ ** : bool, if True, perform operation in-place
> **_kind_ ** : {quicksort, mergesort, heapsort}
> Choice of sorting algorithm. See also ndarray.np.sort for more information.
> mergesort is the only stable algorithm. For DataFrames, this option is only
> applied when sorting on a single column or label.
> **_na_position_ ** : {‘first’, ‘last’}
> first puts NaNs at the beginning, last puts NaNs at the end
> **_sort_remaining_ ** : bool
> if true and sorting by level and index is multilevel, sort by other levels
> too (in order) after sorting by specified level
> **Returns:**
> **_sorted_obj_ ** : DataFrame
>
> **pandas.DataFrame.sort_values()**
> **Parameters** :
> **_by_ ** : string name or list of names which refer to the axis items
> **_axis_ ** : index, columns to direct sorting
> **_ascending_ ** : bool or list of bool
> Sort ascending vs. descending. Specify list for multiple sort orders. If
> this is a list of bools, must match the length of the by.
> **_inplace_ ** : bool
> if True, perform operation in-place
> **_kind_ ** : {quicksort, mergesort, heapsort}
> Choice of sorting algorithm. See also ndarray.np.sort for more information.
> mergesort is the only stable algorithm. For DataFrames, this option is only
> applied when sorting on a single column or label.
> **_na_position_ ** : {‘first’, ‘last’}
> first puts NaNs at the beginning, last puts NaNs at the end
> **Returns** :
> **_sorted_obj_ ** : DataFrame