Python数据分析示例(2)Day3

说明:本文章为Python数据处理学习日志,主要内容来自书本《利用Python进行数据分析》,Wes McKinney著,机械工业出版社。

电影数据分析

所需文件在Day2中下载,接下来要用到的一些文件的文件格式如下:

    users.dat文件格式
  1::F::1::10::48067
  2::M::56::16::70072
  3::M::25::15::55117
   
  ratings.dat文件格式
  1::1193::5::978300760
  1::661::3::978302109
  1::914::3::978301968
   
  movies.dat文件格式
  1::Toy Story (1995)::Animation|Children's|Comedy
  2::Jumanji (1995)::Adventure|Children's|Fantasy
  3::Grumpier Old Men (1995)::Comedy|Romance
[/code]

通过pandas.read_table将各个表分别读到pandas DataFrame对象中:

```code
  import pandas as pd
  import os
  path='E:\\Enthought\\book\\ch02\\movielens'
  os.chdir(path) #改变当前工作目录到path
   
  unames = ['user_id','gender','age','occupation','zip']
  users = pd.read_table('users.dat',sep='::',header=None,names=unames) #根据'::'分解记录
  -c:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
   
  rnames = ['user_id','movie_id','rating','timestamp']
  ratings = pd.read_table('ratings.dat',sep='::',header=None,names=rnames,engine='python') #加了engine='python'就不会出现上述报错
   
  mnames = ['movie_id','title','genres']
  movies = pd.read_table('movies.dat',sep='::',header=None,names=mnames,engine='python')
[/code]

查看各个DataFrame对象:

```code
  users[:5]
  Out[11]:
      user_id gender age occupation   zip
  0       1     F   1         10 48067
  1       2     M   56         16 70072
  2       3     M   25         15 55117
  3       4     M   45           7 02460
  4       5     M   25         20 55455
   
  ratings[:5]
  Out[12]:
      user_id movie_id rating timestamp
  0       1     1193       5 978300760
  1       1       661       3 978302109
  2       1       914       3 978301968
  3       1     3408       4 978300275
  4       1     2355       5 978824291
   
  movies[:5]
  Out[13]:
      movie_id                               title                       genres
  0         1                   Toy Story (1995)   Animation|Children's|Comedy
  1         2                     Jumanji (1995) Adventure|Children's|Fantasy
  2         3             Grumpier Old Men (1995)               Comedy|Romance
  3         4           Waiting to Exhale (1995)                 Comedy|Drama
  4         5 Father of the Bride Part II (1995)                       Comedy
[/code]

其中年龄age,职业occupation是以编码形式给出,具体含义参见README。  
接下来尝试分析散布在三个表中的数据。假设我们想根据性别和年龄计算某部电影的平均得分,如果将所有数据合并到一个表的话问题就简单多了。我们先用pandas的merge函数将ratings跟users
**合并** 到一起,然后再将movies野合并进去。pandas会根据列明的重叠情况推断出哪些是合并(或连接)键:

```code
  data = pd.merge(pd.merge(ratings,users),movies)
  data[:5] #可能输merge策略改变,接下来两个输出结果均与书本不同
  Out[16]:
      user_id movie_id rating timestamp gender age occupation   zip \
  0       1     1193       5 978300760     F   1         10 48067  
  1       2     1193       5 978298413     M   56         16 70072  
  2       12     1193       4 978220179     M   25         12 32793  
  3       15     1193       4 978199279     M   25           7 22903  
  4       17     1193       5 978158471     M   50           1 95350  
   
                                      title genres  
  0 One Flew Over the Cuckoo's Nest (1975) Drama  
  1 One Flew Over the Cuckoo's Nest (1975) Drama  
  2 One Flew Over the Cuckoo's Nest (1975) Drama  
  3 One Flew Over the Cuckoo's Nest (1975) Drama  
  4 One Flew Over the Cuckoo's Nest (1975) Drama  
   
  data.ix[0] #输出第一条记录
  Out[17]:
  user_id                                           1
  movie_id                                       1193
  rating                                             5
  timestamp                                 978300760
  gender                                             F
  age                                               1
  occupation                                       10
  zip                                           48067
  title         One Flew Over the Cuckoo's Nest (1975)
  genres                                         Drama
  Name: 0, dtype: object
[/code]

接下来就可以根据任意个用户或者电影属性对评分数据进行 **聚合** 操作。按性别计算每部电影的平均分,可以使用pivot_table方法:

```code
  mean_ratings = data.pivot_table('rating',index='title',columns='gender',aggfunc='mean') #参数改变rows-index,cols-columns,与书本不一样
  mean_ratings[:5]
  Out[26]:
  gender                               F         M
  title                                            
  $1,000,000 Duck (1971)         3.375000 2.761905
  'Night Mother (1986)           3.388889 3.352941
  'Til There Was You (1997)     2.675676 2.733333
  'burbs, The (1989)             2.793478 2.962085
  ...And Justice for All (1979) 3.828571 3.689024
[/code]

该操作产生一个DataFrame,其内容为电影平均分,行标为电影名称,列标为性别。现在,过滤掉评分数据不够250条的电影。先对title进行分组,然后利用size()得到一个含有各电影分组大小的Series对象:

```code
  ratings_by_title = data.groupby('title').size()
  ratings_by_title[:10]
  Out[28]:
  title
  $1,000,000 Duck (1971)               37
  'Night Mother (1986)                 70
  'Til There Was You (1997)             52
  'burbs, The (1989)                   303
  ...And Justice for All (1979)       199
  1-900 (1994)                           2
  10 Things I Hate About You (1999)   700
  101 Dalmatians (1961)               565
  101 Dalmatians (1996)               364
  12 Angry Men (1957)                 616
  dtype: int64
   
  active_titles = ratings_by_title.index[ratings_by_title>=250]
  active_titles
  Out[31]:
  Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)',
          u'101 Dalmatians (1961)', u'101 Dalmatians (1996)',
          u'12 Angry Men (1957)', u'13th Warrior, The (1999)',
          u'2 Days in the Valley (1996)', u'20,000 Leagues Under the Sea (1954)',
          u'2001: A Space Odyssey (1968)', u'2010 (1984)',
          ...
          u'X-Men (2000)', u'Year of Living Dangerously (1982)',
          u'Yellow Submarine (1968)', u'You've Got Mail (1998)',
          u'Young Frankenstein (1974)', u'Young Guns (1988)',
          u'Young Guns II (1990)', u'Young Sherlock Holmes (1985)',
          u'Zero Effect (1998)', u'eXistenZ (1999)'],
        dtype='object', name=u'title', length=1216)
[/code]

该索引中含有评分数据大于250条的电影名称,然后就可以据此从前面的mean_ratings中 **选取** 所需的行了:

```code
  mean_ratings = mean_ratings.ix[active_titles]
  mean_ratings[:5] #此处与书本不同
  Out[34]:
  gender                                   F         M
  title                                                
  'burbs, The (1989)                 2.793478 2.962085
  10 Things I Hate About You (1999) 3.646552 3.311966
  101 Dalmatians (1961)             3.791444 3.500000
  101 Dalmatians (1996)             3.240000 2.911215
  12 Angry Men (1957)               4.184397 4.328421
[/code]

为了了解女性观众最喜欢的电影,可以对F列降序排列:

```code
  top_female_ratings = mean_ratings.sort_index(by='F',ascending=False)
  -c:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
  #此处出现警告,pandas0.18.1版本sort_index没有by参数,具体见下
  top_female_ratings = mean_ratings.sort_values(by='F',ascending=False)
   
  top_female_ratings[:10]
  Out[38]:
  gender                                                     F         M
  title                                                                
  Close Shave, A (1995)                               4.644444 4.473795
  Wrong Trousers, The (1993)                         4.588235 4.478261
  Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)       4.572650 4.464589
  Wallace & Gromit: The Best of Aardman Animation... 4.563107 4.385075
  Schindler's List (1993)                             4.562602 4.491415
  Shawshank Redemption, The (1994)                   4.539075 4.560625
  Grand Day Out, A (1992)                             4.537879 4.293255
  To Kill a Mockingbird (1962)                       4.536667 4.372611
  Creature Comforts (1990)                           4.513889 4.272277
  Usual Suspects, The (1995)                         4.513317 4.518248
[/code]

警告函数比较,pandas版本0.18.1

> **pandas.DataFrame.sort_index()**  
> **Parameters:**  
> **_axis_ ** : index, columns to direct sorting  
> **_level_ ** : int or level name or list of ints or list of level names  
> if not None, sort on values in specified index level(s)  
> **_ascending_ ** : boolean, default True  
> Sort ascending vs. descending  
> **_inplace_ ** : bool, if True, perform operation in-place  
> **_kind_ ** : {quicksort, mergesort, heapsort}  
> Choice of sorting algorithm. See also ndarray.np.sort for more information.
> mergesort is the only stable algorithm. For DataFrames, this option is only
> applied when sorting on a single column or label.  
> **_na_position_ ** : {‘first’, ‘last’}  
> first puts NaNs at the beginning, last puts NaNs at the end  
> **_sort_remaining_ ** : bool  
> if true and sorting by level and index is multilevel, sort by other levels
> too (in order) after sorting by specified level  
> **Returns:**  
> **_sorted_obj_ ** : DataFrame
>
> **pandas.DataFrame.sort_values()**  
> **Parameters** :  
> **_by_ ** : string name or list of names which refer to the axis items  
> **_axis_ ** : index, columns to direct sorting  
> **_ascending_ ** : bool or list of bool  
> Sort ascending vs. descending. Specify list for multiple sort orders. If
> this is a list of bools, must match the length of the by.  
> **_inplace_ ** : bool  
> if True, perform operation in-place  
> **_kind_ ** : {quicksort, mergesort, heapsort}  
> Choice of sorting algorithm. See also ndarray.np.sort for more information.
> mergesort is the only stable algorithm. For DataFrames, this option is only
> applied when sorting on a single column or label.  
> **_na_position_ ** : {‘first’, ‘last’}  
> first puts NaNs at the beginning, last puts NaNs at the end  
> **Returns** :  
> **_sorted_obj_ ** : DataFrame

计算评分分歧  
假设我们想要找出男性和女性观众分歧最大的电影。一个办法师给mean_ratings加上一个用于存放平均得分之差的列diff,并对其进行排序可得到分歧最大且女性观众更喜欢的电影:

```code
  mean_ratings['diff'] = mean_ratings['M']-mean_ratings['F']
  sort_by_diff = mean_ratings.sort_values(by='diff')
  sort_by_diff[:5]
  Out[41]:
  gender                           F         M     diff
  title                                                  
  Dirty Dancing (1987)       3.790378 2.959596 -0.830782
  Jumpin' Jack Flash (1986) 3.254717 2.578358 -0.676359
  Grease (1978)             3.975265 3.367041 -0.608224
  Little Women (1994)       3.870588 3.321739 -0.548849
  Steel Magnolias (1989)     3.901734 3.365957 -0.535777
[/code]

堆排序结果反序并取前5行,得到的则是男性观众更喜爱的电影:

```code
  sort_by_diff[::-1][:5]
  Out[43]:
  gender                                         F         M     diff
  title                                                              
  Good, The Bad and The Ugly, The (1966) 3.494949 4.221300 0.726351
  Kentucky Fried Movie, The (1977)       2.878788 3.555147 0.676359
  Dumb & Dumber (1994)                   2.697987 3.336595 0.638608
  Longest Day, The (1962)                 3.411765 4.031447 0.619682
  Cable Guy, The (1996)                   2.250000 2.863787 0.613787
[/code]

如果只想要找出分歧最大的电影(不考虑性别因素),则可以计算得分数据的方差或者标准差:

```code
  #分组后计算标准差
  rating_std_by_title = data.groupby('title')['rating'].std()
  #筛选评分多于250条的
  rating_std_by_title = rating_std_by_title.ix[active_titles]
   
  rating_std_by_title.order(ascending=False)[:5]
  -c:1: FutureWarning: order is deprecated, use sort_values(...) #虽有警告,依然能得出结果
  rating_std_by_title.sort_values(ascending=False)[:5]
  Out[50]:
  title
  Dumb & Dumber (1994)                     1.321333
  Blair Witch Project, The (1999)         1.316368
  Natural Born Killers (1994)             1.307198
  Tank Girl (1995)                         1.277695
  Rocky Horror Picture Show, The (1975)   1.260177
  Name: rating, dtype: float64
[/code]


![在这里插入图片描述](https://img-blog.csdnimg.cn/20210608151750993.gif)

 

posted on 2021-07-07 15:40  BabyGo000  阅读(102)  评论(0)    收藏  举报