Python数据分析示例（2）Day3

说明：本文章为Python数据处理学习日志，主要内容来自书本《利用Python进行数据分析》，Wes McKinney著，机械工业出版社。
电影数据分析

所需文件在Day2中下载，接下来要用到的一些文件的文件格式如下：
    users.dat文件格式
    1::F::1::10::48067
    2::M::56::16::70072
    3::M::25::15::55117
    
    ratings.dat文件格式
    1::1193::5::978300760
    1::661::3::978302109
    1::914::3::978301968
    
    movies.dat文件格式
    1::Toy Story (1995)::Animation|Children's|Comedy
    2::Jumanji (1995)::Adventure|Children's|Fantasy
    3::Grumpier Old Men (1995)::Comedy|Romance
[/code]

通过pandas.read_table将各个表分别读到pandas DataFrame对象中：

```code
    import pandas as pd
    import os
    path='E:\\Enthought\\book\\ch02\\movielens'
    os.chdir(path) #改变当前工作目录到path
    
    unames = ['user_id','gender','age','occupation','zip']
    users = pd.read_table('users.dat',sep='::',header=None,names=unames) #根据'::'分解记录
    -c:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
    
    rnames = ['user_id','movie_id','rating','timestamp']
    ratings = pd.read_table('ratings.dat',sep='::',header=None,names=rnames,engine='python') #加了engine='python'就不会出现上述报错
    
    mnames = ['movie_id','title','genres']
    movies = pd.read_table('movies.dat',sep='::',header=None,names=mnames,engine='python')
[/code]

查看各个DataFrame对象：

```code
    users[:5]
    Out[11]: 
       user_id gender  age  occupation    zip
    0        1      F    1          10  48067
    1        2      M   56          16  70072
    2        3      M   25          15  55117
    3        4      M   45           7  02460
    4        5      M   25          20  55455
    
    ratings[:5]
    Out[12]: 
       user_id  movie_id  rating  timestamp
    0        1      1193       5  978300760
    1        1       661       3  978302109
    2        1       914       3  978301968
    3        1      3408       4  978300275
    4        1      2355       5  978824291
    
    movies[:5]
    Out[13]: 
       movie_id                               title                        genres
    0         1                    Toy Story (1995)   Animation|Children's|Comedy
    1         2                      Jumanji (1995)  Adventure|Children's|Fantasy
    2         3             Grumpier Old Men (1995)                Comedy|Romance
    3         4            Waiting to Exhale (1995)                  Comedy|Drama
    4         5  Father of the Bride Part II (1995)                        Comedy
[/code]

其中年龄age，职业occupation是以编码形式给出，具体含义参见README。  
接下来尝试分析散布在三个表中的数据。假设我们想根据性别和年龄计算某部电影的平均得分，如果将所有数据合并到一个表的话问题就简单多了。我们先用pandas的merge函数将ratings跟users
**合并** 到一起，然后再将movies野合并进去。pandas会根据列明的重叠情况推断出哪些是合并（或连接）键：

```code
    data = pd.merge(pd.merge(ratings,users),movies)
    data[:5] #可能输merge策略改变，接下来两个输出结果均与书本不同
    Out[16]: 
       user_id  movie_id  rating  timestamp gender  age  occupation    zip  \
    0        1      1193       5  978300760      F    1          10  48067   
    1        2      1193       5  978298413      M   56          16  70072   
    2       12      1193       4  978220179      M   25          12  32793   
    3       15      1193       4  978199279      M   25           7  22903   
    4       17      1193       5  978158471      M   50           1  95350   
    
                                        title genres  
    0  One Flew Over the Cuckoo's Nest (1975)  Drama  
    1  One Flew Over the Cuckoo's Nest (1975)  Drama  
    2  One Flew Over the Cuckoo's Nest (1975)  Drama  
    3  One Flew Over the Cuckoo's Nest (1975)  Drama  
    4  One Flew Over the Cuckoo's Nest (1975)  Drama  
    
    data.ix[0] #输出第一条记录
    Out[17]: 
    user_id                                            1
    movie_id                                        1193
    rating                                             5
    timestamp                                  978300760
    gender                                             F
    age                                                1
    occupation                                        10
    zip                                            48067
    title         One Flew Over the Cuckoo's Nest (1975)
    genres                                         Drama
    Name: 0, dtype: object
[/code]

接下来就可以根据任意个用户或者电影属性对评分数据进行 **聚合** 操作。按性别计算每部电影的平均分，可以使用pivot_table方法：

```code
    mean_ratings = data.pivot_table('rating',index='title',columns='gender',aggfunc='mean') #参数改变rows-index，cols-columns，与书本不一样
    mean_ratings[:5]
    Out[26]: 
    gender                                F         M
    title                                            
    $1,000,000 Duck (1971)         3.375000  2.761905
    'Night Mother (1986)           3.388889  3.352941
    'Til There Was You (1997)      2.675676  2.733333
    'burbs, The (1989)             2.793478  2.962085
    ...And Justice for All (1979)  3.828571  3.689024
[/code]

该操作产生一个DataFrame，其内容为电影平均分，行标为电影名称，列标为性别。现在，过滤掉评分数据不够250条的电影。先对title进行分组，然后利用size（）得到一个含有各电影分组大小的Series对象：

```code
    ratings_by_title = data.groupby('title').size()
    ratings_by_title[:10]
    Out[28]: 
    title
    $1,000,000 Duck (1971)                37
    'Night Mother (1986)                  70
    'Til There Was You (1997)             52
    'burbs, The (1989)                   303
    ...And Justice for All (1979)        199
    1-900 (1994)                           2
    10 Things I Hate About You (1999)    700
    101 Dalmatians (1961)                565
    101 Dalmatians (1996)                364
    12 Angry Men (1957)                  616
    dtype: int64
    
    active_titles = ratings_by_title.index[ratings_by_title>=250]
    active_titles
    Out[31]: 
    Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)',
           u'101 Dalmatians (1961)', u'101 Dalmatians (1996)',
           u'12 Angry Men (1957)', u'13th Warrior, The (1999)',
           u'2 Days in the Valley (1996)', u'20,000 Leagues Under the Sea (1954)',
           u'2001: A Space Odyssey (1968)', u'2010 (1984)',
           ...
           u'X-Men (2000)', u'Year of Living Dangerously (1982)',
           u'Yellow Submarine (1968)', u'You've Got Mail (1998)',
           u'Young Frankenstein (1974)', u'Young Guns (1988)',
           u'Young Guns II (1990)', u'Young Sherlock Holmes (1985)',
           u'Zero Effect (1998)', u'eXistenZ (1999)'],
          dtype='object', name=u'title', length=1216)
[/code]

该索引中含有评分数据大于250条的电影名称，然后就可以据此从前面的mean_ratings中 **选取** 所需的行了：

```code
    mean_ratings = mean_ratings.ix[active_titles]
    mean_ratings[:5] #此处与书本不同
    Out[34]: 
    gender                                    F         M
    title                                                
    'burbs, The (1989)                 2.793478  2.962085
    10 Things I Hate About You (1999)  3.646552  3.311966
    101 Dalmatians (1961)              3.791444  3.500000
    101 Dalmatians (1996)              3.240000  2.911215
    12 Angry Men (1957)                4.184397  4.328421
[/code]

为了了解女性观众最喜欢的电影，可以对F列降序排列：

```code
    top_female_ratings = mean_ratings.sort_index(by='F',ascending=False)
    -c:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
    #此处出现警告，pandas0.18.1版本sort_index没有by参数，具体见下
    top_female_ratings = mean_ratings.sort_values(by='F',ascending=False)
    
    top_female_ratings[:10]
    Out[38]: 
    gender                                                     F         M
    title                                                                 
    Close Shave, A (1995)                               4.644444  4.473795
    Wrong Trousers, The (1993)                          4.588235  4.478261
    Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)       4.572650  4.464589
    Wallace & Gromit: The Best of Aardman Animation...  4.563107  4.385075
    Schindler's List (1993)                             4.562602  4.491415
    Shawshank Redemption, The (1994)                    4.539075  4.560625
    Grand Day Out, A (1992)                             4.537879  4.293255
    To Kill a Mockingbird (1962)                        4.536667  4.372611
    Creature Comforts (1990)                            4.513889  4.272277
    Usual Suspects, The (1995)                          4.513317  4.518248
[/code]

警告函数比较，pandas版本0.18.1

> **pandas.DataFrame.sort_index()**  
>  **Parameters:**  
>  **_axis_ ** : index, columns to direct sorting  
>  **_level_ ** : int or level name or list of ints or list of level names  
>  if not None, sort on values in specified index level(s)  
>  **_ascending_ ** : boolean, default True  
>  Sort ascending vs. descending  
>  **_inplace_ ** : bool, if True, perform operation in-place  
>  **_kind_ ** : {quicksort, mergesort, heapsort}  
>  Choice of sorting algorithm. See also ndarray.np.sort for more information.
> mergesort is the only stable algorithm. For DataFrames, this option is only
> applied when sorting on a single column or label.  
>  **_na_position_ ** : {‘first’, ‘last’}  
>  first puts NaNs at the beginning, last puts NaNs at the end  
>  **_sort_remaining_ ** : bool  
>  if true and sorting by level and index is multilevel, sort by other levels
> too (in order) after sorting by specified level  
>  **Returns:**  
>  **_sorted_obj_ ** : DataFrame
>
> **pandas.DataFrame.sort_values()**  
>  **Parameters** :  
>  **_by_ ** : string name or list of names which refer to the axis items  
>  **_axis_ ** : index, columns to direct sorting  
>  **_ascending_ ** : bool or list of bool  
>  Sort ascending vs. descending. Specify list for multiple sort orders. If
> this is a list of bools, must match the length of the by.  
>  **_inplace_ ** : bool  
>  if True, perform operation in-place  
>  **_kind_ ** : {quicksort, mergesort, heapsort}  
>  Choice of sorting algorithm. See also ndarray.np.sort for more information.
> mergesort is the only stable algorithm. For DataFrames, this option is only
> applied when sorting on a single column or label.  
>  **_na_position_ ** : {‘first’, ‘last’}  
>  first puts NaNs at the beginning, last puts NaNs at the end  
>  **Returns** :  
>  **_sorted_obj_ ** : DataFrame

计算评分分歧  
假设我们想要找出男性和女性观众分歧最大的电影。一个办法师给mean_ratings加上一个用于存放平均得分之差的列diff，并对其进行排序可得到分歧最大且女性观众更喜欢的电影：

```code
    mean_ratings['diff'] = mean_ratings['M']-mean_ratings['F']
    sort_by_diff = mean_ratings.sort_values(by='diff')
    sort_by_diff[:5]
    Out[41]: 
    gender                            F         M      diff
    title                                                  
    Dirty Dancing (1987)       3.790378  2.959596 -0.830782
    Jumpin' Jack Flash (1986)  3.254717  2.578358 -0.676359
    Grease (1978)              3.975265  3.367041 -0.608224
    Little Women (1994)        3.870588  3.321739 -0.548849
    Steel Magnolias (1989)     3.901734  3.365957 -0.535777
[/code]

堆排序结果反序并取前5行，得到的则是男性观众更喜爱的电影：

```code
    sort_by_diff[::-1][:5]
    Out[43]: 
    gender                                         F         M      diff
    title                                                               
    Good, The Bad and The Ugly, The (1966)  3.494949  4.221300  0.726351
    Kentucky Fried Movie, The (1977)        2.878788  3.555147  0.676359
    Dumb & Dumber (1994)                    2.697987  3.336595  0.638608
    Longest Day, The (1962)                 3.411765  4.031447  0.619682
    Cable Guy, The (1996)                   2.250000  2.863787  0.613787
[/code]

如果只想要找出分歧最大的电影（不考虑性别因素），则可以计算得分数据的方差或者标准差：

```code
    #分组后计算标准差
    rating_std_by_title = data.groupby('title')['rating'].std()
    #筛选评分多于250条的
    rating_std_by_title = rating_std_by_title.ix[active_titles]
    
    rating_std_by_title.order(ascending=False)[:5]
    -c:1: FutureWarning: order is deprecated, use sort_values(...) #虽有警告，依然能得出结果
    rating_std_by_title.sort_values(ascending=False)[:5]
    Out[50]: 
    title
    Dumb & Dumber (1994)                     1.321333
    Blair Witch Project, The (1999)          1.316368
    Natural Born Killers (1994)              1.307198
    Tank Girl (1995)                         1.277695
    Rocky Horror Picture Show, The (1975)    1.260177
    Name: rating, dtype: float64
[/code]


![在这里插入图片描述](https://img-blog.csdnimg.cn/20210608151750993.gif)
posted on 2021-07-07 15:40 BabyGo000 阅读(102) 评论(0) 收藏举报