Kaggle TMDB 票房预测

数据分析 - Kaggle TMDB 票房预测

- 环境准备
- 数据集
- 正文
- 数据预处理
- 数据探索性分析
- 建模

环境准备

使用了的环境：

Windows 10
python 3.7.2
Jupyter Notebook （代码均在此测试成功）

数据集

https://www.kaggle.com/c/tmdb-box-office-prediction/data

正文

开工前准备，导入第三方库：

    import pandas as pd
    pd.set_option('max_columns',None)
    import matplotlib.pyplot as plt
    import seaborn as sns
    import plotly.graph_objs as go
    import plotly.offline as py
    from wordcloud import WordCloud
    plt.style.use('ggplot')
    import ast
    from collections import Counter
    import numpy as np
    from sklearn.preprocessing import LabelEncoder
    # 文本挖掘
    from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
    from sklearn.linear_model import LinearRegression
    # 模型
    from sklearn.model_selection import train_test_split
    import lightgbm as lgb

加载数据：

    train=pd.read_csv('dataset/train.csv')
    test=pd.read_csv('dataset/test.csv')

简单了解数据：

    train.head()

| id | belongs_to_collection | budget | genres | homepage | imdb_id
| original_language | original_title | overview | popularity |
poster_path | production_companies | production_countries | release_date
| runtime | spoken_languages | status | tagline | title | Keywords
| cast | crew | revenue
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---
0 | 1 | [{‘id’: 313576, ‘name’: 'Hot Tub Time Machine … | 14000000 |
[{‘id’: 35, ‘name’: ‘Comedy’}] | NaN | tt2637294 | en | Hot Tub Time
Machine 2 | When Lou, who has become the "father of the In… | 6.575393 |
/tQtWuwvMf0hCc2QR2tkolwl7c3c.jpg | [{‘name’: ‘Paramount Pictures’, ‘id’: 4},
{'na… | [{‘iso_3166_1’: ‘US’, ‘name’: 'United States o… | 2/20/15 | 93.0
| [{‘iso_639_1’: ‘en’, ‘name’: ‘English’}] | Released | The Laws of Space
and Time are About to be Vio… | Hot Tub Time Machine 2 | [{‘id’: 4379,
‘name’: ‘time travel’}, {‘id’: 9… | [{‘cast_id’: 4, ‘character’: ‘Lou’,
'credit_id… | [{‘credit_id’: ‘59ac067c92514107af02c8c8’, 'de… | 12314651
1 | 2 | [{‘id’: 107674, ‘name’: 'The Princess Diaries … | 40000000 |
[{‘id’: 35, ‘name’: ‘Comedy’}, {‘id’: 18, 'nam… | NaN | tt0368933 | en
| The Princess Diaries 2: Royal Engagement | Mia Thermopolis is now a
college graduate and … | 8.248895 | /w9Z7A0GHEhIp7etpj0vyKOeU1Wx.jpg |
[{‘name’: ‘Walt Disney Pictures’, ‘id’: 2}] | [{‘iso_3166_1’: ‘US’, ‘name’:
'United States o… | 8/6/04 | 113.0 | [{‘iso_639_1’: ‘en’, ‘name’:
‘English’}] | Released | It can take a lifetime to find true love; she’…
| The Princess Diaries 2: Royal Engagement | [{‘id’: 2505, ‘name’:
‘coronation’}, {‘id’: 42… | [{‘cast_id’: 1, ‘character’: ‘Mia Thermopolis’…
| [{‘credit_id’: ‘52fe43fe9251416c7502563d’, 'de… | 95149435
2 | 3 | NaN | 3300000 | [{‘id’: 18, ‘name’: ‘Drama’}] |
http://sonyclassics.com/whiplash/ | tt2582802 | en | Whiplash | Under
the direction of a ruthless instructor, … | 64.299990 |
/lIv1QinFqz4dlp5U4lQ6HaiskOZ.jpg | [{‘name’: ‘Bold Films’, ‘id’: 2266},
{‘name’: … | [{‘iso_3166_1’: ‘US’, ‘name’: 'United States o… | 10/10/14 |
105.0 | [{‘iso_639_1’: ‘en’, ‘name’: ‘English’}] | Released | The road
to greatness can take you to the edge. | Whiplash | [{‘id’: 1416, ‘name’:
‘jazz’}, {‘id’: 1523, 'n… | [{‘cast_id’: 5, ‘character’: ‘Andrew Neimann’,…
| [{‘credit_id’: ‘54d5356ec3a3683ba0000039’, 'de… | 13092000
3 | 4 | NaN | 1200000 | [{‘id’: 53, ‘name’: ‘Thriller’}, {‘id’: 18,
'n… | http://kahaanithefilm.com/ | tt1821480 | hi | Kahaani | Vidya
Bagchi (Vidya Balan) arrives in Kolkata … | 3.174936 |
/aTXRaPrWSinhcmCrcfJK17urp3F.jpg | NaN | [{‘iso_3166_1’: ‘IN’, ‘name’:
‘India’}] | 3/9/12 | 122.0 | [{‘iso_639_1’: ‘en’, ‘name’: ‘English’},
{'iso… | Released | NaN | Kahaani | [{‘id’: 10092, ‘name’: ‘mystery’},
{‘id’: 1054… | [{‘cast_id’: 1, ‘character’: ‘Vidya Bagchi’, '… |
[{‘credit_id’: ‘52fe48779251416c9108d6eb’, 'de… | 16000000
4 | 5 | NaN | 0 | [{‘id’: 28, ‘name’: ‘Action’}, {‘id’: 53, 'nam… |
NaN | tt1380152 | ko | 마린보이 | Marine Boy is the story of a former
national s… | 1.148070 | /m22s7zvkVFDU9ir56PiiqIEWFdT.jpg | NaN |
[{‘iso_3166_1’: ‘KR’, ‘name’: ‘South Korea’}] | 2/5/09 | 118.0 |
[{‘iso_639_1’: ‘ko’, ‘name’: ‘한국어/조선말’}] | Released | NaN | Marine Boy
| NaN | [{‘cast_id’: 3, ‘character’: ‘Chun-soo’, 'cred… | [{‘credit_id’:
‘52fe464b9251416c75073b43’, 'de… | 3923970

数据集大小：数据量挺小的

    print(train.shape)
    print(test.shape)

(3000, 23)
(4398, 22)

数据预处理

从上面的数据预览，发现有几列是json形式的数据，必须转化成可处理的格式。json数据在python中可以pyquery处理，pyquery的语法类似于jquery，也可以用ast.literal_eval将字符串型的json数据转化成字典列表，这里我用第二种方法：

    dict_columns = ['belongs_to_collection', 'genres', 'production_companies',
                    'production_countries', 'spoken_languages', 'Keywords', 'cast', 'crew']
    
    def json_to_dict(df):
        for column in dict_columns:
            df[column] = df[column].apply(lambda x: {} if pd.isna(x) else ast.literal_eval(x) )
        return df
    
    train = json_to_dict(train)
    test = json_to_dict(test)

再将这些不规则数据转化成特征，分为标签提取与编码，如关键演员、题材、分类、系列、发行方等，以及标签数量统计，如分类数量、演员数量、主题长度等。这里需要注意，因为数据集不多，为避免模型过拟合，应仅对TOP的标签进行编码：

    # collections
    train['collection_name'] = train['belongs_to_collection'].apply(lambda x: x[0]['name'] if x != {} else 0)
    train['has_collection'] = train['belongs_to_collection'].apply(lambda x: len(x) if x != {} else 0)
    
    test['collection_name'] = test['belongs_to_collection'].apply(lambda x: x[0]['name'] if x != {} else 0)
    test['has_collection'] = test['belongs_to_collection'].apply(lambda x: len(x) if x != {} else 0)
    
    train = train.drop(['belongs_to_collection'], axis=1)
    test = test.drop(['belongs_to_collection'], axis=1)
    
    # genres
    list_of_genres = list(train['genres'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values)
    train['num_genres'] = train['genres'].apply(lambda x: len(x) if x != {} else 0)
    train['all_genres'] = train['genres'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '')
    top_genres = [m[0] for m in Counter([i for j in list_of_genres for i in j]).most_common(15)]
    for g in top_genres:
        train['genre_' + g] = train['all_genres'].apply(lambda x: 1 if g in x else 0)
        
    test['num_genres'] = test['genres'].apply(lambda x: len(x) if x != {} else 0)
    test['all_genres'] = test['genres'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '')
    for g in top_genres:
        test['genre_' + g] = test['all_genres'].apply(lambda x: 1 if g in x else 0)
    
    train = train.drop(['genres'], axis=1)
    test = test.drop(['genres'], axis=1)
    
    # production companies
    list_of_companies = list(train['production_companies'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values)
    
    train['num_companies'] = train['production_companies'].apply(lambda x: len(x) if x != {} else 0)
    train['all_production_companies'] = train['production_companies'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '')
    top_companies = [m[0] for m in Counter([i for j in list_of_companies for i in j]).most_common(30)]
    for g in top_companies:
        train['production_company_' + g] = train['all_production_companies'].apply(lambda x: 1 if g in x else 0)
        
    test['num_companies'] = test['production_companies'].apply(lambda x: len(x) if x != {} else 0)
    test['all_production_companies'] = test['production_companies'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '')
    for g in top_companies:
        test['production_company_' + g] = test['all_production_companies'].apply(lambda x: 1 if g in x else 0)
    
    train = train.drop(['production_companies', 'all_production_companies'], axis=1)
    test = test.drop(['production_companies', 'all_production_companies'], axis=1)
    
    # production countries
    list_of_countries = list(train['production_countries'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values)
    train['num_countries'] = train['production_countries'].apply(lambda x: len(x) if x != {} else 0)
    train['all_countries'] = train['production_countries'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '')
    top_countries = [m[0] for m in Counter([i for j in list_of_countries for i in j]).most_common(25)]
    for g in top_countries:
        train['production_country_' + g] = train['all_countries'].apply(lambda x: 1 if g in x else 0)
        
    test['num_countries'] = test['production_countries'].apply(lambda x: len(x) if x != {} else 0)
    test['all_countries'] = test['production_countries'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '')
    for g in top_countries:
        test['production_country_' + g] = test['all_countries'].apply(lambda x: 1 if g in x else 0)
    
    train = train.drop(['production_countries', 'all_countries'], axis=1)
    test = test.drop(['production_countries', 'all_countries'], axis=1)
    
    # spoken languages
    list_of_languages = list(train['spoken_languages'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values)
    train['num_languages'] = train['spoken_languages'].apply(lambda x: len(x) if x != {} else 0)
    train['all_languages'] = train['spoken_languages'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '')
    top_languages = [m[0] for m in Counter([i for j in list_of_languages for i in j]).most_common(30)]
    for g in top_languages:
        train['language_' + g] = train['all_languages'].apply(lambda x: 1 if g in x else 0)
        
    test['num_languages'] = test['spoken_languages'].apply(lambda x: len(x) if x != {} else 0)
    test['all_languages'] = test['spoken_languages'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '')
    for g in top_languages:
        test['language_' + g] = test['all_languages'].apply(lambda x: 1 if g in x else 0)
    
    train = train.drop(['spoken_languages', 'all_languages'], axis=1)
    test = test.drop(['spoken_languages', 'all_languages'], axis=1)
    
    # keywords
    list_of_keywords = list(train['Keywords'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values)
    train['num_Keywords'] = train['Keywords'].apply(lambda x: len(x) if x != {} else 0)
    train['all_Keywords'] = train['Keywords'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '')
    top_keywords = [m[0] for m in Counter([i for j in list_of_keywords for i in j]).most_common(30)]
    for g in top_keywords:
        train['keyword_' + g] = train['all_Keywords'].apply(lambda x: 1 if g in x else 0)
        
    test['num_Keywords'] = test['Keywords'].apply(lambda x: len(x) if x != {} else 0)
    test['all_Keywords'] = test['Keywords'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '')
    for g in top_keywords:
        test['keyword_' + g] = test['all_Keywords'].apply(lambda x: 1 if g in x else 0)
    
    train = train.drop(['Keywords', 'all_Keywords'], axis=1)
    test = test.drop(['Keywords', 'all_Keywords'], axis=1)
    
    # cast
    list_of_cast_names = list(train['cast'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values)
    list_of_cast_genders = list(train['cast'].apply(lambda x: [i['gender'] for i in x] if x != {} else []).values)
    list_of_cast_characters = list(train['cast'].apply(lambda x: [i['character'] for i in x] if x != {} else []).values)
    train['num_cast'] = train['cast'].apply(lambda x: len(x) if x != {} else 0)
    top_cast_names = [m[0] for m in Counter([i for j in list_of_cast_names for i in j]).most_common(15)]
    for g in top_cast_names:
        train['cast_name_' + g] = train['cast'].apply(lambda x: 1 if g in str(x) else 0)
    train['genders_0_cast'] = train['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 0]))
    train['genders_1_cast'] = train['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 1]))
    train['genders_2_cast'] = train['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 2]))
    top_cast_characters = [m[0] for m in Counter([i for j in list_of_cast_characters for i in j]).most_common(15)]
    for g in top_cast_characters:
        train['cast_character_' + g] = train['cast'].apply(lambda x: 1 if g in str(x) else 0)
        
    test['num_cast'] = test['cast'].apply(lambda x: len(x) if x != {} else 0)
    for g in top_cast_names:
        test['cast_name_' + g] = test['cast'].apply(lambda x: 1 if g in str(x) else 0)
    test['genders_0_cast'] = test['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 0]))
    test['genders_1_cast'] = test['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 1]))
    test['genders_2_cast'] = test['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 2]))
    for g in top_cast_characters:
        test['cast_character_' + g] = test['cast'].apply(lambda x: 1 if g in str(x) else 0)
    
    train = train.drop(['cast'], axis=1)
    test = test.drop(['cast'], axis=1)
    
    # crew
    list_of_crew_names = list(train['crew'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values)
    list_of_crew_jobs = list(train['crew'].apply(lambda x: [i['job'] for i in x] if x != {} else []).values)
    list_of_crew_genders = list(train['crew'].apply(lambda x: [i['gender'] for i in x] if x != {} else []).values)
    list_of_crew_departments = list(train['crew'].apply(lambda x: [i['department'] for i in x] if x != {} else []).values)
    list_of_crew_names = train['crew'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values
    train['num_crew'] = train['crew'].apply(lambda x: len(x) if x != {} else 0)
    top_crew_names = [m[0] for m in Counter([i for j in list_of_crew_names for i in j]).most_common(15)]
    for g in top_crew_names:
        train['crew_name_' + g] = train['crew'].apply(lambda x: 1 if g in str(x) else 0)
    train['genders_0_crew'] = train['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 0]))
    train['genders_1_crew'] = train['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 1]))
    train['genders_2_crew'] = train['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 2]))
    top_crew_jobs = [m[0] for m in Counter([i for j in list_of_crew_jobs for i in j]).most_common(15)]
    for j in top_crew_jobs:
        train['jobs_' + j] = train['crew'].apply(lambda x: sum([1 for i in x if i['job'] == j]))
    top_crew_departments = [m[0] for m in Counter([i for j in list_of_crew_departments for i in j]).most_common(15)]
    for j in top_crew_departments:
        train['departments_' + j] = train['crew'].apply(lambda x: sum([1 for i in x if i['department'] == j])) 
        
    test['num_crew'] = test['crew'].apply(lambda x: len(x) if x != {} else 0)
    for g in top_crew_names:
        test['crew_name_' + g] = test['crew'].apply(lambda x: 1 if g in str(x) else 0)
    test['genders_0_crew'] = test['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 0]))
    test['genders_1_crew'] = test['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 1]))
    test['genders_2_crew'] = test['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 2]))
    for j in top_crew_jobs:
        test['jobs_' + j] = test['crew'].apply(lambda x: sum([1 for i in x if i['job'] == j]))
    for j in top_crew_departments:
        test['departments_' + j] = test['crew'].apply(lambda x: sum([1 for i in x if i['department'] == j])) 
    
    train = train.drop(['crew'], axis=1)
    test = test.drop(['crew'], axis=1)

预览一下数据处理完成后的效果：

    train.head()

| id | budget | homepage | imdb_id | original_language |
original_title | overview | popularity | poster_path | release_date |
runtime | status | tagline | title | revenue | collection_name |
has_collection | num_genres | all_genres | genre_Drama | genre_Comedy
| genre_Thriller | genre_Action | genre_Romance | genre_Crime |
genre_Adventure | genre_Horror | genre_Science Fiction | genre_Family |
genre_Fantasy | genre_Mystery | genre_Animation | genre_History |
genre_Music | num_companies | production_company_Warner Bros. |
production_company_Universal Pictures | production_company_Paramount
Pictures | production_company_Twentieth Century Fox Film Corporation |
production_company_Columbia Pictures | production_company_Metro-Goldwyn-
Mayer (MGM) | production_company_New Line Cinema |
production_company_Touchstone Pictures | production_company_Walt Disney
Pictures | production_company_Columbia Pictures Corporation |
production_company_TriStar Pictures | production_company_Relativity Media |
production_company_Canal+ | production_company_United Artists |
production_company_Miramax Films | production_company_Village Roadshow
Pictures | production_company_Regency Enterprises | production_company_BBC
Films | production_company_Dune Entertainment | production_company_Working
Title Films | production_company_Fox Searchlight Pictures |
production_company_StudioCanal | production_company_Lionsgate |
production_company_DreamWorks SKG | production_company_Fox 2000 Pictures |
production_company_Summit Entertainment | production_company_Hollywood
Pictures | production_company_Orion Pictures | production_company_Amblin
Entertainment | production_company_Dimension Films | num_countries |
production_country_United States of America | production_country_United
Kingdom | production_country_France | production_country_Germany |
production_country_Canada | production_country_India |
production_country_Italy | production_country_Japan |
production_country_Australia | production_country_Russia |
production_country_Spain | production_country_China |
production_country_Hong Kong | production_country_Ireland |
production_country_Belgium | production_country_South Korea |
production_country_Mexico | production_country_Sweden |
production_country_New Zealand | production_country_Netherlands |
production_country_Czech Republic | production_country_Denmark |
production_country_Brazil | production_country_Luxembourg |
production_country_South Africa | num_languages | language_English |
language_Français | language_Español | language_Deutsch |
language_Pусский | language_Italiano | language_日本語 | language_普通话 |
language_हिन्दी | language_ | language_Português | language_العربية |
language_한국어/조선말 | language_广州话 / 廣州話 | language_தமிழ் | language_Polski
| language_Magyar | language_Latin | language_svenska |
language_ภาษาไทย | language_Český | language_עִבְרִית |
language_ελληνικά | language_Türkçe | language_Dansk |
language_Nederlands | language_فارسی | language_Tiếng Việt |
language_اردو | language_Română | num_Keywords | keyword_woman director
| keyword_independent film | keyword_duringcreditsstinger |
keyword_murder | keyword_based on novel | keyword_violence |
keyword_sport | keyword_biography | keyword_aftercreditsstinger |
keyword_dystopia | keyword_revenge | keyword_friendship | keyword_sex |
keyword_suspense | keyword_sequel | keyword_love | keyword_police |
keyword_teenager | keyword_nudity | keyword_female nudity | keyword_drug
| keyword_prison | keyword_musical | keyword_high school | keyword_los
angeles | keyword_new york | keyword_family | keyword_father son
relationship | keyword_kidnapping | keyword_investigation | num_cast |
cast_name_Samuel L. Jackson | cast_name_Robert De Niro | cast_name_Morgan
Freeman | cast_name_J.K. Simmons | cast_name_Bruce Willis |
cast_name_Liam Neeson | cast_name_Susan Sarandon | cast_name_Bruce McGill
| cast_name_John Turturro | cast_name_Forest Whitaker | cast_name_Willem
Dafoe | cast_name_Bill Murray | cast_name_Owen Wilson |
cast_name_Nicolas Cage | cast_name_Sylvester Stallone | genders_0_cast |
genders_1_cast | genders_2_cast | cast_character_ |
cast_character_Himself | cast_character_Herself | cast_character_Dancer |
cast_character_Additional Voices (voice) | cast_character_Doctor |
cast_character_Reporter | cast_character_Waitress | cast_character_Nurse
| cast_character_Bartender | cast_character_Jack |
cast_character_Debutante | cast_character_Security Guard |
cast_character_Paul | cast_character_Frank | num_crew | crew_name_Avy
Kaufman | crew_name_Robert Rodriguez | crew_name_Deborah Aquila |
crew_name_James Newton Howard | crew_name_Mary Vernieu | crew_name_Steven
Spielberg | crew_name_Luc Besson | crew_name_Jerry Goldsmith |
crew_name_Francine Maisler | crew_name_Tricia Wood | crew_name_James
Horner | crew_name_Kerry Barden | crew_name_Bob Weinstein |
crew_name_Harvey Weinstein | crew_name_Janet Hirshenson | genders_0_crew
| genders_1_crew | genders_2_crew | jobs_Producer | jobs_Executive
Producer | jobs_Director | jobs_Screenplay | jobs_Editor |
jobs_Casting | jobs_Director of Photography | jobs_Original Music Composer
| jobs_Art Direction | jobs_Production Design | jobs_Costume Design |
jobs_Writer | jobs_Set Decoration | jobs_Makeup Artist | jobs_Sound Re-
Recording Mixer | departments_Production | departments_Sound |
departments_Art | departments_Crew | departments_Writing |
departments_Costume & Make-Up | departments_Camera | departments_Directing
| departments_Editing | departments_Visual Effects | departments_Lighting
| departments_Actors
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---
0 | 1 | 14000000 | NaN | tt2637294 | en | Hot Tub Time Machine 2
| When Lou, who has become the "father of the In… | 6.575393 |
/tQtWuwvMf0hCc2QR2tkolwl7c3c.jpg | 2/20/15 | 93.0 | Released | The
Laws of Space and Time are About to be Vio… | Hot Tub Time Machine 2 |
12314651 | Hot Tub Time Machine Collection | 1 | 1 | Comedy | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 3 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 |
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 24 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 6 | 8 | 10 | 1 | 1 | 1 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 72 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 59 | 0 | 13 | 1 | 3 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
1 | 1 | 1 | 1 | 4 | 2 | 9 | 10 | 12 | 4 | 2 | 13 | 8
| 4 | 2 | 4 | 4 | 0
1 | 2 | 40000000 | NaN | tt0368933 | en | The Princess Diaries 2:
Royal Engagement | Mia Thermopolis is now a college graduate and … |
8.248895 | /w9Z7A0GHEhIp7etpj0vyKOeU1Wx.jpg | 8/6/04 | 113.0 |
Released | It can take a lifetime to find true love; she’… | The Princess
Diaries 2: Royal Engagement | 95149435 | The Princess Diaries Collection
| 1 | 4 | Comedy Drama Family Romance | 1 | 1 | 0 | 0 | 1 | 0
| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20 | 0
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 0 | 10 | 10 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 1 | 0 | 9 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4 | 4 | 3
| 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 4 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0
2 | 3 | 3300000 | http://sonyclassics.com/whiplash/ | tt2582802 | en
| Whiplash | Under the direction of a ruthless instructor, … | 64.299990
| /lIv1QinFqz4dlp5U4lQ6HaiskOZ.jpg | 10/10/14 | 105.0 | Released |
The road to greatness can take you to the edge. | Whiplash | 13092000 |
0 | 0 | 1 | Drama | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 0 | 0 | 12 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 51 | 0 | 0 | 0 | 1
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 31 | 7
| 13 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0
| 0 | 1 | 1 | 64 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 0 | 0 | 49 | 4 | 11 | 4 | 4 | 1 | 1
| 1 | 2 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 2 | 18 | 9
| 5 | 9 | 1 | 5 | 4 | 3 | 6 | 3 | 1 | 0
3 | 4 | 1200000 | http://kahaanithefilm.com/ | tt1821480 | hi |
Kahaani | Vidya Bagchi (Vidya Balan) arrives in Kolkata … | 3.174936 |
/aTXRaPrWSinhcmCrcfJK17urp3F.jpg | 3/9/12 | 122.0 | Released | NaN |
Kahaani | 16000000 | 0 | 0 | 2 | Drama Thriller | 1 | 0 | 1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 4 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 0
4 | 5 | 0 | NaN | tt1380152 | ko | 마린보이 | Marine Boy is the
story of a former national s… | 1.148070 |
/m22s7zvkVFDU9ir56PiiqIEWFdT.jpg | 2/5/09 | 118.0 | Released | NaN |
Marine Boy | 3923970 | 0 | 0 | 2 | Action Thriller | 0 | 0 | 1
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 0 | 0 | 4 | 1 | 0 | 0 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2
| 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0
| 0 | 0 | 0

达到预期的效果，但是日期还是字符串格式，再转化一下日期为标准格式：

    def fix_date(x):
        """
        Fixes dates which are in 20xx
        """
        year = x.split('/')[2]
        if int(year) <= 19:
            return x[:-2] + '20' + year
        else:
            return x[:-2] + '19' + year
    test.loc[test['release_date'].isnull() == True, 'release_date'] = '01/01/98'
    train['release_date'] = train['release_date'].apply(lambda x: fix_date(x))
    test['release_date'] = test['release_date'].apply(lambda x: fix_date(x))
    train['release_date'] = pd.to_datetime(train['release_date'])
    test['release_date'] = pd.to_datetime(test['release_date'])

数据探索性分析

首先看一下预算的分布情况，发现大部分值比较小，数据不平衡，应做log处理，增加数值较小时的区分度：

    plt.hist(train['budget'])
    plt.title('budget distribution')

![在这里插入图片描述](https://img-blog.csdnimg.cn/20190813010908708.png?x-oss-
process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MzM0MjIwOA==,size_16,color_FFFFFF,t_70)

    plt.hist(np.log1p(train['budget']))
    plt.title('log1p budget distribution')

![在这里插入图片描述](https://img-blog.csdnimg.cn/20190813011216144.png?x-oss-
process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MzM0MjIwOA==,size_16,color_FFFFFF,t_70)
显然收入revenue也一样处理：

    # log_budget, Normalization
    train['log_budget'] = np.log1p(train['budget'])
    test['log_budget'] = np.log1p(test['budget'])
    # log_revenue, Normalization
    train['log_revenue'] = np.log1p(train['revenue'])

再看下homepage，这里把homepage转换成布尔值，有homepage的也是有实力的象征：

    train['has_homepage'] = 0
    train.loc[train['homepage'].isnull() == False, 'has_homepage'] = 1
    test['has_homepage'] = 0
    test.loc[test['homepage'].isnull() == False, 'has_homepage'] = 1

是否有主页的分布情况，有主页的票房更高：

    sns.catplot(x='has_homepage', y='revenue', data=train);
    plt.title('Revenue for film with and without homepage');

![在这里插入图片描述](https://img-blog.csdnimg.cn/20190813011620487.png?x-oss-
process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MzM0MjIwOA==,size_16,color_FFFFFF,t_70)
各个语言的票房收入情况：

    sns.boxplot(x='original_language',y='log_revenue',
                data=train.loc[train['original_language'].isin(train['original_language'].value_counts().head(10).index)])
    plt.title("Log_Revenue VS Original_language")

![在这里插入图片描述](https://img-blog.csdnimg.cn/2019081301180524.png?x-oss-
process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MzM0MjIwOA==,size_16,color_FFFFFF,t_70)
英语好片很多，烂片也很多，其他语言也有好的电影，总体差别不大。

overview列，涉及到文本信息挖掘，这里简单结合常用的Tfidf和线性回归进行建模，如下：（）

    vectorizer=TfidfVectorizer(sublinear_tf=True,analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1,2),min_df=5)
    overview_text=vectorizer.fit_transform(train['overview'].fillna(''))
    linreg=LinearRegression()
    linreg.fit(overview_text,train['log_revenue'])

使用eli5可视化各关键字对log_revenue的影响：

    import eli5
    print('Target value:', train['log_revenue'][5])
    eli5.show_prediction(linreg,doc=train['overview'][5],vec=vectorizer)

日期特征比较粗糙，增加星期几、月份、季度、年份等特征：

    def process_date(df):
        date_parts=['year','weekday','month','weekofyear','day','quarter']
        for part in date_parts:
            df["release_date_"+part]=getattr(df["release_date"].dt,part).astype(int)
        return df
    train=process_date(train)
    test=process_date(test)

先看下每年电影的发行量：这里用可交互式的可视化工具plotly

    d1=train['release_date_year'].value_counts().sort_index()
    d2=test['release_date_year'].value_counts().sort_index()
    py.init_notebook_mode(connected=True)
    data=[go.Scatter(x=d1.index,y=d1.values,name='train'),go.Scatter(x=d2.index,y=d2.values,name='test')]
    layout=go.Layout(dict(title='Number of films per year',xaxis=dict(title='year'),yaxis=dict(title='Count')),legend=dict(orientation='v'))
    py.iplot(dict(data=data,layout=layout))

![在这里插入图片描述](https://img-blog.csdnimg.cn/2019081301371847.png?x-oss-
process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MzM0MjIwOA==,size_16,color_FFFFFF,t_70)
总发行量与总票房的趋势：

    d1=train['release_date_year'].value_counts().sort_index()
    d2=train.groupby(['release_date_year'])['revenue'].sum()
    data=[go.Scatter(x=d1.index,y=d1.values,name='Count'),go.Scatter(x=d2.index,y=d2.values,name='overall_revenue',yaxis='y2')]
    layout=go.Layout(dict(title= "Number of films and total revenue per year",xaxis=dict(title='year'),yaxis=dict(title='Count'),yaxis2=dict(title='Total revenue', overlaying='y', side='right')),
                     legend=dict(orientation='v'))
    py.iplot(dict(data=data,layout=layout))

![在这里插入图片描述](https://img-blog.csdnimg.cn/20190813013921489.png?x-oss-
process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MzM0MjIwOA==,size_16,color_FFFFFF,t_70)
总发行量与平均票房的趋势：（似乎平均票房在2000之后趋于稳定）

    d1=train['release_date_year'].value_counts().sort_index()
    d2=train.groupby(['release_date_year'])['revenue'].mean()
    data=[go.Scatter(x=d1.index,y=d1.values,name='Count'),go.Scatter(x=d2.index,y=d2.values,name='Average revenue',yaxis='y2')]
    layout=go.Layout(dict(title="Number of films and average revenue per year",xaxis=dict(title='year'),yaxis=dict(title='Count'),
                         yaxis2=dict(title='Average revenue',overlaying='y',side='right')),legend=dict(orientation='v'))
    py.iplot(dict(data=data,layout=layout))

![在这里插入图片描述](https://img-blog.csdnimg.cn/2019081301404894.png?x-oss-
process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MzM0MjIwOA==,size_16,color_FFFFFF,t_70)

周几发行是否与票房有关：

    sns.catplot(x='release_date_weekday',y='revenue',data=train)
    plt.title('Revenue on different days of week of release')

![在这里插入图片描述](https://img-blog.csdnimg.cn/2019081301431287.png?x-oss-
process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MzM0MjIwOA==,size_16,color_FFFFFF,t_70)
再看下箱线图：

    sns.boxplot(x='release_date_weekday',y='log_revenue',data=train)
    plt.title('Revenue on different days of week of release')

![在这里插入图片描述](https://img-blog.csdnimg.cn/20190813014556974.png?x-oss-
process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MzM0MjIwOA==,size_16,color_FFFFFF,t_70)

发现：周一到周三发布的电影很多是高票房的，周六发行的电影票房就很低了。

电影介绍tagline分析，分析出现频率最高的词汇：

    plt.figure(figsize = (12, 12))
    text_tagline=' '.join(train['tagline'].fillna(''))
    wordcloud_tagline=WordCloud(max_font_size=None,background_color='white',width=1200,height=1000).generate_from_text(text_tagline)
    plt.imshow(wordcloud_tagline)
    plt.title('Top words in tagline')
    plt.axis("off")
    plt.show()

![在这里插入图片描述](https://img-blog.csdnimg.cn/20190813014848549.png?x-oss-
process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MzM0MjIwOA==,size_16,color_FFFFFF,t_70)

是否有系列has_collection对票房的影响：

    sns.boxplot(x='has_collection',y='log_revenue',data=train)

![在这里插入图片描述](https://img-blog.csdnimg.cn/20190813015110601.png?x-oss-
process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MzM0MjIwOA==,size_16,color_FFFFFF,t_70)
发现系列电影的平均票房更高。

分析电影题材数量与票房的关系：

    train['num_genres'].value_counts()
    sns.catplot(x='num_genres',y='revenue',data=train)

![在这里插入图片描述](https://img-blog.csdnimg.cn/2019081301525729.png?x-oss-
process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MzM0MjIwOA==,size_16,color_FFFFFF,t_70)
题材数量3个往往有更高的票房，数量多了反而不好。

最后看下电影发行方与票房的关系，分别绘制分布图：

    f,axes=plt.subplots(6,5,figsize=(24,32))
    plt.suptitle('Violin of revenue vs production company')
    for i,e in enumerate([i for i in train.columns if 'production_company_' in i]):
        sns.violinplot(x=e,y='revenue',data=train,ax=axes[i//5][i%5])

![在这里插入图片描述](https://img-blog.csdnimg.cn/20190813015504604.png?x-oss-
process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MzM0MjIwOA==,size_16,color_FFFFFF,t_70)

建模

先删除一些无关的特征：

    train = train.drop(['homepage', 'imdb_id', 'poster_path', 'release_date', 'status', 'log_revenue'], axis=1)
    test = test.drop(['homepage', 'imdb_id', 'poster_path', 'release_date', 'status'], axis=1)

再删除特征值唯一的特征：

    for col in train.columns:
        if train[col].nunique()==1:
            print(col)
            train.drop([col],axis=1)
            train.drop([col],axis=1)

对分类标签进行编码：

    for col in ['original_language','collection_name','all_genres']:
        le=LabelEncoder()
        le.fit(list(train[col].fillna(''))+list(test[col].fillna('')))
        train[col]=le.transform(train[col].fillna('').astype(str))
        test[col]=le.transform(test[col].fillna('').astype(str))

将文本转化成特征：

    train_texts = train[['title', 'tagline', 'overview', 'original_title']]
    test_texts = test[['title', 'tagline', 'overview', 'original_title']]
    for col in ['title','tagline','overview','original_title']:
        train['len_'+col]=train[col].fillna('').apply(lambda x: len(str(x)))
        train['words_'+col]=train[col].fillna('').apply(lambda x: len(str(x).split(' ')))
        test['len_'+col]=test[col].fillna('').apply(lambda x: len(str(x)))
        test['words_'+col]=test[col].fillna('').apply(lambda x: len(str(x).split(' ')))
        train=train.drop(col,axis=1)
        test=test.drop(col,axis=1)

训练数据和测试数据：

    X=train.drop(['id','revenue'],axis=1)
    y=np.log1p(train['revenue'])
    X_test=test.drop(['id'],axis=1)

模型训练：

    X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.1)
    # rmse: root mean square error, (sum(d^2)/N)^0.5
    params = {'num_leaves': 30,
             'min_data_in_leaf': 20,
             'objective': 'regression',
             'max_depth': 5,
             'learning_rate': 0.01,
             "boosting": "gbdt",
             "feature_fraction": 0.9,
             "bagging_freq": 1,
             "bagging_fraction": 0.9,
             "bagging_seed": 11,
             "metric": 'rmse',
             "lambda_l1": 0.2,
             "verbosity": -1}
    model1=lgb.LGBMRegressor(**params,n_estimators=20000,nthread=4,jobs=-1)
    model1.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test)], eval_metric='rmse',verbose=1000, early_stopping_rounds=200)

Training until validation scores don’t improve for 200 rounds.
[1000] training’s rmse: 1.42756 valid_1’s rmse: 2.07259
Early stopping, best iteration is:
[1118] training’s rmse: 1.38621 valid_1’s rmse: 2.06726

训练后，各特征权重：

posted @ 2021-07-01 20:41 老酱阅读(565) 评论(0) 收藏举报

刷新页面返回顶部

老酱

Kaggle TMDB 票房预测

数据分析 - Kaggle TMDB 票房预测

环境准备

数据集

正文

数据预处理

数据探索性分析

建模

公告