Introduction to data Science with Python

Posted on 2024-09-10 09:11 n24j7b 阅读(49) 评论(0) 收藏举报

FINAL ASSESEMENT. Introduction to data Science with Python

General Instructions

This is the final assessment for the course. You need to download the datasets providedto answer the questions.The 5 datasets named 'World_Happiness_Report' (there is one for each year of data) areused for Part A of this Assessment which corresponds to Modules 3 to 5, and account for50% of the grade of the assessment.The dataset 'Air Passengers' is used in the final part of the assessment, whichcorrespond to module 6 and account to 20% of the grade.Modules 1 and 2 are implicitly evaluated as part of the assessment, as you are expectedto use Python and Jupyter notebooks for data analysis and presentation. Thecorrectness, quality and clarity of your code will account for 20% of the grade.You are supposed to complete the assessment using a Jupyter Notebook. You should usethe notebook to load the data, perform the calculations, and present the results usingMarkdown or text cells. The clarity and presentation of the notebook will also be part ofthe grading criteria (10% of the grade).When you finish the assessment, please submit the .ipynb notebook. Make sure that allcells work correctly, as the professor will re-run all cells before evaluating theassessment. To simplify the correction process, please place the data files in the samedirectory as the Jupyter notebook. This way, the cells loading the data files will workdirectly in the professor's environment without the need to change the file path.

IMPORTANT

Put your name and your student code (if you know it) at the beginning of your jupyternotebook, and is possible, also as the title of the notebook. For instance

Final_Assesment_Planas_Bielsa_240784.ipynb

==================================================

PART A- Understanding the Sources of Happiness

(datasets used: all 5 CSV files - World Happiness Report_yyyy.csv)

Context:The 'World Happiness Report' dataset is a comprehensive collection of data thatmeasures happiness levels in different countries worldwide. It provides valuable

nsights into the factors that contribute to happiness and offers a global perspective onwell-being. The dataset includes various variables that assess economic indicators,social support, health, freedom, trust, and generosity, among other factors. With observations spanning multiple years, it allows for temporal analysis and identificationof happiness trends across different countries.Each entry in the dataset represents a specific country and contains a wealth ofinformation for comparative analysis. Researchers, policymakers, and individualsinterested in understanding happiness levels can leverage this dataset to explore theunderlying factors and their variations across different regions. By analysing the WorldHappiness Report dataset, one can gain insights into the determinants of happiness anduncover potential strategies to promote well-being and enhance the quality of lifeworldwide.To perform the analysis, you should load all five CSV files corresponding to the year2015 to 2019 into your Jupyter Notebook. These datasets will provide the necessarydata to answer the questions and gain a comprehensive understanding of happinesstrends and influencing factors across different years.

QUESTIONS PART A

Descriptive Statistics

1) Identify and describe the types of variables present in the World Happiness Reportdataset. Categorize each variable as discrete, categorical, ordinal, or continuous.

2) Calculate the mean, median, variance, and standard deviation for each continuousvariable in the dataset for each year from 2015 to 2019.

3) Create a separate line plot for each continuous variable, showing the temporalvolution of the mean from 2015 to 2019. Each plot should have the year on the x-axisand the mean on the y-axis.

4) Generate two histograms in the same plt.figure, one for the Happiness Score in 2015and another for the Happiness Score in 2019. Discuss any observed differences orsimilarities.

5) Calculate the correlation coefficients between the Happiness_Index and thecontinuous variables (economy, family, health, freedom, trust, generosity). For eachyear, which variable has the highest positive correlation with happiness and whichvariable has the highest negative correlation with happiness.

Modelling (Use LinearRegression model from the scikit-learn library)

1) For each year in the dataset, create 代写Introduction to data Science with Pythona unidimensional linear model where theindependent variable is one of the factors (economy, family, health, freedom, trust,

generosity), and the dependent variable is the happiness score. Report the coefficientsand intercept for each linear model and calculate the coefficient of determination (R^2)for each model.

2) Identify which variable explains the happiness score better by itself for each year. Isthis variable the same every year, or does it vary?3) Build a multidimensional linear model using all the variables (economy, family,health, freedom, trust, generosity) asindependent variables and the happiness score asthe dependent variable for each year. Then, compare the accuracy of themultidimensional model with the unidimensional models. Does the inclusion of allvariables improve the accuracy of the model?

Inference and Hypothesis Testing

1) Consider the population to consist of all the countries listed. For the 2019 data,randomly select a sample of 30 countries from the population. Compute a 95%confidence interval for the sample mean of The Happiness score.

2) When constructing the dataset, various variables were considered, including theannual sunshine duration in hours for each country. Although the data was notultimately included, we know that the standard deviation of the values was 200 hours.

The newspaper claimed that the population average was 2500 hours or less. To

investigate this claim, a new measurement was conducted last year on a small sample of

50 countries randomly selected, and the sample average was found to be 2580 hours.

Formulate the null and alternative hypotheses to test the claim and determine whether

this should be a one-sided or two-sided test. Based on the sampled data, state whether

the null hypothesis should be rejected or not.

==================================================

PART B - Forecasting air passengers.

(dataset used: AirPassengers.csv)

Context:

Part B of this Assessment focuses on the famous 'Air Passengers' dataset, whichprovides monthly data on the number of air passengers from 1949 to 1960. This datasetis commonly used in time series analysis and forecasting tasks and it presents a valuableopportunity to study the trends and patterns in air travel over a specific time period.allowing for the exploration of various time series analysis techniques. This dataset isparticularly useful in understanding the growth, seasonality, and long-term trendsin airtravel demand.By analysing the 'Air Passengers' dataset, we can gain insights into the factors thatinfluence air travel, identify patterns in passenger behaviour, and make predictionabout future passenger numbers. This dataset is widely used in the field oftransportation analysis, helping airlines, airports, and policymakers make informeddecisions about capacity planning, marketing strategies, and resource allocation.

QUESTIONS PART B

Time Series

load the dataset and plot the time series. Which decomposition method do you think

ill be more appropriate for this dataset? Additive or multiplicative? reason youranswer. 2 Perform a seasonal decomposition of the "Air Passengers" dataset using themethodthat you selected as more appropriate in your previous answer. Visualize the trend,seasonality, and residual components. make a plot(or several plots) toshow all differentcomponents of the time series.

3 Can you estimate a prediction for the number of passengers in December 1961?

================END ASSESMENT ==============================

Acknowledgement of data sources.

The “Happiness Report dataset”, which is an open source dataset freely available to the public. The WorldHappiness Report is an annual publication by the Sustainable Development Solutions Network, and thedataset can be accessed from their official website or platforms like Kaggle.The 'Air Passengers' dataset is a publicly available dataset widely used in time series analysis. The 'AirPassengers' dataset provides monthly data on the number of air passengers from 1949 to 1960 and isoften used as a benchmark dataset for forecasting and analysing time series data. The dataset can beaccessed from various sources, including the 'datasets' packageinPython's seaborn library or otheronline data repositories."

刷新页面返回顶部

qq--99515681

导航

公告

Introduction to data Science with Python