【课程】Introduction to Data Science in Python Week3

Merging Dataframes

When we want to join the DataFrames together, we have some choices to make.
outer join: a union. In the Venn diagram, it represents everyone in any circle.

pd.merge(dataframe1, dataframe2, how='outer', left_index=True, right_index=True)

iner join: the intersection. This is represented in the Venn diagram as the overlapping parts of each circle.

pd.merge(df1, df2, how='inner', left_index=True, right_index=True)
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR'},
                         {'Name': 'Sally', 'Role': 'Course liasion'},
                         {'Name': 'James', 'Role': 'Grader'}])
staff_df = staff_df.set_index('Name')
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business'},
                           {'Name': 'Mike', 'School': 'Law'},
                           {'Name': 'Sally', 'School': 'Engineering'}])
student_df = student_df.set_index('Name')
print(staff_df.head())
print()
print(student_df.head())

left join: we want to get a list of all staff regardless of whether they were students or not. But if they were students, we would want to get their student details as well.

pd.merge(staff_df, student_df, how='left', left_index=True, right_index=True)

right join: we want a list of all of the students and their roles if they were also staff.

pd.merge(staff_df, student_df, how='right', left_index=True, right_index=True)

表格中_x是left dataframe的信息, _y是right dataframe的信息。

Eg. Here are two DataFrames, products and invoices. The product DataFrame has an identifier and a sticker price. The invoices DataFrame lists the people, product identifiers, and quantity. Assuming that we want to generate totals, how do we join these two DataFrames together so that we have one which lists all of the information we need?

print(pd.merge(products, invoices, left_index=True, right_on='Product ID'))
staff_df = pd.DataFrame([{'First Name': 'Kelly', 'Last Name': 'Desjardins', 'Role': 'Director of HR'},
                         {'First Name': 'Sally', 'Last Name': 'Brooks', 'Role': 'Course liasion'},
                         {'First Name': 'James', 'Last Name': 'Wilde', 'Role': 'Grader'}])
student_df = pd.DataFrame([{'First Name': 'James', 'Last Name': 'Hammond', 'School': 'Business'},
                           {'First Name': 'Mike', 'Last Name': 'Smith', 'School': 'Law'},
                           {'First Name': 'Sally', 'Last Name': 'Brooks', 'School': 'Engineering'}])
staff_df
student_df
pd.merge(staff_df, student_df, how='inner', left_on=['First Name','Last Name'], right_on=['First Name','Last Name'])

left_on: 左侧Dataframe中的列或索引级别用作键。可以是列名,索引级名称,可以是长度等于Dataframe长度的数组。
left_index: 如果为True,则使用左侧Dataframe中的索引(行标签)作为其链接键。

Pandas Idioms

Pandorable

(df.where(df['SUMLEV']==50)
    .dropna()
    .set_index(['STNAME','CTYNAME'])
    .rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'}))

Eg. Suppose we are working on a DataFrame that holds information on our equipment for an upcoming backpacking trip. Can you use method chaining to modify the DataFrame df in one statement to drop any entries where ‘Quantity’ is 0 and rename the column ‘Weight’ to ‘Weight (oz.)’?

print(df.where(df['Quantity']!=0)
        .dropna()
        .rename(columns={'Weight':'Weight(oz.)'}))
# OR
print(df.drop(df[df['Quantity'] == 0].index).rename(columns={'Weight': 'Weight (oz.)'}))

Group by

Split

df = df.set_index('STNAME')

def fun(item):
    if item[0]<'M':
        return 0
    if item[0]<'Q':
        return 1
    return 2

for group, frame in df.groupby(fun):
    print('There are ' + str(len(frame)) + ' records in group ' + str(group) + ' for processing.')

Apply

df.groupby('STNAME').agg({'CENSUS2010POP': np.average})
print(df.groupby('Category').apply(lambda df,a,b: sum(df[a] * df[b]), 'Weight (oz.)', 'Quantity'))
 Or alternatively without using a lambda:
# def totalweight(df, w, q):
#        return sum(df[w] * df[q])
#        
# print(df.groupby('Category').apply(totalweight, 'Weight (oz.)', 'Quantity'))

agg 方法将一个函数使用在一个数列上,然后返回一个标量的值。也就是说agg每次传入的是一列数据,对其聚合后返回标量。
apply 是一个更一般化的方法:将一个数据分拆-应用-汇总。而apply会将当前分组后的数据一起传入,可以返回多维数据。

一个讲解groupby很好的网页 Pandas Groupby 使用

Scales

Four scales

(a,b) (c,d): Scale

  • Ratio scale: units are equally spaced. 四则运算都有效。例如height and weight
  • Interval scale: units are equally spaced, but there is no true zero. 因为无0,所以乘除法无效。例如 测量的温度。
  • Ordinal scale: the order of the units is important, but not evenly spaced. letter grades such as A+, A are a good example
  • Nominal scale: categories of data, but the categories have no order w.r.t one another. E.g. teams of a sport.
df=pd.DataFrame(['A+','A','A-','B+','B','B-','C+','C','C-','D+','D'], 
index=['excellent', 'excellent', 'excellent', 'good','good','good', 'ok','ok','ok','poor','poor'])
df.rename(columns={0:'Grades'},inplace = True)
df
grade=df['Grades'].astype('category',categories= ['D','D+','C-','C','C+','B-','B','B+','A-','A','A+'], ordered=True)
grades.head()

等级从小到大排序,如果没有ordered=True,根据字典顺序, 则C-和C+都大于C。

grades > 'C'

E.g. Try casting this series to categorical with the ordering Low < Medium < High.

s = pd.Series(['Low', 'Low', 'High', 'Medium', 'Low', 'High', 'Low'])

# Your code here
s.astype('category',categories=['Low','Medium','High'], ordered =True)
df=pd.read_csv('census.csv')
df=df[df['SUMLEV'==50]
df=df.set_index('STNAME',groupby(level=0)['CENSUS2010POP'].agg({'avg':np.average})
pd.cut(df['avg'],10)

Build categories

E.g. Suppose we have a series that holds height data for jacket wearers. Use pd.cut to bin this data into 3 bins.

s = pd.Series([168, 180, 174, 190, 170, 185, 179, 181, 175, 169, 182, 177, 180, 171])

# Your code here
pd.cut(s,3)

# You can also add labels for the sizes [Small < Medium < Large].
pd.cut(s, 3, labels=['Small', 'Medium', 'Large'])

Pivot Tables

df=pd.read_csv('cars.csv')
df.head()
df.pivot_table(values='(kW)',index='YEAR',columns='Make',aggfunc=np.mean)

E.g. Suppose we have a DataFrame with price and ratings for different bikes, broken down by manufacturer and type of bicycle. Create a pivot table that shows the mean price and mean rating for every ‘Manufacturer’ / ‘Bike Type’ combination.

#print(Bikes)
# Your code here
import numpy as np
print(Bikes.pivot_table(index=('Manufacturer','Bike Type'), aggfunc=np.mean))

If we set 参数 margins = True, for each of the functions there’s now an ‘all’ category, which shows the overall mean and overall min, for a given year and a given vendor.

df.pivot_table(values='(kW)',index='YEAR',column='Make',aggfunc=[np.mean,np.min],margins=True)

Date Functionality

Pandas has four main time related classes. Timestamp, DatetimeIndex, Period and PeriodIndex

Timestamp

pd.Timestamp['9/1/2016 10:05AM']
>>> Timestamp('2016-09-01 10:05:00')

Period

Periods used for a period of time, not a 时间点.

pd.Period('1/2016')
>>>Period('2016-01','M') #Month
pd.Period('3/5/2016')
>>>Period('2016-03-05','D') #Day

DatetimeIndex

t1=pd.Series(list('abc'),[pd.Timestamp('2016-09-01'),pd.Timestamp('2016-09-02'),pd.Timestamp('2016-09-03')])
t1
>>> 三个日期分别对应abc

PeriodIndex

t2=pd.Series(list('def'),[pd.Timestamp('2016-09'),pd.Timestamp('2016-10'),pd.Timestamp('2016-11')])
t2
>>> 三个时间段分别对应def三个字母

Converting to Datetime

pd.to_datetime()
dayfirst=True # European format

Timedeltas

pd.Timestamp('9/3/2016)-pd.Timestamp['9/1/2016']
>>>Timedelta('2 says 00:00:00')
pd.Timestamp('9/2/2016 8:10AM' + pd.Timedelta('12D 3H')
>>> Timestamp('2016-09-14 11:10:00')

Working with Dates in a Dateframe

dates= pd.date_range('10-01-2016', periods=9, freq='2W-SUN')
df=pd.DataFrame({'Count 1': 100+np.random.randint(-5,10,9).cumsum(), 'Count2': 120+np.random.randint(05,10,9)}, index=dates)
df.index.weekday_name
df.diff() #diff between each date
df.resampe('M').mean()
df['2017']
df['2016-12]
df['2016-12':]

import matplotlib.pyplot as plt
%matplotlib inline
df.plot() #visualize count 1 and count 2
posted @ 2021-12-31 20:28  __一条秋刀鱼  阅读(14)  评论(0)    收藏  举报  来源