Python Tutorial: More Advanced Data Wrangling

2017-12-12 15:18 nuswgg 阅读(186) 评论(0) 收藏举报

Drop observations with missing information.

# Notice the use of the fish data set because it has some missing 
# observations 
fish = pd.read_csv('/Users/fish.csv')
# First sort by Weight, requesting those with NA for Weight first 
fish = fish.sort_values(by='Weight', kind='mergesort', na_position='first')
print(fish.head())

new_fish = fish.dropna()
print(new_fish.head())

pandas.DataFrame.dropna

Merge two data sets together on a common variable.

# Notice the use of the student data set again, however we want to reload it
# without the changes we've made previously
student = pd.read_csv('/Users/class.csv')
student1 = pd.concat([student["Name"], student["Sex"], student["Age"]],
axis = 1)
print(student1.head())

a) First, select specific columns of a data set to create two smaller data sets.

student2 = pd.concat([student["Name"], student["Height"], student["Weight"]], axis = 1)
print(student2.head())

b) Second, we want to merge the two smaller data sets on the common variable.

new = pd.merge(student1, student2, on="Name")
print(new.head())

Finally, we want to check to see if the merged data set is the same as the original data set.

print(student.equals(new))

Merge two data sets together by index number only.

a) First, select specific columns of a data set to create two smaller data sets.

newstudent1 = pd.concat([student["Name"], student["Sex"], student["Age"]], axis = 1)
print(newstudent1.head())

newstudent2 = pd.concat([student["Height"], student["Weight"]], axis = 1)
print(newstudent2.head())

b) Second, we want to join the two smaller data sets.

new2 = newstudent1.join(newstudent2)

print(new2.head())

c) Finally, we want to check to see if the joined data set is the same as the original data set.

print(student.equals(new2))

Create a pivot table to summarize information about a data set.

# Notice we are using a new data set that needs to be read into the
# environment
price = pd.read_csv('/Users/price.csv')
# The following code is used to remove the "," and "$" characters from
# the ACTUAL colum so that the values can be summed
from re import sub
from decimal import Decimal
def trim_money(money):
return(float(Decimal(sub(r'[^\d.]', '', money))))
price["REVENUE"] = price["ACTUAL"].apply(trim_money)
table = pd.pivot_table(price, index=["COUNTRY", "STATE", PRODTYPE", "PRODUCT"], values="REVENUE",
aggfunc=np.sum)
print(table.head())

pd.pivot_table() pd.pivot()

Return all unique values from a text variable.

print(np.unique(price["STATE"]))

np.unique()

刷新页面返回顶部

nuswgg

Python Tutorial: More Advanced Data Wrangling

About