Python Tutorial: More Advanced Data Wrangling
2017-12-12 15:18 nuswgg 阅读(186) 评论(0) 收藏 举报- Drop observations with missing information.
# Notice the use of the fish data set because it has some missing
# observations
fish = pd.read_csv('/Users/fish.csv')
# First sort by Weight, requesting those with NA for Weight first
fish = fish.sort_values(by='Weight', kind='mergesort', na_position='first')
print(fish.head())
new_fish = fish.dropna()
print(new_fish.head())
- Merge two data sets together on a common variable.
# Notice the use of the student data set again, however we want to reload it
# without the changes we've made previously
student = pd.read_csv('/Users/class.csv')
student1 = pd.concat([student["Name"], student["Sex"], student["Age"]],
axis = 1)
print(student1.head())
a) First, select specific columns of a data set to create two smaller data sets.
# Notice the use of the student data set again, however we want to reload it
# without the changes we've made previously
student = pd.read_csv('/Users/class.csv')
student1 = pd.concat([student["Name"], student["Sex"], student["Age"]],
axis = 1)
print(student1.head())
student2 = pd.concat([student["Name"], student["Height"], student["Weight"]], axis = 1)
print(student2.head())
b) Second, we want to merge the two smaller data sets on the common variable.
new = pd.merge(student1, student2, on="Name")
print(new.head())
Finally, we want to check to see if the merged data set is the same as the original data set.
print(student.equals(new))
- Merge two data sets together by index number only.
a) First, select specific columns of a data set to create two smaller data sets.
newstudent1 = pd.concat([student["Name"], student["Sex"], student["Age"]], axis = 1)
print(newstudent1.head())
newstudent2 = pd.concat([student["Height"], student["Weight"]], axis = 1)
print(newstudent2.head())
b) Second, we want to join the two smaller data sets.
print(student.equals(new2))
- Create a pivot table to summarize information about a data set.
# Notice we are using a new data set that needs to be read into the
# environment
price = pd.read_csv('/Users/price.csv')
# The following code is used to remove the "," and "$" characters from
# the ACTUAL colum so that the values can be summed
from re import sub
from decimal import Decimal
def trim_money(money):
return(float(Decimal(sub(r'[^\d.]', '', money))))
price["REVENUE"] = price["ACTUAL"].apply(trim_money)
table = pd.pivot_table(price, index=["COUNTRY", "STATE", PRODTYPE", "PRODUCT"], values="REVENUE",
aggfunc=np.sum)
print(table.head())
- Return all unique values from a text variable.
print(np.unique(price["STATE"]))
浙公网安备 33010602011771号