alex_bn_lee

导航

【843】dataframe相关操作 (match merge)

Reference:


29-April-2024

  • use to_string() to print the entire DataFrame.

  • Check the number of maximum returned rows: print(pd.options.display.max_rows)ref: https://www.w3schools.com/python/pandas/pandas_csv.asp

  • Increase the maximum number of rows to display the entire DataFrame: pd.options.display.max_rows = 9999ref: https://www.w3schools.com/python/pandas/pandas_csv.asp

  • The DataFrames object has a method called info(), that gives you more information about the data set.

  • Remove all rows with NULL values: dropna()

  • The fillna() method allows us to replace empty cells with a value 

  • Pandas uses the mean() median() and mode() methods to calculate the respective values for a specified column

    • Mean = the average value (the sum of all values divided by number of values).

    • Median = the value in the middle, after you have sorted all values ascending.

    • Mode = the value that appears most frequently.

  • Pandas has a to_datetime() method to convert all cells in the 'Date' column into dates: df['Date'] = pd.to_datetime(df['Date'])

  • Remove rows with a NULL value in the "Date" column: df.dropna(subset=['Date'], inplace = True)


pandas.DataFrame.merge (match columns)

DataFrame.merge(righthow='inner'on=Noneleft_on=Noneright_on=Noneleft_index=Falseright_index=Falsesort=Falsesuffixes=('_x', '_y')copy=Noneindicator=Falsevalidate=None)[source]

Merge DataFrame or named Series objects with a database-style join.

Parameters:

how{‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’

Type of merge to be performed.

  • left: use only keys from left frame, similar to a SQL left outer join; preserve key order. (Keep all left dataframe's records, and no matched cells will be set with "NaN")

  • right: use only keys from right frame, similar to a SQL right outer join; preserve key order. (Keep all right dataframe's records, and no matched cells will be set with "NaN")

  • outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically. (Keep both of the records, and no matched cells will be set with "NaN")

  • inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys. (Only keep the matched records)

  • cross: creates the cartesian product from both frames, preserves the order of the left keys.

import pandas as pd

# Create two sample DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'E', 'F'],
                    'value1': [1, 2, 3, 4, 5, 6]})
df2 = pd.DataFrame({'key': ['A', 'B', 'C', 'G'],
                    'value2': [5, 6, 7, 8]})

print("Left DataFrame:")
print(df1)
print()
print("Right DataFrame:")
print(df2)
print() 

# Merge the DataFrames based on the 'key' column
merged_df1 = pd.merge(df1, df2, on='key', how='left')

# Display the merged DataFrame
print("how='left': Keep all left dataframe's records, and no matched cells will be set with \"NaN\"")
print(merged_df1)
print() 

# Merge the DataFrames based on the 'key' column
merged_df2 = pd.merge(df1, df2, on='key', how='right')

# Display the merged DataFrame
print("how='right': Keep all right dataframe's records, and no matched cells will be set with \"NaN\"")
print(merged_df2)
print()

# Merge the DataFrames based on the 'key' column
merged_df3 = pd.merge(df1, df2, on='key', how='outer')

# Display the merged DataFrame
print("how='outer': Keep both of the records, and no matched cells will be set with \"NaN\"")
print(merged_df3)
print()

# Merge the DataFrames based on the 'key' column
merged_df4 = pd.merge(df1, df2, on='key', how='inner')

# Display the merged DataFrame
print("how='inner': Only keep the matched records")
print(merged_df4)

Output:

Left DataFrame:
  key  value1
0   A       1
1   B       2
2   C       3
3   D       4
4   E       5
5   F       6

Right DataFrame:
  key  value2
0   A       5
1   B       6
2   C       7
3   G       8

how='left': Keep all left dataframe's records, and no matched cells will be set with "NaN"
  key  value1  value2
0   A       1     5.0
1   B       2     6.0
2   C       3     7.0
3   D       4     NaN
4   E       5     NaN
5   F       6     NaN

how='right': Keep all right dataframe's records, and no matched cells will be set with "NaN"
  key  value1  value2
0   A     1.0       5
1   B     2.0       6
2   C     3.0       7
3   G     NaN       8

how='outer': Keep both of the records, and no matched cells will be set with "NaN"
  key  value1  value2
0   A     1.0     5.0
1   B     2.0     6.0
2   C     3.0     7.0
3   D     4.0     NaN
4   E     5.0     NaN
5   F     6.0     NaN
6   G     NaN     8.0

how='inner': Only keep the matched records
  key  value1  value2
0   A       1       5
1   B       2       6
2   C       3       7

With two columns:

# Create two sample DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'E', 'F'],
                    'value': [1, 2, 3, 4, 5, 6]})
df2 = pd.DataFrame({'key': ['A', 'B', 'C', 'G'],
                    'value': [1, 6, 3, 8]})

print("Left DataFrame:")
print(df1)
print()
print("Right DataFrame:")
print(df2)
print() 

# Merge the DataFrames based on the 'key' column
merged_df4 = pd.merge(df1, df2, on=['key', 'value'], how='inner')

# Display the merged DataFrame
print("how='inner': Only keep the matched records for both columns of 'key' and 'value'")
print(merged_df4)

Output: 

Left DataFrame:
  key  value
0   A      1
1   B      2
2   C      3
3   D      4
4   E      5
5   F      6

Right DataFrame:
  key  value
0   A      1
1   B      6
2   C      3
3   G      8

how='inner': Only keep the matched records for both columns of 'key' and 'value'
  key  value
0   A      1
1   C      3

pandas与SQL对比:Comparison with SQL

按照列排序:pandas.DataFrame.sort_values

创建dataframe:pandas读取字典(dict)数据

数据表join,生成新列:PANDAS 数据合并与重塑(join/merge篇)

  • pandas的merge方法提供了一种类似于SQL的内存链接操作,官网文档提到它的性能会比其他开源语言的数据操作(例如R)要高效。和SQL语句的对比可以看这里
  • merge的列需要有同样的名称
  • merge的参数
    • on:列名,join用来对齐的那一列的名字,用到这个参数的时候一定要保证左表和右表用来对齐的那一列都有相同的列名。
    • left_on:左表对齐的列,可以是列名,也可以是和dataframe同样长度的arrays。
    • right_on:右表对齐的列,可以是列名,也可以是和dataframe同样长度的arrays。
    • left_index/ right_index: 如果是True的haunted以index作为对齐的key。
    • how:数据融合的方法。
    • sort:根据dataframe合并的keys按字典顺序排序,默认是,如果置false可以提高表现。

Example:

pd.merge(gdf_africa, df_af_mp, how='left', on='country') 

 

posted on 2023-06-15 18:31  McDelfino  阅读(47)  评论(0)    收藏  举报