【843】dataframe相关操作 (match merge)

Reference:

29-April-2024

use to_string() to print the entire DataFrame.
Check the number of maximum returned rows: print(pd.options.display.max_rows)ref: https://www.w3schools.com/python/pandas/pandas_csv.asp
Increase the maximum number of rows to display the entire DataFrame: pd.options.display.max_rows = 9999ref: https://www.w3schools.com/python/pandas/pandas_csv.asp
The DataFrames object has a method called info(), that gives you more information about the data set.
Remove all rows with NULL values: dropna()
The fillna() method allows us to replace empty cells with a value
Pandas uses the mean() median() and mode() methods to calculate the respective values for a specified column
- Mean = the average value (the sum of all values divided by number of values).
- Median = the value in the middle, after you have sorted all values ascending.
- Mode = the value that appears most frequently.
Pandas has a to_datetime() method to convert all cells in the 'Date' column into dates: df['Date'] = pd.to_datetime(df['Date'])
Remove rows with a NULL value in the "Date" column: df.dropna(subset=['Date'], inplace = True)

pandas.DataFrame.merge (match columns)

DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=None, indicator=False, validate=None)[source]
Merge DataFrame or named Series objects with a database-style join.

Parameters:

how{‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’

Type of merge to be performed.

left: use only keys from left frame, similar to a SQL left outer join; preserve key order. (Keep all left dataframe's records, and no matched cells will be set with "NaN")

right: use only keys from right frame, similar to a SQL right outer join; preserve key order. (Keep all right dataframe's records, and no matched cells will be set with "NaN")

outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically. (Keep both of the records, and no matched cells will be set with "NaN")

inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys. (Only keep the matched records)

cross: creates the cartesian product from both frames, preserves the order of the left keys.

import pandas as pd # Create two sample DataFrames df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'E', 'F'], 'value1': [1, 2, 3, 4, 5, 6]}) df2 = pd.DataFrame({'key': ['A', 'B', 'C', 'G'], 'value2': [5, 6, 7, 8]}) print("Left DataFrame:") print(df1) print() print("Right DataFrame:") print(df2) print() # Merge the DataFrames based on the 'key' column merged_df1 = pd.merge(df1, df2, on='key', how='left') # Display the merged DataFrame print("how='left': Keep all left dataframe's records, and no matched cells will be set with \"NaN\"") print(merged_df1) print() # Merge the DataFrames based on the 'key' column merged_df2 = pd.merge(df1, df2, on='key', how='right') # Display the merged DataFrame print("how='right': Keep all right dataframe's records, and no matched cells will be set with \"NaN\"") print(merged_df2) print() # Merge the DataFrames based on the 'key' column merged_df3 = pd.merge(df1, df2, on='key', how='outer') # Display the merged DataFrame print("how='outer': Keep both of the records, and no matched cells will be set with \"NaN\"") print(merged_df3) print() # Merge the DataFrames based on the 'key' column merged_df4 = pd.merge(df1, df2, on='key', how='inner') # Display the merged DataFrame print("how='inner': Only keep the matched records") print(merged_df4)

Output:

Left DataFrame: key value1 0 A 1 1 B 2 2 C 3 3 D 4 4 E 5 5 F 6 Right DataFrame: key value2 0 A 5 1 B 6 2 C 7 3 G 8 how='left': Keep all left dataframe's records, and no matched cells will be set with "NaN" key value1 value2 0 A 1 5.0 1 B 2 6.0 2 C 3 7.0 3 D 4 NaN 4 E 5 NaN 5 F 6 NaN how='right': Keep all right dataframe's records, and no matched cells will be set with "NaN" key value1 value2 0 A 1.0 5 1 B 2.0 6 2 C 3.0 7 3 G NaN 8 how='outer': Keep both of the records, and no matched cells will be set with "NaN" key value1 value2 0 A 1.0 5.0 1 B 2.0 6.0 2 C 3.0 7.0 3 D 4.0 NaN 4 E 5.0 NaN 5 F 6.0 NaN 6 G NaN 8.0 how='inner': Only keep the matched records key value1 value2 0 A 1 5 1 B 2 6 2 C 3 7

With two columns:

# Create two sample DataFrames df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'E', 'F'], 'value': [1, 2, 3, 4, 5, 6]}) df2 = pd.DataFrame({'key': ['A', 'B', 'C', 'G'], 'value': [1, 6, 3, 8]}) print("Left DataFrame:") print(df1) print() print("Right DataFrame:") print(df2) print() # Merge the DataFrames based on the 'key' column merged_df4 = pd.merge(df1, df2, on=['key', 'value'], how='inner') # Display the merged DataFrame print("how='inner': Only keep the matched records for both columns of 'key' and 'value'") print(merged_df4)

Output:

Left DataFrame: key value 0 A 1 1 B 2 2 C 3 3 D 4 4 E 5 5 F 6 Right DataFrame: key value 0 A 1 1 B 6 2 C 3 3 G 8 how='inner': Only keep the matched records for both columns of 'key' and 'value' key value 0 A 1 1 C 3

pandas与SQL对比：Comparison with SQL

按照列排序：pandas.DataFrame.sort_values

创建dataframe：pandas读取字典（dict）数据

数据表join，生成新列：PANDAS 数据合并与重塑（join/merge篇）

pandas的merge方法提供了一种类似于SQL的内存链接操作，官网文档提到它的性能会比其他开源语言的数据操作（例如R）要高效。和SQL语句的对比可以看这里
merge的列需要有同样的名称
merge的参数
- on：列名，join用来对齐的那一列的名字，用到这个参数的时候一定要保证左表和右表用来对齐的那一列都有相同的列名。
- left_on：左表对齐的列，可以是列名，也可以是和dataframe同样长度的arrays。
- right_on：右表对齐的列，可以是列名，也可以是和dataframe同样长度的arrays。
- left_index/ right_index: 如果是True的haunted以index作为对齐的key。
- how：数据融合的方法。
- sort：根据dataframe合并的keys按字典顺序排序，默认是，如果置false可以提高表现。

Example:

pd.merge(gdf_africa, df_af_mp, how='left', on='country')

posted on 2023-06-15 18:31 McDelfino 阅读(47) 评论(0) 收藏举报

刷新页面返回顶部

alex_bn_lee

导航

公告

【843】dataframe相关操作 (match merge)

pandas.DataFrame.merge (match columns)