大数据中台-Pandas与SQL对比
与SQL比较 大多数示例将使用tipspandas测试中找到的数据集。我们将数据读入名为tips的DataFrame中,并假设我们有一个具有相同名称和结构的数据库表。 In [3]: url = ('https://raw.github.com/pandas-dev' ...: '/pandas/master/pandas/tests/data/tips.csv') ...: In [4]: tips = pd.read_csv(url) In [5]: tips.head() Out[5]: total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4 SELECT 在SQL中,使用您要选择的以逗号分隔的列列表(或* 选择所有列)来完成选择: SELECT total_bill, tip, smoker, time FROM tips LIMIT 5; 使用pandas,通过将列名列表传递给DataFrame来完成列选择: In [6]: tips[['total_bill', 'tip', 'smoker', 'time']].head(5) Out[6]: total_bill tip smoker time 0 16.99 1.01 No Dinner 1 10.34 1.66 No Dinner 2 21.01 3.50 No Dinner 3 23.68 3.31 No Dinner 4 24.59 3.61 No Dinner 在没有列名列表的情况下调用DataFrame将显示所有列(类似于SQL*)。 WHERE SQL中的过滤是通过WHERE子句完成的。 SELECT * FROM tips WHERE time = 'Dinner' LIMIT 5; DataFrame可以通过多种方式进行过滤; 最直观的是使用 布尔索引。 In [7]: tips[tips['time'] == 'Dinner'].head(5) Out[7]: total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4 上面的语句只是将一个 Series 的 True / False 对象传递给 DataFrame,返回所有带有True的行。 In [8]: is_dinner = tips['time'] == 'Dinner' In [9]: is_dinner.value_counts() Out[9]: True 176 False 68 Name: time, dtype: int64 In [10]: tips[is_dinner].head(5) Out[10]: total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4 就像SQL的OR和AND一样,可以使用|将多个条件传递给DataFrame (OR)和&(AND)。 -- tips of more than $5.00 at Dinner meals SELECT * FROM tips WHERE time = 'Dinner' AND tip > 5.00; # tips of more than $5.00 at Dinner meals In [11]: tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)] Out[11]: total_bill tip sex smoker day time size 23 39.42 7.58 Male No Sat Dinner 4 44 30.40 5.60 Male No Sun Dinner 4 47 32.40 6.00 Male No Sun Dinner 4 52 34.81 5.20 Female No Sun Dinner 4 59 48.27 6.73 Male No Sat Dinner 4 116 29.93 5.07 Male No Sun Dinner 4 155 29.85 5.14 Female No Sun Dinner 5 170 50.81 10.00 Male Yes Sat Dinner 3 172 7.25 5.15 Male Yes Sun Dinner 2 181 23.33 5.65 Male Yes Sun Dinner 2 183 23.17 6.50 Male Yes Sun Dinner 4 211 25.89 5.16 Male Yes Sat Dinner 4 212 48.33 9.00 Male No Sat Dinner 4 214 28.17 6.50 Female Yes Sat Dinner 3 239 29.03 5.92 Male No Sat Dinner 3 -- tips by parties of at least 5 diners OR bill total was more than $45 SELECT * FROM tips WHERE size >= 5 OR total_bill > 45; # tips by parties of at least 5 diners OR bill total was more than $45 In [12]: tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)] Out[12]: total_bill tip sex smoker day time size 59 48.27 6.73 Male No Sat Dinner 4 125 29.80 4.20 Female No Thur Lunch 6 141 34.30 6.70 Male No Thur Lunch 6 142 41.19 5.00 Male No Thur Lunch 5 143 27.05 5.00 Female No Thur Lunch 6 155 29.85 5.14 Female No Sun Dinner 5 156 48.17 5.00 Male No Sun Dinner 6 170 50.81 10.00 Male Yes Sat Dinner 3 182 45.35 3.50 Male Yes Sun Dinner 3 185 20.69 5.00 Male No Sun Dinner 5 187 30.46 2.00 Male Yes Sun Dinner 5 212 48.33 9.00 Male No Sat Dinner 4 216 28.15 3.00 Male Yes Sat Dinner 5 使用notna()和isna() 方法完成NULL检查。 In [13]: frame = pd.DataFrame({'col1': ['A', 'B', np.NaN, 'C', 'D'], ....: 'col2': ['F', np.NaN, 'G', 'H', 'I']}) ....: In [14]: frame Out[14]: col1 col2 0 A F 1 B NaN 2 NaN G 3 C H 4 D I 假设我们有一个与上面的DataFrame结构相同的表。我们只能col2通过以下查询看到IS NULL 的记录: SELECT * FROM frame WHERE col2 IS NULL; In [15]: frame[frame['col2'].isna()] Out[15]: col1 col2 1 B NaN 获取col1IS NOT NULL的项目可以完成notna()。 SELECT * FROM frame WHERE col1 IS NOT NULL; In [16]: frame[frame['col1'].notna()] Out[16]: col1 col2 0 A F 1 B NaN 3 C H 4 D I 如果要选择某列等于多个数值或者字符串时,要用到.isin() 平时使用最多的筛选应该是字符串的模糊筛选,在SQL语句里用的是like,在pandas里我们可以用.str.contains()来实现 GROUP BY 在pandas中,SQL的GROUP BY操作使用类似命名的 groupby()方法执行。groupby()通常是指我们想要将数据集拆分成组,应用某些功能(通常是聚合),然后将这些组合在一起的过程。 常见的SQL操作是获取整个数据集中每个组中的记录数。例如,有一个需要向我们提供提示中的性别的数量的查询语句: SELECT sex, count(*) FROM tips GROUP BY sex; /* Female 87 Male 157 */ 在 pandas 中可以这样: In [17]: tips.groupby('sex').size() Out[17]: sex Female 87 Male 157 dtype: int64 请注意,在我们使用的pandas代码中size(),没有 count()。这是因为 count()将函数应用于每个列,返回每个列中的记录数。not null In [18]: tips.groupby('sex').count() Out[18]: total_bill tip smoker day time size sex Female 87 87 87 87 87 87 Male 157 157 157 157 157 157 或者,我们可以将该count()方法应用于单个列: In [19]: tips.groupby('sex')['total_bill'].count() Out[19]: sex Female 87 Male 157 Name: total_bill, dtype: int64 也可以一次应用多个功能。例如,假设我们希望查看提示量与星期几的不同之处 - agg()允许您将字典传递给分组的DataFrame,指示要应用于特定列的函数。 SELECT day, AVG(tip), COUNT(*) FROM tips GROUP BY day; /* Fri 2.734737 19 Sat 2.993103 87 Sun 3.255132 76 Thur 2.771452 62 */ In [20]: tips.groupby('day').agg({'tip': np.mean, 'day': np.size}) Out[20]: tip day day Fri 2.734737 19 Sat 2.993103 87 Sun 3.255132 76 Thur 2.771452 62 通过将列列表传递给groupby()方法来完成多个列的分组 。 SELECT smoker, day, COUNT(*), AVG(tip) FROM tips GROUP BY smoker, day; /* smoker day No Fri 4 2.812500 Sat 45 3.102889 Sun 57 3.167895 Thur 45 2.673778 Yes Fri 15 2.714000 Sat 42 2.875476 Sun 19 3.516842 Thur 17 3.030000 */ In [21]: tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]}) Out[21]: tip size mean smoker day No Fri 4.0 2.812500 Sat 45.0 3.102889 Sun 57.0 3.167895 Thur 45.0 2.673778 Yes Fri 15.0 2.714000 Sat 42.0 2.875476 Sun 19.0 3.516842 Thur 17.0 3.030000 JOIN 可以使用join()或执行JOIN merge()。默认情况下, join()将在其索引上加入DataFrame。每个方法都有参数,允许您指定要执行的连接类型(LEFT,RIGHT,INNER,FULL)或要连接的列(列名称或索引)。 In [22]: df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], ....: 'value': np.random.randn(4)}) ....: In [23]: df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'], ....: 'value': np.random.randn(4)}) ....: 假设我们有两个与DataFrames名称和结构相同的数据库表。 现在让我们来看看各种类型的JOIN。 INNER JOIN SELECT * FROM df1 INNER JOIN df2 ON df1.key = df2.key; # merge performs an INNER JOIN by default In [24]: pd.merge(df1, df2, on='key') Out[24]: key value_x value_y 0 B -0.282863 1.212112 1 D -1.135632 -0.173215 2 D -1.135632 0.119209 merge() 当您想要将一个DataFrame列与另一个DataFrame索引连接时,还会为这些情况提供参数。 In [25]: indexed_df2 = df2.set_index('key') In [26]: pd.merge(df1, indexed_df2, left_on='key', right_index=True) Out[26]: key value_x value_y 1 B -0.282863 1.212112 3 D -1.135632 -0.173215 3 D -1.135632 0.119209 LEFT OUTER JOIN -- show all records from df1 SELECT * FROM df1 LEFT OUTER JOIN df2 ON df1.key = df2.key; # show all records from df1 In [27]: pd.merge(df1, df2, on='key', how='left') Out[27]: key value_x value_y 0 A 0.469112 NaN 1 B -0.282863 1.212112 2 C -1.509059 NaN 3 D -1.135632 -0.173215 4 D -1.135632 0.119209 RIGHT JOIN -- show all records from df2 SELECT * FROM df1 RIGHT OUTER JOIN df2 ON df1.key = df2.key; # show all records from df2 In [28]: pd.merge(df1, df2, on='key', how='right') Out[28]: key value_x value_y 0 B -0.282863 1.212112 1 D -1.135632 -0.173215 2 D -1.135632 0.119209 3 E NaN -1.044236 FULL JOIN pandas还允许显示数据集两侧的FULL JOIN,无论连接列是否找到匹配项。在编写时,所有RDBMS(MySQL)都不支持FULL JOIN。 -- show all records from both tables SELECT * FROM df1 FULL OUTER JOIN df2 ON df1.key = df2.key; # show all records from both frames In [29]: pd.merge(df1, df2, on='key', how='outer') Out[29]: key value_x value_y 0 A 0.469112 NaN 1 B -0.282863 1.212112 2 C -1.509059 NaN 3 D -1.135632 -0.173215 4 D -1.135632 0.119209 5 E NaN -1.044236 UNION UNION ALL可以使用concat()。 In [30]: df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'], ....: 'rank': range(1, 4)}) ....: In [31]: df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'], ....: 'rank': [1, 4, 5]}) ....: SELECT city, rank FROM df1 UNION ALL SELECT city, rank FROM df2; /* city rank Chicago 1 San Francisco 2 New York City 3 Chicago 1 Boston 4 Los Angeles 5 */ In [32]: pd.concat([df1, df2]) Out[32]: city rank 0 Chicago 1 1 San Francisco 2 2 New York City 3 0 Chicago 1 1 Boston 4 2 Los Angeles 5 SQL的UNION类似于UNION ALL,但是UNION将删除重复的行。 SELECT city, rank FROM df1 UNION SELECT city, rank FROM df2; -- notice that there is only one Chicago record this time /* city rank Chicago 1 San Francisco 2 New York City 3 Boston 4 Los Angeles 5 */ 在 pandas 中,您可以concat()结合使用 drop_duplicates()。 In [33]: pd.concat([df1, df2]).drop_duplicates() Out[33]: city rank 0 Chicago 1 1 San Francisco 2 2 New York City 3 1 Boston 4 2 Los Angeles 5 Pandas等同于某些SQL分析和聚合函数 带有偏移量的前N行 -- MySQL SELECT * FROM tips ORDER BY tip DESC LIMIT 10 OFFSET 5; In [34]: tips.nlargest(10 + 5, columns='tip').tail(10) Out[34]: total_bill tip sex smoker day time size 183 23.17 6.50 Male Yes Sun Dinner 4 214 28.17 6.50 Female Yes Sat Dinner 3 47 32.40 6.00 Male No Sun Dinner 4 239 29.03 5.92 Male No Sat Dinner 3 88 24.71 5.85 Male No Thur Lunch 2 181 23.33 5.65 Male Yes Sun Dinner 2 44 30.40 5.60 Male No Sun Dinner 4 52 34.81 5.20 Female No Sun Dinner 4 85 34.83 5.17 Female No Thur Lunch 4 211 25.89 5.16 Male Yes Sat Dinner 4 每组前N行 -- Oracle's ROW_NUMBER() analytic function SELECT * FROM ( SELECT t.*, ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn FROM tips t ) WHERE rn < 3 ORDER BY day, rn; In [35]: (tips.assign(rn=tips.sort_values(['total_bill'], ascending=False) ....: .groupby(['day']) ....: .cumcount() + 1) ....: .query('rn < 3') ....: .sort_values(['day', 'rn'])) ....: Out[35]: total_bill tip sex smoker day time size rn 95 40.17 4.73 Male Yes Fri Dinner 4 1 90 28.97 3.00 Male Yes Fri Dinner 2 2 170 50.81 10.00 Male Yes Sat Dinner 3 1 212 48.33 9.00 Male No Sat Dinner 4 2 156 48.17 5.00 Male No Sun Dinner 6 1 182 45.35 3.50 Male Yes Sun Dinner 3 2 197 43.11 5.00 Female Yes Thur Lunch 4 1 142 41.19 5.00 Male No Thur Lunch 5 2 同样使用 rank (method ='first') 函数 In [36]: (tips.assign(rnk=tips.groupby(['day'])['total_bill'] ....: .rank(method='first', ascending=False)) ....: .query('rnk < 3') ....: .sort_values(['day', 'rnk'])) ....: Out[36]: total_bill tip sex smoker day time size rnk 95 40.17 4.73 Male Yes Fri Dinner 4 1.0 90 28.97 3.00 Male Yes Fri Dinner 2 2.0 170 50.81 10.00 Male Yes Sat Dinner 3 1.0 212 48.33 9.00 Male No Sat Dinner 4 2.0 156 48.17 5.00 Male No Sun Dinner 6 1.0 182 45.35 3.50 Male Yes Sun Dinner 3 2.0 197 43.11 5.00 Female Yes Thur Lunch 4 1.0 142 41.19 5.00 Male No Thur Lunch 5 2.0 -- Oracle's RANK() analytic function SELECT * FROM ( SELECT t.*, RANK() OVER(PARTITION BY sex ORDER BY tip) AS rnk FROM tips t WHERE tip < 2 ) WHERE rnk < 3 ORDER BY sex, rnk; 让我们找到每个性别组(等级<3)的提示(提示<2)。请注意,使用rank(method='min')函数时 rnk_min对于相同的提示保持不变 (如Oracle的RANK()函数) In [37]: (tips[tips['tip'] < 2] ....: .assign(rnk_min=tips.groupby(['sex'])['tip'] ....: .rank(method='min')) ....: .query('rnk_min < 3') ....: .sort_values(['sex', 'rnk_min'])) ....: Out[37]: total_bill tip sex smoker day time size rnk_min 67 3.07 1.00 Female Yes Sat Dinner 1 1.0 92 5.75 1.00 Female Yes Fri Dinner 2 1.0 111 7.25 1.00 Female No Sat Dinner 1 1.0 236 12.60 1.00 Male Yes Sat Dinner 2 1.0 237 32.83 1.17 Male Yes Sat Dinner 2 2.0 更新(UPDATE) UPDATE tips SET tip = tip*2 WHERE tip < 2; In [38]: tips.loc[tips['tip'] < 2, 'tip'] *= 2 删除(DELETE) DELETE FROM tips WHERE tip > 9; 在pandas中,我们选择应保留的行,而不是删除它们 In [39]: tips = tips.loc[tips['tip'] <= 9]
huidaoli版权所有:转载请注明出处,谢谢合作!

浙公网安备 33010602011771号