DataFrame中交叉表使用
1、运用场景:交叉表(cross-tabulation, 简称crosstab)是一种常用的分类汇总表格,用于计算分组频率的特殊透视表,主要价值在于描述了变量间关系的深刻含义。虽然两个(或以上)变量可以是分类的或数量的,但是以都是分类的情形最为常见。
2、Python中函数说明:
pd.crosstab(
    index,
    columns,
    values=None,
    rownames=None,
    colnames=None,
    aggfunc=None,
    margins=False,
    margins_name='All',
    dropna=True,
    normalize=False,
)
Docstring:
Compute a simple cross tabulation of two (or more) factors. By default
computes a frequency table of the factors unless an array of values and an
aggregation function are passed.
作用:计算两个(或多个)变量(因子)的简单交叉表。默认情况下计算变量(因子)之间的的频率,
如果传递聚合函数有数组或值,将按照设置的内容计算变量之间的关系,具体详见参数说明。
Parameters
----------
index : array-like, Series, or list of arrays/Series
    Values to group by in the rows.()
columns : array-like, Series, or list of arrays/Series
    Values to group by in the columns.
values : array-like, optional
    Array of values to aggregate according to the factors.
    Requires `aggfunc` be specified.
rownames : sequence, default None
    If passed, must match number of row arrays passed.
colnames : sequence, default None
    If passed, must match number of column arrays passed.
aggfunc : function, optional
    If specified, requires `values` be specified as well.
margins : bool, default False
    Add row/column margins (subtotals).
margins_name : str, default 'All'
    Name of the row/column that will contain the totals
    when margins is True.
    .. versionadded:: 0.21.0
dropna : bool, default True
    Do not include columns whose entries are all NaN.
normalize : bool, {'all', 'index', 'columns'}, or {0,1}, default False
    Normalize by dividing all values by the sum of values.
    - If passed 'all' or `True`, will normalize over all values.
    - If passed 'index' will normalize over each row.
    - If passed 'columns' will normalize over each column.
    - If margins is `True`, will also normalize margin values.
    .. versionadded:: 0.18.1
Returns
-------
DataFrame
    Cross tabulation of the data.
See Also
--------
DataFrame.pivot : Reshape data based on column values.
pivot_table : Create a pivot table as a DataFrame.
Notes
-----
Any Series passed will have their name attributes used unless row or column
names for the cross-tabulation are specified.
Any input passed containing Categorical data will have **all** of its
categories included in the cross-tabulation, even if the actual data does
not contain any instances of a particular category.
In the event that there aren't overlapping indexes an empty DataFrame will
be returned.
3、Examples
--------
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,8,8,8,1],
'B':[6,6,4,4,4],
'C':[1,1,2,1,1]})
print('df:\n',df)
df:
    A  B  C
0  1  6  1
1  8  6  1
2  8  4  2
3  8  4  1
4  1  4  1
data1=pd.crosstab(df['A'],df['B'])
print('data1:\n',data1)
data1:
 B  4  6
A      
1  1  1
8  2  1
##normalize=True表示统计交叉表中各项的相对频率(即所占百分比)
data2=pd.crosstab(df['A'],df['B'],normalize=True)
print("data2:\n",data2)
data2:
 B    4    6
A          
1 0.20 0.20
8 0.40 0.20
#values:根据因子聚合的值数组
#aggfunc:如果未传递values数组,则计算频率表,如果传递数组,则按照指定计算
data3 =pd.crosstab(df['A'],df['B'],values=df['C'],aggfunc=np.sum)
print('data3:\n',data3)
data3:
 B  4  6
A      
1  1  1
8  3  1
#margins:布尔值,默认值False,当其为True时,表示:添加行/列边距(小计),
# 还可以通过margins_name设置总计行(列)的名称(默认名称是“All”)。
data4=pd.crosstab(df['A'],df['B'],values=df['C'],aggfunc=np.sum,margins=True)
print('data4:\n',data4)
data4:
 B    4  6  All
A             
1    1  1    2
8    3  1    4
All  4  2    6
# 分层交叉:crosstab()的参数index和columns可以接受列表传入,构建分层交叉表
data5=pd.crosstab([df['A'],df['B']],df['C'])
print('data5:\n',data5)
data5:
 C    1  2
A B      
1 4  1  0
  6  1  0
8 4  1  1
  6  1  0
                    
                
                
            
        
浙公网安备 33010602011771号