pandas组队学习：task9

一、Cat对象

cat对象的属性

使用astype将普通序列转换为分类变量，例如：

s = pd.Series(['man','woman','child','man','child'])
s = s.astype('category')
s
Out[49]: 
0      man
1    woman
2    child
3      man
4    child
dtype: category
Categories (3, object): ['child', 'man', 'woman']

使用cat.categories查看分类的类型：

s.cat.categories
Out[50]: Index(['child', 'man', 'woman'], dtype='object')

cat.ordered查看是否有序：

s.cat.ordered
Out[59]: False

还可以对类别进行编码，编码顺序取决于categories的顺序：

s.cat.codes
Out[60]: 
0    1
1    2
2    0
3    1
4    0
dtype: int8

类别的增删改

增加：

使用 add_categories增加类别：

s.cat.add_categories('oldman') 
Out[61]: 
0      man
1    woman
2    child
3      man
4    child
dtype: category
Categories (4, object): ['child', 'man', 'woman', 'oldman']

删除：

使用 remove_categories ，删除某一个类别，原来序列中的该类会被设置为缺失：

s.cat.remove_categories('child')
Out[64]: 
0      man
1    woman
2      NaN
3      man
4      NaN
dtype: category
Categories (2, object): ['man', 'woman']

使用 remove_unused_categories ，删除未出现在序列中的类别：

s = s.cat.add_categories('oldman') 
s = s.cat.remove_unused_categories()

使用 set_categories 直接设置序列的新类别，原类别若不能存在，则会被设置为缺失：

s.cat.set_categories(['Sophomore','PhD']) 
Out[65]: 
0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
dtype: category
Categories (2, object): ['Sophomore', 'PhD']

修改

使用rename_categories 方法完成：

s.cat.rename_categories({'child':'old'})
Out[66]: 
0      man
1    woman
2      old
3      man
4      old
dtype: category
Categories (3, object): ['old', 'man', 'woman']

二、有序分类

序的建立

使用reorder_categories将无序转换为有序，传入时不能够增加新的类别，也不能缺少原来的类别，并且必须指定参数 ordered=True ：

s = pd.Series(['man','woman','child','man','child'])
s = s.astype('category')
s = s.cat.reorder_categories(['child', 'woman','man'],ordered=True)
s
Out[68]: 
0      man
1    woman
2    child
3      man
4    child
dtype: category
Categories (3, object): ['child' < 'woman' < 'man']

使用 as_unordered将有序转换为无序：

s.cat.as_unordered()
Out[69]: 
0      man
1    woman
2    child
3      man
4    child
dtype: category
Categories (3, object): ['child', 'woman', 'man']

排序和比较

排序：

使用reorder_categories将无序转换为有序后，即可使用 sort_values 进行值排序：

s.sort_values()
Out[71]: 
2    child
4    child
1    woman
0      man
3      man
dtype: category
Categories (3, object): ['child' < 'woman' < 'man']

使用sort_index进行索引排序：

s.sort_index()
Out[72]: 
0      man
1    woman
2    child
3      man
4    child
dtype: category
Categories (3, object): ['child' < 'woman' < 'man']

比较

对于排序后的序列，可以使用==，!=或者>,<等进行比较，例如：

s == 'child'			#进行==比较
Out[73]: 
0    False
1    False
2     True
3    False
4     True
dtype: bool
	
s >'child'				#进行>比较
Out[74]: 
0     True
1     True
2    False
3     True
4    False
dtype: bool

三、区间类别

利用cut和qcut进行区间构造

可以将数值类别分类到不同的区间中，主要使用cut和qcut函数。

1) cut

第一个参数为要划分区间的序列
bins：表示划分的区间，可以为整数或者区间
right：默认为True，表示左开右闭
labels ：代表区间的名字

retbins：是否返回分割点（默认不返回）

bins为整数时：

s = pd.Series([1,2])
pd.cut(s, bins=2)
Out[75]: 
0    (0.999, 1.5]
1      (1.5, 2.0]
dtype: category
Categories (2, interval[float64]): [(0.999, 1.5] < (1.5, 2.0]]

bin还可以指定区间：

pd.cut(s, bins=[-1,1.5,2,3])
Out[77]: 
0    (-1.0, 1.5]
1     (1.5, 2.0]
dtype: category
Categories (3, interval[float64]): [(-1.0, 1.5] < (1.5, 2.0] < (2.0, 3.0]]

返回区间名字和分割点：

res = pd.cut(s, bins=2, labels=['small', 'big'], retbins=True)

res[0]
Out[79]: 
0    small
1      big
dtype: category
Categories (2, object): ['small' < 'big']

res[1] 
Out[80]: array([0.999, 1.5  , 2.   ])

2) qcut

qcut只是把 bins 参数变成的 q 参数， q为整数 n 时，指按照 n 等分位数把数据分箱，还可以传入浮点列表指代相应的分位数分割点。

q为整数：

s = pd.Series([1,2,3,4,5,6])
pd.qcut(s,q=2)
Out[84]: 
0    (0.999, 3.5]
1    (0.999, 3.5]
2    (0.999, 3.5]
3      (3.5, 6.0]
4      (3.5, 6.0]
5      (3.5, 6.0]
dtype: category
Categories (2, interval[float64]): [(0.999, 3.5] < (3.5, 6.0]]

q为列表，此时传入的要从0到1，否则不在区间范围的会设为缺失值：

pd.qcut(s,q=[0.1,0.5])
Out[86]: 
0             NaN
1    (1.499, 3.5]
2    (1.499, 3.5]
3             NaN
4             NaN
5             NaN
dtype: category
Categories (1, interval[float64]): [(1.499, 3.5]]

pd.qcut(s,q=[0,0.1,0.5,1])
Out[87]: 
0    (0.999, 1.5]
1      (1.5, 3.5]
2      (1.5, 3.5]
3      (3.5, 6.0]
4      (3.5, 6.0]
5      (3.5, 6.0]
dtype: category
Categories (3, interval[float64]): [(0.999, 1.5] < (1.5, 3.5] < (3.5, 6.0]]

一般区间的构造

区间的构造可以使用Interval，其中具备三个要素，即左端点、右端点和端点的开闭状态，其中开闭状态可以指定 right, left, both, neither 中的一类：

my_interval = pd.Interval(0, 1, 'right')

In [50]: my_interval
Out[50]: Interval(0, 1, closed='right')

pd.IntervalIndex 对象有四类方法生成，分别是 from_breaks, from_arrays, from_tuples, interval_range ，它们分别应用于不同的情况：

from_breaks：直接传入分割点

pd.IntervalIndex.from_breaks([1,3,6,10], closed='both')
Out[54]: 
IntervalIndex([[1, 3], [3, 6], [6, 10]],
              closed='both',
              dtype='interval[int64]')

from_arrays 分别传入左端点和右端点的列表：

pd.IntervalIndex.from_arrays(left = [1,3,6,10], right = [5,4,9,11], closed ='neither')
Out[55]: 
IntervalIndex([(1, 5), (3, 4), (6, 9), (10, 11)],
              closed='neither',
              dtype='interval[int64]')

from_tuples 传入的是起点和终点元组构成的列表：

pd.IntervalIndex.from_tuples([(1,5),(3,4),(6,9),(10,11)],closed='neither')
Out[56]: 
IntervalIndex([(1, 5), (3, 4), (6, 9), (10, 11)],
              closed='neither',
              dtype='interval[int64]')

interval_range 传入start, end, periods, freq 起点，终点，区间个数，区间长度：

传入个数：

pd.interval_range(start=1,end=5,periods=8)
Out[57]: 
IntervalIndex([(1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5], (3.5, 4.0], (4.0, 4.5], (4.5, 5.0]],
              closed='right',
              dtype='interval[float64]')

传入长度：

pd.interval_range(end=5,periods=8,freq=0.5)
Out[58]: 
IntervalIndex([(1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5], (3.5, 4.0], (4.0, 4.5], (4.5, 5.0]],
              closed='right',
              dtype='interval[float64]')

区间的属性与方法

IntervalIndex 有若干常用属性： left, right, mid, length ，分别表示左右端点、两端点均值和区间长度。
IntervalIndex 还有两个常用方法，包括 contains 和 overlaps ，分别指逐个判断每个区间是否包含某元素，以及是否和一个 pd.Interval 对象有交集。

Ex1：统计未出现的类别

我的答案：

dropna参数默认为True，此时对于未出现的类别将不显示，设为False时，未出现的类别也会显示。

思路是先统计行索引和列索引的类别数目，对同时属于这两个类别的元素求和，最后对于dropna参数为True时，将舍弃掉全为0的行或者列。

def my_crosstab(A,B,dropna = True):
    A = A.astype('category')
    B = B.astype('category')
    index1 = A.cat.categories
    index2 = B.cat.categories
    n = len(index1)
    m = len(index2)							#统计类别数目
    data = np.zeros([n,m])
    for i in range(n):
        for j in range(m):
            data[i][j] = sum( (A ==A.cat.categories[i]) &  (B == B.cat.categories[j]))	#统计同时属于两个类别的元素数目
    if dropna == False:    
        df = pd.DataFrame(data,
                     index=index1,
                     columns=index2)
    else:								#对全0行或者列进行舍弃
        df = pd.DataFrame(data,
                     index=index1,
                     columns=index2)
        df = df.drop(df.index[(df==0).all(axis=1)])
        df = df.drop(df.columns[(df==0).all(axis=0)],axis=1)
    return df

测试dropna==True：

import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['a','b','c','a'],'B':['cat','cat','dog','cat']})
res = my_crosstab(df.A,df.B)
res
Out[188]: 
   cat  dog
a  2.0  0.0
b  1.0  0.0
c  0.0  1.0

测试dropna==False：

df = pd.DataFrame({'A':['a','b','c','a'],'B':['cat','cat','dog','cat']})
df.B = df.B.astype('category').cat.add_categories('sheep')
res = my_crosstab(df.A,df.B,drop = False)
res
Out[191]: 
   cat  dog  sheep
a  2.0  0.0    0.0
b  1.0  0.0    0.0
c  0.0  1.0    0.0

EX.2 钻石数据集

直接比较即可：

s_object =  df.cut
s_category = df.cut.astype('category')
timeit -n 30 s_object.nunique()
timeit -n 30 s_category.nunique()
2.7 ms ± 67.5 µs per loop (mean ± std. dev. of 7 runs, 30 loops each)
668 µs ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 30 loops each)

由结果可见，category的效率更高

先将无序的转换为有序，然后再对有序序列进行排序：

df.cut = df.cut.astype('category').cat.reorder_categories(['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'],ordered=True)
df.clarity = df.clarity.astype('category').cat.reorder_categories(['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF'], ordered=True)
res = df.sort_values(['cut','clarity'], ascending=[False, True])

	carat	cut	clarity	price
315	0.96	Ideal	I1	2801
535	0.96	Ideal	I1	2826
551	0.97	Ideal	I1	2830
653	1.01	Ideal	I1	2844
718	0.97	Ideal	I1	2856

先使用 reorder_categories按倒序排列，然后用cat.codes进行编码：

df.cut = df.cut.astype('category')
df.clarity = df.clarity.astype('category')
df.cut = df.cut.cat.reorder_categories(df.cut.cat.categories[::-1])
df.clarity = df.clarity.cat.reorder_categories(df.clarity.cat.categories[::-1])
df.cut = df.cut.cat.codes 				#使用cat.codes编码
df.clarity = df.clarity.cat.codes

	carat	cut	clarity	price
0	0.23	2	3	326
1	0.21	3	2	326
2	0.23	1	4	327
3	0.29	3	5	334
4	0.31	1	3	335

使用cut和qcut函数:

avg = df.price / df.carat
df['avg'] = df.price / df.carat
df['avg_cut'] = pd.cut(avg, bins=[0, 1000, 3500, 5500, 18000, np.infty], labels=['Very Low', 'Low', 'Mid', 'High', 'Very High'])
df['avg_qcut'] = pd.qcut(avg, q=[0, 0.2, 0.4, 0.6, 0.8, 1], labels=['Very Low', 'Low', 'Mid', 'High', 'Very High'])

	carat	cut	clarity	price	avg	avg_cut	avg_qcut
0	0.23	Ideal	SI2	326	1417.39	Low	Very Low
1	0.21	Premium	SI1	326	1552.38	Low	Very Low
2	0.23	Good	VS1	327	1421.74	Low	Very Low
3	0.29	Premium	VS2	334	1151.72	Low	Very Low
4	0.31	Good	SI2	335	1080.65	Low	Very Low

先统计出现的类别，然后再删除未出现的：

df.avg_cut.unique()
Out[224]: 
['Low', 'Mid', 'High']
Categories (3, object): ['Low' < 'Mid' < 'High']
df.avg_cut = df.avg_cut.cat.remove_categories(['Very Low', 'Very High'])

posted @ 2021-01-07 23:42 爱睡觉的皮卡丘阅读(101) 评论(0) 收藏举报

刷新页面返回顶部