pandas组队学习：task8

一、str对象

pandas的str对象将字符串序列化，可以通过[]取出某个位置的元素。

例如，返回的是每个位置的元素，如果缺失返回Nan：

import pandas as pd
s = pd.Series(['abcd', 'efg', 'hi'])
s.str[0]
Out[10]: 
0    a
1    e
2    h
dtype: object

object和string对象的不同点：

object是对每一个元素进行[ ]索引，string是先把每个元素转换为字符串。例如：

s = pd.Series([{1: 'temp_1', 2: 'temp_2'}, ['a', 'b'], 0.5, 'my_string'])
s.str[1]
Out[426]: 
0    temp_1
1         b
2       NaN
3         y
dtype: object
s.astype('string').str[1]
Out[427]: 
0    1
1    '
2    .
3    y
dtype: string

对于object，进行[ ]索引后第1个元素的[1]为temp_1，string对象将第一个元素转换为字符串，故输出：1

string 类型的序列，如果调用的 str 方法返回值为整数 Series 和布尔 Series 时，其分别对应的 dtype 是 Int 和 boolean 的 Nullable 类型，而 object 类型则会分别返回 int/float 和 bool/object ，取决于缺失值的存在与否。同时，字符串的比较操作，也具有相似的特性， string 返回 Nullable 类型，但 object 不会。

二、正则表达式基础

一般字符匹配

使用re模块的findall函数：第一个参数是正则表达式，第二个参数是待匹配字符串。

例如：
```
In [28]: import re

In [29]: re.findall(r'Apple', 'Apple! This Is an Apple!')
Out[29]: ['Apple', 'Apple']
```

三、文本处理的五类操作

拆分

str.split：第一个参数为正则表达式，n为最大拆分次数，expand为是否展开为多个列

例如：
```
s.str.split('[市区路]', n=2, expand=True)
Out[44]: 
    0   1         2
0  上海  黄浦  方浜中路249号
1  上海  宝山     密山路5号
```
最大拆分为2，所以最后的’路‘就没有进行拆分

合并

str.join：把两个字符串列表合并，出现非字符串则返回Nan

s = pd.Series([['a','b'], [1, 'a'], [['a', 'b'], 'c']])
In [47]: s.str.join('-')
Out[47]: 
0    a-b
1    NaN
2    NaN
dtype: object

str.cat：

sep：连接符
join：连接形式，默认索引为键的左连接
na_rep：缺失值代替符

s1 = pd.Series(['a','b'])
s2 = pd.Series(['cat','dog'])
s2.index = [1, 2]
In [52]: s1.str.cat(s2, sep='-', na_rep='?', join='outer')
Out[52]: 
0      a-?
1    b-cat
2    ?-dog

匹配

str.contains：返回每个字符串是否包含所给的正则表达式

s = pd.Series(['my cat', 'he is fat', 'railway station'])
In [54]: s.str.contains('\s\wat')
Out[54]: 
0     True
1     True
2    False
dtype: bool

str.startswith 和 str.endswith ：返回了每个字符串以给定模式为开始和结束的布尔序列，它们都不支持正则表达式：

s.str.startswith('my')
Out[55]: 
0     True
1    False
2    False
dtype: bool

s.str.endswith('t')
Out[56]: 
0     True
1     True
2    False
dtype: bool

str.match ：返回每个字符串起始处是否符合给定正则模式的布尔序列

s.str.match('m|h')
Out[57]: 
0     True
1     True
2    False
dtype: bool

str.find 与 str.rfind ：其分别返回从左到右和从右到左第一次匹配的位置的索引，未找到则返回-1。都不支持正则匹配

s = pd.Series(['This is an apple. That is not an apple.'])

s.str.find('apple')
Out[62]: 
0    11
dtype: int64

s.str.rfind('apple')
Out[63]: 
0    33
dtype: int64

替换

str.replace：第一个参数为正则表达式，regex=True表示使用的为正则表达式

s = pd.Series(['a_1_b','c_?'])

In [65]: s.str.replace('\d|\?', 'new', regex=True)
Out[65]: 
0    a_new_b
1      c_new
dtype: object

提取

str.extract 进行提取

pat = '(\w+市)(\w+区)(\w+路)(\d+号)'
s.str.extract(pat)
Out[77]: 
     0    1     2     3
0  上海市  黄浦区  方浜中路  249号
1  上海市  宝山区   密山路    5号
2  北京市  昌平区   北农路    2号

四、常见字符串函数

字母函数

upper：转换为大写

lower：转换为小写

title：第一个字母大写

capitalize：首字母大写

swapcase ：大小写转换

数值函数

pd.to_numeric：

errors：raise, coerce, ignore 分别表示直接报错、设为缺失以及保持原来的字符串
downcast：转换类型

s = pd.Series(['1', '2.2', '2e', '??', '-2.1', '0'])

pd.to_numeric(s, errors='ignore')
Out[93]: 
0       1
1     2.2
2      2e
3      ??
4    -2.1
5       0
dtype: object

pd.to_numeric(s, errors='coerce')
Out[94]: 
0    1.0
1    2.2
2    NaN
3    NaN
4   -2.1
5    0.0
dtype: float64

格式型函数

除空型：strip, rstrip, lstrip ，分别代表去除两侧空格、右侧空格和左侧空格。

my_index.str.strip()
Out[436]: Index(['col1', 'col2', 'col3'], dtype='object')

my_index.str.rstrip()
Out[437]: Index([' col1', 'col2', ' col3'], dtype='object')

my_index.str.lstrip()
Out[439]: Index(['col1', 'col2 ', 'col3 '], dtype='object')

填充型：pad，可以选定填充内容和填充方向

s = pd.Series(['a','b','c'])

s.str.pad(5,'left','*')
Out[104]: 
0    ****a
1    ****b
2    ****c
dtype: object

s.str.pad(5,'right','*')
Out[105]: 
0    a****
1    b****
2    c****
dtype: object

posted @ 2021-01-06 23:16 爱睡觉的皮卡丘阅读(78) 评论(0) 收藏举报

刷新页面返回顶部