python学习笔记

字符串内置函数

1、调整字符串对齐方式

str.center(width[, fillchar]) # 居中对齐
str.ljust(width[, fillchar]) # 左对齐
str.rjust(width[, fillchar]) # 右对齐

2、空白字符相关（空格 \f \n \r \t \v等）

str.isspace()
str.lstrip([chars]) # 去掉左边的字符，默认值为空白字符，返回原字符串的副本，而不是在原字符串基础上修改
str.rstrip([chars])
str.strip([chars])

3、判断是否是数字、字母

str.isalpha() # 是否全部都是字母
str.isdecimal() # 是否全部都是数字
str.isdigit()
str.isnumeric()

isdecimal、isdigit、isnumeric的区别：文档

Isdecimal() < isdigit() < isnumeric()，并不是范围越大越好，要根据实际情况来使用，一般判断字符串是否是0-9，用isdecimal就行

isdecimal()：

True：Unicode数字，全角数字（双字节）
False：罗马数字，汉字数字，小数，带逗号的货币数字，带圈圈的数字（④）
Error：byte数字（单字节）

isdigit()：

True：Unicode数字，全角数字（双字节），带圈圈的数字（④），byte数字（单字节）
False：汉字数字，罗马数字，小数，带逗号的货币数字
Error：无

isnumeric()：

True：Unicode数字，全角数字（双字节），带圈圈的数字（④），罗马数字（Ⅵ，不是字母VI），汉字数字（十、亿、兆等）
False：小数，带逗号的货币数字
Error：byte数字（单字节）

4、字母大小写相关

str.lower()
str.upper()
str.title() # 每个单词首字母大写，其他字母小写
str.swapcase() # 每个字母转换大小写，注意str.swapcase().swapcase() == str并不一定成立，如某些字符可能有多个不同的小写字符，都是26个英文字母之外的字符，有需要自行google

5、其他

str.count(sub[, start[, end]]) # 字符计数
str.encode(encoding="utf-8", errors="strict") # 编码
str.decode(encoding="utf-8", errors="strict") # 解码
str.startswith(prefix[, start[, end]])
str.endswith(prefix[, start[, end]])
str.find(sub[, start[, end]]) # 返回子字符串 sub 在 str[start:end] 切片内被找到的最小索引，如果 sub 未被找到则返回 -1
str.rfind(sub[, start[, end]]) # 返回子字符串 sub 在 str[start:end] 切片内被找到的最大（最右）索引，如果 sub 未被找到则返回 -1
str.format(*args, **kwargs) # 重要！！！字符串格式化语法详见文档https://docs.python.org/zh-cn/3.7/library/string.html#formatstrings
str.join(iterable) # 使用str字符串把可迭代对象iterable串起来
static str.maketrans(x[, y[, z]]) # 重要！！！
    # 此静态方法返回一个可供 str.translate() 使用的转换对照表
    # 如果只有一个参数，则它必须是一个将 Unicode 码位序号（整数）或字符（长度为 1 的字符串）映射到 Unicode 码位序号、（任意长度的）字符串或 None 的字典。字符键将会被转换为码位序号。
    # 如果有两个参数，则它们必须是两个长度相等的字符串，并且在结果字典中，x 中每个字符将被映射到 y 中相同位置的字符。
    # 如果有第三个参数，它必须是一个字符串，其中的字符将在结果中被映射到 None。
str.translate(table) # 重要！！！返回原字符串的副本，其中每个字符按给定的转换表进行映射。
    # 一个参数
    d = {'a':'1','b':'2','c':'3','d':'4','e':'5','s':'6'}
    trantab = str.maketrans(d)
    st = 'just do it'
    print(st.translate(trantab)) # ju6t 4o it

    # 两个参数
    x = 'abcdefs'
    y = '1234567'
    st = 'just do it'
    trantab = str.maketrans(x,y,z)
    print(st.translate(trantab)) # ju7t 4o it

    # 三个参数，第三个参数 z 必须是字符串，其字符将被映射为 None，即删除该字符；如果 z 中字符与 x 中字符重复，该重复的字符在最终结果中还是会被删除。也就是无论是否重复，只要有第三个参数 z，z 中的字符都会被删除。
    >>> x = 'abcdefs'
    >>> y = '1234567'
    >>> z = 'ot'
    >>> st = 'just do it'
    >>> trantab = str.maketrans(x,y,z)
    >>> print(st.translate(trantab)) # ju7 4 i

    >>>x = 'abst'
    >>>y = '1234'
    >>>z = 's'
    >>>st = 'just do it'
    >>>trantab = str.maketrans(x,y,z)
    >>>print(st.translate(trantab)) # ju4 do i4

str.splitlines([keepends]) # 在行边界的位置拆分，当给定keepends = True时，会保留后面的行边界符
    >>> 'ab c\n\nde fg\rkl\r\n'.splitlines() # ['ab c', '', 'de fg', 'kl']
    >>> 'ab c\n\nde fg\rkl\r\n'.splitlines(keepends=True) # ['ab c\n', '\n', 'de fg\r', 'kl\r\n']
    # 不同于 split()，当给出了分隔字符串 sep 时，对于空字符串此方法将返回一个空列表，而末尾的换行不会令结果中增加额外的行:
    >>> "".splitlines() # []
    >>> "One line\n".splitlines() # ['One line']
    # 作为比较，split('\n') 的结果为:
    >>> ''.split('\n') # ['']
    >>> 'Two lines\n'.split('\n') # ['Two lines', '']
str.zfill(width) # 返回原字符串的副本，在左边填充 ASCII '0' 使其长度变为 width。正负值前缀 ('+'/'-') 的处理方式是在正负符号之后填充而非在之前。

正则表达式相关

1、在正则表达式中，\num（num为正整数）有以下作用：

文档

（1）\num（num为正整数，且1<num<99），表示对第num个所获取的匹配的引用，如果num>所获取的匹配组的数量，则会报错。

>>> re.findall(r'(s)(a)\1', 'sasdf')
[('s', 'a')]
>>> re.findall(r'(s)(a)\1', 'sadf')
[]
>>> re.findall(r'(s)(a)\2', 'sasdf')
[]
>>> re.findall(r'(s)(a)\2', 'saadf')
[('s', 'a')]
>>> re.findall(r'(s)(a)\3', 'saadf')
Traceback (most recent call last):
…
re.error: invalid group reference 3 at position 7

（2）\num（num为正整数，且num是一个三位数，每位上的数字都是一个八进制数字，或第一位是0，第二三位是两个八进制数字），python文档认为它将不会被看作是一个组合的引用，而是八进制字符。

>>> re.search(r'\100', '@') # 八进制数100 = 二进制数64，ascii中序号为64的字符为@<re.Match object; span=(0, 1), match='@'>
>>> re.search(r'\112', 'J')
<re.Match object; span=(0, 1), match='J'>
>>> re.search(r'\053', '+')
<re.Match object; span=(0, 1), match='+'>

2、re.match和re.search的区别：

re.match() 检查字符串开头，而re.search() 检查字符串的任意位置。

>>> re.match(r"c", "abcdef")    # No match
>>> re.search(r"c", "abcdef")   # Match
<re.Match object; span=(2, 3), match='c'>

使用MULTILINE，match匹配不到，但是search搭配“^”可以匹配到。

>>> re.match(r'X', 'A\nB\nX', re.MULTILINE)  # No match
>>> re.search(r'^X', 'A\nB\nX', re.MULTILINE)  # Match
<re.Match object; span=(4, 5), match='X'>

3、善用re.split(pattern, string, maxsplit=0, flags=0)

文档

以pattern分割字符串string，maxsplit为最大分割次数，flags枚举类型，如下。

>>> for a in re.RegexFlag:
    print (a, a.value)
... ... 
RegexFlag.ASCII 256
RegexFlag.IGNORECASE 2
RegexFlag.LOCALE 4
RegexFlag.UNICODE 32
RegexFlag.MULTILINE 8
RegexFlag.DOTALL 16
RegexFlag.VERBOSE 64
RegexFlag.TEMPLATE 1
RegexFlag.DEBUG 128

>>> text = """Ross McFluff: 834.345.1254 155 Elm Street
Ronald Heathmore: 892.345.3428 436 Finley Avenue
Frank Burger: 925.541.7625 662 South Dogwood Way

Heather Albrecht: 548.326.4584 919 Park Place"""
>>> entries = re.split("\n+", text)
>>> entries
['Ross McFluff: 834.345.1254 155 Elm Street',
'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
'Frank Burger: 925.541.7625 662 South Dogwood Way',
'Heather Albrecht: 548.326.4584 919 Park Place']
>>> [re.split(':? ', each, 4) for each in entries]
[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]

4、同样善用re.sub(pattern, repl, string, count=0, flags=0)

pattren：要替换的目标字符串的正则表达式；

repl：新字符串，或函数；

string：要替换的完整字符串。

以repl替代用pattern在string中搜索到的字符串，就像sublime text和vscode中都有的正则表达式替换。

其中repl可以为函数，会将每个匹配到的结果作为参数执行函数，以函数结果作为替代后的字符串。

>>> def dashrepl(matchobj):
...     if matchobj.group(0) == '-': return ' '
...     else: return '-'
>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files') # {1,2} 贪婪模式
'pro--gram files'
>>> re.sub('-{1,2}?', dashrepl, 'pro----gram-files') # {1.2}? 非贪婪模式
'pro    gram files'
>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
'Baked Beans & Spam'

getattribute、getattr、setattr详解

文档

1、getattribute

此方法会无条件地被调用以实现对类实例属性的访问（文档原话），即无论是用.attr的方式还是访问__dict__[attr]字典的方式，都会调用__getattribute__()。但是访问类属性的时候不会调用__getattribute__()，也不会调用后面会说的__getattr__()。因此在自定义__getattribute__()时切忌在自定义方法内部使用__dict__[attr]，否则会形成无限递归，应该调用基类方法的__getattribute__()。

class Test():
    n = 100
    def __init__(self):
        print("执行了__init__")
        self.a = 'a'
    def __getattribute__(self, attr):
        print("执行了__getattribute__")
        # return self.__dict__[attr] # 错误方式
        return super().__getattribute__(attr) # 正确方式
... ... ... ... ... ... ... ... ... ... ... ... ... 
>>> t = Test()
执行__init__
>>> t.a
执行了__getattribute__
'a'
>>> t.n
100

问题

基类若不是object，其__getattribute__方法肯定也只能继续用super().__getattribute__(attr)，最终还是会递归到object.__getattribute__(self, attr)，那么object中的__getattribute__又是如何实现的呢？目前只知道object没有__dict__。

2、getattr

只有当__getattribute__()方法引发AttributeError时，系统才会捕获该异常并调用__getattr__()，若用户没有自定义__getattr__()，则会调用基类的__getattr__()

class Test():
    n = 100
    def __init__(self):
        print("执行了__init__")
        self.a = 'a'
    def __getattribute__(self, attr):
        print("执行了__getattribute__")
        return super().__getattribute__(attr)
    def __getattr__(self, attr):
        print("执行了__getattr__")
        try:
            return super().__getattr__(attr)
        except AttributeError:
            return "No such attribute"
... ... ... ... ... ... ... ... ... ... ... ... ... 
>>> t = Test()
执行__init__
>>> t.a
执行了__getattribute__
'a'
>>> t.b
执行了__getattribute__
执行了__getattr__
'No such attribute'

3、setattr

对象实例设置或修改公开属性时，会调用__setattr__(attr, value)方法，该方法有两种方式来设置属性，一是用super().__setattr__(attr, value)，二是用self.__dict__[attr] = value。值得注意的是，第二种方式由于需要访问__dict__，因此会调用__getattribute__()。切忌在__setattr__(attr, value)方法内部使用self.attr = value，会产生无限递归。

class Test():
    n = 100
    def __init__(self):
        print("执行__init__")
        self.a = 'a'
    def __getattribute__(self, attr):
        print("执行了__getattribute__")
        return super().__getattribute__(attr)
    def __getattr__(self, attr):
        print("执行了__getattr__")
        try:
            return super().__getattr__(attr) # 看基类有没有对不存在的属性的处理
        except AttributeError:
            return "No such attribute"
    def __setattr__(self, attr, value):
        print("执行了__setattr__")
        super().__setattr__(attr, value)
        print("--------------------------------------")
        self.__dict__[attr] = value
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 
>>> t = Test()
执行__init__ # 初始化过程中属性的设置也会调用__setattr__，而类型属性设置不会调用__setattr__
执行了__setattr__
--------------------------------------
执行了__getattribute__

字符串相似度

1、difflib.SequenceMatcher

class difflib.SequenceMatcher(isjunk = None, a = "", b = "", autojunk = True)

可选参数 isjunk 必须为 None (默认值) 或为接受一个序列元素并当且仅当其为应忽略的“垃圾”元素时返回真值的单参数函数。传入 None 作为 isjunk 的值就相当于传入 lambda x: False；也就是说不忽略任何值。
可选参数 a 和 b 为要比较的序列；两者默认为空字符串。两个序列的元素都必须为 hashable。
可选参数 autojunk 可用于启用自动垃圾启发式计算。

>>> from difflib import SequenceMatcher
>>> s = SequenceMatcher(lambda x: x == " ", # 优点，可以选择性排除垃圾字符不参与相似度比较
                    "private Thread currentThread;",
                    "private volatile Thread currentThread")
... ... >>> 
>>> print(round(s.ratio(), 3))
0.848
>>> print(round(s.quick_ratio(), 3))
0.848
>>> print(round(s.real_quick_ratio(), 3))
0.879

2、fuzzywuzzy

fuzzywuzzy依赖编辑距离，需要安装python-Levenshtein

pip install python-Levenshtein
pip install fuzzywuzzy

>>> from fuzzywuzzy import fuzz
>>> fuzz.ratio("private Thread currentThread;", "private volatile Thread currentThread") # 直接计算s2和s2之间的相似度，返回值为0-100，100表示完全相同；
85
>>> fuzz.token_sort_ratio("private Thread currentThread;", "private volatile Thread currentThread") # 只比较单词，不考虑单词之间的顺序
86

Textwrap

1、textwrap.wrap(text, width=70, **kwargs)

对 text (字符串) 中的单独段落自动换行以使每行长度最多为 width 个字符。返回由输出行组成的列表，行尾不带换行符。

>>> import textwrap
>>> a = 'a'*100
>>> textwrap.wrap(a)
['aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa']

2、textwrap.fill(text, width=70, **kwargs)

对 text 中的单独段落自动换行，并返回一个包含被自动换行段落的单独字符串。

>>> textwrap.fill(a, 50)
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\naaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'

3、textwrap.shorten(text, width, **kwargs)

折叠并截短给定的 text 以符合给定的 width。
首先将折叠 text 中的空格（所有连续空格替换为单个空格）。如果结果能适合 width 则将其返回。否则将丢弃足够数量的末尾单词以使得剩余单词加 placeholder 能适合 width。

>>> textwrap.shorten("Hello  world!", width=12)
'Hello world!'
>>> textwrap.shorten("Hello  world!", width=11)
'Hello [...]'
>>> textwrap.shorten("Hello world", width=10, placeholder="...")
'Hello...'

posted @ 2020-10-18 11:25 yury757 阅读(181) 评论(0) 收藏举报

刷新页面返回顶部

yury757

这个人很懒，什么都没写......

python学习笔记

字符串内置函数

1、调整字符串对齐方式

2、空白字符相关（空格 \f \n \r \t \v等）

3、判断是否是数字、字母

4、字母大小写相关

5、其他

正则表达式相关

1、在正则表达式中，\num（num为正整数）有以下作用：

2、re.match和re.search的区别：

3、善用re.split(pattern, string, maxsplit=0, flags=0)

4、同样善用re.sub(pattern, repl, string, count=0, flags=0)

getattribute、getattr、setattr详解

1、getattribute

问题

2、getattr

3、setattr

字符串相似度

1、difflib.SequenceMatcher

2、fuzzywuzzy

Textwrap

1、textwrap.wrap(text, width=70, **kwargs)

2、textwrap.fill(text, width=70, **kwargs)

3、textwrap.shorten(text, width, **kwargs)

公告

yury757

这个人很懒，什么都没写......

python学习笔记

字符串内置函数

1、调整字符串对齐方式

2、空白字符相关（空格 \f \n \r \t \v等）

3、判断是否是数字、字母

4、字母大小写相关

5、其他

正则表达式相关

1、在正则表达式中，\num（num为正整数）有以下作用：

2、re.match和re.search的区别：

3、善用re.split(pattern, string, maxsplit=0, flags=0)

4、同样善用re.sub(pattern, repl, string, count=0, flags=0)

getattribute、getattr、setattr详解

1、__getattribute__

问题

2、__getattr__

3、__setattr__

字符串相似度

1、difflib.SequenceMatcher

2、fuzzywuzzy

Textwrap

1、textwrap.wrap(text, width=70, **kwargs)

2、textwrap.fill(text, width=70, **kwargs)

3、textwrap.shorten(text, width, **kwargs)

公告

1、getattribute

2、getattr

3、setattr