Python学习总结【第四篇】:Python之文件操作(文件、正则、json、pickle)

1、文件操作

1.1 操作流程

1)文件打开

2)文件操作

3)文件关闭

1.2 open简介

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
打开 file 并返回相应 file object (文件对象)。若文件不能被打开的话,会引发 OSError (操作系统错误)。
 # python中打开文件有两种方式,即:open(...) 和 file(...) ,本质上前者在内部会调用后者来进行文件操作,推荐使用 open。3.0以后file方法讲被用做其他,open方法会自动的去帮你找他调用得方法在那里!
1.2.1 file(重点)

file 要么是可给出打开文件路径名 (绝对或相对当前工作目录) 的字符串或字节对象,要么是包裹文件的整数文件描述符。

1.2.2 mode(重点)
    mode 是能指定文件打开方式的可选字符串。默认值为 'r',意味着以文本模式读取打开。
    可用模式包括:
    字符     含义
    'r'     读取打开 (默认)
    'w'     写入打开 (会先截取文件)
    'x'     独占创建打开 (若文件已存在的话,操作会失败)
    'a'     写入打开 (若文件存在的话,会追加到文件末尾)
    'b'     二进制模式
    't'     文本模式 (默认)
    '+'     更新打开磁盘文件 (读写)
    'U'     universal newlines mode (deprecated)

  Python 会区分二进制 I/O 和文本 I/O。以二进制模式 (mode 自变量含有 'b') 打开文件时,无需任何解码就能返回 bytes (字节) 对象内容。在文本模式 (默认,或当 mode 自变量含有 't' 时) 下,会以 str (字符串) 形式返回文件内容 (会先用从属平台编码或给定 encoding 解码字节)。

注意:Python 不会依赖底层操作系统文本文件概念;所有处理都由 Python 自身完成,因此,独立于平台。

1.2.3 buffering

  buffering是用来设置缓冲策略的可选整数

1.2.4 encoding(重点)

  encoding是用于解码或编码文件的编码名称。这只应用于文本模式。默认编码从属平台 (不管 locale.getpreferredencoding() 函数返回什么);而且,被 Python 支持的任何编码都可以使用。请参阅 codecs (编解码器) 模块,了解支持编码列表。

>>> import locale
>>> locale.getpreferredencoding()
'cp936'

1.2.5 errors

  errors是能指定该如何处理编码、解码错误的可选字符串 – 不可用于二进制模式

1.2.6 newline

  newline能控制 universal newlines (通用换行符) 模式如何工作 (只适于文本模式)

1.2.7 closefd

1.2.8 opener

1.3 文件操作案例

以文本模式读取文件内容

文件内容:
12344
aaaaaacccccdddd

操作方法:
>>> f = open('db','r', encoding="utf-8")
>>> data = f.read()
>>> print(data, type(data))
12344
aaaaaacccccdddd
 <class 'str'>
>>> f.close()

以文本模式写文件

>>> f = open('db','a', encoding="utf-8")
>>> data = f.write("asdfasdf")
>>> f.close()

以二进制模式读取文件内容

>>> f.close()
>>> f = open('db','rb')
>>> data = f.read()
>>> print(data,type(data))
b'12344\r\naaaaaacccccdddd\r\n' <class 'bytes'>

以二进制模式写文件

原文件内容:
123

>>> f = open("db", 'ab')
>>> f.write(bytes("李杰", encoding="utf-8"))
6
>>> f.close()

文件内容:
123李杰

 1.4 文件操作方法

class file(object):
  
    def close(self): # real signature unknown; restored from __doc__
        关闭文件
        """
        close() -> None or (perhaps) an integer.  Close the file.
         
        Sets data attribute .closed to True.  A closed file cannot be used for
        further I/O operations.  close() may be called more than once without
        error.  Some kinds of file objects (for example, opened by popen())
        may return an exit status upon closing.
        """
 
    def fileno(self): # real signature unknown; restored from __doc__
        文件描述符  
         """
        fileno() -> integer "file descriptor".
         
        This is needed for lower-level file interfaces, such os.read().
        """
        return 0    
 
    def flush(self): # real signature unknown; restored from __doc__
        刷新文件内部缓冲区
        """ flush() -> None.  Flush the internal I/O buffer. """
        pass
 
 
    def isatty(self): # real signature unknown; restored from __doc__
        判断文件是否是同意tty设备
        """ isatty() -> true or false.  True if the file is connected to a tty device. """
        return False
 
 
    def next(self): # real signature unknown; restored from __doc__
        获取下一行数据,不存在,则报错  Python 3.x已经没有改功能
        """ x.next() -> the next value, or raise StopIteration """
        pass
 
    def read(self, size=None): # real signature unknown; restored from __doc__
        读取指定字节数据
        """
        read([size]) -> read at most size bytes, returned as a string.
         
        If the size argument is negative or omitted, read until EOF is reached.
        Notice that when in non-blocking mode, less data than what was requested
        may be returned, even if no size parameter was given.
        """
        pass
 
    def readinto(self): # real signature unknown; restored from __doc__
        读取到缓冲区,不要用,将被遗弃    Python 3.x已经没有改功能
        """ readinto() -> Undocumented.  Don't use this; it may go away. """
        pass
 
    def readline(self, size=None): # real signature unknown; restored from __doc__
        仅读取一行数据
        """
        readline([size]) -> next line from the file, as a string.
         
        Retain newline.  A non-negative size argument limits the maximum
        number of bytes to return (an incomplete line may be returned then).
        Return an empty string at EOF.
        """
        pass
 
    def readlines(self, size=None): # real signature unknown; restored from __doc__
        读取所有数据,并根据换行保存值列表
        """
        readlines([size]) -> list of strings, each a line from the file.
         
        Call readline() repeatedly and return a list of the lines so read.
        The optional size argument, if given, is an approximate bound on the
        total number of bytes in the lines returned.
        """
        return []
 
    def seek(self, offset, whence=None): # real signature unknown; restored from __doc__
        指定文件中指针位置
        """
        seek(offset[, whence]) -> None.  Move to new file position.
         
        Argument offset is a byte count.  Optional argument whence defaults to
(offset from start of file, offset should be >= 0); other values are 1
        (move relative to current position, positive or negative), and 2 (move
        relative to end of file, usually negative, although many platforms allow
        seeking beyond the end of a file).  If the file is opened in text mode,
        only offsets returned by tell() are legal.  Use of other offsets causes
        undefined behavior.
        Note that not all file objects are seekable.
        """
        pass
 
    def tell(self): # real signature unknown; restored from __doc__
        获取当前指针位置
        """ tell() -> current file position, an integer (may be a long integer). """
        pass
 
    def truncate(self, size=None): # real signature unknown; restored from __doc__
        截断数据,仅保留指定之前数据
        """
        truncate([size]) -> None.  Truncate the file to at most size bytes.
         
        Size defaults to the current file position, as returned by tell().
        """
        pass
 
    def write(self, p_str): # real signature unknown; restored from __doc__
        写内容
        """
        write(str) -> None.  Write string str to file.
         
        Note that due to buffering, flush() or close() may be needed before
        the file on disk reflects the data written.
        """
        pass
 
    def writelines(self, sequence_of_strings): # real signature unknown; restored from __doc__
        将一个字符串列表写入文件
        """
        writelines(sequence_of_strings) -> None.  Write the strings to the file.
         
        Note that newlines are not added.  The sequence can be any iterable object
        producing strings. This is equivalent to calling write() for each string.
        """
        pass
 
    def xreadlines(self): # real signature unknown; restored from __doc__
        可用于逐行读取文件,非全部      Python 3.x已经没有改功能
        """
        xreadlines() -> returns self.
         
        For backward compatibility. File objects now include the performance
        optimizations previously implemented in the xreadlines module.
        """
        pass

练习:

fileno
>>> f = open('python.txt','r')
>>> f.fileno()
3

flush
isatty
>>> f = open('python.txt.new','r')
>>> f.isatty()
False

line_buffering
mode
name
newlines
read
>>> f = open('python.txt.new','r')
>>> f.read()
'模块\npickle\n\n认证信息写入列表,写入文件\n'
>>> f.close()

>>> f = open('python.txt.new','r')>>> f.read(10)
'模块\npickle\n'
>>> f.close()

readable
readline
>>> f = open('python.txt.new','r')
>>> f.readline()
'模块\n'
>>> f.readline()
'pickle\n'
>>> f.readline()
'\n'
>>> f.readline()
'认证信息写入列表,写入文件\n'
>>> f.close()

readlines
>>> f = open('python.txt.new','r')
>>> f.readlines()
['模块\n', 'pickle\n', '\n', '认证信息写入列表,写入文件\n']
>>> f.close()

seek
>>> f = open('python.txt.new','r')
>>> f.tell()
0
>>> f.read()
'模块\npickle\n\n认证信息写入列表,写入文件\n'
>>> f.tell()
44
>>> f.close()

1.5 with方法

为了避免打开文件后忘记关闭,可以通过管理上下文,即:(建议使用此方法打开文件)

with open('xb') as f:
    pass

如此方式,当with代码块执行完毕时,内部会自动关闭并释放文件资源

在Python 2.7 后,with又支持同时对多个文件的上下文进行管理,即:

with open('db1', 'r', encoding="utf-8") as f1, open("db2", 'w',encoding="utf-8") as f2:     #  f1文件读 f2文件写
    times = 0
    for line in f1:
        times += 1
        if times <=10:
            f2.write(line)
        else:
            break

with open('db1', 'r', encoding="utf-8") as f1, open("db2", 'w',encoding="utf-8") as f2:
    for line in f1:
        new_str = line.replace("alex", 'st')
        f2.write(new_str)

应用案例:文件备份

>>> with open('python.txt','r') as obj1,open('python.txt.new','w') as obj2:
...     for i in obj1.readlines():
...         i = i.strip()
...         print(i)
...         obj2.write(i)
...         obj2.write('\n')
...

其实因为处理问题,两个文件并非完全一致

校验md5值

File: C:\Users\Administrator\python.txt.new                                        # Python重写文件
Size: 44 bytes
Modified: 2016年5月23日 星期一, 22:13:47
MD5: FA84E60379990C979DF1D2EE778E6821
SHA1: 82627034A574A59D5266C9BBC1B427625E38104D
CRC32: 368DCCC0

File: C:\Users\Administrator\python.txt                                               # 原文件
Size: 42 bytes
Modified: 2016年5月9日 星期一, 19:49:35
MD5: 8D6BE3607B6F6E0C8696BE90270FC0DD
SHA1: 522B719CB211C55E10E9EC9BADD453BE14C5EAC4
CRC32: 5D3A3BFF

File: C:\Users\Administrator\python - 副本.txt                                     # window直接复制一份文件
Size: 42 bytes
Modified: 2016年5月9日 星期一, 19:49:35
MD5: 8D6BE3607B6F6E0C8696BE90270FC0DD
SHA1: 522B719CB211C55E10E9EC9BADD453BE14C5EAC4
CRC32: 5D3A3BFF

 1.6 open对象的属性

open.closed  # 返回true如果文件已被关闭,否则返回false。
open.mode    # 返回被打开文件的访问模式。
open.name    # 返回文件的名称。

应用案例:
>>> fo = open("test.txt", "wb")
>>> print("文件名: ", fo.name)
文件名:  test.txt
>>> print("是否已关闭 : ", fo.closed)
是否已关闭 :  False
>>> print("访问模式 : ", fo.mode)
访问模式 :  rb

2、os模块中的文件操作

Python的os模块提供了帮你执行文件处理操作的方法,比如重命名和删除文件。要使用这个模块,你必须先导入它,然后才可以调用相关的各种功能。

os.rename(current_file_name, new_file_name)  # 文件重命名
os.remove(file_name)                         # 文件删除
os.mkdir("newdir")                           # 目录创建
os.chdir("newdir")                           # 目录切换
os.getcwd()                                  # 显示当前目录名称
os.rmdir('dirname')                          # 删除目录,必须为空目录
os.removedirs(path)                          # 递归删除目录
os.removedirs(path)                          # 显示文件或者目录信息

3、正则表达式

3.1 Python正则表达简介

  正则表达式是一个特殊的字符序列,它能帮助你方便的检查一个字符串是否与某种模式匹配。Python 自1.5版本起增加了re 模块,它提供 Perl 风格的正则表达式模式。re 模块使 Python 语言拥有全部的正则表达式功能。compile 函数根据一个模式字符串和可选的标志参数生成一个正则表达式对象。该对象拥有一系列方法用于正则表达式匹配和替换。re 模块也提供了与这些方法功能完全一致的函数,这些函数使用一个模式字符串做为它们的第一个参数。

3.2 re模块方法

#
# Secret Labs' Regular Expression Engine
#
# re-compatible interface for the sre matching engine
#
# Copyright (c) 1998-2001 by Secret Labs AB.  All rights reserved.
#
# This version of the SRE library can be redistributed under CNRI's
# Python 1.6 license.  For any other use, please contact Secret Labs
# AB (info@pythonware.com).
#
# Portions of this engine have been developed in cooperation with
# CNRI.  Hewlett-Packard provided funding for 1.6 integration and
# other compatibility work.
#

r"""Support for regular expressions (RE).

This module provides regular expression matching operations similar to
those found in Perl.  It supports both 8-bit and Unicode strings; both
the pattern and the strings being processed can contain null bytes and
characters outside the US ASCII range.

Regular expressions can contain both special and ordinary characters.
Most ordinary characters, like "A", "a", or "0", are the simplest
regular expressions; they simply match themselves.  You can
concatenate ordinary characters, so last matches the string 'last'.

The special characters are:
    "."      Matches any character except a newline.               
    "^"      Matches the start of the string.                    
    "$"      Matches the end of the string or just before the newline at    
             the end of the string.
    "*"      Matches 0 or more (greedy) repetitions of the preceding RE.       
             Greedy means that it will match as many repetitions as possible.
    "+"      Matches 1 or more (greedy) repetitions of the preceding RE.     
    "?"      Matches 0 or 1 (greedy) of the preceding RE.                       
    *?,+?,?? Non-greedy versions of the previous three special characters.
    {m,n}    Matches from m to n repetitions of the preceding RE.
    {m,n}?   Non-greedy version of the above.
    "\\"     Either escapes special characters or signals a special sequence.
    []       Indicates a set of characters.
             A "^" as the first character indicates a complementing set.
    "|"      A|B, creates an RE that will match either A or B.
    (...)    Matches the RE inside the parentheses.
             The contents can be retrieved or matched later in the string.
    (?aiLmsux) Set the A, I, L, M, S, U, or X flag for the RE (see below).
    (?:...)  Non-grouping version of regular parentheses.
    (?P<name>...) The substring matched by the group is accessible by name.
    (?P=name)     Matches the text matched earlier by the group named name.
    (?#...)  A comment; ignored.
    (?=...)  Matches if ... matches next, but doesn't consume the string.
    (?!...)  Matches if ... doesn't match next.
    (?<=...) Matches if preceded by ... (must be fixed length).
    (?<!...) Matches if not preceded by ... (must be fixed length).
    (?(id/name)yes|no) Matches yes pattern if the group with id/name matched,
                       the (optional) no pattern otherwise.

The special sequences consist of "\\" and a character from the list
below.  If the ordinary character is not on the list, then the
resulting RE will match the second character.
    \number  Matches the contents of the group of the same number.
    \A       Matches only at the start of the string.
    \Z       Matches only at the end of the string.
    \b       Matches the empty string, but only at the start or end of a word.
    \B       Matches the empty string, but not at the start or end of a word.
    \d       Matches any decimal digit; equivalent to the set [0-9] in
             bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the whole
             range of Unicode digits.
    \D       Matches any non-digit character; equivalent to [^\d].
    \s       Matches any whitespace character; equivalent to [ \t\n\r\f\v] in
             bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the whole
             range of Unicode whitespace characters.
    \S       Matches any non-whitespace character; equivalent to [^\s].
    \w       Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]
             in bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the
             range of Unicode alphanumeric characters (letters plus digits
             plus underscore).
             With LOCALE, it will match the set [0-9_] plus characters defined
             as letters for the current locale.
    \W       Matches the complement of \w.
    \\       Matches a literal backslash.

This module exports the following functions:
    match     Match a regular expression pattern to the beginning of a string.
    fullmatch Match a regular expression pattern to all of a string.
    search    Search a string for the presence of a pattern.
    sub       Substitute occurrences of a pattern found in a string.
    subn      Same as sub, but also return the number of substitutions made.
    split     Split a string by the occurrences of a pattern.
    findall   Find all occurrences of a pattern in a string.
    finditer  Return an iterator yielding a match object for each match.
    compile   Compile a pattern into a RegexObject.
    purge     Clear the regular expression cache.
    escape    Backslash all non-alphanumerics in a string.

Some of the functions in this module takes flags as optional parameters:
    A  ASCII       For string patterns, make \w, \W, \b, \B, \d, \D
                   match the corresponding ASCII character categories
                   (rather than the whole Unicode categories, which is the
                   default).
                   For bytes patterns, this flag is the only available
                   behaviour and needn't be specified.
    I  IGNORECASE  Perform case-insensitive matching.
    L  LOCALE      Make \w, \W, \b, \B, dependent on the current locale.
    M  MULTILINE   "^" matches the beginning of lines (after a newline)
                   as well as the string.
                   "$" matches the end of lines (before a newline) as well
                   as the end of the string.
    S  DOTALL      "." matches any character at all, including the newline.
    X  VERBOSE     Ignore whitespace and comments for nicer looking RE's.
    U  UNICODE     For compatibility only. Ignored for string patterns (it
                   is the default), and forbidden for bytes patterns.

This module also defines an exception 'error'.

"""

import sys
import sre_compile
import sre_parse
try:
    import _locale
except ImportError:
    _locale = None

# public symbols
__all__ = [
    "match", "fullmatch", "search", "sub", "subn", "split",
    "findall", "finditer", "compile", "purge", "template", "escape",
    "error", "A", "I", "L", "M", "S", "X", "U",
    "ASCII", "IGNORECASE", "LOCALE", "MULTILINE", "DOTALL", "VERBOSE",
    "UNICODE",
]

__version__ = "2.2.1"

# flags
A = ASCII = sre_compile.SRE_FLAG_ASCII # assume ascii "locale"
I = IGNORECASE = sre_compile.SRE_FLAG_IGNORECASE # ignore case
L = LOCALE = sre_compile.SRE_FLAG_LOCALE # assume current 8-bit locale
U = UNICODE = sre_compile.SRE_FLAG_UNICODE # assume unicode "locale"
M = MULTILINE = sre_compile.SRE_FLAG_MULTILINE # make anchors look for newline
S = DOTALL = sre_compile.SRE_FLAG_DOTALL # make dot match newline
X = VERBOSE = sre_compile.SRE_FLAG_VERBOSE # ignore whitespace and comments

# sre extensions (experimental, don't rely on these)
T = TEMPLATE = sre_compile.SRE_FLAG_TEMPLATE # disable backtracking
DEBUG = sre_compile.SRE_FLAG_DEBUG # dump pattern after compilation

# sre exception
error = sre_compile.error

# --------------------------------------------------------------------
# public interface

def match(pattern, string, flags=0):
    """Try to apply the pattern at the start of the string, returning
    a match object, or None if no match was found."""
    return _compile(pattern, flags).match(string)
    ''' 
尝试从字符串的起始位置匹配一个模式,如果不是起始位置匹配成功的话,match()就返回none
pattern:匹配的正则表达式
string:要匹配的字符串
flags:标志位,用于控制正则表达式的匹配方式,如:是否区分大小写,多行匹配等等。
'''
def fullmatch(pattern, string, flags=0): """Try to apply the pattern to all of the string, returning a match object, or None if no match was found.""" return _compile(pattern, flags).fullmatch(string) def search(pattern, string, flags=0): """Scan through string looking for a match to the pattern, returning a match object, or None if no match was found.""" return _compile(pattern, flags).search(string)
'''
       扫描整个字符串并返回第一个成功的匹配
       各个参数含义与match()一致
    '''
def sub(pattern, repl, string, count=0, flags=0): """Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl. repl can be either a string or a callable; if a string, backslash escapes in it are processed. If it is a callable, it's passed the match object and must return a replacement string to be used.""" return _compile(pattern, flags).sub(repl, string, count)
'''
替换字符串中的匹配项
'''
def subn(pattern, repl, string, count=0, flags=0): """Return a 2-tuple containing (new_string, number). new_string is the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in the source string by the replacement repl. number is the number of substitutions that were made. repl can be either a string or a callable; if a string, backslash escapes in it are processed. If it is a callable, it's passed the match object and must return a replacement string to be used.""" return _compile(pattern, flags).subn(repl, string, count) def split(pattern, string, maxsplit=0, flags=0): """Split the source string by the occurrences of the pattern, returning a list containing the resulting substrings. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.""" return _compile(pattern, flags).split(string, maxsplit) def findall(pattern, string, flags=0): """Return a list of all non-overlapping matches in the string. If one or more capturing groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.""" return _compile(pattern, flags).findall(string)
'''
match\search只能匹配字符串中的一个,如果想要匹配到字符串中所有符合条件的元素,则需要使用 findall
'''
def finditer(pattern, string, flags=0): """Return an iterator over all non-overlapping matches in the string. For each match, the iterator returns a match object. Empty matches are included in the result.""" return _compile(pattern, flags).finditer(string) def compile(pattern, flags=0): "Compile a regular expression pattern, returning a pattern object." return _compile(pattern, flags) def purge(): "Clear the regular expression caches" _cache.clear() _cache_repl.clear() def template(pattern, flags=0): "Compile a template pattern, returning a pattern object" return _compile(pattern, flags|T) _alphanum_str = frozenset( "_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890") _alphanum_bytes = frozenset( b"_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890") def escape(pattern): """ Escape all the characters in pattern except ASCII letters, numbers and '_'. """ if isinstance(pattern, str): alphanum = _alphanum_str s = list(pattern) for i, c in enumerate(pattern): if c not in alphanum: if c == "\000": s[i] = "\\000" else: s[i] = "\\" + c return "".join(s) else: alphanum = _alphanum_bytes s = [] esc = ord(b"\\") for c in pattern: if c in alphanum: s.append(c) else: if c == 0: s.extend(b"\\000") else: s.append(esc) s.append(c) return bytes(s) # -------------------------------------------------------------------- # internals _cache = {} _cache_repl = {} _pattern_type = type(sre_compile.compile("", 0)) _MAXCACHE = 512 def _compile(pattern, flags): # internal: compile pattern try: p, loc = _cache[type(pattern), pattern, flags] if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE): return p except KeyError: pass if isinstance(pattern, _pattern_type): if flags: raise ValueError( "cannot process flags argument with a compiled pattern") return pattern if not sre_compile.isstring(pattern): raise TypeError("first argument must be string or compiled pattern") p = sre_compile.compile(pattern, flags) if not (flags & DEBUG): if len(_cache) >= _MAXCACHE: _cache.clear() if p.flags & LOCALE: if not _locale: return p loc = _locale.setlocale(_locale.LC_CTYPE) else: loc = None _cache[type(pattern), pattern, flags] = p, loc return p def _compile_repl(repl, pattern): # internal: compile replacement pattern try: return _cache_repl[repl, pattern] except KeyError: pass p = sre_parse.parse_template(repl, pattern) if len(_cache_repl) >= _MAXCACHE: _cache_repl.clear() _cache_repl[repl, pattern] = p return p def _expand(pattern, match, template): # internal: match.expand implementation hook template = sre_parse.parse_template(template, pattern) return sre_parse.expand_template(template, match) def _subx(pattern, template): # internal: pattern.sub/subn implementation helper template = _compile_repl(template, pattern) if not template[0] and len(template[1]) == 1: # literal replacement return template[1][0] def filter(match, template=template): return sre_parse.expand_template(template, match) return filter # register myself for pickling import copyreg def _pickle(p): return _compile, (p.pattern, p.flags) copyreg.pickle(_pattern_type, _pickle, _compile) # -------------------------------------------------------------------- # experimental stuff (see python-dev discussions for details) class Scanner: def __init__(self, lexicon, flags=0): from sre_constants import BRANCH, SUBPATTERN self.lexicon = lexicon # combine phrases into a compound pattern p = [] s = sre_parse.Pattern() s.flags = flags for phrase, action in lexicon: gid = s.opengroup() p.append(sre_parse.SubPattern(s, [ (SUBPATTERN, (gid, sre_parse.parse(phrase, flags))), ])) s.closegroup(gid, p[-1]) p = sre_parse.SubPattern(s, [(BRANCH, (None, p))]) self.scanner = sre_compile.compile(p) def scan(self, string): result = [] append = result.append match = self.scanner.scanner(string).match i = 0 while True: m = match() if not m: break j = m.end() if i == j: break action = self.lexicon[m.lastindex-1][1] if callable(action): self.match = m action = action(self, m.group()) if action is not None: append(action) i = j return result, string[i:]

3.3 re应用案例



# re.match()
>>> import re                         # 导入re模块  
>>> print(re.match('www', 'www.runoob.com').span())   # 在起始位置匹配
(0, 3)
>>> print(re.match('com', 'www.runoob.com'))          # 不在起始位置匹配
None

# 其他案例
>>> import re
>>>
>>> obj = re.match('\d+', '123uuasf')
>>> if obj:
...     print(obj.group())
...
123


# group()
>>> import re
>>>
>>> line = "Cats are smarter than dogs"
>>> matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)
>>>
>>> if matchObj:
...    print ("matchObj.group() : ", matchObj.group())
...    print ("matchObj.group(1) : ", matchObj.group(1))
...    print ("matchObj.group(2) : ", matchObj.group(2))
... else:
...    print ("No match!!")
...
matchObj.group() :  Cats are smarter than dogs
matchObj.group(1) :  Cats
matchObj.group(2) :  smarter

# re.search()
>>> import re
>>> print(re.search('www', 'www.runoob.com').span())      # 在起始位置匹配
(0, 3)
>>> print(re.search('com', 'www.runoob.com').span())         # 不在起始位置匹配
(11, 14)

# 其他案例
>>> obj = re.search('\d+', 'u123uu888asf')
>>> if obj:
...     print(obj.group())
...
123

# re.findall()
>>> import re
>>>
>>> obj = re.findall('\d+', 'fa123uu888asf')
>>> print(obj)
['123', '888']


re.match与re.search的区别:

re.match只匹配字符串的开始,如果字符串开始不符合正则表达式,则匹配失败,函数返回None;而re.search匹配整个字符串,直到找到一个匹配。

import re

line = "Cats are smarter than dogs";

matchObj = re.match( r'dogs', line, re.M|re.I)
if matchObj:
   print("match --> matchObj.group() : ", matchObj.group())
else:
   print("No match!!")

matchObj = re.search( r'dogs', line, re.M|re.I)
if matchObj:
   print("search --> matchObj.group() : ", matchObj.group())
else:
   print("No match!!")

# re.sub()
>>> import re
>>>
>>> phone = "2004-959-559 # This is Phone Number"
>>>
>>> # Delete Python-style comments
... num = re.sub(r'#.*$', "", phone)
>>> print ("Phone Num : ", num)
Phone Num :  2004-959-559
>>>
>>> # Remove anything other than digits
... num = re.sub(r'\D', "", phone)
>>> print ("Phone Num : ", num)
Phone Num :  2004959559

 

3.4 正则表达式修饰符-可选标志

正则表达式可以包含一些可选标志修饰符来控制匹配的模式。修饰符被指定为一个可选的标志。多个标志可以通过按位 OR(|) 它们来指定。如 re.I | re.M 被设置成 I 和 M 标志:

3.5 正则表达式模式

模式字符串使用特殊的语法来表示一个正则表达式:

字母和数字表示他们自身。一个正则表达式模式中的字母和数字匹配同样的字符串。

多数字母和数字前加一个反斜杠时会拥有不同的含义。

标点符号只有被转义时才匹配自身,否则它们表示特殊的含义。

反斜杠本身需要使用反斜杠转义。

由于正则表达式通常都包含反斜杠,所以你最好使用原始字符串来表示它们。模式元素(如 r'/t',等价于'//t')匹配相应的特殊字符。

下表列出了正则表达式模式语法中的特殊元素。如果你使用模式的同时提供了可选的标志参数,某些模式元素的含义会改变。

练习:

# 字符匹配(普通字符,元字符)
# 普通字符:大多数字符和字母都会和自身匹配
>>> import re
>>> re.findall('alex','yuanaleSxalexwupeiqi')
['alex']
>>> re.findall('yuan','yuanaleSxalexwupeiqi')
['yuan']

# 元字符(.   ^   $   *   +   ?   { }   [ ]   |   ( ) \)
# 注:[]比较特殊会去掉括号内的具有特性的字符,使其变为普通字符
# . :匹配任意单个字符
>>> re.findall('al.x','yuanaleSxalexwupeiqialax')        # 字母
['alex', 'alax']
>>> re.findall('al.x','yuanaleSxalexwupeiqialaxal4x')    # 数字
['alex', 'alax', 'al4x']
>>> re.findall('al.x','yuanaleSxalexwupeiqialaxal!x')    # !符号
['alex', 'alax', 'al!x'] 
>>> re.findall('al.x','yuanaleSxalexwupeiqialaxal@x')    # @符号
['alex', 'alax', 'al@x']
# ^ :匹配以某个字符开头 $匹配以某个字符结尾
>>> re.findall('^yuan','yuanaleSxalexwupeiqialaxalx')
['yuan']
>>> re.findall('alx$','yuanaleSxalexwupeiqialaxalx')
['alx']

# *:匹配前一个字符0次或者多次
>>> re.findall('al*x','yuanaleSxalexwuallllxpeiqialaxalx')
['allllx', 'ax', 'alx']

# + :匹配前一个字符1次或者多次(对比1个例子)
>>> re.findall('al+x','yuanaleSxalexwuallllxpeiqialaxalx')
['allllx', 'alx']

# ?:匹配前一个字符0次或者1次(对比2个例子)
>>> re.findall('al?x','yuanaleSxalexwuallllxpeiqialaxalx')
['ax', 'alx']

# {m}:匹配前一个字符m次  {m,}: 匹配前面的字符m次或者更多次 {m,n}:匹配前面的内容m到n次
>>> re.findall('al{4}x','yuanaleSxalexwuallllxpeiqialaxalx')
['allllx']
>>> re.findall('al{4,}x','yuanaleSxalexwuallllxpalllllllllxeiqialaxalx')
['allllx', 'alllllllllx']
>>> re.findall('al{3,4}x','yuanaleSxalexwuallllxpeiqialaxalx')
['allllx']
# [] :字符集,可能位置是字符集中的任意字符 a-z A-Z 0-9 在[]中,只有字符^、-、]和\有特殊含义。字符\仍然表示转义,字符-可以定义字符范围,字符^放在前面,表示非 >>> re.findall('al[a-z]x','yuanaleSxalexwuallllxpeiqialaxalx') ['alex', 'alax'] # | :表示匹配|两边的任意一个表达式 此例:匹配 ale 或者 xx >>> re.findall('ale|xx','yuanaleSxalexwuallllxpeiqialaxalxx') ['ale', 'ale', 'xx'] #():被括起来的表达式将作为分组 # \:转义字符,是一个普通字符改变原有的含义 是原字符则去除特殊功能 >>> re.findall('\alex','yuanaleSxalexwuallllxpeiqialaxalxx') #\a有特殊含义 [] >>> re.findall('a\lex','yuanaleSxalexwuallllxpeiqialaxalxx') #\l 没有特殊含义 ['alex'] >>> re.findall('al\ex','yuanaleSxalexwuallllxpeiqialaxalxx') # \e没有特殊含义 ['alex']
# 特殊字符类
# \d  匹配任何十进制数;它相当于类 [0-9]。
>>> re.findall('al.x','yuanaleSxalexwuallllxpeiqialaxal3x')  # 正常匹配
['alex', 'alax', 'al3x']
>>> re.findall('al\dx','yuanaleSxalexwuallllxpeiqialaxal3x')
['al3x']

# \D  匹配任何非数字字符;它相当于类 [^0-9]。
>>> re.findall('al\Dx','yuanaleSxalexwuallllxpeiqialaxal3x')
['alex', 'alax']

# \s  匹配任何空白字符;它相当于类  [ \t\n\r\f\v]。
# \S  匹配任何非空白字符;它相当于类 [^ \t\n\r\f\v]。
# \w  匹配任何字母数字字符;它相当于类 [a-zA-Z0-9_]。
# \W  匹配任何非字母数字字符;它相当于类 [^a-zA-Z0-9_]
# \b: 匹配一个单词边界,也就是指单词和空格间的位置。匹配单词边界(包括开始和结束),这里的“单词”,是指连续的字母、数字和
    下划线组成的字符串。注意,\b的定义是\w和\W的交界,这是个零宽界定符(zero-width assertions)只用以匹配单词的词首和词尾。
    单词被定义为一个字母数字序列,因此词尾就是用空白符或非字母数字符来标示的。 

分组

# re.compile()
>>> p = re.compile('(a(b)c)d')
>>> m = p.match('abcd')
>>> m.group(0)
'abcd'
>>> m.group(1)
'abc'
>>> m.group(2)
'b'

贪婪模式,惰性模式

3.7 常见正则应用

# 匹配手机号
>>> import re
>>> phone_num = '13001000000'
>>> a = re.compile(r"^1[\d+]{10}")
>>> b = a.match(phone_num)
>>> print(b.group())
13001000000

# 匹配IP地址
ip = '192.168.1.1'
a = re.compile(r"(((1?[0-9]?[0-9])|(2[0-4][0-9])|(25[0-5]))\.){3}((1?[0-9]?[0-9])|(2[0-4][0-9])|(25[0-5]))$")
b = a.search(ip)
print(b)

# 匹配 email
email = '630571017@qq.com'
a = re.compile(r"(.*){0,26}@(\w+){0,20}.(\w+){0,8}")
b = a.search(email)
print(b.group())

3、json&pickle

与文件相关的操作,应用案例详见:http://www.cnblogs.com/madsnotes/articles/5537947.html

 

posted @ 2016-05-23 21:49  每天进步一点点!!!  阅读(3193)  评论(0编辑  收藏  举报