第4章：文本处理

1.字符串常量

1).字符串是不可变的有序集合

Python不区分字符和字符串

所有的字符都是字符串

字符串是不可变的

字符串是字符的有序集合

2).字符串函数

通用操作

与大小写相关的方法

判断类方法

字符串方法startswith和endswith

查找类函数

字符串操作方法

3).使用Python分析Apache的访问日志

from __future__ import print_function

ips = []
with open('access.log') as f:
    for line in f:
        ips.append(line.split()[0])

print(ips)
print("PV is {0}".format(len(ips)))
print("UV is {0}".format(len(set(ips))))

from __future__ import print_function
from collections import Counter

c = Counter()
with open('access.log') as f:
    for line in f:
        c[line.split()[6]] += 1

print(c)
print("Popular resources : {0}".format(c.most_common(10)))

from __future__ import print_function
from collections import Counter

d = {}
with open('access.log') as f:
    for line in f:
        key = line.split()[8]
        d.setdefault(key, 0)
        d[key] += 1

sum_requests = 0
error_requests = 0

for key, val in d.items():
    if int(key) >= 400:
        error_requests += val
    sum_requests += val

print("error rate: {0:.2f}%".format(error_requests * 100.0 / sum_requests))

4).字符串格式化

在Python中，存在两种格式化字符串的方法，即%表达式和format函数

format函数才是字符串格式化的未来

"{} is better than {}.".format('Beautiful', 'ugly')

2.正则表达式

1).利用re库处理正则表达式

非编译的正则表达式版本：
import re

def main():
    pattern = "[0-9]+"
    with open('data.txt') as f:
        for line in f:
            re.findall(pattern, line)

if __name__ == '__main__':
    main()

编译的正则表达式版本：
import re

def main():
    pattern = "[0-9]+"
    re_obj = re.complile(pattern)
    with open('data.txt') as f:
        for line in f:
            re_obj.findall(line)

if __name__ == '__main__':
    main()

search函数与match函数用法几乎一样，区别在于前者在字符串的任意位置进行匹配，后者仅对字符串的开始部分进行匹配

2).常用的re方法

匹配类函数：re模块中最简单的便是findall函数

修改类函数：re模块的sub函数类似于字符串的replace函数，只是sub函数支持正则表达式

3).案例：获取HTML页面中的所有超链接

import re
import requests

r = requests.get('https://news.ycombinator.com/')
mydata = re.findall('"(https?://.*?)"', r.content.decode('utf-8'))
print(mydata)

3.字符集编码

1).UTF8编码

2).Python2和Python3中的Unicode

把Unicode字符表示为二进制数据有许多中办法，最常见的编码方式就是UTF-8

Unicode是表现形式，UTF-8是存储形式

UTT-8是使用最广泛的编码，但仅仅是Unicode的一种存储形式

from __future__ import unicode_literals

4.Jiaja2模板

1).语法块

在Jinja2中，存在三种语法：
控制结构{%%}
变量取值{{}}
注释{##}
{% if users %}
    <ul>
    {% for user in users %}
        <li>{{ user.username }}</li>
    {% endfor %}
    </ul>
{% endif %}

2).Jinja2的继承和Super函数

base.html页面
<html lang= "en" >
<head>
    {% block head %｝
    <link rel ="stylesheet" href="style.css" />
    <title> {% block title %}｛% endblock %｝－ My Webpage</title>
    {% endblock %｝
</ head>
<body>
<div id="content">
  ｛% block content %}｛% endblock %}
</div>
</body >

index.html页面
{% extends "base .html" %｝
{% block title %｝index｛% endblock %｝
{% block head %｝
{ { super () } }
<style type= ” text/css ” >
.important { color: #336699 ; }
</style>
{% endblock %｝
{% block content %｝
<hl ＞index</hl>
<p class ＝ "important"> Welcome on my awesome homepage. </p>
{% endblock %｝

posted @ 2019-08-09 11:51 AllenHU320 阅读(197) 评论(0) 收藏举报

刷新页面返回顶部

第4章：文本处理

公告