Python Web服务(15) 持续更新

网页信息抓取

from urllib import urlopen
import re
p = re.compile('<h3><a .*?><a .*? href="(.*?)">(.*?)</a>')
text = urlopen('http://python.org/community/jobs').read()
for url, name in p.findall(text):
    print '%s (%s)' % (name, url)

这段程序有以下缺点

正则表达式读起来并不是那容易理解
程序对于CDATA部分和字符实体(比如&)
正则表达式被HTML源代码约束

对于这种有两个解决方案

程序调用Tidy（Python库），进行XHTML解析
使用Beautiful Soup库，它是专门为网页信息抓取的

还有其他的如 scrape.py

Tidy和XHTML解析

Tidy是用来修复不规范且有些随意的HTML文档的工具。当然Tidy不能修复HTML文件的所有问题，但是它会确保文件的格式正确的(就是所有元素正确的嵌套)。

Tidy相关内容

tidy 原版c语言写的tidy
utidy python包装的库，比较老了
mxtidy python包装的库，比较老了，只支持到python2.5
jtidy 用java写的tidy
tidy-html5 c语言写的支持html5的tidy
npp-tidy2 notepad编辑器的tidy插件

原文链接 http://xpenxpen.iteye.com/blog/2160269

Window 安装 http://binaries.html-tidy.org/ 下载压缩包文件，然后解压到程序目录下。把 tidy.exe 移动到和程序同一级目录下。

然后Python用subprocess模块中的popen函数运行tidy程序

messy.html

<h1>Pet Shop
<h2>Complaints</h3>

<p>There is <b>no <i>way</b> at all</i> we can accetp returned parrots.

<h1><i>Dead pets</h1>

<p>Our pets may tend to rest at times. but rarely die within the warranty period.

<i><h2>News</h2></i>

<p>We have just received <b>a really nice parrot.

<p>It's really nice</b>  <h3><hr />The Norwegian Blue</h3>  <h4>Plumage and
        <hr />pining behavior</h4>
    <a href="#norwegian-blue">More information<a>
        <p>Features: <body> <li>Beautiful plumage

tidy_test.py

from subprocess import Popen, PIPE

text = open('messy.html').read()
tidy = Popen('tidy', stdin=PIPE, stdout=PIPE, stderr=PIPE)

tidy.stdin.write(text)
tidy.stdin.close()

print tidy.stdout.read()

输出

<!DOCTYPE html>
<html>
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Windows version 5.2.0">
<title></title>
</head>
<body>
<h1>Pet Shop</h1>
<h2>Complaints</h2>
<p>There is <b>no <i>way</i></b> <i>at all</i> we can accetp
returned parrots.</p>
<h1><i>Dead pets</i></h1>
<p><i>Our pets may tend to rest at times. but rarely die within the
warranty period. </i></p>
<h2><i>News</i></h2>
<p>We have just received <b>a really nice parrot.</b></p>
<p><b>It's really nice</b></p>
<hr>
<h3>The Norwegian Blue</h3>
<h4>Plumage and</h4>
<hr>
<h4>pining behavior</h4>
<a href="#norwegian-blue">More information</a>
<p>Features:</p>
<ul>
<li>Beautiful plumage</li>
</ul>
</body>
</html>

实际上就是在Shell里面执行语句

D:\python_basic_course\course_15>tidy messy.html
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 1 column 1 - Warning: inserting implicit <body>
line 1 column 1 - Warning: missing </h1> before <h2>
line 2 column 1 - Warning: missing </h2> before </h3>
line 2 column 15 - Warning: discarding unexpected </h3>
line 4 column 19 - Warning: replacing unexpected b with </b>
line 4 column 29 - Warning: inserting implicit <i>
line 6 column 5 - Warning: missing </i> before </h1>
line 8 column 4 - Warning: inserting implicit <i>
line 10 column 1 - Warning: missing </i> before <h2>
line 8 column 4 - Warning: missing </i> before <h2>
line 10 column 8 - Warning: inserting implicit <i>
line 10 column 17 - Warning: discarding unexpected </i>
line 12 column 26 - Warning: missing </b> before <p>
line 14 column 4 - Warning: inserting implicit <b>
line 14 column 30 - Warning: <hr> isn't allowed in <h3> elements
line 14 column 26 - Info: <h3> previously mentioned
line 15 column 9 - Warning: <hr> isn't allowed in <h4> elements
line 14 column 61 - Info: <h4> previously mentioned
line 16 column 5 - Warning: <a> is probably intended as </a>
line 17 column 22 - Warning: discarding unexpected <body>
line 17 column 29 - Warning: <li> isn't allowed in <body> elements
line 1 column 1 - Info: <body> previously mentioned
line 17 column 29 - Warning: inserting implicit <ul>
line 17 column 29 - Warning: missing </ul>
line 1 column 1 - Warning: inserting missing 'title' element
line 10 column 1 - Warning: trimming empty <i>
Info: Document content looks like HTML5
Tidy found 24 warnings and 0 errors!

<!DOCTYPE html>
<html>
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Windows version 5.2.0">
<title></title>
</head>
<body>
<h1>Pet Shop</h1>
<h2>Complaints</h2>
<p>There is <b>no <i>way</i></b> <i>at all</i> we can accetp
returned parrots.</p>
<h1><i>Dead pets</i></h1>
<p><i>Our pets may tend to rest at times. but rarely die within the
warranty period. </i></p>
<h2><i>News</i></h2>
<p>We have just received <b>a really nice parrot.</b></p>
<p><b>It's really nice</b></p>
<hr>
<h3>The Norwegian Blue</h3>
<h4>Plumage and</h4>
<hr>
<h4>pining behavior</h4>
<a href="#norwegian-blue">More information</a>
<p>Features:</p>
<ul>
<li>Beautiful plumage</li>
</ul>
</body>
</html>

About HTML Tidy: https://github.com/htacg/tidy-html5
Bug reports and comments: https://github.com/htacg/tidy-html5/issues
Official mailing list: https://lists.w3.org/Archives/Public/public-htacg/
Latest HTML specification: http://dev.w3.org/html5/spec-author-view/
Validate your HTML documents: http://validator.w3.org/nu/
Lobby your company to join the W3C: http://www.w3.org/Consortium

Do you speak a language other than English, or a different variant of
English? Consider helping us to localize HTML Tidy. For details please see
https://github.com/htacg/tidy-html5/blob/master/README/LOCALIZE.md

D:\python_basic_course\course_15>

XHTML

xhtml和旧版的html之间最主要的区别是XHTML对显示关闭所有元素要求更加严格。

如 html中可能只用一个开始标签<p>标签结束一段然后开始下一段。而在XHTML中必须显式的关闭当前段落。

解析从Tidy中获得的XHtml然后用HTMLParser解析。

HTMLParser

使用HTMLParser意味着要生成它的一个子类，并且对handle_starttage或handle_data之类的事件处理方法进行覆盖。

HTMLParser 回调方法

handle_starttag(tag, attrs)        找到开始标签时，调用。attrs是（名称，值）对的序列
handle_startendtag(tag, attrs)     使用空标签时调用。默认分开处理和结束标签
handle_endtag(tag)                 找到结束标签时调用
handle_data(data)                  使用文本数据时调用
handle_charref(ref)                当使用&#ref;形式的实体引用时调用
handle_entityref(name)             当使用&name;形式的实体引用时调用
handle_comment(data)               注释时调用。只对注释内容调用
handle_decl(decl)                  声明<!...>形式时调用
handle_pi(data)                    处理指令时调用

htmlparser_test.py

# coding: utf-8

from urllib import urlopen
from HTMLParser import HTMLParser

class Scraper(HTMLParser):

    in_h3 = False
    in_link = False

    def handle_starttag(self, tag, attrs):
        attrs = dict(attrs)
        if tag == 'h3':
            self.in_h3 = True

        if tag == 'a' and 'href' in attrs:
            self.in_link = True
            self.chunks = []
            self.url = attrs['href']


    def handle_data(self, data):
        if self.in_link:
            self.chunks.append(data)

    def handle_endtag(self, tag):
        if tag == 'h3':
            self.in_h3 = False
        if tag == 'a':
            if self.in_h3 and self.in_link:
                string = ''.join(self.chunks)

                if isinstance(string, unicode):
                    string = string.encode("utf8")
                else:
                    string = unicode(string, "gb2312")
                    string = string.encode("utf8")

                print '%s (%s)' % (string, self.url)
            self.in_link = False

response = None

try:
    response = urlopen("http://www.qq.com/")
except Exception as e:
    print "错误：下载网页时遇到问题：" + str(e)

if response.code != 200:
    print "错误：访问后，返回的状态代码（Code）并不是预期值【200】，而是【" + str(response.code) + "】"

text = response.read()

parser = Scraper()
parser.feed(text)
parser.close()

输出

台风“妮妲”登陆广东 大树被连根拔起 (http://news.qq.com/a/20160802/003497.htm)
航拍张家界玻璃栈道 绝壁凌空令人头晕目眩 (http://news.qq.com/a/20160802/006341.htm#p=1)
疯狂敛财10亿的“心灵培训班”，别无人监管 (http://view.news.qq.com/original/intouchtoday/n3605.html)
中国高铁盈利地图：东部赚翻 中西部巨亏 (http://finance.qq.com/a/20160802/005982.htm)
20万多买奥迪Q3 讴歌CDX竞品SUV最高降6万 (http://auto.qq.com/a/20160802/004725.htm)
美国男篮44分狂胜尼日利亚 奥运热身5战净胜215分 (http://sports.qq.com/nba/)
央视：房价未来怎么走？看完这个就明白了 (http://news.house.qq.com/)
韦德遭热火怠慢詹皇抱不平 称韦德是热火科比 (http://sports.qq.com/a/20160802/005100.htm)
霍建华林心如返台准备归宁宴 走商务通道避媒体 (http://ent.qq.com/a/20160802/006231.htm#p=1)
企鹅智酷：魏则西事件后，网民如何看网上就医？ (http://tech.qq.com/a/20160503/006393.htm#p=1)
iPhone指纹扫描弱爆了，LG把指纹做到了屏幕里 (http://tech.qq.com/a/20160503/002791.htm)
这才是中国的奢侈品，惊艳上千年！ (http://cul.qq.com/a/20160802/004627.htm#p=1)
全都输范冰冰？但比腿我站张馨予 (http://fashion.qq.com/visual/photo.shtml)
张檬又变脸了，这次真的认不出！ (http://fashion.qq.com/a/20160802/008502.htm#p=1)
遭遇恐怖袭击怎么办？一个听天由命者的视角 (http://dajia.qq.com/)
星运365 8月2日12星座运势 哪个星座运势最差 (http://astro.fashion.qq.com)
星座控：从南北交点探寻你的前世今生（上） (http://astro.fashion.qq.com/original/constellationControl/NBJD.html)
点赞！河南双腿瘫痪高考生被武大录取 (http://edu.qq.com/photo/)
暑期充电助你变身学霸 (http://edu.qq.com/class/onecourse/shujiaxuexi.htm)
两只考拉树上打架 一只被打哭哈哈哈！ (http://v.qq.com/cover/a/aekiwhvmdhhwa23/k00206jkl9t.html)
中国海军和平方舟医院船凯旋而归 (http://news.qq.com/a/20160127/011493.htm#p=1)
CCTV称直播行业烧钱 曝小智1.2亿被挖！ (http://games.qq.com/a/20160802/000757.htm)
Sky李晓峰晒魔兽选手聚会 网友看哭：都是青春 (http://games.qq.com/a/20160802/001398.htm)
压力山大的现代人 你可以试着用佛法减压 (http://foxue.qq.com/)
净慧长老：《心经》里的一个“心”字 奥义无穷 (http://rufodao.qq.com/a/20160801/023724.htm)
存在：84岁老爹和他13岁的娃 (http://gongyi.qq.com/original/exist/oldfatherinfamily.html)

Beautiful Soup

关于介绍 http://cuiqingcai.com/1319.html

beautifulsoup_test.py

from urllib import urlopen
from BeautifulSoup import BeautifulSoup

text = urlopen("http://www.qq.com/").read()
soup = BeautifulSoup(text)

jobs = set()
for header in soup('h3'):
    links = header('a', 'reference')
    if not links:
        continue
    link = links[0]
    jobs.add('%s (%s)' % (link.string, link['href']))

print '\n'.join(sorted(jobs, key=lambda s: s.lower()))

使用CGI创建动态网页

CGI 通用网关接口（Common Gateway Interface）.

第一步：准备网络服务器

CGI程序必须放在通过网络可以访问的目录中。并且须将它们标识为CGI脚本，这样网络服务器就不会将普通源代码作为网页处理。

将脚本放在叫做cgi-bin的子目录中
把脚本文件扩展名改为.cgi

如果用的Apache，需要目录的ExecCGI选项。

第二步：加入Pound Bang行

当脚本放在正确位置后，需要在脚本的开始处增加 pound bang 行。没有这样的话，网络服务器不知道如何执行脚本。

(脚本可以用其他的语言来写，比如Perl或者Ruby) 只要在脚本开始处添加

#!/usr/bin/env python

注意，它一定要是第一行（之前没有空行）。如果不能正常工作，需要查看Python可执行文件的确切位置。

#!/usr/bin/python

如果还是不行确保这行是以\r\n而不是\n结尾，且文件为UNIX风格的文本文件。

在Window系统中

#!C:\Python22\python.exe

第三步：设置文件权限

确保每个人都可以读取和执行脚本文件，还要确保只有你可以写入文件。

有的时候，在Window编辑脚本文件，而它存储在UNIX磁盘服务手上(通过Samba或FTP访问文件)，文件权限有可能在对文件进行更改后搞乱了。所以脚本无法执行时，请确保文件权限仍是正确的。

修改文件权限（或者文件模式）的UNIX命令是chmod。只要运行下面命令即可

chmod 755 somescript.cgi

用法：chmod XXX filename

×××（所有者\组用户\其他用户）

×=4 读的权限

×=2 写的权限

×=1 执行的权限

常用修改权限的命令：

sudo chmod 600 ××× （只有所有者有读和写的权限）

sudo chmod 644 ××× （所有者有读和写的权限，组用户只有读的权限）

sudo chmod 700 ××× （只有所有者有读和写以及执行的权限）

sudo chmod 666 ××× （每个人都有读和写的权限）

sudo chmod 777 ××× （每个人都有读和写以及执行的权限）

关于Linux知识可以看

http://linux.ximizi.com/linux/4/linux30939.htm

如果还是不清楚怎么搭建可以看下下面文章

http://koda.iteye.com/blog/556393

http://8796902.blog.51cto.com/8786902/1560549

http://www.111cn.net/sys/Windows/63254.htm

搭建完成后，我们就可以直接访问测试了

http://localhost/cgi-bin/test.py

test.py

#!D:\Python27\python.exe

print 'Content-type: text/html'
print # Prints an empty line, to end the headers

print 'Hello, world2222!'

注意 print 'Content-type: text/html' 后面必须有两个空行，后面才是主程序。所以上面示例的代码后面的空print 是必须的。否则报错。

http://soige.blog.51cto.com/512568/325409

使用CGITB调试

#!D:\Python27\python.exe

import cgitb

cgitb.enable()

print 'COntent-type: text/html'

print

print 1/0

print 'Hello,Python'

注意，开发完成后需要关掉 cgitb 功能，因为回溯也不是为程序的一般用户准备的。

使用cgi模块

html表单提供给CGI脚本的键-值对，或称为字段，使用FieldStorage类从CGI脚本中获取这些字段。当创建FieldStorage实例时（应该只创建一个），它会从请求中获取输入变量，然后通过类字典接口将它们提供给程序。

如果真的请求中包括名为name的值不应该这样做

form = cgi.FieldStorage()
name = form['name']

应该这样

form = cgi.FieldStorage()
name = form['name'].value

还可以这样

form.getvalue('name', 'Unknown')

示例

#!D:\Python27\python.exe

import cgi, cgitb
cgitb.enable()

form = cgi.FieldStorage()
name = form.getvalue('name', 'Python')

print 'Content-type: text/html'
print

print 'Hello,%s!' % name

调用示例 http://localhost/cgi-bin/test.py?name=Java&age=12

>>> import urllib
>>> urllib.urlencode({'name':'c++','age':'23'}
... )
'age=23&name=c%2B%2B'

带有问候的HTML表单脚本

#!D:\Python27\python.exe

import cgi, cgitb
cgitb.enable()

form = cgi.FieldStorage()
name = form.getvalue('name', 'Python')

print 'Content-type: text/html'
print

print '''<html>
<head>
<title>Greeting Page</title>
</head>
<body>
<h1>Hello,%s!</h1>

<form action='test.py'>
Change name <input type='text' name='name' />
<input type='submit'/>
</form>
</body>
</html>
''' % name

这里的test.py 也可以是 test.cgi

mod_python

mod_python是Apache网络服务器的扩展，可以让Python解释器直接成为Apache的一部分。在Python中编写Apache处理程序的功能，和使用C语言不通，它是标准的。使用mod_python处理程序框架可以访问丰富的API，深入Apache内核。、

CGI处理程序，允许使用mod_python解释器运行CGI脚本，执行速度会有相当大的提高
PSP处理程序，运行使用HTML以及Python代码混合编程创建可创建可执行网页，或者Python服务器页面
发布处理程序，允许使用url调用python函数。

CGI处理程序

CGI处理程序在使用CGI的时候，会模拟程序运行的环境。所以可以用mod_python运行程序，但是还可以使用gi和gitb模块把它当做CGI脚本来写。

使用CGI处理程序而不使用普通CGI的主要原因性能。根据mod_python文档中的简单测试，至少能将程序的性能提升一个数量级。使用发布处理程序比这个还要快，用自己编写的处理程序甚至会更快，可能会达到CGI处理程序速度的3倍。

如果要使用CGI处理程序，要将下面的代码放在放置CGI脚本所在目录中的.htaccess文件内：

SetHandler mod_python
PythonHandler mod_python.cgihandler

确保Apache的全局配置中没有冲突的定义，因为.htaccess文件并不会进行覆盖。

为了运行CGI脚本，需要脚本以.py结尾-------尽管访问的时候，还是用以.cgi结尾的URL。mod_python在查找满足请求的文件时会将.cgi转换为.py。

PSP(Python Server Page，Python服务器页面)

它实际上就类似PHP、ASP。PSP文档是HTML以及Python代码的混合，Python代码会包括具有特殊用途的标签中。任何HTML会被转换为输出函数。

只要把下面的代码放在.htaccess文件中即可设置Apache支持PSP页面：

AddHandler mod_python .psp
PythonHandler mod_python .psp

这样服务器会把扩展名为.psp文件看做PSP文件。

在开发PSP页面时，使用PythonDebug On指令，会在PSP页面中的任何错误都会导致异常回溯，其中也包括用户源代码。如果让用户通过错误信息看到代码，可能并没有帮助，而且可能会有安全风险。

PSP标签有两类：一类用于语句，另一类用于表达式。表达式标签内的表达式的值会直接放在输出文档中。

<%
from random import choice
adjectives = ['beautiful', 'cruel']
%>
<html>
    <head>
        <title>Hello</title>
    </head>
    <body>
        <p>Hello,<%=choice(adjectives)%> world.My name is Mr. Gumby</p>
    </body>
</html>

普通输出、语句和表达式可以随意混合。可以<% 像这样 %>书写注释。

网络程序框架

Albatross、Cherrpy、Diango、Plone、Pylons、Quixote、Spyce、TurboGears、web.py、Webware、Zope

Web服务

XML-RPC、SOAP

posted @ 2016-08-02 10:08 笨重的石头阅读(306) 评论(0) 收藏举报

刷新页面返回顶部

励志成为优秀程序员

非淡泊无以明志,非宁静无以致远！