Python练习—Google’s Python Class

首先介绍下正则表达式:

1)python中提供了re模块来进行正则表达式支持,因此第一步 import re

2)几个常用的方法:

match = re.search(pat, str)  
注意点:1.match是个对象,使用match.group()来输出匹配文本,若失败返回None
       2.search从str的起始处开始处理,在第一个匹配处结束
       3.所有的模式都必须匹配上,但并不是所有的字符串都要匹配一遍
       4.首先找到匹配模式的最左边,然后尽可能的往右尝试
 
list = re.findall(pat, str) 搜索所有的匹配项,以列表形式返回
注意点:1.可以用f.read()把所有文本都丢给findall
       2.使用()后,返回的是元组的列表
 
re.sub(pat,replacement,str) 搜索所有匹配项,并进行替换,匹配字符串可以包括\1,\2来引用group(1),group(2)的内容
 
3)基本模式
普通字符原样匹配,元字符会特殊处理. ^ $ * + ? { [ ] \ | ( )
.匹配除了\n外的任意字符
\w 匹配一个字符[a-zA-Z0-9_] 
\W 匹配非上面的任意字符
\b 字符和非字符的边界
\s 匹配单个空格 [ \n\r\t\f]
\S 匹配非空格字符
\t, \n, \r   制表,换行,回车
\d 十进制数
^ 开始 $结束
\ 转义
[] 指明字符集,注意这时.就代表 [^]代表取反
() 分组抽取,组特性允许抽取部分匹配文本
重复:
+ 出现一次或多次
* 出现0次或多次
? 出现0次或一次,在正则表达式后面加?可以取消贪婪搜索

 

BUG Fixed:

WIN7+MINGW:

使用commands.getstatusoutput()函数,由于cmd加上了{,出现歧义,需要矫正

def getstatusoutput(cmd):
    """Return (status, output) of executing cmd in a shell."""

    import sys
    mswindows = (sys.platform == "win32")

    import os
    if not mswindows:
      cmd = '{ ' + cmd + '; }'

    pipe = os.popen(cmd + ' 2>&1', 'r')
    text = pipe.read()
    sts = pipe.close()
    if sts is None: sts = 0
    if text[-1:] == '\n': text = text[:-1]
    return sts, text

 

Google’s Class介绍了基本的内容,包括:字符串操作,列表操作,排序操作,字典和文件操作,正则表达式操作,一些辅助工具操作

提供的练习包括:字符串,列表使用;正则表达式,文件使用;辅助工具使用。并提供了参考代码。

特别是最后一个练习,根据文件提取图片地址,并下载,生成HTML文件的。稍微修改就可以用来订阅网站内容的功能,值得初学者练习使用。

 

这里贴个代码(新浪图片页面指定部分抓取):

   1: #!/usr/bin/python
   2: # -*- coding: utf-8 -*-
   3: # Copyright 2010 Google Inc.
   4: # Licensed under the Apache License, Version 2.0
   5: # http://www.apache.org/licenses/LICENSE-2.0
   6:  
   7: # Google's Python Class
   8: # http://code.google.com/edu/languages/google-python-class/
   9:  
  10: import os
  11: import re
  12: import sys
  13: import urllib
  14:  
  15:  
  16: """Logpuzzle exercise
  17: Given an apache logfile, find the puzzle urls and download the images.
  18:  
  19: Here's what a puzzle url looks like:
  20: 10.254.254.28 - - [06/Aug/2007:00:13:48 -0700] "GET /~foo/puzzle-bar-aaab.jpg HTTP/1.0" 302 528 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6"
  21: """
  22: def grep_file(url):
  23:   filename='test.html'
  24:   abspath=os.path.abspath(filename)
  25:   urllib.urlretrieve(url,abspath)
  26:  
  27: def read_urls(filename):
  28:   """Returns a list of the puzzle urls from the given log file,
  29:   extracting the hostname from the filename itself.
  30:   Screens out duplicate urls and returns the urls sorted into
  31:   increasing order."""
  32:   # +++your code here+++
  33:   url=[]
  34:   piclist=[]
  35:   """firstpart=re.search(r'(.*)_(.*)',filename)
  36:   if firstpart:
  37:     first=firstpart.group(2)"""
  38:     
  39:   try:
  40:     f=open(filename,'rU')
  41:     """for line in f:
  42:       urline=re.search(r'GET\s(.*\.jpg)\sHTTP/1.0',line)
  43:       if urline:
  44:         urlpart=urline.group(1)
  45:         str='http://'+first+urlpart
  46:         if str not in url:
  47:           url.append(str)"""
  48:     url=re.findall('<!--写真 start-->([\w\W]*?)<!--写真 end-->',f.read().decode('gbk').encode('utf-8'))
  49:     f.close()
  50:   except IOError as (errno, strerror):
  51:     sys.stderr.write("I/O error({0}): {1}".format(errno, strerror))
  52:   """def MyFn(name):
  53:     base=os.path.basename(name)
  54:     set=re.findall(r'(.*?)[-.]',base)
  55:     if set:
  56:       #print set[0],set[1],set[2]
  57:       return set[2]
  58:     else:
  59:       return base
  60:   url=sorted(url,key=MyFn)
  61:   #url.sorted()"""
  62:   for i in url:
  63:     piclist=re.findall(r'<img src="(.*?)"',i)
  64:   return piclist
  65:   
  66:  
  67: def download_images(img_urls, dest_dir):
  68:   """Given the urls already in the correct order, downloads
  69:   each image into the given directory.
  70:   Gives the images local filenames img0, img1, and so on.
  71:   Creates an index.html in the directory
  72:   with an img tag to show each local image file.
  73:   Creates the directory if necessary.
  74:   """
  75:   # +++your code here+++
  76:   abspath=os.path.abspath(dest_dir)
  77:   if not os.path.exists(abspath):
  78:     os.mkdir(abspath)
  79:    
  80:   count=0 
  81:   for i in img_urls:
  82:   
  83:     fn=abspath+'\img'+str(count)
  84:     print 'Retrieving...'+fn
  85:     urllib.urlretrieve(i,fn)
  86:     count+=1
  87:   
  88:   #create html
  89:   toshow=''
  90:   htmlpath=os.path.join(abspath,'index.html')
  91:  
  92:   f=open(htmlpath,'w')
  93:   for i in range(count):
  94:     toshow+='<img src="img'+str(i)+'">'
  95:   f.write(toshow)
  96:   f.close
  97:     
  98:   
  99: def main():
 100:   args = sys.argv[1:]
 101:  
 102:   if len(args)>0 and args[0] == '-h':
 103:     print 'usage: [--todir dir]'
 104:     sys.exit(1)
 105:  
 106:   todir = ''
 107:   if len(args)>0 and args[0] == '--todir':
 108:     todir = args[1]
 109:     del args[0:2]
 110:   
 111:   url='http://ent.sina.com.cn/photo/'
 112:   grep_file(url)
 113:  
 114:   #read_urls('test.html')
 115:   img_urls = read_urls('test.html')
 116:  
 117:   if todir:
 118:     download_images(img_urls, todir)
 119:   else:
 120:     print '\n'.join(img_urls)
 121:  
 122: if __name__ == '__main__':
 123:   main()
posted @ 2012-05-27 22:06  Orcus  阅读(743)  评论(0编辑  收藏  举报