《鲜活的数据:数据可视化指南》第2章 收集数据 Python3.3源码

 全部源码和数据文件下载:仅供参考

《鲜活的数据:数据可视化指南》书中的python代码为2.x,在看书过程中把部分源代码转换为Python3.3格式。

  • 运行环境:Windows7, Python3.3

(1) get-weather-data.py 从网页抓取天气信息(类似于小爬虫)

  • 说明:主要修改包括urllib2的变化,timestamp的修改,采集数据的地点,数据位置等
View Code
 1 # coding = utf-8
 2 __author__ = 'hillfree'
 3 
 4 from urllib.request import urlopen
 5 from bs4 import BeautifulSoup
 6 
 7 # Create/open a file called wunder.txt (which will be a comma-delimited file)
 8 f = open('wunder-data.txt', 'a')
 9 
10 # Iterate through months and day
11 for month in range(1, 13):
12     for day in range(1, 32):
13 
14         # Check if already gone through month
15         if (month == 2 and day > 28):
16             break
17         elif (month in [4, 6, 9, 11] and day > 30):
18             break
19 
20         # Open wunderground.com url
21         url = "http://www.wunderground.com/history/airport/ZBAA/2012/{0}/{1}/DailyHistory.html".format(month, day, )
22         page = urlopen(url)
23 
24         # Get temperature from page
25         soup = BeautifulSoup(page)
26         # 取得最高温度
27         max_temp = soup.findAll(attrs={"class":"nobr"})[3].span.string
28 
29         # Build day record with timestamp
30         record = "2012{0:02d}{1:02d}, {2}\n".format(month, day, max_temp)
31         print(record)
32         # Write timestamp and temperature to file
33         f.write(record)
34 
35 # Done getting data! Close file.
36 f.close()

(2) get-weather-data-full.py 从网页抓取天气信息强化版

  • 说明:在get-weather-data.py基础上增加的年份的循环,以及闰年的判断等
View Code
 1 # coding = utf-8 
 2 __author__ = 'hillfree'
 3 
 4 
 5 from urllib.request import urlopen
 6 from bs4 import BeautifulSoup
 7 
 8 # Create/open a file called wunder.txt (which will be a comma-delimited file)
 9 f = open('wunder-data.txt', 'a')
10 
11 # Iterate through year, months and day
12 for year in range(2013, 2014):
13     for month in range(1, 2):
14         for day in range(1, 32):
15 
16         # Check if leap year
17             if year % 400 == 0:
18                 leap = True
19             elif year % 100 == 0:
20                 leap = False
21             elif year % 4 == 0:
22                 leap = True
23             else:
24                 leap = False
25 
26             # Check if already gone through month
27             if (month == 2 and leap and day > 29):
28                 continue
29             elif (month == 2 and day > 28):
30                 continue
31             elif (month in [4, 6, 9, 10] and day > 30):
32                 continue
33 
34             # Check if already gone through month
35             if (month == 2 and day > 28):
36                 break
37             elif (month in [4, 6, 9, 11] and day > 30):
38                 break
39 
40             # Open wunderground.com url
41             url = "http://www.wunderground.com/history/airport/ZBAA/{0}/{1}/{2}/DailyHistory.html".format(year, month, day, )
42             page = urlopen(url)
43 
44             # Get temperature from page
45             soup = BeautifulSoup(page)
46             # 取得最高温度
47             max_temp = soup.findAll(attrs={"class":"nobr"})[3].span.string
48 
49             # Build day record with timestamp
50             record = "{0:04d}{1:02d}{2:02d}, {3}\n".format(year, month, day, max_temp)
51             print(record)
52             # Write timestamp and temperature to file
53             f.write(record)
54 
55 # Done getting data! Close file.
56 f.close()

(3)add-csv-flag.py 为CSV文件内容增加标志位

  • 说明:在之前生成的CSV文件的基础上,进行判断,并添加is_freezing的标志位,用print输出
View Code
 1 # coding = utf-8 
 2 __author__ = 'hillfree'
 3 
 4 import csv
 5 
 6 
 7 reader = csv.reader(open('wunder-data.txt', 'r'), delimiter=",")
 8 
 9 for row in reader:
10     if int(row[1]) < 0:
11         is_freezing = '1'
12         print("{0}, {1}, {2}".format(row[0], row[1], is_freezing)) # 列出冰冻日
13     else:
14         is_freezing = '0'
15 
16     # print("{0}, {1}, {2}".format(row[0], row[1], is_freezing)) # 可写入文件

(4)csv-to-xml.py 把CSV文件转换为XML格式

  • 说明:在print输出的基础上,增加了写入文件“wunder-data.xml"
View Code
 1 # coding = utf-8
 2 
 3 """
 4 source: <Visualize This> by Nathan Yau
 5 name: csv-to-xml.py
 6 python: v3.3
 7 description: Convert CSV file to Xml format
 8 """
 9 __author__ = 'hillfree'
10 
11 import csv
12 
13 reader = csv.reader(open('wunder-data.txt', 'r'), delimiter=",")
14 output = open("wunder-data.xml", 'w')
15 
16 print('<weather_data>')
17 output.write('<weather_data>\n')
18 
19 for row in reader:
20     print('<observation>')
21     print('<date>' + row[0] + '</date>')
22     print('<max_temperature>' + row[1] + '</max_temperature>')
23     print('</observation>')
24 
25     output.write('<observation>\n')
26     output.write('<date>' + row[0] + '</date>\n')
27     output.write('<max_temperature>' + row[1] + '</max_temperature>\n')
28     output.write('</observation>\n')
29 
30 print('</weather_data>')
31 output.write('</weather_data>\n')

(5)csv-to-json.py 把CSV文件转换为Json格式

  • 说明:原文件是利用csv模块引入,然后print输出。但是原文件中利用365条记录作为文件的终结过于僵硬。这里没有引入csv模块,而是利用file的操作,并且做了简单的格式化。
View Code
 1 # coding = utf-8
 2 
 3 """
 4 source: <Visualize This> by Nathan Yau
 5 name: csv-to-json.py
 6 python: v3.3
 7 description: Convert CSV file to json format, also write to file.
 8 use file module operation instead of csv module
 9 """
10 __author__ = 'hillfree'
11 
12 lines = open('wunder-data.txt', 'r').readlines()
13 output = open("wunder-data.json", 'w')
14 
15 output.write('{"observations": [\n')
16 
17 max = len(lines)
18 count = 0
19 for line in lines:
20     count += 1
21     row = line.split(',')
22     record ='\t{\n\t\t"date": "%s", \n\t\t"temperature": %s' % (row[0], row[1].lstrip())
23 
24     if count < max:
25         record += '\t},\n'
26     else:
27         record += '\t}]\n}'
28 
29     output.write(record)

(6)xml-to-csv.py 把xml格式文件转换为csv格式

  • 说明:原文件是利用BeautifulSoup模块的BeautifulStoneSoup来处理xml。但在python3.3中相应的lxml无法使用(?)。因此只好采用了python3.3自带的xml.dom.minidom模块,第一次使用感觉很别扭,尤其是子节点的使用很不习惯。加入了很多判断才能将日期和最高气温捏合在一起。求其他更好的办法。
View Code
 1 # coding = utf-8
 2 
 3 """
 4 source: <Visualize This> by Nathan Yau
 5 name: xml-to-csv.py
 6 python: v3.3
 7 description: Convert xml file to csv format, also write to file.
 8 因为Python3.3下lxml无法应用,所以采用python自带的minidom,
 9 用起来比较别扭,不知道如何改进?
10 """
11 __author__ = 'hillfree'
12 
13 from xml.dom.minidom import parse, Node, NodeList
14 
15 xml_file =parse("wunder-data.xml")
16 
17 date = ""
18 max_temperature = ""
19 for node1 in xml_file.getElementsByTagName("observation"):
20     for node2 in node1.childNodes:
21 
22         for node3 in node2.childNodes:
23             if node3.nodeType == Node.TEXT_NODE and node2.nodeName == "date":
24                 if node3.nodeValue != "":
25                     date = node3.nodeValue
26             if node3.nodeType == Node.TEXT_NODE and node2.nodeName == "max_temperature":
27                 if node3.nodeValue != "":
28                     max_temperature = node3.nodeValue.lstrip()
29 
30         print(date + "," + max_temperature)

 

 

 

 

 

 

posted @ 2013-03-18 14:14  hgdfr  阅读(510)  评论(0编辑  收藏  举报