《鲜活的数据：数据可视化指南》第2章收集数据 Python3.3源码

全部源码和数据文件下载：仅供参考

《鲜活的数据：数据可视化指南》书中的python代码为2.x，在看书过程中把部分源代码转换为Python3.3格式。

运行环境：Windows7, Python3.3

（1） get-weather-data.py 从网页抓取天气信息（类似于小爬虫）

说明：主要修改包括urllib2的变化，timestamp的修改，采集数据的地点，数据位置等

View Code

 1 # coding = utf-8
 2 __author__ = 'hillfree'
 3 
 4 from urllib.request import urlopen
 5 from bs4 import BeautifulSoup
 6 
 7 # Create/open a file called wunder.txt (which will be a comma-delimited file)
 8 f = open('wunder-data.txt', 'a')
 9 
10 # Iterate through months and day
11 for month in range(1, 13):
12     for day in range(1, 32):
13 
14         # Check if already gone through month
15         if (month == 2 and day > 28):
16             break
17         elif (month in [4, 6, 9, 11] and day > 30):
18             break
19 
20         # Open wunderground.com url
21         url = "http://www.wunderground.com/history/airport/ZBAA/2012/{0}/{1}/DailyHistory.html".format(month, day, )
22         page = urlopen(url)
23 
24         # Get temperature from page
25         soup = BeautifulSoup(page)
26         # 取得最高温度
27         max_temp = soup.findAll(attrs={"class":"nobr"})[3].span.string
28 
29         # Build day record with timestamp
30         record = "2012{0:02d}{1:02d}, {2}\n".format(month, day, max_temp)
31         print(record)
32         # Write timestamp and temperature to file
33         f.write(record)
34 
35 # Done getting data! Close file.
36 f.close()

（2） get-weather-data-full.py 从网页抓取天气信息强化版

说明：在get-weather-data.py基础上增加的年份的循环，以及闰年的判断等

View Code

 1 # coding = utf-8 
 2 __author__ = 'hillfree'
 3 
 4 
 5 from urllib.request import urlopen
 6 from bs4 import BeautifulSoup
 7 
 8 # Create/open a file called wunder.txt (which will be a comma-delimited file)
 9 f = open('wunder-data.txt', 'a')
10 
11 # Iterate through year, months and day
12 for year in range(2013, 2014):
13     for month in range(1, 2):
14         for day in range(1, 32):
15 
16         # Check if leap year
17             if year % 400 == 0:
18                 leap = True
19             elif year % 100 == 0:
20                 leap = False
21             elif year % 4 == 0:
22                 leap = True
23             else:
24                 leap = False
25 
26             # Check if already gone through month
27             if (month == 2 and leap and day > 29):
28                 continue
29             elif (month == 2 and day > 28):
30                 continue
31             elif (month in [4, 6, 9, 10] and day > 30):
32                 continue
33 
34             # Check if already gone through month
35             if (month == 2 and day > 28):
36                 break
37             elif (month in [4, 6, 9, 11] and day > 30):
38                 break
39 
40             # Open wunderground.com url
41             url = "http://www.wunderground.com/history/airport/ZBAA/{0}/{1}/{2}/DailyHistory.html".format(year, month, day, )
42             page = urlopen(url)
43 
44             # Get temperature from page
45             soup = BeautifulSoup(page)
46             # 取得最高温度
47             max_temp = soup.findAll(attrs={"class":"nobr"})[3].span.string
48 
49             # Build day record with timestamp
50             record = "{0:04d}{1:02d}{2:02d}, {3}\n".format(year, month, day, max_temp)
51             print(record)
52             # Write timestamp and temperature to file
53             f.write(record)
54 
55 # Done getting data! Close file.
56 f.close()

（3）add-csv-flag.py 为CSV文件内容增加标志位

说明：在之前生成的CSV文件的基础上，进行判断，并添加is_freezing的标志位，用print输出

View Code

 1 # coding = utf-8 
 2 __author__ = 'hillfree'
 3 
 4 import csv
 5 
 6 
 7 reader = csv.reader(open('wunder-data.txt', 'r'), delimiter=",")
 8 
 9 for row in reader:
10     if int(row[1]) < 0:
11         is_freezing = '1'
12         print("{0}, {1}, {2}".format(row[0], row[1], is_freezing)) # 列出冰冻日
13     else:
14         is_freezing = '0'
15 
16     # print("{0}, {1}, {2}".format(row[0], row[1], is_freezing)) # 可写入文件

（4）csv-to-xml.py 把CSV文件转换为XML格式

说明：在print输出的基础上，增加了写入文件“wunder-data.xml"

View Code

 1 # coding = utf-8
 2 
 3 """
 4 source: <Visualize This> by Nathan Yau
 5 name: csv-to-xml.py
 6 python: v3.3
 7 description: Convert CSV file to Xml format
 8 """
 9 __author__ = 'hillfree'
10 
11 import csv
12 
13 reader = csv.reader(open('wunder-data.txt', 'r'), delimiter=",")
14 output = open("wunder-data.xml", 'w')
15 
16 print('<weather_data>')
17 output.write('<weather_data>\n')
18 
19 for row in reader:
20     print('<observation>')
21     print('<date>' + row[0] + '</date>')
22     print('<max_temperature>' + row[1] + '</max_temperature>')
23     print('</observation>')
24 
25     output.write('<observation>\n')
26     output.write('<date>' + row[0] + '</date>\n')
27     output.write('<max_temperature>' + row[1] + '</max_temperature>\n')
28     output.write('</observation>\n')
29 
30 print('</weather_data>')
31 output.write('</weather_data>\n')

（5）csv-to-json.py 把CSV文件转换为Json格式

说明：原文件是利用csv模块引入，然后print输出。但是原文件中利用365条记录作为文件的终结过于僵硬。这里没有引入csv模块，而是利用file的操作，并且做了简单的格式化。

View Code

 1 # coding = utf-8
 2 
 3 """
 4 source: <Visualize This> by Nathan Yau
 5 name: csv-to-json.py
 6 python: v3.3
 7 description: Convert CSV file to json format, also write to file.
 8 use file module operation instead of csv module
 9 """
10 __author__ = 'hillfree'
11 
12 lines = open('wunder-data.txt', 'r').readlines()
13 output = open("wunder-data.json", 'w')
14 
15 output.write('{"observations": [\n')
16 
17 max = len(lines)
18 count = 0
19 for line in lines:
20     count += 1
21     row = line.split(',')
22     record ='\t{\n\t\t"date": "%s", \n\t\t"temperature": %s' % (row[0], row[1].lstrip())
23 
24     if count < max:
25         record += '\t},\n'
26     else:
27         record += '\t}]\n}'
28 
29     output.write(record)

（6）xml-to-csv.py 把xml格式文件转换为csv格式

说明：原文件是利用BeautifulSoup模块的BeautifulStoneSoup来处理xml。但在python3.3中相应的lxml无法使用（？）。因此只好采用了python3.3自带的xml.dom.minidom模块，第一次使用感觉很别扭，尤其是子节点的使用很不习惯。加入了很多判断才能将日期和最高气温捏合在一起。求其他更好的办法。

View Code

 1 # coding = utf-8
 2 
 3 """
 4 source: <Visualize This> by Nathan Yau
 5 name: xml-to-csv.py
 6 python: v3.3
 7 description: Convert xml file to csv format, also write to file.
 8 因为Python3.3下lxml无法应用，所以采用python自带的minidom，
 9 用起来比较别扭，不知道如何改进？
10 """
11 __author__ = 'hillfree'
12 
13 from xml.dom.minidom import parse, Node, NodeList
14 
15 xml_file =parse("wunder-data.xml")
16 
17 date = ""
18 max_temperature = ""
19 for node1 in xml_file.getElementsByTagName("observation"):
20     for node2 in node1.childNodes:
21 
22         for node3 in node2.childNodes:
23             if node3.nodeType == Node.TEXT_NODE and node2.nodeName == "date":
24                 if node3.nodeValue != "":
25                     date = node3.nodeValue
26             if node3.nodeType == Node.TEXT_NODE and node2.nodeName == "max_temperature":
27                 if node3.nodeValue != "":
28                     max_temperature = node3.nodeValue.lstrip()
29 
30         print(date + "," + max_temperature)

posted @ 2013-03-18 14:14 hgdfr 阅读(510) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

东坡肘子－Hillfree

桃李春风一杯酒，江湖夜雨十年灯。

《鲜活的数据：数据可视化指南》第2章收集数据 Python3.3源码

公告

东坡肘子－Hillfree

桃李春风一杯酒，江湖夜雨十年灯。

《鲜活的数据：数据可视化指南》第2章 收集数据 Python3.3源码

公告

《鲜活的数据：数据可视化指南》第2章收集数据 Python3.3源码