爬取小说文件(第一天)
网址:
http://www.aqtxt.com/xiazai/txt/21984.htm
url的数字可以换[0:]
代码内容:
#coding:utf-8
import requests
import urllib2
import re
import sys
import os
# reload(sys)
# sys.setdefaultencoding('utf-8')
# from bs4 import BeautifulSoup
def url(n):
xiaoshuo_url = "http://www.aqtxt.com/xiazai/txt/"+str(n-1)+".htm"
html = requests.get(xiaoshuo_url)
name = re.findall('<h1>(.*?)<span id="author">',html.text,re.S)[0]
xiaoshuo_name = name.encode('utf-8')
xs_name = "名称:"+xiaoshuo_name
print xs_name
author = re.findall('<span id="author">(.*?)</span>',html.text,re.S)[0]
xiaoshuo_author = author.encode('utf-8')
xs_author = xiaoshuo_author
print xs_author
xiaoshuo_down_urls = re.findall('<li class="bd"><a href="(.*?)" title="',html.text,re.S)[0]
xs_url = xiaoshuo_down_urls
print xs_url
con_list = []
con_list.extend([xiaoshuo_name,xiaoshuo_author,xiaoshuo_down_urls])
file = open('xiaoshuo.txt','a')
for item in con_list:
file.write(item)
file.write('\n')
file.close()
if n<0:
return 0
else:
return url(n-1)
url(21984) #数值为任意值
1、编码问题
http://againinput4.blog.163.com/blog/static/1727994912011111011432810/
2、正则表达式
查看源代码要看仔细,最好在 Edit as HTML中查看
import re
author = re.findall('<span id="author">(.*?)</span>',html.text,re.S)[0]
re.findall 返回的是列表,需要后面加上[i]
3、建立文件,并读入文本,加入空格
con_list = []
con_list.extend([xiaoshuo_name,xiaoshuo_author,xiaoshuo_down_urls])
file = open('xiaoshuo.txt','a')
for item in con_list:
file.write(item+" ")
file.write('\n')
file.close()
4、转码问题
网页编码为:gb2312
输出中文时出现错误,如2中的正则表达式author输出会报错显示乱码,需要
xiaoshuo_author = author.encode('utf-8')
转码一下,正常输出中文
kingrain送给读者的话:Everthing isn't getting along well,your efforts paid off!

浙公网安备 33010602011771号