爬取小说文件(第一天)

网址:

http://www.aqtxt.com/xiazai/txt/21984.htm

 

url的数字可以换[0:]

代码内容:

#coding:utf-8
import requests
import urllib2
import re
import sys
import os
# reload(sys)
# sys.setdefaultencoding('utf-8')
# from bs4 import BeautifulSoup

def url(n):
    xiaoshuo_url = "http://www.aqtxt.com/xiazai/txt/"+str(n-1)+".htm"

    html = requests.get(xiaoshuo_url)


    name = re.findall('<h1>(.*?)<span id="author">',html.text,re.S)[0]

    xiaoshuo_name = name.encode('utf-8')
    xs_name =  "名称:"+xiaoshuo_name
    print xs_name

    author = re.findall('<span id="author">(.*?)</span>',html.text,re.S)[0]
    xiaoshuo_author = author.encode('utf-8')
    xs_author = xiaoshuo_author
    print xs_author

    xiaoshuo_down_urls = re.findall('<li class="bd"><a href="(.*?)" title="',html.text,re.S)[0]
    xs_url = xiaoshuo_down_urls
    print xs_url

    con_list = []
    con_list.extend([xiaoshuo_name,xiaoshuo_author,xiaoshuo_down_urls])
    file = open('xiaoshuo.txt','a')
    for item in con_list:
     file.write(item)
    file.write('\n')
    file.close()


    if n<0:
        return  0
    else:
        return url(n-1)


url(21984) #数值为任意值

  

1、编码问题

http://againinput4.blog.163.com/blog/static/1727994912011111011432810/

2、正则表达式

查看源代码要看仔细,最好在 Edit as HTML中查看
import re
author = re.findall('<span id="author">(.*?)</span>',html.text,re.S)[0]
re.findall 返回的是列表,需要后面加上[i]

3、建立文件,并读入文本,加入空格

con_list = []
    con_list.extend([xiaoshuo_name,xiaoshuo_author,xiaoshuo_down_urls])
    file = open('xiaoshuo.txt','a')
    for item in con_list:
     file.write(item+"     ")
    file.write('\n')
    file.close()
 
4、转码问题
网页编码为:gb2312
输出中文时出现错误,如2中的正则表达式author输出会报错显示乱码,需要
xiaoshuo_author = author.encode('utf-8')
转码一下,正常输出中文
posted @ 2016-01-19 22:25  kingrain  阅读(381)  评论(0)    收藏  举报