爬取小说文件（第一天）

网址：

http://www.aqtxt.com/xiazai/txt/21984.htm

url的数字可以换[0:]

代码内容：

#coding:utf-8
import requests
import urllib2
import re
import sys
import os
# reload(sys)
# sys.setdefaultencoding('utf-8')
# from bs4 import BeautifulSoup

def url(n):
    xiaoshuo_url = "http://www.aqtxt.com/xiazai/txt/"+str(n-1)+".htm"

    html = requests.get(xiaoshuo_url)


    name = re.findall('<h1>(.*?)<span id="author">',html.text,re.S)[0]

    xiaoshuo_name = name.encode('utf-8')
    xs_name =  "名称："+xiaoshuo_name
    print xs_name

    author = re.findall('<span id="author">(.*?)</span>',html.text,re.S)[0]
    xiaoshuo_author = author.encode('utf-8')
    xs_author = xiaoshuo_author
    print xs_author

    xiaoshuo_down_urls = re.findall('<li class="bd"><a href="(.*?)" title="',html.text,re.S)[0]
    xs_url = xiaoshuo_down_urls
    print xs_url

    con_list = []
    con_list.extend([xiaoshuo_name,xiaoshuo_author,xiaoshuo_down_urls])
    file = open('xiaoshuo.txt','a')
    for item in con_list:
     file.write(item)
    file.write('\n')
    file.close()


    if n<0:
        return  0
    else:
        return url(n-1)


url(21984) #数值为任意值

1、编码问题

http://againinput4.blog.163.com/blog/static/1727994912011111011432810/

2、正则表达式

查看源代码要看仔细，最好在 Edit as HTML中查看

import re
author = re.findall('<span id="author">(.*?)</span>',html.text,re.S)[0]
re.findall 返回的是列表，需要后面加上[i]

3、建立文件，并读入文本，加入空格

con_list = []
    con_list.extend([xiaoshuo_name,xiaoshuo_author,xiaoshuo_down_urls])
    file = open('xiaoshuo.txt','a')
    for item in con_list:
     file.write(item+"     ")
    file.write('\n')
    file.close()

4、转码问题

网页编码为：gb2312

输出中文时出现错误，如2中的正则表达式author输出会报错显示乱码，需要

xiaoshuo_author = author.encode('utf-8')

转码一下，正常输出中文

posted @ 2016-01-19 22:25 kingrain 阅读(381) 评论(0) 收藏举报

刷新页面返回顶部

创业无限好

爬取小说文件（第一天）

公告