Python requests+lxml 编写简单小说爬虫

看到课表上下个学期要学python，有点吃鲸。计科专业还会安排这个课程。不过既然都安排了，肯定是要学的。学了差不多一个礼拜才把基础语法看完 = =。，python比起其他语言确实简洁不少，就是不能加分号感觉很别扭。闲话不多说了，下面是学完语法+面向CNDS编程的一个简单爬虫，写个文字纪念下，以后回来看看也蛮有意义的。

文件下载：

蓝奏云

使用方法:

解压后点击下载器.exe

ui的代码就不贴出来了，没什么意义，看着文档自己拼凑就行。把具体方法思路实现写一下吧，大部分变量都是ui（tkinter），这里写下重要的几个。

thread_end = False #用户是否主动中断爬取
thread_num = 3 #爬取线程数
is_runing = False #是否正在爬取
lock = threading.Lock() #线程锁 防止线程爬取相同的章节

class Book:
    name = None
    words = None
    author = None
    url = None

线程方法

每个线程不断从集合中拿取章节链接并调用爬取函数爬取，直到爬取完成或用户主动中断爬取。

def myThread():
    global catalogues_url, main_text,is_runing
    while len(catalogues_url) != 0:
        if thread_end == False:
            lock.acquire()
            url = catalogues_url.pop()
            lock.release()
            down_catalogue(url)
        else:
            main_text.insert(END, f"由于用户停止爬取，{threading.current_thread().name}停止了工作。\n")
            main_text.yview_moveto(1)
            is_runing = False
            break
    if len(catalogues_url) == 0:
        main_text.insert(END, f"爬取完成,{threading.current_thread().name}停止了工作。\n")
        main_text.yview_moveto(1)
        is_runing = False

下载方法

这里的url是前面处理好了的单个章节链接，用一个集合装着。请求到网页后用工具直接处理存下来即可。

def down_catalogue(url):
    global main_label_str, main_progressbar, main_progressbar_value, main_win, down_local_path_str
    session = requests.session()
    # session.proxies = {"https": "106.14.255.124:80", "http": "58.246.58.150:9002", }
    session.keep_alive = False
    book_content_html = session.get(url)
    book_name = etree.HTML(book_content_html.text).xpath("/html/body/div[2]/div[3]/div[2]/a[3]/text()")
    book_catalogue_name = etree.HTML(book_content_html.text).xpath(
        "/html/body/div[2]/div[3]/div[3]/div/div[1]/div[2]/div[2]/text()")
    book_content = etree.HTML(book_content_html.text).xpath(
        "/html/body/div[2]/div[3]/div[3]/div/div[1]/div[5]/p/text()")
    temp = ""
    for i in book_content:
        temp += i + '\n'
    book_content = temp
    file = open(f"{down_local_path_str}\\{book_name}{book_catalogue_name}.txt", "a+")
    file.write(book_content)
    file.close()
    main_progressbar_value += 1;
    main_progressbar['value'] = main_progressbar_value
    main_win.update()
    print(temp)
    main_text.insert(END, f"{threading.current_thread().name}爬取的{book_name}{book_catalogue_name}下载完成\n")
    main_text.yview_moveto(1)
    session.close()
    # print(temp)

主要的方法就是这两个了，剩下一堆杂七杂八的要优化或者美化看自己怎么想了。

一个一百多行的main函数....可以说是确切的xx了

posted @ 2023-02-07 16:19 *RavE 阅读(279) 评论(0) 收藏举报

刷新页面返回顶部

RavE

Maintain self-discipline

Python requests+lxml 编写简单小说爬虫

公告