20191118孙源《Python程序设计》实验四报告

 

 

实验报告

 

 

 

课  程:

Python程序设计

实验名称:

实验四

实验日期:

2020年6月1日

学  号:

20191108

姓  名:

孙源

任课教师:

王志强老师

 

 

成绩:

评语: 

 


实验目的与要求

使用Python爬虫进行网页内容爬取

实验设计与实现

import requests
import bs4
import re

code_class = {"sh_cpp": ".cpp",
              "sh_c": ".c", "": ".py",
              "sh_pascal": ".pas",
              "sh_java": ".java"}
mylog = {"redirectUrl": "http://openjudge.cn/",
         "password": "",
         "email": ""}
session_requests = requests.session()
own_url = "http://openjudge.cn/"


# 爬取个人主页url下accept的内容,若有next page则继续爬取
def download_url(url):
    global session_requests
    global own_url
    global code_class
    ans = session_requests.get(url)
    s = str(ans.content, encoding="utf-8")
    soup = bs4.BeautifulSoup(s, "html.parser")
    blocks = soup.find_all("a", class_="result-right")
    # blocks包含accept代码网页的信息,遍历该页
    if blocks != []:
        for i in blocks:
            solution_url = i["href"]
            solution = session_requests.get(solution_url)
            ss = str(solution.content, encoding="utf-8")
            s_soup = bs4.BeautifulSoup(ss, "html.parser")
            # 判断该题的代码类型
            for class_name in code_class:
                block = s_soup.find("pre", class_=class_name)
                if block == None:
                    continue
                try:
                    name = s_soup.find_all("h3")[1]
                    name = name.text[:-5]
                except:
                    print("Get name wrong!")
                # 去掉第一个':'前面的编号
                index = name.find(':')
                if (index != -1):
                    name = name[index + 1:]
                # 去除题名中的非法字符和开头结尾的空格
                name = re.sub(r"[\\/:*?#\"<>|:]", " ", name).strip()
                try:
                    # 已存在同名代码
                    f = open("C:/tmp/" + name + code_class[class_name])
                    print(name + " has already downloaded")
                    continue
                except IOError:
                    # 不存在同名代码
                    print("downloading your correct code " + name)
                    try:
                        f = open("C:/tmp/" + name + code_class[class_name], 'w', encoding="utf-8")
                        new_str = block.text
                        f.write(new_str)
                        f.close()
                    except Exception as e:
                        print(name + " can't be downloaded correctly")
                        print(e)
    # next 是下一页的相对路径
    next = soup.find("a", class_="nextprev", rel="next")
    if next != None:
        download_url(own_url + next["href"])


def spider():
    global code_class
    global mylog, session_requests, own_url
    mylog["email"] = input("请输入您登陆openjudge使用的email账号:\n")
    mylog["password"] = input("请输入您的密码:\n")
    login_url = "http://openjudge.cn/api/auth/login/"
    result = session_requests.post(  # 向服务器发送post请求
        login_url,
        data=mylog,
        headers=dict(referer=login_url),
    )
    result = session_requests.get("http://openjudge.cn/")
    # 用正则表达式匹配寻找个人首页的url
    pt = r"<a href=\"(http://[^\"]*)\">个人首页</a>"
    try:
        own_url = re.search(pt, result.text).group(1)
        print("这是您的主页:" + own_url)
    except:
        print("账号不存在或密码错误!请重新输入!")
        spider()
        return
        own_url = ''
    download_url(own_url)
    print("您已成功下载所有accept的程序至c:\\tmp文件夹下!")


spider()

 

课程感悟

课程开始时因为对程序设计不了解,又没有好好复习,经常对老师上课讲的知识一头雾水,直到慢慢看云班课的视频,在网上看教程,才能慢慢跟上老师的节奏,当初选择这门课的时候就抱着学一门新技术的想法,学python确实受益匪浅,还为学习C语言提供了很多帮助。

## 参考资料:

-  [《Python爬虫实例》](https://www.jianshu.com/p/757d8981fdda)

-  [《Python 网络编程》](https://www.runoob.com/python/python-socket.html)

-  [《Python爬虫实例》](https://www.jianshu.com/p/757d8981fdda)

##附码云链接:

[实验四]( https://gitee.com/sunyuan1118/python-test-2020)

 

posted @ 2020-07-06 14:23  孙源1118  阅读(156)  评论(0编辑  收藏  举报