数据采集第五次作业
第五次作业
一、作业内容
作业一:
要求:
熟练掌握 Selenium 查找HTML元素、爬取Ajax网页数据、等待HTML元素等内容。使用Selenium框架爬取京东商城某类商品信息及图片。
候选网站:http://www.jd.com/
关键词:学生自由选择
输出信息:MYSQL的输出信息如下

步骤如下:
①建表与输入查询
def startUp(self, url, key): # # Initializing Chrome browser chrome_options = Options() chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu') self.driver = webdriver.Chrome(chrome_options=chrome_options) # Initializing variables self.threads = [] self.No = 0 self.imgNo = 0 # Initializing database try: self.conn = pymysql.connect(host='localhost', port=3306, user='root', password='cyz2952178079', db='phones', charset='utf8') self.cursor = self.conn.cursor() try: # 如果有表就删除 self.cursor.execute("drop table phones") except: pass try: # 建立新的表 sql = "create table phones (mNo varchar(32) primary key, mMark varchar(256),mPrice varchar(32),mNote varchar(1024),mFile varchar(256))" self.cursor.execute(sql) except: pass except Exception as err: print(err) # Initializing images folder try: if not os.path.exists(MySpider.imagePath): os.mkdir(MySpider.imagePath) images = os.listdir(MySpider.imagePath) for img in images: s = os.path.join(MySpider.imagePath, img) os.remove(s) except Exception as err: print(err) self.driver.get(url) keyInput = self.driver.find_element_by_id("key") keyInput.send_keys(key) keyInput.send_keys(Keys.ENTER)
②根据网页标签寻找对应的xpath,插入数据库
def processSpider(self): try: time.sleep(1) print(self.driver.current_url) lis =self.driver.find_elements_by_xpath("//div[@id='J_goodsList']//li[@class='gl-item']") for li in lis: # We find that the image is either in src or in data-lazy-img attribute try: src1 = li.find_element_by_xpath(".//div[@class='p-img']//a//img").get_attribute("src") except: src1 = "" try: src2 = li.find_element_by_xpath(".//div[@class='p-img']//a//img").get_attribute("data-lazy-img") except: src2 = "" try: price = li.find_element_by_xpath(".//div[@class='p-price']//i").text except: price = "0" try: note = li.find_element_by_xpath(".//div[@class='p-name p-name-type-2']//em").text mark = note.split(" ")[0] mark = mark.replace("爱心东东\n", "") mark = mark.replace(",", "") note = note.replace("爱心东东\n", "") note = note.replace(",", "") except: note = "" mark = "" self.No = self.No + 1 no = str(self.No) while len(no) < 6: no = "0" + no print(no, mark, price) if src1: src1 = urllib.request.urljoin(self.driver.current_url, src1) p = src1.rfind(".") mFile = no + src1[p:] elif src2: src2 = urllib.request.urljoin(self.driver.current_url, src2) p = src2.rfind(".") mFile = no + src2[p:] if src1 or src2: T = threading.Thread(target=self.download, args=(src1, src2, mFile)) T.setDaemon(False) T.start() self.threads.append(T) else: mFile = "" self.insertDB(no, mark, price, note, mFile)
③实现翻页
先查看对应标签,后编写翻页的代码
try: self.driver.find_element_by_xpath("//span[@class='p-num']//a[@class='pn-next disabled']") except: nextPage = self.driver.find_element_by_xpath("//span[@class='p-num']//a[@class='pn-next']") time.sleep(10) nextPage.click() self.processSpider()
④图片下载
def download(self, src1, src2, mFile): data = None if src1: try: req = urllib.request.Request(src1, headers=MySpider.headers) resp = urllib.request.urlopen(req, timeout=10) data = resp.read() except: pass if not data and src2: try: req = urllib.request.Request(src2, headers=MySpider.headers) resp = urllib.request.urlopen(req, timeout=10) data = resp.read() except: pass if data: print("download begin", mFile) fobj = open(MySpider.imagePath + "\\" + mFile, "wb") fobj.write(data) fobj.close() print("download finish", mFile)
⑤查看结果
作业二:
要求:
熟练掌握 Selenium 查找HTML元素、实现用户模拟登录、爬取Ajax网页数据、等待
HTML元素等内容。
使用Selenium框架+MySQL爬取中国mooc网课程资源信息(课程号、课程名称、教学
进度、课程状态,课程图片地址),同时存储图片到本地项目根目录下的imgs文件夹
中,图片的名称用课程名来存储。
候选网站:中国mooc网:https://www.icourse163.org
输出信息:MYSQL数据库存储和输出格式
表头应是英文命名例如:课程号ID,课程名称:cCourse……,由同学们自行定义
设计表头:

步骤如下:
①设置驱动
# 使用谷歌驱动器 options = webdriver.ChromeOptions() # options.add_experimental_option('prefs', {'profile.managed_default_content_settings.images': 2}) browser = webdriver.Chrome(options=options) url = 'https://www.icourse163.org/' data_list = []
②登录
browser.get(url) # 获取登录框 start = browser.find_element_by_css_selector('a.f-f0.navLoginBtn') start.click() # 这里留20秒手动扫码登录一下 time.sleep(20) person = browser.find_element_by_xpath( '/html/body/div[4]/div[1]/div/div/div/div/div[7]/div[3]/div/div/a/span') person.click() print("正在进入个人中心")
③获取课程信息
print("慕课的课程总数:", len(courses)) for course in courses: try: name = course.find_element_by_xpath('./div[1]/a/div[2]/div[1]/div[1]/div/span[2]').text school = course.find_element_by_xpath('./div[1]/a/div[2]/div[1]/div[2]/a').text schedule = course.find_element_by_xpath('./div[1]/a/div[2]/div[2]/div[1]/div[1]/div[1]/a/span').text status = course.find_element_by_xpath('./div[1]/a/div[2]/div[2]/div[2]').text imgUrl = course.find_element_by_xpath('./div[1]/a/div[1]//img').get_attribute("src") filename = "./images/"+str(name)+".jpg" urllib.request.urlretrieve(imgUrl, filename) data_list.append([name, school, schedule, status, imgUrl]) except Exception as err: print("error:", err)
④查看结果
作业三:
要求:
理解Flume架构和关键特性,掌握使用Flume完成日志采集任务。
完成Flume日志采集实验,包含以下步骤:
任务1:开通MapReduce服务

任务2:Python脚本生成测试数据

任务3:配置Kafka

任务4:安装Flume客户端
点击下载

安装Flume环境
安装Flume客户端
安装结束
任务5:配置Flume采集数据
编辑文件
properties.properties

创建消费者消费kafka中的数据并查看结果
心得体会:
通过本次学习,我对Selenium 查找HTML元素、实现用户模拟登录等操作更加熟悉了,同时也对Flume架构有了进一步的理解。