117李智濠

hadoop大作业

摘要： 1.用Hive对爬虫大作业产生的文本文件（或者英文词频统计下载的英文长篇小说）词频统计。 a.开启 b.查看 c.将网络爬虫大作业的结果存入txt，并且保存到hdfs里面。这是爬虫爬出来的数据将爬虫大作业得到的数据写入创建的MuKeData.txt中，将其存入 /user/hadoop/webi 阅读全文

posted @ 2018-05-24 11:58 117李智濠阅读(346) 评论(0) 推荐(0) 编辑

理解MapReduce

摘要： 1. 用Python编写WordCount程序并提交任务 2.将其权限作出相应修改 3.本机上测试运行代码 2. 用mapreduce 处理气象数据集编写程序求每日最高最低气温，区间最高最低气温阅读全文

posted @ 2018-05-10 21:44 117李智濠阅读(107) 评论(0) 推荐(0) 编辑

熟悉常用的HBase操作

摘要： 1. 以下关系型数据库中的表和数据，要求将其转换为适合于HBase存储的表并插入数据：学生表（Student）（不包括最后一列） 2. 用Hadoop提供的HBase Shell命令完成相同任务：阅读全文

posted @ 2018-05-04 20:06 117李智濠阅读(139) 评论(0) 推荐(0) 编辑

第三章熟悉常用的HDFS操作

摘要：一、Hadoop提供的Shell命令完成相同任务：二、阅读全文

posted @ 2018-04-27 19:32 117李智濠阅读(119) 评论(0) 推荐(0) 编辑

数据结构化

摘要： # -*- coding: utf-8 -*- import requests import re import pandas from bs4 import BeautifulSoup from datetime import datetime def getPageN(pageUrl): res 阅读全文

posted @ 2018-04-16 11:49 117李智濠阅读(167) 评论(0) 推荐(0) 编辑

爬虫大作业

摘要：通过设置入口url寻找首页中内容页的链接，并寻找首页中的最大页数，通过嵌套循环遍历页数和内容页链接，实现深度为3的深度爬取，通过yield生成item对象，同时输出词频统计后出现次数的top20 该函数定义scrapy中item的键以传值该函数设置header头部信息及延迟时间的设置通过jieb 阅读全文

posted @ 2018-04-16 11:48 117李智濠阅读(305) 评论(0) 推荐(0) 编辑

摘要： import re a = "123456789@qq.com" b = '020-88770099' mail = re.search('\d{6,12}@[a-zA-Z0-9]+.[a-zA-Z0-9]+', a).group(0) tele_num = re.search('\d{3,4}-\d{6,8}', b).group(0) print(mail+'\n'+tele_num) ... 阅读全文

posted @ 2018-04-09 11:39 117李智濠阅读(159) 评论(0) 推荐(0) 编辑

爬取校园新闻首页的新闻

摘要： import requests from bs4 import BeautifulSoup from datetime import datetime url = "http://news.gzcc.cn/html/xiaoyuanxinwen/" res = requests.get(url) res.encoding = "utf-8" soup = BeautifulSoup(res.t... 阅读全文

posted @ 2018-04-02 11:43 117李智濠阅读(247) 评论(0) 推荐(0) 编辑

网络爬虫基础练习

摘要： import requests from bs4 import BeautifulSoup as bs url = "http://news.gzcc.cn/html/2018/xiaoyuanxinwen_0329/9122.html" res = requests.get(url).content.decode('utf-8') soup = bs(res, 'html.parser')... 阅读全文

posted @ 2018-03-29 11:45 117李智濠阅读(151) 评论(0) 推荐(0) 编辑

综合练习

摘要： f = open('C:\\Users\\Administrator\\Desktop\\s.txt', 'r', encoding='utf-8') a = f.read() d = {} h = '''.'!?:,''' danci = ['the', 'and', 'a'] for j in 阅读全文

posted @ 2018-03-26 11:24 117李智濠阅读(71) 评论(0) 推荐(0) 编辑