184陈楚浩

Hadoop综合大作业

摘要： 1.用Hive对爬虫大作业产生的文本文件（或者英文词频统计下载的英文长篇小说）进行词频统计。我下载了英文长篇小说《The Souls of Black Folk》，全文69100个单词，下载完更改名字为Black.txt方便操作。 1 start-all.sh Hdfs上创建文件夹 1 2 hdf 阅读全文

posted @ 2018-05-25 21:46 184陈楚浩阅读(90) 评论(0) 推荐(0) 编辑

hive基本操作与应用

摘要：通过hadoop上的hive完成WordCount 启动hadoop Hdfs上创建文件夹上传文件至hdfs 启动Hive 创建原始文档表导入文件内容到表docs并查看 1 2 load data inpath '/user/hadoop/tese1/try.txt' overwrite int 阅读全文

posted @ 2018-05-16 21:57 184陈楚浩阅读(114) 评论(0) 推荐(0) 编辑

用mapreduce 处理气象数据集

摘要：用mapreduce 处理气象数据集编写程序求每日最高最低气温，区间最高最低气温阅读全文

posted @ 2018-05-09 21:50 184陈楚浩阅读(106) 评论(0) 推荐(0) 编辑

熟悉常用的HBase操作

摘要： 1. 以下关系型数据库中的表和数据，要求将其转换为适合于HBase存储的表并插入数据：学生表（Student）（不包括最后一列）学号（S_No）姓名（S_Name）性别（S_Sex）年龄（S_Age）课程（course） 2015001 Zhangsan male 23 2015003 阅读全文

posted @ 2018-05-08 20:55 184陈楚浩阅读(72) 评论(0) 推荐(0) 编辑

爬虫大作业

摘要： import requests from bs4 import BeautifulSoup from datetime import datetime import re import jieba def getNewsDetail(newsurl):#获取新闻详情 resd=requests.get(newsurl) resd.encoding='utf-8' sou... 阅读全文

posted @ 2018-04-30 19:37 184陈楚浩阅读(110) 评论(0) 推荐(0) 编辑

摘要： import requests from bs4 import BeautifulSoup from datetime import datetime import re import pandas #获取点击次数 def getClickCount(newsUrl): newId=re.search('\_(.*).html',newsUrl).group(1).split('/')... 阅读全文

posted @ 2018-04-12 20:43 184陈楚浩阅读(119) 评论(0) 推荐(0) 编辑

爬取校园新闻首页的新闻

摘要： import requests from bs4 import BeautifulSoup from datetime import datetime import re res = requests.get('http://news.gzcc.cn/html/xiaoyuanxinwen/') res.encoding = 'utf-8' soup = BeautifulSoup(res.te... 阅读全文

posted @ 2018-04-09 20:22 184陈楚浩阅读(157) 评论(0) 推荐(0) 编辑

爬取校园新闻

摘要： import requests re=requests.get('http://news.gzcc.cn/html/xiaoyuanxinwen/') re.encoding='utf-8' from bs4 import BeautifulSoup soup = BeautifulSoup(re.text,'html.parser') for news in soup.select('li')... 阅读全文

posted @ 2018-04-04 15:22 184陈楚浩阅读(117) 评论(0) 推荐(0) 编辑

网络爬虫基础练习

摘要： import requests from bs4 import BeautifulSoup res = requests.get('http://www.qq.com/') res.encoding = 'UTF-8' soup = BeautifulSoup(res.text, 'html.parser') # 取出h1标签的文本 for h1 in soup.find_all('h1'):... 阅读全文

posted @ 2018-03-29 20:49 184陈楚浩阅读(66) 评论(0) 推荐(0) 编辑

中文词频统计

摘要： import jieba file=open('sanguoyanyi','r',encoding = 'utf-8') wordList=list(jieba.cut(file.read())) wordDict={} for word in wordList: if(len(word)==1): continue wordDict[word]= wordL... 阅读全文

posted @ 2018-03-28 21:54 184陈楚浩阅读(87) 评论(0) 推荐(0) 编辑