大数据(bigdata)练习题

 

1.HDFS文件系统的根目录下创建递归目录“1daoyun/file”,将附件中的BigDataSkills.txt文件,上传到1daoyun/file目录中,使用相关命令查看文件系统中1daoyun/file目录的文件列表信息。

答:

[root@master MapReduce]# hadoop fs -mkdir -p /1daoyun/file

[root@master MapReduce]# hadoop fs -put BigDataSkills.txt /1daoyun/file

[root@master MapReduce]# hadoop fs -ls /1daoyun/file

Found 1 items

-rw-r--r--   3 root hdfs       1175 2018-02-12 08:01 /1daoyun/file/BigDataSkills.txt

 

 

2.HDFS文件系统的根目录下创建递归目录“1daoyun/file”,将附件中的BigDataSkills.txt文件,上传到1daoyun/file目录中,上传过程指定BigDataSkills.txt文件在HDFS文件系统中的复制因子为2,并使用fsck工具检查存储块的副本数。

答:

[root@master MapReduce]# hadoop fs -mkdir -p /1daoyun/file

[root@master MapReduce]# hadoop fs -D dfs.replication=2 -put BigDataSkills.txt /1daoyun/file

[root@master MapReduce]# hadoop fsck /1daoyun/file/BigDataSkills.txt

DEPRECATED: Use of this script to execute hdfs command is deprecated.

Instead use the hdfs command for it.

 

Connecting to namenode via http://master.hadoop:50070/fsck?ugi=root&path=%2F1daoyun%2Ffile%2FBigDataSkills.txt

FSCK started by root (auth:SIMPLE) from /10.0.6.123 for path /1daoyun/file/BigDataSkills.txt at Mon Feb 12 08:11:47 UTC 2018

.

/1daoyun/file/BigDataSkills.txt:  Under replicated BP-297530755-10.0.6.123-1518056860260:blk_1073746590_5766. Target Replicas is 2 but found 1 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).

Status: HEALTHY

 Total size: 1175 B

 Total dirs: 0

 Total files: 1

 Total symlinks: 0

 Total blocks (validated): 1 (avg. block size 1175 B)

 Minimally replicated blocks: 1 (100.0 %)

 Over-replicated blocks: 0 (0.0 %)

 Under-replicated blocks: 1 (100.0 %)

 Mis-replicated blocks: 0 (0.0 %)

 Default replication factor: 3

 Average block replication: 1.0

 Corrupt blocks: 0

 Missing replicas: 1 (50.0 %)

 Number of data-nodes: 1

 Number of racks: 1

FSCK ended at Mon Feb 12 08:11:47 UTC 2018 in 1 milliseconds

 

 

The filesystem under path '/1daoyun/file/BigDataSkills.txt' is HEALTHY

 

 

3.HDFS文件系统的根目录下存在一个/apps的文件目录,要求开启该目录的可创建快照功能,并为该目录文件创建快照,快照名称为apps_1daoyun,使用相关命令查看该快照文件的列表信息。

答:

[hdfs@master ~]# hadoop dfsadmin -allowSnapshot /apps

Allowing snaphot on /apps succeeded

[hdfs@master ~]# hadoop fs -createSnapshot /apps apps_1daoyun

Created snapshot /apps/.snapshot/apps_1daoyun

[hdfs@master ~]# hadoop fs -ls /apps/.snapshot

Found 1 items

drwxrwxrwx   - hdfs hdfs          0 2017-05-07 09:48 /apps/.snapshot/apps_1daoyun

 

 

4.为了防止操作人员误删文件,HDFS文件系统提供了回收站的功能,但过多的垃圾文件会占用大量的存储空间。要求在Linux Shell中使用“vi”命令修改相应的配置文件以及参数信息,关闭回收站功能。完成后,重启相应的服务。

答:

[root@master ~]# vi /etc/hadoop/ 2.6.1.0-129/0/hdfs-site.xml

 <property>

      <name>fs.trash.interval</name>

      <value>0</value>

    </property>

 

[root@master ~]# su - hdfs

Last login: Mon May  8 09:31:52 UTC 2017

[hdfs@master ~]$ /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /usr/hdp/current/hadoop-client/conf stop namenode

[hdfs@master ~]$ /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /usr/hdp/current/hadoop-client/conf start namenode

[hdfs@master ~]$ /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /usr/hdp/current/hadoop-client/conf stop datanode

[hdfs@master ~]$ /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /usr/hdp/current/hadoop-client/conf start datanode

 

5.使用命令查看hdfs文件系统中/tmp目录下的目录个数,文件个数和文件总大小。

答:

[root@master ~]# hadoop fs -count  /tmp

          21            6               4336 /tmp

 

6.在集群节点中/usr/hdp/ 2.6.1.0-129/hadoop-mapreduce/目录下,存在一个案例JARhadoop-mapreduce-examples.jar。运行JAR包中的wordcount程序来对/1daoyun/file/BigDataSkills.txt文件进行单词计数,将运算结果输出到/1daoyun/output目录中,使用相关命令查询单词计数结果。

答:

[root@master ~]# hadoop jar /usr/hdp/ 2.6.1.0-129/hadoop-mapreduce/hadoop-mapreduce-examples-2.7.3.2.6.1.0-129.jar wordcount /1daoyun/file/BigDataSkills.txt /1daoyun/output

 

[root@master ~]# hadoop fs -cat /1daoyun/output/part-r-00000

"duiya  1

hello   1

nisibusisha     1

wosha"  1

zsh     1

 

 

 

 

 

 

7.在集群节点中/usr/hdp/ 2.6.1.0-129/hadoop-mapreduce/目录下,存在一个案例JARhadoop-mapreduce-examples.jar。运行JAR包中的sudoku程序来计算下表中数独运算题的结果。

8

 

 

 

 

 

 

 

 

 

 

3

6

 

 

 

 

 

 

7

 

 

9

 

2

 

 

 

5

 

 

 

7

 

 

 

 

 

 

 

4

5

7

 

 

 

 

 

1

 

 

 

3

 

 

 

1

 

 

 

 

6

8

 

 

8

5

 

 

 

1

 

 

9

 

 

 

 

4

 

 

答:

[root@master ~]# cat puzzle1.dta

8 ? ? ? ? ? ? ? ?

? ? 3 6 ? ? ? ? ?

? 7 ? ? 9 ? 2 ? ?

? 5 ? ? ? 7 ? ? ?

? ? ? ? 4 5 7 ? ?

? ? ? 1 ? ? ? 3 ?

? ? 1 ? ? ? ? 6 8

? ? 8 5 ? ? ? 1 ?

? 9 ? ? ? ? 4 ? ?

 

[root@master hadoop-mapreduce]# hadoop jar hadoop-mapreduce-examples-2.7.1.2.4.3.0-227.jar sudoku /root/puzzle1.dta

WARNING: Use "yarn jar" to launch YARN applications.

Solving /root/puzzle1.dta

8 1 2 7 5 3 6 4 9

9 4 3 6 8 2 1 7 5

6 7 5 4 9 1 2 8 3

1 5 4 2 3 7 8 9 6

3 6 9 8 4 5 7 2 1

2 8 7 1 6 9 5 3 4

5 2 1 9 7 4 3 6 8

4 3 8 5 2 6 9 1 7

7 9 6 3 1 8 4 5 2

 

Found 1 solutions

 

8.在集群节点中/usr/hdp/2.6.1.0-129/hadoop-mapreduce/目录下,存在一个案例JARhadoop-mapreduce-examples.jar。运行JAR包中的grep程序来统计文件系统中/1daoyun/file/BigDataSkills.txt文件中“Hadoop”出现的次数,统计完成后,查询统计结果信息。

答:

[root@master hadoop-mapreduce]# hadoop jar hadoop-mapreduce-examples-2.7.1.2.4.3.0-227.jar grep /1daoyun/file/BigDataSkills.txt /output hadoop

 

 

[root@master hadoop-mapreduce]# hadoop fs -cat /output/part-r-00000

2       hadoop

 

 

9.启动先电大数据平台的Hbase数据库,其中要求使用master节点的RegionServer。在Linux Shell中启动Hbase shell,查看进入HBase shell的当前系统用户。(相关数据库命令语言请全部使用小写格式)

答:

hbase(main):003:0> whoami

root (auth:SIMPLE)

    groups: root

 

 

 

10.开启HBase的安全认证功能,在HBase Shell中设置root用户拥有表xiandian_user的读写与执行的权限,设置完成后,使用相关命令查看其权限信息。

答:

参数 Enable Authorization

参数值 native

 

hbase(main):002:0> grant 'root','RWX','xiandian_user'

0 row(s) in 0.4800 seconds

 

hbase(main):003:0> user_permission 'xiandian_user'

User                                             Namespace,Table,Family,Qualifier:Permission                                                                                                 

 root                                            default,xiandian_user,,: [Permission: actions=READ,WRITE,EXEC]                                                                              

1 row(s) in 0.1180 seconds

 

 

 

 

11. 登录hbase数据库,创建一张表为member,列族为'address','info',创建完之后,向该表插入数据,插入的数据为:

'xiandianA','info:age','24'

'xiandianA','info:birthday','1990-07-17'

'xiandianA','info:company','alibaba'

'xiandianA','address:contry','china'

'xiandianA','address:province','zhejiang'

'xiandianA','address:city','hangzhou'

 

插入完毕后,使用命令查询member表中xiandianA的所有info信息,最后将xiandianA的年龄改为99,并只查询infoage信息。

答:

hbase(main):001:0> create 'member','address','info'

0 row(s) in 1.5730 seconds

 

=> Hbase::Table - member

hbase(main):002:0> list

TABLE                                                                                      

emp                                                                                        

member                                                                                     

2 row(s) in 0.0240 seconds

hbase(main):007:0> put'member','xiandianA','info:age','24'

0 row(s) in 0.1000 seconds

 

hbase(main):008:0> put'member','xiandianA','info:birthday','1990-07-17'

0 row(s) in 0.0130 seconds

 

hbase(main):010:0> put'member','xiandianA','info:company','alibaba'

0 row(s) in 0.0080 seconds

 

hbase(main):011:0> put'member','xiandianA','address:contry','china'

0 row(s) in 0.0080 seconds

 

hbase(main):012:0> put'member','xiandianA','address:province','zhejiang'

0 row(s) in 0.0070 seconds

 

hbase(main):013:0> put'member','xiandianA','address:city','hangzhou'

0 row(s) in 0.0090 seconds

 

hbase(main):014:0> get 'member','xiandianA','info'

COLUMN                  CELL                                                               

 info:age               timestamp=1522140592336, value=24                                  

 info:birthday          timestamp=1522140643072, value=1990-07-17                          

 info:company           timestamp=1522140745172, value=alibaba                             

3 row(s) in 0.0170 seconds

 

hbase(main):015:0>

hbase(main):016:0* put 'member','xiandianA','info:age','99'

0 row(s) in 0.0080 seconds

hbase(main):018:0> get 'member','xiandianA','info:age'

COLUMN                  CELL                                                               

 info:age               timestamp=1522141564423, value=99                                  

1 row(s) in 0.0140 seconds

 

 

 

12.在关系数据库系统中,命名空间namespace是表的逻辑分组,同一组中的表有类似的用途。登录hbase数据库,新建一个命名空间叫newspace并用list查询,然后在这个命名空间中创建表member,列族为'address','info',创建完之后,向该表插入数据,插入的数据为:

'xiandianA','info:age','24'

'xiandianA','info:birthday','1990-07-17'

'xiandianA','info:company','alibaba'

'xiandianA','address:contry','china'

'xiandianA','address:province','zhejiang'

'xiandianA','address:city','hangzhou'

 

插入完毕后,使用scan命令只查询表中infoage的信息,指定startrowxiandianA

答:

hbase(main):022:0> create_namespace 'newspace'

0 row(s) in 0.1130 seconds

hbase(main):024:0> list

TABLE                                                                                                                                                                                        

emp                                                                                                                                                                                          

member                                                                                                                                                                                       

newspace:member                                                                                                                                                                              

3 row(s) in 0.0100 seconds

 

=> ["emp", "member", "newspace:member"]

hbase(main):023:0> create 'newspace:member','address','info'

0 row(s) in 1.5270 seconds

hbase(main):033:0> put 'newspace:member','xiandianA','info:age','24'

0 row(s) in 0.0620 seconds

hbase(main):037:0> put 'newspace:member','xiandianA','info:birthday','1990-07-17'

0 row(s) in 0.0110 seconds

 

hbase(main):038:0> put 'newspace:member','xiandianA','info:company','alibaba'

0 row(s) in 0.0130 seconds

 

hbase(main):039:0> put 'newspace:member','xiandianA','address:contry','china'

0 row(s) in 0.0070 seconds

 

hbase(main):040:0> put 'newspace:member','xiandianA','address:province','zhejiang'

0 row(s) in 0.0070 seconds

 

hbase(main):041:0> put 'newspace:member','xiandianA','address:city','hangzhou'

0 row(s) in 0.0070 seconds

hbase(main):044:0> scan 'newspace:member', {COLUMNS => ['info:age'],STARTROW => 'xiandianA'}

ROW                                              COLUMN+CELL                                                                                                                                 

 xiandianA                                       column=info:age, timestamp=1522214952401, value=24                                                                                          

1 row(s) in 0.0160 seconds

 

 

13.登录master节点,在本地新建一个文件叫hbasetest.txt文件,编写内容,要求新建一张表为'test', 列族为'cf',然后向这张表批量插入数据,数据如下所示:

'row1', 'cf:a', 'value1'  

'row2', 'cf:b', 'value2'  

'row3', 'cf:c', 'value3'  

'row4', 'cf:d', 'value4'  

 

在插入数据完毕后用scan命令查询表内容,然后用get命令只查询row1的内容,最后退出hbase shell。使用命令运行hbasetest.txt,将hbasetest.txt的内容和执行命令后的返回结果提交。

答:

[root@exam1 ~]# cat hbasetest.txt

create 'test', 'cf'  

list 'test'  

put 'test', 'row1', 'cf:a', 'value1'  

put 'test', 'row2', 'cf:b', 'value2'  

put 'test', 'row3', 'cf:c', 'value3'  

put 'test', 'row4', 'cf:d', 'value4'  

scan 'test'  

get 'test', 'row1'  

exit

 

[root@exam1 ~]# hbase shell hbasetest.txt

0 row(s) in 1.5010 seconds

 

TABLE                                                                                      

test                                                                                       

1 row(s) in 0.0120 seconds

 

0 row(s) in 0.1380 seconds

 

0 row(s) in 0.0090 seconds

 

0 row(s) in 0.0050 seconds

 

0 row(s) in 0.0050 seconds

 

ROW                     COLUMN+CELL                                                        

 row1                   column=cf:a, timestamp=1522314428726, value=value1                 

 row2                   column=cf:b, timestamp=1522314428746, value=value2                 

 row3                   column=cf:c, timestamp=1522314428752, value=value3                 

 row4                   column=cf:d, timestamp=1522314428758, value=value4                 

4 row(s) in 0.0350 seconds

 

COLUMN                  CELL                                                               

 cf:a                   timestamp=1522314428726, value=value1                              

1 row(s) in 0.0190 seconds

 

 

 

14.使用Hive工具来创建数据表xd_phy_course,并定义该表为外部表,外部存储位置为/1daoyun/data/hive,将phy_course_xd.txt导入到该表中,其中xd_phy_course表的数据结构如下表所示。导入完成后,在hive中查询数据表xd_phy_course的数据结构信息。(相关数据库命令语言请全部使用小写格式)

stname(string)

stID(int)

class(string)

opt_cour(string)

答:

hive> create external table xd_phy_course (stname string,stID int,class string,opt_cour string) row format delimited fields terminated by '\t' lines terminated by '\n' location '/1daoyun/data/hive';

OK

Time taken: 1.197 seconds

 

hive> load data local inpath '/root/phy_course_xd.txt' into table xd_phy_course;

Loading data to table default.xd_phy_course

Table default.xd_phy_course stats: [numFiles=1, totalSize=89444]

OK

Time taken: 0.96 seconds

 

hive> desc xd_phy_course2;

OK

stname                  string                                      

stid                    int                                         

class                   string                                      

opt_cour                string                                      

Time taken: 0.588 seconds, Fetched: 4 row(s)

 

 

15.使用Hive工具来统计phy_course_xd.txt文件中某高校报名选修各个体育科目的总人数,其中phy_course_xd.txt文件数据结构如下表所示,选修科目字段为opt_cour,将统计的结果导入到表phy_opt_count中,通过SELECT语句查询表phy_opt_count内容。(相关数据库命令语言请全部使用小写格式)

stname(string)

stID(int)

class(string)

opt_cour(string)

答:

hive> create table xd_phy_course (stname string,stID int,class string,opt_cour string) row format delimited fields terminated by '\t' lines terminated by '\n';

OK

Time taken: 4.067 seconds

 

hive> load data local inpath '/root/phy_course_xd.txt' into table xd_phy_course;

Loading data to table default.xd_phy_course

Table default.xd_phy_course stats: [numFiles=1, totalSize=89444]

OK

Time taken: 1.422 seconds

 

hive> create table phy_opt_count (opt_cour string,cour_count int) row format delimited fields terminated by '\t' lines terminated by '\n';

OK

Time taken: 1.625 seconds

 

hive> insert overwrite table phy_opt_count select xd_phy_course.opt_cour,count(distinct xd_phy_course.stID) from xd_phy_course group by xd_phy_course.opt_cour;

Query ID = root_20170507125642_6af22d21-ae88-4daf-a346-4b1cbcd7d9fe

Total jobs = 1

Launching Job 1 out of 1

Tez session was closed. Reopening...

Session re-established.

 

 

Status: Running (Executing on YARN cluster with App id application_1494149668396_0004)

 

--------------------------------------------------------------------------------

        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

--------------------------------------------------------------------------------

Map 1 ..........   SUCCEEDED      1          1        0        0       0       0

Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0

--------------------------------------------------------------------------------

VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 4.51 s     

--------------------------------------------------------------------------------

Loading data to table default.phy_opt_count

Table default.phy_opt_count stats: [numFiles=1, numRows=10, totalSize=138, rawDataSize=128]

OK

Time taken: 13.634 seconds

 

hive> select * from phy_opt_count;

OK

badminton       234

basketball      224

football        206

gymnastics      220

opt_cour        0

swimming        234

table tennis    277

taekwondo       222

tennis  223

volleyball      209

Time taken: 0.065 seconds, Fetched: 10 row(s)

 

 

16.使用Hive工具来统计phy_course_score_xd.txt文件中某高校各个班级体育课的平均成绩,使用round函数保留两位小数。其中phy_course_score_xd.txt文件数据结构如下表所示,班级字段为class,成绩字段为score。(相关数据库命令语言请全部使用小写格式)

stname(string)

stID(int)

class(string)

opt_cour(string)

score(float)

答:

hive> create table phy_course_score_xd (stname string,stID int,class string,opt_cour string,score float) row format delimited fields terminated by '\t' lines terminated by '\n';

OK

Time taken: 0.339 seconds

 

hive> load data local inpath '/root/phy_course_score_xd.txt' into table phy_course_score_xd;

Loading data to table default.phy_course_score_xd

Table default.phy_course_score_xd stats: [numFiles=1, totalSize=1910]

OK

Time taken: 1.061 seconds

hive> select class,round(avg(score)) from phy_course_score_xd group by class;

Query ID = root_20170507131823_0bfb1faf-3bfb-42a5-b7eb-3a6a284081ae

Total jobs = 1

Launching Job 1 out of 1

 

 

Status: Running (Executing on YARN cluster with App id application_1494149668396_0005)

 

--------------------------------------------------------------------------------

        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

--------------------------------------------------------------------------------

Map 1 ..........   SUCCEEDED      1          1        0        0       0       0

Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0

--------------------------------------------------------------------------------

VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 26.68 s    

--------------------------------------------------------------------------------

OK

Network_1401    73.0

Software_1403   72.0

class   NULL

Time taken: 27.553 seconds, Fetched: 3 row(s)

 

17.使用Hive工具来统计phy_course_score_xd.txt文件中某高校各个班级体育课的最高成绩。其中phy_course_score_xd.txt文件数据结构如下表所示,班级字段为class,成绩字段为score。(相关数据库命令语言请全部使用小写格式)

stname(string)

stID(int)

class(string)

opt_cour(string)

score(float)

答:

hive> create table phy_course_score_xd (stname string,stID int,class string,opt_cour string,score float) row format delimited fields terminated by '\t' lines terminated by '\n';

OK

Time taken: 0.339 seconds

 

hive> load data local inpath '/root/phy_course_score_xd.txt' into table phy_course_score_xd;

Loading data to table default.phy_course_score_xd

Table default.phy_course_score_xd stats: [numFiles=1, totalSize=1910]

OK

Time taken: 1.061 seconds

 

hive> select class,max(score) from phy_course_score_xd group by class;

Query ID = root_20170507131942_86a2bf55-49ac-4c2e-b18b-8f63191ce349

Total jobs = 1

Launching Job 1 out of 1

 

 

Status: Running (Executing on YARN cluster with App id application_1494149668396_0005)

 

--------------------------------------------------------------------------------

        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

--------------------------------------------------------------------------------

Map 1 ..........   SUCCEEDED      1          1        0        0       0       0

Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0

--------------------------------------------------------------------------------

VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 5.08 s     

--------------------------------------------------------------------------------

OK

Network_1401    95.0

Software_1403   100.0

class   NULL

Time taken: 144.035 seconds, Fetched: 3 row(s)

 

 

18.Hive数据仓库将网络日志weblog_entries.txt中分开的request_daterequest_time字段进行合并,并以一个下划线“_”进行分割,如下图所示,其中weblog_entries.txt的数据结构如下表所示。(相关数据库命令语言请全部使用小写格式)

 

md5(STRING)

url(STRING)

request_date (STRING)

request_time (STRING)

ip(STRING)

答:

hive> create table weblog_entries (md5 string,url string,request_date string,request_time string,ip string) row format delimited fields terminated by '\t' lines terminated by '\n';

OK

Time taken: 0.502 seconds

 

hive> load data local inpath '/root/weblog_entries.txt' into table weblog_entries;

Loading data to table default.weblog_entries

Table default.weblog_entries stats: [numFiles=1, totalSize=251130]

OK

Time taken: 1.203 seconds

 

hive> select concat_ws('_', request_date, request_time) from weblog_entries;

2012-05-10_21:29:01

2012-05-10_21:13:47

2012-05-10_21:12:37

2012-05-10_21:34:20

2012-05-10_21:27:00

2012-05-10_21:33:53

2012-05-10_21:10:19

2012-05-10_21:12:05

2012-05-10_21:25:58

2012-05-10_21:34:28

Time taken: 0.265 seconds, Fetched: 3000 row(s)  

 

 

19. 使用Hive动态地关于网络日志weblog_entries.txt的查询结果创建Hive表。通过创建一张名为weblog_entries_url_length的新表来定义新的网络日志数据库的三个字段,分别是urlrequest_daterequest_time。此外,在表中定义一个获取url字符串长度名为“url_length”的新字段,其中weblog_entries.txt的数据结构如下表所示。完成后查询weblog_entries_url_length表文件内容。(相关数据库命令语言请全部使用小写格式)

md5(STRING)

url(STRING)

request_date (STRING)

request_time (STRING)

ip(STRING)

答:

hive> create table weblog_entries (md5 string,url string,request_date string,request_time string,ip string) row format delimited fields terminated by '\t' lines terminated by '\n';

OK

Time taken: 0.502 seconds

 

hive> load data local inpath '/root/weblog_entries.txt' into table weblog_entries;

Loading data to table default.weblog_entries

Table default.weblog_entries stats: [numFiles=1, totalSize=251130]

OK

Time taken: 1.203 seconds

hive> create table weblog_entries_url_length as select url, request_date, request_time, length(url) as url_length from weblog_entries;

Query ID = root_20170507065123_e3105d8b-84b6-417f-ab58-21ea15723e0a

Total jobs = 1

Launching Job 1 out of 1

 

 

Status: Running (Executing on YARN cluster with App id application_1494136863427_0002)

 

--------------------------------------------------------------------------------

        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

--------------------------------------------------------------------------------

Map 1 ..........   SUCCEEDED      1          1        0        0       0       0

--------------------------------------------------------------------------------

VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 4.10 s     

--------------------------------------------------------------------------------

Moving data to: hdfs://master:8020/apps/hive/warehouse/weblog_entries_url_length

Table default.weblog_entries_url_length stats: [numFiles=1, numRows=3000, totalSize=121379, rawDataSize=118379]

OK

Time taken: 5.874 seconds

 

hive> select * from weblog_entries_url_length;

/qnrxlxqacgiudbtfggcg.html      2012-05-10      21:29:01        26

/sbbiuot.html   2012-05-10      21:13:47        13

/ofxi.html      2012-05-10      21:12:37        10

/hjmdhaoogwqhp.html     2012-05-10      21:34:20        19

/angjbmea.html  2012-05-10      21:27:00        14

/mmdttqsnjfifkihcvqu.html       2012-05-10      21:33:53        25

/eorxuryjadhkiwsf.html  2012-05-10      21:10:19        22

/e.html 2012-05-10      21:12:05        7

/khvc.html      2012-05-10      21:25:58        10

/c.html 2012-05-10      21:34:28        7

Time taken: 0.08 seconds, Fetched: 3000 row(s)

 

20.masterslaver节点安装Sqoop Clients,完成后,在master节点查看Sqoop的版本信息。

答:

[root@master ~]# sqoop version

Warning: /usr/hdp/2.4.3.0-227/accumulo does not exist! Accumulo imports will fail.

Please set $ACCUMULO_HOME to the root of your Accumulo installation.

17/05/07 06:56:25 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.3.0-227

Sqoop 1.4.6.2.4.3.0-227

git commit id d296ad374bd38a1c594ef0f5a2d565d71e798aa6

Compiled by jenkins on Sat Sep 10 00:58:52 UTC 2016

 

 

21.使用Sqoop工具列出master节点中MySQLambari数据库中所有的数据表。

答:

[root@master ~]# sqoop list-tables --connect jdbc:mysql://localhost/ambari --username root --password bigdata  

Warning: /usr/hdp/2.4.3.0-227/accumulo does not exist! Accumulo imports will fail.

Please set $ACCUMULO_HOME to the root of your Accumulo installation.

17/05/07 07:07:01 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.3.0-227

17/05/07 07:07:01 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

17/05/07 07:07:02 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.

ClusterHostMapping

QRTZ_BLOB_TRIGGERS

QRTZ_CALENDARS

QRTZ_CRON_TRIGGERS

QRTZ_FIRED_TRIGGERS

QRTZ_JOB_DETAILS

QRTZ_LOCKS

QRTZ_PAUSED_TRIGGER_GRPS

QRTZ_SCHEDULER_STATE

QRTZ_SIMPLE_TRIGGERS

QRTZ_SIMPROP_TRIGGERS

QRTZ_TRIGGERS

adminpermission

adminprincipal

adminprincipaltype

adminprivilege

adminresource

adminresourcetype

alert_current

alert_definition

alert_group

alert_group_target

alert_grouping

alert_history

alert_notice

alert_target

alert_target_states

ambari_sequences

artifact

blueprint

blueprint_configuration

clusterEvent

cluster_version

clusterconfig

clusterconfigmapping

clusters

clusterservices

clusterstate

confgroupclusterconfigmapping

configgroup

configgrouphostmapping

execution_command

groups

hdfsEvent

host_role_command

host_version

hostcomponentdesiredstate

hostcomponentstate

hostconfigmapping

hostgroup

hostgroup_component

hostgroup_configuration

hosts

hoststate

job

kerberos_descriptor

kerberos_principal

kerberos_principal_host

key_value_store

mapreduceEvent

members

metainfo

repo_version

request

requestoperationlevel

requestresourcefilter

requestschedule

requestschedulebatchrequest

role_success_criteria

servicecomponentdesiredstate

serviceconfig

serviceconfighosts

serviceconfigmapping

servicedesiredstate

stack

stage

task

taskAttempt

topology_host_info

topology_host_request

topology_host_task

topology_hostgroup

topology_logical_request

topology_logical_task

topology_request

upgrade

upgrade_group

upgrade_item

users

viewentity

viewinstance

viewinstancedata

viewinstanceproperty

viewmain

viewparameter

viewresource

widget

widget_layout

widget_layout_user_widget

workflow

 

22.MySQL中创建名为xiandian的数据库,在xiandian数据库中创建xd_phy_course数据表,其数据表结构如表1所示。使用Hive工具来创建数据表xd_phy_course,将phy_course_xd.txt导入到该表中,其中xd_phy_course表的数据结构如表2所示。使用Sqoop工具将hive数据仓库中的xd_phy_course表导出到master节点的MySQLxiandain数据库的xd_phy_course表。

1

stname VARCHAR(20)

stID INT(1)

class VARCHAR(20)

opt_cour VARCHAR(20)

2

stname(string)

stID(int)

class(string)

opt_cour(string)

答:

[root@master ~]# mysql -uroot -pbigdata

Welcome to the MariaDB monitor.  Commands end with ; or \g.

Your MariaDB connection id is 37

Server version: 5.5.44-MariaDB MariaDB Server

 

Copyright (c) 2000, 2015, Oracle, MariaDB Corporation Ab and others.

 

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

 

MariaDB [(none)]> create database xiandian;

Query OK, 1 row affected (0.00 sec)

 

MariaDB [(none)]> use xiandian;

Database changed

MariaDB [xiandian]> create table xd_phy_course(stname varchar(20),stID int(1),class varchar(20),opt_cour varchar(20));

Query OK, 0 rows affected (0.20 sec)

 

hive> create table xd_phy_course (stname string,stID int,class string,opt_cour string) row format delimited fields terminated by '\t' lines terminated by '\n';

OK

Time taken: 3.136 seconds

 

hive> load data local inpath '/root/phy_course_xd.txt' into table xd_phy_course3;

Loading data to table default.xd_phy_course3

Table default.xd_phy_course3 stats: [numFiles=1, totalSize=89444]

OK

Time taken: 1.129 seconds

 

 [root@master ~]# sqoop export --connect jdbc:mysql://localhost:3306/xiandian --username root --password bigdata --table xd_phy_course  --hcatalog-table xd_phy_course  

Warning: /usr/hdp/2.4.3.0-227/accumulo does not exist! Accumulo imports will fail.

Please set $ACCUMULO_HOME to the root of your Accumulo installation.

17/05/07 07:29:48 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.3.0-227

17/05/07 07:29:48 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

17/05/07 07:29:48 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.

17/05/07 07:29:48 INFO tool.CodeGenTool: Beginning code generation

17/05/07 07:29:48 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `xd_phy_course` AS t LIMIT 1

17/05/07 07:29:48 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `xd_phy_course` AS t LIMIT 1

17/05/07 07:29:48 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.3.0-227/hadoop-mapreduce

Note: /tmp/sqoop-root/compile/35d4b31b4d93274ba6bde54b3e56a821/xd_phy_course.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

17/05/07 07:29:50 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/35d4b31b4d93274ba6bde54b3e56a821/xd_phy_course.jar

17/05/07 07:29:50 INFO mapreduce.ExportJobBase: Beginning export of xd_phy_course

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.3.0-227/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.3.0-227/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

 

 

 

 

23.使用Pig工具在Local模式计算系统日志access-log.txt中的IP的点击数,要求使用GROUP BY语句按照IP进行分组,通过FOREACH 运算符,对关系的列进行迭代,统计每个分组的总行数,最后使用DUMP语句查询统计结果。

答:

grunt> copyFromLocal /root/Pig/access-log.txt /user/root/input/log1.txt

grunt> A =LOAD '/user/root/input/log1.txt' USING PigStorage (' ') AS (ip,others);

grunt> group_ip =group A by ip;

grunt> result =foreach group_ip generate group,COUNT(A);

grunt> dump result;

2018-02-13 08:13:36,520 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: GROUP_BY

HadoopVersion PigVersion UserId StartedAt FinishedAt Features

2.7.3.2.6.1.0-129 0.16.0.2.6.1.0-129 root 2018-02-13 08:13:37 2018-02-13 08:13:41 GROUP_BY

 

Success!

 

Job Stats (time in seconds):

JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs

job_local963723433_0001 1 1 n/a n/a n/a n/a n/a n/a n/a n/a A,group_ip,result GROUP_BY,COMBINER file:/tmp/temp-1479363025/tmp133834330,

 

Input(s):

Successfully read 62991 records from: "/user/root/input/log1.txt"

 

Output(s):

Successfully stored 182 records in: "file:/tmp/temp-1479363025/tmp133834330"

 

(220.181.108.186,1)

(222.171.234.225,142)

(http://www.1daoyun.com/course/toregeister",1)

 

 

24.使用Pig工具计算天气数据集temperature.txt中年度最高气温,要求使用GROUP BY语句按照year进行分组,通过FOREACH 运算符,对关系的列进行迭代,统计每个分组的最大值,最后使用DUMP语句查询计算结果。

答:

grunt> copyFromLocal /root/Pig/temperature.txt /user/root/temp.txt

grunt> A = LOAD '/user/root/temp.txt' USING PigStorage(' ')AS (year:int,temperature:int);

grunt> B = GROUP A BY year;

grunt> C = FOREACH B GENERATE group,MAX(A.temperature);

grunt> dump C;

2018-02-13 08:18:52,107 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: GROUP_BY

(2012,40)

(2013,36)

(2014,37)

(2015,39)

 

 

25.使用Pig工具统计数据集ip_to_country中每个国家的IP地址数。要求使用GROUP BY语句按照国家进行分组,通过FOREACH 运算符,对关系的列进行迭代,统计每个分组的IP地址数目,最后将统计结果保存到/data/pig/output目录中,并查看数据结果。

答:

grunt> copyFromLocal /root/Pig/ip_to_country.txt /user/root/ip_to_country.txt

grunt> ip_countries = LOAD '/user/root/ip_to_country.txt' AS (ip: chararray, country:chararray);

grunt> country_grpd = GROUP ip_countries BY country;

grunt> country_counts = FOREACH country_grpd GENERATE FLATTEN(group),COUNT(ip_countries) as counts;

grunt> STORE country_counts INTO '/data/pig/output';

2018-02-13 08:23:35,621 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: GROUP_BY

Moldova, Republic of 1

Syrian Arab Republic 1

United Arab Emirates 2

Bosnia and Herzegovina 1

Iran, Islamic Republic of 2

Tanzania, United Republic of 1

 

 

26.master节点安装Mahout Client,打开Linux Shell运行mahout命令查看Mahout自带的案例程序。

答:

[root@master ~]# mahout

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.

Running on hadoop, using /usr/hdp/2.6.1.0-129/hadoop/bin/hadoop and HADOOP_CONF_DIR=/usr/hdp/2.6.1.0-129/hadoop/conf

MAHOUT-JOB: /usr/hdp/2.6.1.0-129/mahout/mahout-examples-0.9.0.2.6.1.0-129-job.jar

An example program must be given as the first argument.

Valid program names are:

  arff.vector: : Generate Vectors from an ARFF file or directory

  baumwelch: : Baum-Welch algorithm for unsupervised HMM training

  buildforest: : Build the random forest classifier

  canopy: : Canopy clustering

  cat: : Print a file or resource as the logistic regression models would see it

  cleansvd: : Cleanup and verification of SVD output

  clusterdump: : Dump cluster output to text

  clusterpp: : Groups Clustering Output In Clusters

  cmdump: : Dump confusion matrix in HTML or text formats

  concatmatrices: : Concatenates 2 matrices of same cardinality into a single matrix

  cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)

  cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.

  describe: : Describe the fields and target variable in a data set

  evaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probes

  fkmeans: : Fuzzy K-means clustering

  hmmpredict: : Generate random sequence of observations by given HMM

  itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering

  kmeans: : K-means clustering

  lucene.vector: : Generate Vectors from a Lucene index

  lucene2seq: : Generate Text SequenceFiles from a Lucene index

  matrixdump: : Dump matrix in CSV format

  matrixmult: : Take the product of two matrices

  parallelALS: : ALS-WR factorization of a rating matrix

  qualcluster: : Runs clustering experiments and summarizes results in a CSV

  recommendfactorized: : Compute recommendations using the factorization of a rating matrix

  recommenditembased: : Compute recommendations using item-based collaborative filtering

  regexconverter: : Convert text files on a per line basis based on regular expressions

  resplit: : Splits a set of SequenceFiles into a number of equal splits

  rowid: : Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}

  rowsimilarity: : Compute the pairwise similarities of the rows of a matrix

  runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model

  runlogistic: : Run a logistic regression model against CSV data

  seq2encoded: : Encoded Sparse Vector generation from Text sequence files

  seq2sparse: : Sparse Vector generation from Text sequence files

  seqdirectory: : Generate sequence files (of Text) from a directory

  seqdumper: : Generic Sequence File dumper

  seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives

  seqwiki: : Wikipedia xml dump to sequence file

  spectralkmeans: : Spectral k-means clustering

  split: : Split Input data into test and train sets

  splitDataset: : split a rating dataset into training and probe parts

  ssvd: : Stochastic SVD

  streamingkmeans: : Streaming k-means clustering

  svd: : Lanczos Singular Value Decomposition

  testforest: : Test the random forest classifier

  testnb: : Test the Vector-based Bayes classifier

  trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model

  trainlogistic: : Train a logistic regression using stochastic gradient descent

  trainnb: : Train the Vector-based Bayes classifier

  transpose: : Take the transpose of a matrix

  validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set

  vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors

  vectordump: : Dump vectors from a sequence file to text

  viterbi: : Viterbi decoding of hidden states from given output states sequence

 

 

 

27.使用Mahout挖掘工具对数据集user-item-score.txt(用户-物品-得分)进行物品推荐,要求采用基于项目的协同过滤算法,欧几里得距离公式定义,并且每位用户的推荐个数为3,设置非布尔数据,最大偏好值为4,最小偏好值为1,将推荐输出结果保存到output目录中,通过-cat命令查询输出结果part-r-00000中的内容 。

答:

[hdfs@master ~]$ hadoop fs -mkdir -p /data/mahout/project

[hdfs@master ~]$ hadoop fs -put user-item-score.txt /data/mahout/project

[hdfs@master ~]$ mahout recommenditembased -i /data/mahout/project/ user-item-score.txt -o /data/mahout/project/output -n 3 -b false -s SIMILARITY_EUCLIDEAN_DISTANCE --maxPrefsPerUser 4 --minPrefsPerUser 1 --maxPrefsInItemSimilarity 4 --tempDir /data/mahout/project/temp   

                                                

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.

Running on hadoop, using /usr/hdp/2.4.3.0-227/hadoop/bin/hadoop and

17/05/15 19:37:25 INFO driver.MahoutDriver: Program took 259068 ms (Minutes: 4.3178)

 

[hdfs@master ~]$ hadoop fs -cat /data/mahout/project/output/part-r-00000

1       [105:3.5941463,104:3.4639049]

2       [106:3.5,105:2.714964,107:2.0]

3       [103:3.59246,102:3.458911]

4       [107:4.7381864,105:4.2794304,102:4.170158]

5       [103:3.8962872,102:3.8564017,107:3.7692602]

 

28.master节点安装启动Flume组件,打开Linux Shell运行flume-ng的帮助命令,查看Flume-ng的用法信息。

答:

[root@master ~]# flume-ng help

Usage: /usr/hdp/2.6.1.0-129/flume/bin/flume-ng.distro <command> [options]...

 

commands:

  help                  display this help text

  agent                 run a Flume agent

  avro-client           run an avro Flume client

  password              create a password file for use in flume config

  version               show Flume version info

 

global options:

  --conf,-c <conf>      use configs in <conf> directory

  --classpath,-C <cp>   append to the classpath

  --dryrun,-d           do not actually start Flume, just print the command

  --plugins-path <dirs> colon-separated list of plugins.d directories. See the

                        plugins.d section in the user guide for more details.

                        Default: $FLUME_HOME/plugins.d

  -Dproperty=value      sets a Java system property value

  -Xproperty=value      sets a Java -X option

 

agent options:

  --conf-file,-f <file> specify a config file (required)

  --name,-n <name>      the name of this agent (required)

  --help,-h             display help text

 

avro-client options:

  --rpcProps,-P <file>   RPC client properties file with server connection params

  --host,-H <host>       hostname to which events will be sent

  --port,-p <port>       port of the avro source

  --dirname <dir>        directory to stream to avro source

  --filename,-F <file>   text file to stream to avro source (default: std input)

  --headerFile,-R <file> File containing event headers as key/value pairs on each new line

  --help,-h              display help text

 

  Either --rpcProps or both --host and --port must be specified.

 

password options:

  --outfile              The file in which encoded password is stored

 

Note that if <conf> directory is specified, then it is always included first

in the classpath.

 

 

29. 根据提供的模板hdfs-example.conf文件,使用Flume NG工具设置master节点的系统路径/opt/xiandian/为实时上传文件至HDFS文件系统的实时路径,设置HDFS文件系统的存储路径为/data/flume/,上传后的文件名保持不变,文件类型为DataStream,然后启动flume-ng agent

答:

[root@master ~]# flume-ng agent --conf-file hdfs-example.conf --name master -Dflume.root.logger=INFO,cnsole

Warning: No configuration directory set! Use --conf <dir> to override.

Info: Including Hadoop libraries found via (/bin/hadoop) for HDFS access

Info: Excluding /usr/hdp/2.4.3.0-227/hadoop/lib/slf4j-api-1.7.10.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/hadoop/lib/slf4j-log4j12-1.7.10.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/tez/lib/slf4j-api-1.7.5.jar from classpath

Info: Including HBASE libraries found via (/bin/hbase) for HBASE access

Info: Excluding /usr/hdp/2.4.3.0-227/hbase/lib/slf4j-api-1.7.7.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/hadoop/lib/slf4j-api-1.7.10.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/hadoop/lib/slf4j-log4j12-1.7.10.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/tez/lib/slf4j-api-1.7.5.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/hadoop/lib/slf4j-api-1.7.10.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/hadoop/lib/slf4j-log4j12-1.7.10.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/zookeeper/lib/slf4j-api-1.6.1.jar from classpath

Info: Excluding /usr/hdp/2.4.3.0-227/zookeeper/lib/slf4j-log4j12-1.6.1.jar from classpath

Info: Including Hive libraries found via () for Hive access

 

[root@master ~]# cat hdfs-example.conf

# example.conf: A single-node Flume configuration

# Name the components on this agent

master.sources = webmagic

master.sinks = k1

master.channels = c1

# Describe/configure the source

master.sources.webmagic.type = spooldir

master.sources.webmagic.fileHeader = true

master.sources.webmagic.fileHeaderKey = fileName

master.sources.webmagic.fileSuffix = .COMPLETED

master.sources.webmagic.deletePolicy = never

master.sources.webmagic.spoolDir = /opt/xiandian/

master.sources.webmagic.ignorePattern = ^$

master.sources.webmagic.consumeOrder = oldest

master.sources.webmagic.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder

master.sources.webmagic.batchsize = 5

master.sources.webmagic.channels = c1

# Use a channel which buffers events in memory

master.channels.c1.type = memory

# Describe the sink

master.sinks.k1.type = hdfs

master.sinks.k1.channel = c1

master.sinks.k1.hdfs.path = hdfs://master:8020/data/flume/%{dicName}

master.sinks.k1.hdfs.filePrefix = %{fileName}

master.sinks.k1.hdfs.fileType = DataStream

30.在先电大数据平台部署Spark服务组件,打开Linux Shell启动spark-shell终端,将启动的程序进程信息提交。

答:

[root@master ~]# spark-shell

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Spark context Web UI available at http://172.24.2.110:4040

Spark context available as 'sc' (master = local[*], app id = local-1519375873795).

Spark session available as 'spark'.

Welcome to

      ____              __

     / __/__  ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 2.1.1.2.6.1.0-129

      /_/

         

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77)

Type in expressions to have them evaluated.

Type :help for more information.

 

scala>

 

 

31.登录spark-shell,定义i值为1sum值为0,使用while循环,求从1加到100的值,最后使用scala的标准输出函数输出sum值。

答:

scala> var i=1

i: Int = 1

scala> var sum=0

sum: Int = 0

scala> while(i<=100){

     | sum+=i

     | i=i+1

     | }

 

scala> println(sum)

5050

 

32.登录spark-shell,定义一个list为(1,2,3,4,5,6,7,8,9),然后利用map函数,对这个list进行元素乘2的操作。

答:

scala> import scala.math._

import scala.math._

 

scala> val nums=List(1,2,3,4,5,6,7,8,9)

nums: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9)

 

scala> nums.map(x=>x*2)

res18: List[Int] = List(2, 4, 6, 8, 10, 12, 14, 16, 18)

 

 

 

 

33.登录spark-shell,定义一个list为("Hadoop","Java","Spark"),然后利用flatmap函数将list转换为单个字母并转换为大写。

答:

scala> val data = List("Hadoop","Java","Spark")

data: List[String] = List(Hadoop, Java, Spark)

 

scala> println(data.flatMap(_.toUpperCase))

List(H, A, D, O, O, P, J, A, V, A, S, P, A, R, K)

 

 

 

 

34.登录大数据云主机master节点,在root目录下新建一个abc.txt,内容为:

hadoop  hive  

solr    redis  

kafka   hadoop  

storm   flume  

sqoop   docker  

spark   spark  

hadoop  spark  

elasticsearch   hbase  

hadoop  hive  

spark   hive  

hadoop  spark

 

然后登录spark-shell,首先使用命令统计abc.txt的行数,接着对abc.txt文档中的单词进行计数,并按照单词首字母的升序进行排序,最后统计结果行数。

答:

scala> val words=sc.textFile("file:///root/abc.txt").count

words: Long = 11

scala> val words=sc.textFile("file:///root/abc.txt").flatMap(_.split("\\W+")).map(x=>(x,1)).reduceByKey(_+_).sortByKey().collect

words: Array[(String, Int)] = Array((docker,1), (elasticsearch,1), (flume,1), (hadoop,5), (hbase,1), (hive,3), (kafka,1), (redis,1), (solr,1), (spark,5), (sqoop,1), (storm,1))

scala> val words=sc.textFile("file:///root/abc.txt").flatMap(_.split("\\W+")).map(x=>(x,1)).reduceByKey(_+_).count

words: Long = 12

 

35. 登录spark-shell,定义一个List(1,2,3,3,4,4,5,5,6,6,6,8,9),使用spark自带函数对这个list进行去重操作。

答:

scala> val l = List(1,2,3,3,4,4,5,5,6,6,6,8,9)

l: List[Int] = List(1, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 8, 9)

 

scala> l.distinct

res1: List[Int] = List(1, 2, 3, 4, 5, 6, 8, 9)

 

posted @ 2019-12-08 16:31  AnotherYou  阅读(2014)  评论(0编辑  收藏  举报