idea连接本地虚拟机Hadoop集群运行wordcount

虚拟机搭建hadoop集群,请参考:

 https://www.cnblogs.com/HusterX/p/14125543.html

环境声明:

1. Hadoop 2.7.0

2 Java 1.8.0

3. window10

4. Vmware workstation pro 16

5. centos7 

window上hadoop的安装:

1. 将hadoop.tar.gz文件解压

2. 将  "hadoop安装路径"\bin 添加到PATH路径中

3. 将hadoop.dll文件放到 C:\Windows\System32 目录下,将winutils.exe文件到 "hadoop安装路径"\bin  目录下 

    PS: 要下载对应版本的 hadoop.dll 和 winutils.exe (如果没有,尽量用高于自己版本的文件)具体文件在github上找。

 

更改window系统上的hosts文件

路径:C:\Windows\System32\drivers\etc\hosts

 1 # Copyright (c) 1993-2009 Microsoft Corp.
 2 #
 3 # This is a sample HOSTS file used by Microsoft TCP/IP for Windows.
 4 #
 5 # This file contains the mappings of IP addresses to host names. Each
 6 # entry should be kept on an individual line. The IP address should
 7 # be placed in the first column followed by the corresponding host name.
 8 # The IP address and the host name should be separated by at least one
 9 # space.
10 #
11 # Additionally, comments (such as these) may be inserted on individual
12 # lines or following the machine name denoted by a '#' symbol.
13 #
14 # For example:
15 #
16 #      102.54.94.97     rhino.acme.com          # source server
17 #       38.25.63.10     x.acme.com              # x client host
18 
19 # localhost name resolution is handled within DNS itself.
20 #    127.0.0.1       localhost
21 #    ::1             localhost
22 127.0.0.1       activate.navicat.com
23 # 下边三个是虚拟机中的IP地址和hostname
24 192.168.47.131  master
25 192.168.47.132  slave1
26 192.168.47.130  slave2
hosts

 

IDEA新建maven项目

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>hadoop</artifactId>
    <version>1.0-SNAPSHOT</version>
    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <hadoop.version>2.7.0</hadoop.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>commons-cli</groupId>
            <artifactId>commons-cli</artifactId>
            <version>1.3.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

    </dependencies>
</project>
pom.xml

将集群中的 core-site.xmlhdfs-site.xml 放到项目的 resource 目录下。

PS:以下俩个文件中有修改的地方,win系统与虚拟机中的文件要一致,以免报错

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>

    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/opt/hadoop/hdfs/tmp</value>
        <discription>A base for other temporary directories.</discription>
    </property>
    <!--建议这的value写成master的ip地址 同时这也是运行程序时访问的文件系统的路径前缀-->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://192.168.47.131:9000</value>
    </property>

    <property>
        <name>hadoop.proxyuser.hadoop.hosts</name>
        <value>*</value>
    </property>
    <property>
        <name>hadoop.proxyuser.hadoop.groups</name>
        <value>*</value>
    </property>

</configuration>
core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <!---这也写为master的ip地址-->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>192.168.47.131:50090</value>
    </property>
    <property>
        <name>dfs.namenode.http-address</name>
        <value>192.168.47.131:50090</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/opt/hadoop/hdfs/name</value>
    </property>
   <!--取消权限检测,笔者在自己win系统上运行是有权限检测,这样做很方便,当然也有其他的做法。可自行百度-->
    <property>
        <name>dfs.permissions.enabled</name>
        <value>false</value>
    </property>

</configuration>
hdfs-site.xml

笔者搭建集群中的配置文件样例,供参考 

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/opt/hadoop/hdfs/tmp</value>
        <discription>A base for other temporary directories.</discription>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://192.168.47.131:9000</value>
    </property>
</configuration>
core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/opt/hadoop/hdfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/opt/hadoop/hdfs/data</value>
    </property>
    <property>
        <name>dfs.permissions.enabled</name>
        <value>false</value>
    </property>
</configuration>
hdfs-site.xml
<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>192.168.47.131</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>  
        <name>yarn.resourcemanager.address</name>  
        <value>192.168.47.131:8032</value>  
      </property>  
      <property>  
        <name>yarn.resourcemanager.scheduler.address</name>  
        <value>192.168.47.131:8030</value>  
      </property>  
      <property>  
        <name>yarn.resourcemanager.resource-tracker.address</name>  
        <value>192.168.47.131:8031</value>  
   </property> 
</configuration>
yarn-site.xml

 

WordCount运行

1. 测试代码

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.log4j.BasicConfigurator;

import java.io.IOException;


public class HdfsTest {

    public static void main(String[] args) {
        //自动快速地使用缺省Log4j环境。
        BasicConfigurator.configure();
        try {
            String filename = "hdfs://192.168.47.131:9000/words.txt";
            Configuration conf = new Configuration();
            FileSystem fs = null;
            fs = FileSystem.get(conf);
            if (fs.exists(new Path(filename))){
                System.out.println("the file is exist");
            }else{
                System.out.println("the file is not exist");
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}
HdfsTest

2. WordCount代码示例

import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.log4j.BasicConfigurator;

/**
 * 单词统计MapReduce
 */
public class  WordCount {

    /**
     * Mapper类
     */
    public static class WordCountMapper extends MapReduceBase implements Mapper<Object, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
        /**
         * map方法完成工作就是读取文件
         * 将文件中每个单词作为key键,值设置为1,
         * 然后将此键值对设置为map的输出,即reduce的输入
         */
        @Override
        public void map(Object key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
            /**
             * StringTokenizer:字符串分隔解析类型
             * 之前没有发现竟然有这么好用的工具类
             * java.util.StringTokenizer
             * 1. StringTokenizer(String str) :
             *     构造一个用来解析str的StringTokenizer对象。
             *     java默认的分隔符是“空格”、“制表符(‘\t’)”、“换行符(‘\n’)”、“回车符(‘\r’)”。
             * 2. StringTokenizer(String str, String delim) :
             *     构造一个用来解析str的StringTokenizer对象,并提供一个指定的分隔符。
             * 3. StringTokenizer(String str, String delim, boolean returnDelims) :
             *     构造一个用来解析str的StringTokenizer对象,并提供一个指定的分隔符,同时,指定是否返回分隔符。
             *
             * 默认情况下,java默认的分隔符是“空格”、“制表符(‘\t’)”、“换行符(‘\n’)”、“回车符(‘\r’)”。
             */
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                output.collect(word, one);
            }
        }
    }

    /**
     * reduce的输入即是map的输出,将相同键的单词的值进行统计累加
     * 即可得出单词的统计个数,最后把单词作为键,单词的个数作为值,
     * 输出到设置的输出文件中保存
     */
    public static class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        @Override
        public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
            int sum = 0;
            while (values.hasNext()) {
                sum += values.next().get();
            }
            result.set(sum);
            output.collect(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        //快速使用log4j日志功能
        BasicConfigurator.configure();
        //数据输入路径     这里的路径需要换成自己的hadoop所在地址
        String input = "hdfs://192.168.47.131:9000/words.txt";
        /**
         * 输出路径设置为HDFS的根目录下的out文件夹下
         * 注意:该文件夹不应该存在,否则出错
         */
        String output = "hdfs://192.168.47.131:9000/out";

        JobConf conf = new JobConf(WordCount.class);
        //设置是谁提交
        conf.setUser("root");
        /**
         * 因为map-reduce过程需要我们自定以的map-reduce类,
         * 因此,我们需要将项目导出为jar包
         * 然后在此处设置导出jar包的位置
         */
        conf.setJar("D:\\ejar\\hadoop.jar");
        //设置作业名称
        conf.setJobName("wordcount");
        /**
         * 声明跨平台提交作业
         */
        conf.set("mapreduce.app-submission.cross-platform","true");
        //很重要的声明
        conf.setJarByClass(WordCount.class);
        //对应单词字符串
        conf.setOutputKeyClass(Text.class);
        //对应单词的统计个数 int类型
        conf.setOutputValueClass(IntWritable.class);
        //设置mapper类
        conf.setMapperClass(WordCountMapper.class);
        /**
         * 设置合并函数,合并函数的输出作为Reducer的输入,
         * 提高性能,能有效的降低map和reduce之间数据传输量。
         * 但是合并函数不能滥用。需要结合具体的业务。
         * 由于本次应用是统计单词个数,所以使用合并函数不会对结果或者说
         * 业务逻辑结果产生影响。
         * 当对于结果产生影响的时候,是不能使用合并函数的。
         * 例如:我们统计单词出现的平均值的业务逻辑时,就不能使用合并
         * 函数。此时如果使用,会影响最终的结果。
         */
        conf.setCombinerClass(WordCountReducer.class);
        //设置reduce类
        conf.setReducerClass(WordCountReducer.class);
        /**
         * 设置输入格式,TextInputFormat是默认的输入格式
         * 这里可以不写这句代码。
         * 它产生的键类型是LongWritable类型(代表文件中每行中开始的偏移量值)
         * 它的值类型是Text类型(文本类型)
         */
        conf.setInputFormat(TextInputFormat.class);
        /**
         * 设置输出格式,TextOutpuTFormat是默认的输出格式
         * 每条记录写为文本行,它的键和值可以是任意类型,输出回调用toString()
         * 输出字符串写入文本中。默认键和值使用制表符进行分割。
         */
        conf.setOutputFormat(TextOutputFormat.class);
        //设置输入数据文件路径
        FileInputFormat.setInputPaths(conf, new Path(input));
        //设置输出数据文件路径(该路径不能存在,否则异常)
        FileOutputFormat.setOutputPath(conf, new Path(output));
        //启动mapreduce
        JobClient.runJob(conf);
        System.exit(0);
    }

}
WordCount.java

3. 将项目导出为jar包

   请参考:https://www.cnblogs.com/ffaiss/p/10908483.html

4. 运行WordCount

5. 在 hadoop集群中 运行以下命令,查看结果

查看 / 目录下的文件
hadoop fs - ls /

查看 /out 目录下的文件内容
因为笔者在wordcount程序中设定的输出目录是 /out 所以在这查看该目录,具体根据自己的实际情况而定
hadoop fs -ls /out

一般运行成功会有以下俩个文件
[root@master ~]# hadoop fs -ls /out
Found 2 items
-rw-r--r--   2 root supergroup          0 2020-12-19 22:45 /out/_SUCCESS
-rw-r--r--   2 root supergroup         85 2020-12-19 22:45 /out/part-00000

查看part-00000的内容(即程序运行结果)
[root@master ~]# hadoop fs -cat /out/part-00000
CAJViewer    1
a    1
free    1
function    1
is    3
it    1
main    1
paper.And    1
software.Its    1
view    1
Order

 

 
posted @ 2020-12-20 16:26  徐春晖  阅读(2435)  评论(0)    收藏  举报