Hadoop의 WordCount 인스턴스 코드

저자：Eve Cole 업데이트 시간：2025-07-07 04:32:01

간단한 예제를 사용하여 MapReduce가 무엇인지 설명 할 수 있습니다.

우리는 각 단어가 큰 파일에 나타나는 횟수를 계산하고 싶습니다. 파일이 너무 크기 때문에. 이 파일을 작은 파일로 나눈 다음 여러 사람이 계산하도록 배열합니다. 이 과정은 "지도"입니다. 그런 다음 각 사람이 계산 한 숫자를 병합하면 "감소"됩니다.

MapReduce에서 위의 예를 수행하는 경우 파일을 여러 독립 데이터 블록으로 나누고 다른 기계 노드에 배포하는 작업 작업을 작성해야합니다. 그런 다음 다른 노드에 산란 된 맵 작업을 통해 완전히 평행 한 방식으로 처리하십시오. MapReduce는 맵의 출력 라인을 수집 한 다음 결과 출력을 보내 다음 처리 단계를 줄입니다.

작업의 특정 실행 프로세스의 경우 MapReduce의 실행 프로세스에서 모든 작업을 조정하는 "JobTracker"라는 프로세스가 있습니다. 여러 작업 트래커 프로세스는 별도의 맵 작업을 실행하고 언제든지 작업 트래커에 작업 실행을보고하는 데 사용됩니다. TaskTracker가 작업을보고하거나 오랫동안 자체 작업을보고하지 않으면 JobTracker는 다른 작업 트래커를 시작하여 별도의 맵 작업을 다시 실행합니다.

다음과 같은 특정 코드 구현 :

1. WordCount 관련 작업을 작성하십시오

(1) Eclipse에서 관련 Maven 프로젝트를 작성하여 다음과 같이 JAR 패키지에 의존합니다 (Hadoop 소스 코드 패키지에서 Hadoop-Mapreduce-examples 프로젝트의 POM 구성을 참조 할 수도 있습니다).

참고 : Maven 플러그인 Maven-Jar-Plugin을 구성하고 메인 클래스를 지정합니다.

 <pectionies> <pectionency> <groupId> junit> <artifactid> junit </artifactid> <bersion> 4.11 </version> </fectionement> <pectionement> <groupId> org.apache.Hadoop </groupId> <artifactid> hadoop-mapreduce-client-core </arevactid> 2.5.2 </version> </version> </version> 2.5. <groupid> org.apache.hadoop </groupid> <artifactid> hadoop-common </artifactid> <bersion> 2.5.2 </version> </fectionency> </dependencies> <build> <flugin> <grupin> org.apache.maven.plugins </groupid> <artifactid> maven-jar-pupartin- <Archive> <minifest> <mainclass> com.xxx.demo.hadoop.wordcount.wordcount </mainclass> </manifest> </archive> </configuration> </plugin> </plugins> </build>

(2) MapReduce의 작동 메커니즘에 따르면, 작업은 맵 논리, 논리 감소 및 작업 일정의 세 가지를 완성하기 위해 최소한 3 개의 클래스를 작성해야합니다.

지도의 코드는 org.apache.hadoop.mapreduce.mapper class를 상속받을 수 있습니다

 public static class tokenizermapper는 mapper <객체, 텍스트, 텍스트, intwritable을 확장합니다> {private final static intritable one = new Intwritable (1); 개인 텍스트 단어 = 새 텍스트 (); //이 예제는 키 매개 변수를 사용하지 않기 때문에 키 유형은 단순히 객체 공개 void 맵 (개체 키, 텍스트 값, 컨텍스트 컨텍스트)으로 지정됩니다. ioException {StringTokenizer itr = new StringTokenizer (value.toString ()); while (itr.hasmoretokens ()) {word.set (itr.nextToken ()); context.write (Word, One); }}}

Reduce의 코드는 org.apache.hadoop.mapreduce.regucer class를 상속받을 수 있습니다

 Public Class IntSumreducer는 retsing <text, intwritable, text, intwritable> {private intwritable result = new intwritable (); public void 감소 (텍스트 키, 반복 가능한 <intwritable> 값, 컨텍스트 컨텍스트)는 ioexception, InterruptedException {int sum = 0; for (intwritable val : values) {sum += val.get (); } result.set (sum); context.write (키, 결과); }}

작업 일정을위한 주요 방법을 작성하십시오

 public static void main (string [] args)은 예외 {configuration conf = new configuration (); Job Job = job.getInstance (conf, "Word count"); job.setJarbyClass (WordCount.class); job.setMapperClass (tokenizerMapper.class); job.setcombinerclass (intsumpeducer.class); job.setReducerClass (intsumpreducer.class); job.setOutputKeyClass (text.class); job.setOutputValueClass (intwritable.class); fileInputFormat.AdDinputPath (job, new Path (Args [0]); fileoutputformat.setoutputpath (job, new Path (args [1])); job.waitforcompletion (true); //system.exit(job.waitforcompletion(true)? 0 : 1); }

2. 데이터 파일을 Hadoop 클러스터 환경에 업로드하십시오

MVN 설치를 실행하여 프로젝트를 JAR 파일에 입력하고 Linux 클러스터 환경에 업로드하십시오. HDFS DFS -MKDIR 명령을 사용하여 HDFS 파일 시스템에서 해당 명령을 작성하십시오. HDFS DFS를 사용하여 HDFS 시스템으로 처리 해야하는 데이터 파일을 업로드하십시오. 예 : hdfs dfs -put $ {linux_path/data file} $ {hdfs_path}

3. 직업을 실행하십시오

클러스터 환경에서 명령을 실행하십시오 : hadoop jar $ {linux_path} /wordcount.jar $ {hdfs_input_path} $ {hdfs_output_path}

4. 통계 결과를 봅니다

hdfs dfs -cat $ {hdfs_output_path}/출력 파일 이름

위의 방법은 Hadoop 클러스터 환경이 시작되지 않을 때 로컬 모드에서 실행됩니다. 현재 HDF 나 원사는 작동하지 않습니다. 다음은 의사 분포 모드에서 MapReduce 작업을 실행할 때해야 할 작업입니다. 먼저 공식 웹 사이트에 나열된 단계를 발췌했습니다.

호스트 이름을 구성합니다

# vi /etc/sysconfig/network

예를 들어:

 네트워킹 = yeshostname = mastervi /etc /hosts

다음 내용을 작성하십시오

127.0.0.1 localhost

비밀번호 intercommunication없이 ssh를 구성하십시오

ssh-keygen -t rsa # cat?~/.ssh/id_rsa.pub?>>?~/.ssh/authorized_keys

Core-Site.xml 파일을 구성 ($ {hadoop_home}/etc/hadoop/에 위치

 <configuration> <posperty> <name> fs.defaultfs </name> <alue> hdfs : // localhost : 9000 </value> </property> </configuration>

hdfs-site.xml 파일을 구성하십시오

 <configuration> <posperty> <name> dfs.replication </name> <value> 1 </value> </property> </configuration>

다음 명령은 독립형 의사 분포 모드에서 MapReduce 작업을 실행할 수 있습니다.

1. 파일 시스템을 형성하십시오.
$ bin/hdfs namenode -format
2. Namenode 데몬 및 Datanode 데몬을 시작합니다.
$ sbin/start-dfs.sh
3. Hadoop 데몬 로그 출력은 $ hadoop_log_dir 디렉토리에 기록됩니다 (기본값은 $ hadoop_home/logs).
4. Namenode의 웹 인터페이스를 방해합니다. 기본적으로 다음에서 사용할 수 있습니다.
Namenode -http : // localhost : 50070/
MapReduce 작업을 수행하는 데 필요한 HDFS 디렉토리를 작성하십시오.
$ bin /hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir/user/<username>
5. 입력 파일을 분산 파일 시스템에 보관하십시오.
$ bin/hdfs dfs -put etc/hadoop 입력
6. 제공된 몇 가지 예를 연결합니다.
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar grep 입력 출력 'dfs [az.]+'
7. 출력 파일 검사 :
분산 파일 시스템에서 출력 파일을 로컬 파일 시스템으로 복사하여 검사하십시오.
$ bin/hdfs dfs- 출력 출력
$ cat 출력/*
또는
분산 파일 시스템에서 출력 파일을보십시오.
$ bin/hdfs dfs -cat 출력/*
8. 완료되면 다음과 같이 데몬을 중지하십시오.
$ sbin/stop-dfs.sh

요약

위는이 기사에서 WordCount 인스턴스 코드의 전체 내용입니다. 모든 사람에게 도움이되기를 바랍니다. 관심있는 친구는이 사이트의 다른 관련 주제를 계속 참조 할 수 있습니다. 단점이 있으면 메시지를 남겨 두십시오. 이 사이트를 지원해 주신 친구들에게 감사드립니다!