Many hadoop beginners probably are the same. Since there is not enough machine resources, I can only create a pseudo distribution of hadoop installed in the virtual machine. Then I use eclipse or Intellij idea in win7 to write code tests. So the question is, how to remotely submit map/reduce tasks to remote hadoop and breakpoint debugging in win7?
1. Preparation
1.1 In win7, find a directory and decompress hadoop-2.6.0. In this article, D:/yangjm/Code/study/hadoop/hadoop-2.6.0 (The following is represented by $HADOOP_HOME)
1.2 Add several environment variables in win7
HADOOP_HOME=D:/yangjm/Code/study/hadoop/hadoop-2.6.0
HADOOP_BIN_PATH=%HADOOP_HOME%/bin
HADOOP_PREFIX=D:/yangjm/Code/study/hadoop/hadoop-2.6.0
In addition, the PATH variable is appended at the end; %HADOOP_HOME%/bin
2. Eclipse remote debugging
1.1 Download hadoop-eclipse-plugin plugin
hadoop-eclipse-plugin is a hadoop plugin specially used for eclipse, which can view the directory and file content of hdfs directly in the IDE environment. Its source code is hosted on github, and the official website address is https://github.com/winghc/hadoop2x-eclipse-plugin
If you are interested, you can download the source code and compile it yourself. Baidu will take a lot of articles, but if you just use https://github.com/winghc/hadoop2x-eclipse-plugin/tree/master/release%20, various compiled versions are provided here. Just use it directly. Copy the downloaded hadoop-eclipse-plugin-2.6.0.jar to the eclipse/plugins directory, and then restart eclipse. It's all done.
1.2 Download the hadoop2.6 plug-in package for windows64-bit platform (hadoop.dll, winutils.exe)
Under hadoop-common-project/hadoop-common/src/main/winutils in the hadoop2.6.0 source code, there is a vs.net project. Compiling this project can get this bunch of files and output files.
hadoop.dll and winutils.exe are the most useful. Copy winutils.exe to the $HADOOP_HOME/bin directory, and copy hadoop.dll to the %windir%/system32 directory (mainly to prevent the plug-in from reporting various inexplicable errors, such as empty object references).
Note: If you don't want to compile, you can directly download the compiled file hadoop2.6(x64)V0.2.rar
1.3 Configuring hadoop-eclipse-plugin plugin
Start eclipse, windows->show view->other
window->preferences->hadoop map/reduce Specify the root directory of hadoop on win7 (ie: $HADOOP_HOME)
Then in the Map/Reduce Locations panel, click the Baby Elephant icon
Add a Location
This interface is very important. I will explain several parameters:
Location name is just a name, just call it
Map/Reduce(V2) Master Host Here is the IP address corresponding to the hadoop master in the virtual machine. The port below corresponds to the port specified by the dfs.datanode.ipc.address attribute in hdfs-site.xml
DFS Master Port: The port here corresponds to the port specified by fs.defaultFS in core-site.xml
The last user name should be the same as the user name that runs hadoop in the virtual machine. I installed and ran hadoop 2.6.0 with hadoop, so fill in hadoop here. If you installed it with root, change it to root accordingly.
After these parameters are specified, click Finish and eclipse to know how to connect to hadoop. If everything goes well, you can see the directories and files in HDFS in the Project Explorer panel.
You can right-click on the file and select Delete to try. Usually, the first time is unsuccessful, and there will be a lot of things. The general idea is that there are insufficient permissions. The reason is that the current Win7 login user is not the running user of Hadoop in the virtual machine. There are many solutions. For example, you can create a new Hadoop administrator user on win7, then switch to Hadoop to log in to win7, and then use eclipse to develop. However, this is too annoying, the easiest way:
Added in hdfs-site.xml
<property> <name>dfs.permissions</name> <value>false</value> </property>
Then in the virtual machine, run hadoop dfsadmin -safemode leave
To be safe, another hadoop fs -chmod 777 /
In short, it is to completely turn off the security detection of hadoop (there is no need for these in the learning stage, don’t do this when it is officially produced), finally restart hadoop, then go to eclipse, and repeat the delete file operation just now, and it should be fine.
1.4 Creating a WoldCount sample project
Create a new project and select Map/Reduce Project
Just put Next on WodCount.java, the code is as follows:
package yjmyzz;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length < 2) { System.err.println("Usage: wordcount <in> [<in>...] <out>"); System.exit(2); } Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); for (int i = 0; i < otherArgs.length - 1; ++i) { FileInputFormat.addInputPath(job, new Path(otherArgs[i])); } FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}Then put a log4j.properties, the content is as follows: (For the convenience of running, check various outputs)
log4j.rootLogger=INFO, stdout#log4j.logger.org.springframework=INFO#log4j.logger.org.apache.activemq=INFO#log4j.logger.org.apache.activemq.spring=WARN#log4j.logger.org.apache.activemq.store.journal=INFO#log4j.logger.org.apache.activemq.or g.activeio.journal=INFOlog4j.appender.stdout=org.apache.log4j.ConsoleAppenderlog4j.appender.stdout.layout=org.apache.log4j.PatternLayoutlog4j.appender.stdout.layout.ConversionPattern=%d{ABSOLUTE} | %-5.5p | %-16.16t | %-32.32c{1} | %-32.32C %4L | %m%nThe final directory structure is as follows:
Then you can run it, of course it will not succeed, because no parameters are entered for WordCount, refer to the following figure:
1.5 Set operation parameters
Because WordCount is to enter a file to count words and then output to another folder, so give two parameters, refer to the above figure, enter in Program arguments
hdfs://172.28.20.xxx:9000/jimmy/input/README.txt
hdfs://172.28.20.xxx:9000/jimmy/output/
Please refer to this to change it (mainly replace the IP with the IP in your virtual machine). Note that if the input/READM.txt file does not have it, please upload it manually first, and then /output/ must not exist. Otherwise, if the program runs to the end and finds that the target directory exists, an error will be reported. After this, you can set a breakpoint in the appropriate position and you can finally debug it:
3. Intellij idea remote debugging hadoop
3.1 Create a maven WordCount project
The pom file is as follows:
<?xml version="1.0" encoding="UTF-8"?><project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>yjmyzz</groupId> <artifactId>mapreduce-helloworld</artifactId> <version>1.0-SNAPSHOT</version> <dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.6.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-jobclient</artifactId> <version>2.6.0</version> </dependency> <dependency> <groupId>commons-cli</groupId> <artifactId>commons-cli</artifactId> <version>1.2</version> </dependency> </dependencies> <build> <finalName>${project.artifactId}</finalName> </build></project>The project structure is as follows:
Right-click on the project ->Open Module Settings or press F12 to open module properties
Adding dependency Libary references
Then import all the corresponding packages under $HADOOP_HOME
The imported libary can be named, such as hadoop2.6
3.2 Set operation parameters
Note two places :
1 is Program regimes, which is similar to eclipes, specifying input files and output folders
2 is Working Directory, that is, the working directory, specified as the directory where $HADOOP_HOME is located
Then you can debug
The only thing that is unhappy under Intellij is that there is no Hadoop plugin similar to eclipse. Every time you run wordcount, you can only manually delete the output directory on the command line and then debug. To solve this problem, you can improve the WordCount code and delete the output directory before running, see the following code:
package yjmyzz;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } /** * Delete the specified directory* * @param conf * @param dirPath * @throws IOException */ private static void deleteDir(Configuration conf, String dirPath) throws IOException { FileSystem fs = FileSystem.get(conf); Path targetPath = new Path(dirPath); if (fs.exists(targetPath)) { boolean delResult = fs.delete(targetPath, true); if (delResult) { System.out.println(targetPath + " has been deleted successfullly."); } else { System.out.println(targetPath + "deletion failed."); } } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length < 2) { System.err.println("Usage: wordcount <in> [<in>...] <out>"); System.exit(2); } //Delete the output directory first deleteDir(conf, otherArgs[otherArgs.length - 1]); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); for (int i = 0; i < otherArgs.length - 1; ++i) { FileInputFormat.addInputPath(job, new Path(otherArgs[i])); } FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }} But this is not enough. When running in the IDE environment, the IDE needs to know which hdfs instance to connect to (it is like in db development, you need to specify DataSource in the configuration xml), copy the core-site.xml under $HADOOP_HOME/etc/hadoop to the resources directory, similar to the following:
The contents are as follows:
<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration> <property> <name>fs.defaultFS</name> <value>hdfs://172.28.20.***:9000</value> </property></configuration>
Just replace the IP above with the IP in the virtual machine.