Classification of Spark operators
From a general perspective, Spark operators can be roughly divided into the following two categories:
1) Transformation transform/conversion operator: This transformation does not trigger the submission of the job and completes the intermediate process processing of the job.
Transformation operations are delayed calculations, which means that the conversion operation to generate another RDD from one RDD conversion is not executed immediately. It requires waiting until an Action operation is available before the operation will be triggered.
2) Action Action Operator: This type of operator will trigger SparkContext to submit a job job.
The Action operator will trigger the Spark submission job (Job) and output the data to the Spark system.
From a small perspective, Spark operators can be roughly divided into the following three categories:
1) Transformation operator of Value data type. This transformation does not trigger the submission of the job. The data items processed are Value-type data.
2) The Transfromation operator of the Key-Value data type. This transformation does not trigger the submission of the job. The data items for processing are Key-Value data pairs.
3) Action operator, this type of operator will trigger SparkContext to submit a job job.
introduction
It is usually more convenient to write spark programs with scala, after all, spark's source code is written in scala. However, there are many Java developers at present, especially when data docking and online services. At this time, you need to master some methods of using spark in Java.
1. map
Map cannot be used more frequently when processing and converting data
Before using map, you must first define a transformed function format as follows:
Function<String, LabeledPoint> transform = new Function<String, LabeledPoint>() {//String is the input type of a certain line LabeledPoint is the converted output type @Override public LabeledPoint call(String row) throws Exception {//Rewrite the call method String[] rowArr = row.split(","); int rowSize = rowArr.length; double[] doubleArr = new double[rowSize-1]; // Except for the first lable, the rest of the part is parsed into a double and put into the array for (int i = 1; i < rowSize; i++) { String each = rowArr[i]; doubleArr[i] = Double.parseDouble(each); } //Convert the data you just obtained into a vector Vector feature = Vectors.dense(doubleArr); double label = Double.parseDouble(rowArr[0]); //Construct the data format for classification training LabelPoint LabeledPoint point = new LabeledPoint(label, feature); return point; } };Special attention should be paid to:
1. The input of the call method should be the type of the data row before conversion. The return value should be the type of the data row after processing.
2. If a custom class is called in the conversion method, note that the class name must be serialized, for example
public class TreeEnsemble implements Serializable {}3. If some class objects are called in the conversion function, such as the method needs to call an external parameter, or a numerical processing model (standardization, normalization, etc.), the object needs to be declared final.
Then call the conversion function when appropriate
JavaRDD<LabeledPoint> rdd = oriData.toJavaRDD().map(transForm);
This method requires converting ordinary rdd to javaRDD to use. The operation of converting to javaRDD is not time-consuming, so don't worry
2. filter
It is also very commonly used in scenarios such as avoiding data null values and 0s, and can meet the functions of where in SQL
First of all, we need to define a function. The actual effect of returning a Boolean value given a data line is to retain the data returned to true.
Function<String, Boolean> boolFilter = new Function<String, Boolean>() {//String is the input type of a certain line. Boolean is the corresponding output type. Used to determine whether the data is retained. @Override public Boolean call(String row) throws Exception {//Rewrite the call method boolean flag = row!=null; return flag; } };Usually, what needs to be modified in actual use of this function is only the type of row, that is, the input type of the data row. Unlike the above conversion function, the return value of this call method should be fixed as Boolean.
Then the call method
JavaRDD<LabeledPoint> rdd = oriData.toJavaRDD().filter(boolFilter);
3. mapToPair
This method is somewhat similar to the map method, and it also performs some conversions to the data. However, the output of this function is a tuple input line. The most commonly used method is to do cross-validation or statistical error rate recall calculation AUC, etc.
Similarly, you need to define a conversion function first
Function<String, Boolean> transformer = new PairFunction<LabeledPoint, Object, Object>() {//LabeledPoint is the two Objects after the input type. Do not change @Override public Tuple2 call(LabeledPoint row) throws Exception {//Rewrite the call method usually only changes the input parameters and output. Do not change double predicton = thismodel.predict(row.features()); double label = row.label(); return new Tuple2(predicton, label); } });Regarding the calling class and class object, the requirements are consistent with the previous one. The class needs to be serialized, and the class object needs to be declared as final type
The corresponding calls are as follows:
JavaPairRDD<Object, Object> predictionsAndLabels = oriData.mapToPair(transformer);
Then, for the use of predictionsAndLabels, calculate accuracy, recall, accuracy, and AUC. There will be in the next blog. Please stay tuned.
Summarize
The above is the entire content of this article. I hope that the content of this article has certain reference value for everyone's study or work. If you have any questions, you can leave a message to communicate. Thank you for your support to Wulin.com.