For the popular deep learning now, it is necessary to maintain the spirit of learning - programmers, especially architects, must always be concerned about core technologies and key algorithms, and when necessary, you must write and master it. Don’t care about when to use it - whether to use it is a political issue or whether to write it is a technical issue, just like soldiers do not care about whether to fight or not, but about how to win.
How programmers learn machine learning
For programmers, machine learning has a certain threshold (this threshold is also its core competitiveness). I believe that many people will have headaches for English papers full of mathematical formulas when learning machine learning, and may even give up. But in fact, the machine learning algorithm implementation program is not difficult to write. The following is the reverse multi-layer (BP) neural network algorithm implemented by 70 lines of code, that is, deep learning. In fact, it is not just neural networks, but most machine learning algorithms such as logistic regression, decision tree C45/ID3, random forest, Bayesian, collaborative filtering, graph computing, Kmeans, PageRank, etc. can be implemented in 100 rows of stand-alone programs (consider it later).
The real difficulty of machine learning lies in why it calculates like this, what is the mathematical principle behind it, and how to deduce the formula. Most of the information on the Internet introduces this part of the theoretical knowledge, but rarely tells you how the calculation process and program implementation of the algorithm are. For programmers, what you need to do is only engineering applications, and not prove a new mathematical calculation method. In fact, most machine learning engineers use open source packages or tool software written by others to input data and adjust calculation coefficients to train the results, and rarely implement the algorithm process by themselves. However, it is still very important to master the calculation process of each algorithm, so that you can understand what changes the algorithm has made the data and what effect the algorithm is for achieving.
This article focuses on the single-machine implementation of reverse neural networks. Regarding the multi-machine parallelization of neural networks, Fourinone provides a very flexible and complete parallel computing framework. We only need to understand the implementation of stand-alone program to conceive and design a distributed parallelization solution. If we do not understand the algorithm calculation process, all ideas will not be able to be expanded. In addition, there is also a convolutional neural network, which is mainly a dimensionality reduction idea, used for image processing, which is not within the scope of this article.
Neural network process description:
First of all, it is important to be clear that the neural network does prediction tasks. I believe you remember the least squares method you learned in high school. We can use this to make a less rigorous but more intuitive analogy:
First, we want to get the markers of a dataset and dataset (in the least squares method, we also get a set of values of x and y)
The algorithm fits a function parameter that can express this data set based on this data set and the corresponding marks (that is, the formula that calculates a and b in the least squares method, but this formula cannot be obtained directly in the neural network)
We obtain the fitted function (that is, the fitted line y^=ax+b in the least squares method)
Next, after bringing in new data, the corresponding predicted value y^ can be generated (in the least squares method, it is to bring in y^=ax+b to get the predicted y^, and so is the neural network algorithm, but the obtained function is much more complex than the least squares method).
The calculation process of neural networks
The structure of the neural network is shown in the figure below. The leftmost is the input layer, the rightmost is the output layer, and the middle is multiple hidden layers. Each neural node of the hidden layer and the output layer is accumulated by multiplying the previous layer node by its weight. The circle marked "+1" is the intercept term b. For each node outside the input layer: Y=w0*x0+w1*x1+…+wn*xn+b, we can know that the neural network is equivalent to a multi-layer logistic regression structure.
The algorithm calculation process: The input layer starts, calculates from left to right, and goes forward layer by layer until the output layer produces the result. If there is a difference between the result value and the target value, calculate from right to left, calculate the error of each node layer by layer, and adjust all the weights of each node. After reaching the input layer in reverse, calculate it again forward, and iterate the above steps until all weight parameters converge to a reasonable value. Since computer programs solve equation parameters and mathematical methods are different, they usually select parameters randomly first, and then constantly adjust the parameters to reduce the error until the correct value is approached, most machine learning is constantly iterative training. Let's take a closer look at the implementation of this process from the program.
Algorithm program implementation of neural network
The algorithm program implementation of neural networks is divided into three processes: initialization, forward calculation results, and reverse modification of weights.
1. Initialization process
Since it is an n-layer neural network, we use a two-dimensional array layer to record the node value. The first dimension is the number of layers, the second dimension is the node position of the layer, and the value of the array is the node value; similarly, the node error value layerErr is also recorded in a similar way. Use the three-dimensional array layer_weight to record the weights of each node. The first dimension is the number of layers, the second dimension is the node position of the layer, the third dimension is the position of the lower layer node, the value of the array is the weight value of a node reaching a lower layer, and the initial value is a random number between 0-1. In order to optimize the convergence speed, the momentum method weight adjustment is used here. It is necessary to record the last weight adjustment amount, and use the three-dimensional array layer_weight_delta to record. Intercept term processing: The program sets the value of the intercept to 1, so that it only needs to calculate its weight.
2. Calculate the results forward
The S function 1/(1+Math.exp(-z)) is used to unify the value of each node to between 0-1, and then calculate it layer by layer until the output layer. For the output layer, there is actually no need to use the S function. We regard the output result as a probability value between 0 and 1, so the S function is also used, which is also conducive to the uniformity of the program.
3. Reversely modify the weight
How to calculate errors in neural networks generally use square error function E, as follows:
That is, the squares of the errors of multiple output terms and corresponding target values are accumulated and divided by 2. In fact, this is the error function of logistic regression. As for why this function is used to calculate the error, what is the mathematical rationality and how it is obtained, I suggest that programmers do not want to be mathematicians, so don’t go into it in depth. What we need to do now is how to take the minimum value of the error of this function E and need to derivative it. If there are some basics of derivative mathematics, you can try to deduce how to obtain the following formula from the derivative weights of function E:
It doesn't matter if we can't deduce it. We just need to use the result formula. In our program, we use layerErr to record the minimized error after E's weight derivation, and then adjust the weight according to the minimized error.
Note that the momentum method is used here to take into account the experience of the previous adjustment to avoid falling into the local minimum value. The k below represents the number of iterations, mobp is the momentum term, and rate is the learning step:
Δw(k+1) = mobp*Δw(k)+rate*Err*Layer
There are also many formulas used below, and the difference in effect is not too big:
Δw(k+1) = mobp*Δw(k)+(1-mobp)rate*Err*Layer
In order to improve performance, note that the program implementation is to calculate the error and adjust the weight in a while. First, position the position on the second to last layer (that is, the last hidden layer), and then adjust the weight layer in reverse layer. Adjust the weight of the L layer according to the error calculated by L+1 layer, and calculate the error of the L layer, and calculate the weight next time it cycles to calculate the weight until the end of the first layer (input layer).
summary
During the entire calculation process, the value of the node changes every time it is calculated and does not need to be saved. The weight parameters and error parameters need to be saved and need to provide support for the next iteration. Therefore, if we conceive a distributed multi-machine parallel computing solution, we can understand why there is a concept of Parameter Server in other frameworks.
Complete program implementation of multi-layer neural network
The following implementation program BpDeep.java can be used directly, and it is also easy to modify it to any other language implementation such as C, C#, Python, etc., because they are all basic statements used and no other Java libraries (except Random functions).
import java.util.Random;public class BpDeep{ public double[][] layer;//Neural network nodes public double[][] layerErr;//Neural network node error public double[][][] layer_weight;//Neural layer node weight public double[][][] layer_weight_delta;//Neural layer node weight momentum public double mobp;//Momentum coefficient public double rate;//Learning coefficient public BpDeep(int[] layernum, double rate, double mobp){ this.mobp = mobp; this.rate = rate; layer = new double[layernum.length][]; layerErr = new double[layernum.length][]; layer_weight = new double[layernum.length][][]; layer_weight_delta = new double[layernum.length][][]; Random random = new Random(); for(int l=0;l<layernum.length;l++){ layer[l]=new double[layernum[l]]; layerErr[l]=new double[layernum[l]]; if(l+1<layernum.length){ layer_weight[l]=new double[layernum[l]+1][layernum[l+1]]; layer_weight_delta[l]=new double[layernum[l]+1][layernum[l+1]]; for(int j=0;j<layernum[l]+1;j++) for(int i=0;i<layernum[l+1];i++) layer_weight[l][j][i]=random.nextDouble();//Random initialization weight} } } //Compute the output layer by layer public double[] computeOut(double[] in){ for(int l=1;l<layer.length;l++){ for(int j=0;j<layer[l].length;j++){ double z=layer_weight[l-1][layer[l-1].length][j]; for(int i=0;i<layer[l-1].length;i++){ layer[l-1][i]=l==1?in[i]:layer[l-1][i]; z+=layer_weight[l-1][i][j]*layer[l-1][i]; } layer[l][j]=1/(1+Math.exp(-z)); } } return layer[layer.length-1]; } //Reversely calculate the error layer by layer and modify the weight public void updateWeight(double[] tar){ int l=layer.length-1; for(int j=0;j<layerErr[l].length;j++) layerErr[l][j]=layer[l][j]*(1-layer[l][j])*(tar[j]-layer[l][j]); while(l-->0){ for(int j=0;j<layerErr[l].length;j++){ double z = 0.0; for(int i=0;i<layerErr[l+1].length;i++){ z=z+l>0?layerErr[l+1][i]*layer_weight[l][j][i]:0; layer_weight_delta[l][j][i]= mobp*layer_weight_delta[l][j][i]+rate*layerErr[l+1][i]*layer[l][j];//Implant layer momentum adjustment layer_weight[l][j][i]+=layer_weight_delta[l][j][i];//Implant layer weight adjustment if(j==layerErr[l].length-1){ layer_weight_delta[l][j+1][i]= mobp*layer_weight_delta[l][j+1][i]+rate*layerErr[l+1][i];//Intercept momentum adjustment layer_weight[l][j+1][i]+=layer_weight_delta[l][j+1][i];//Intercept weight adjustment} } layerErr[l][j]=z*layer[l][j]*(1-layer[l][j]);//Record error} } } } public void train(double[] in, double[] tar){ double[] out = computeOut(in); updateWeight(tar); }}An example of using neural networks
Finally, let’s find a simple example to see the magical effects of neural networks. In order to facilitate observation of data distribution, we choose a two-dimensional coordinate data. There are 4 data below. The block represents the type of data is 1, and the triangle represents the type of data is 0. You can see that the data belonging to the block type is (1, 2) and (2, 1), and the data belonging to the triangle type are (1, 1), (2, 2). Now the problem is that we need to divide the 4 data into 1 and 0 on the plane, and use this to predict the type of new data.
We can use logistic regression algorithm to solve the classification problem above, but logistic regression obtains a linear straight line as a dividing line. You can see that no matter how the red line above is placed, a sample is always wrongly divided into different types. Therefore, for the above data, just one straight line cannot correctly divide their classification. If we use neural network algorithm, we can get the classification effect of the figure below, which is equivalent to finding the union of multiple straight lines to divide the space, which is higher in accuracy.
Here is the source code of this test program BpDeepTest.java:
import java.util.Arrays;public class BpDeepTest{ public static void main(String[] args){ //Initialize the basic configuration of the neural network //The first parameter is an integer array, representing the number of layers of the neural network and the number of nodes per layer. For example, {3, 10, 10, 10, 10, 2} means that the input layer is 3 nodes, the output layer is 2 nodes, and there are 4 hidden layers in the middle, and 10 nodes per layer //The second parameter is the learning step size, and the third parameter is the momentum coefficient BpDeep bp = new BpDeep(new int[]{2, 10, 2}, 0.15, 0.8); //Set the sample data, corresponding to the above 4 two-dimensional coordinate data double[][] data = new double[][]{{1,2},{2,2},{1,1},{2,1}}; //Set the target data, corresponding to the classification of 4 coordinate data double[][] target = new double[][]{{1,0},{0,1},{0,1},{1,0}}; //Iterative training 5000 times for(int n=0;n<5000;n++) for(int i=0;i<data.length;i++) bp.train(data[i], target[i]); //Check sample data based on the training results for(int j=0;j<data.length;j++){ double[] result = bp.computeOut(data[j]); System.out.println(Arrays.toString(data[j])+":"+Arrays.toString(result)); } //Predict the classification of a new data based on the training results double[] x = new double[]{3,1}; double[] result = bp.computeOut(x); System.out.println(Arrays.toString(x)+":"+Arrays.toString(result)); }} summary
The above test program shows that neural networks have magical classification effects. In fact, neural networks have certain advantages, but they are not universal algorithms close to the human brain. Many times it may disappoint us. We also need to use a lot of data from various scenarios to observe its effects. We can change the 1-layer hidden layer to n-layers and adjust the number of nodes, iterations, learning step size and momentum coefficients per layer to obtain an optimized result. However, in many cases, the effect of the n-layer hidden layer is not significantly improved than that of layer 1. Instead, the calculation is more complex and time-consuming. Our understanding of neural networks requires more practice and experience.
The above is all the content shared in this article about implementing deep neural network algorithms with 70 lines of Java code. I hope it will be helpful to everyone. If there are any shortcomings, please leave a message to point it out.