Recently I am reading "Collective Intelligent Programming" . Compared with other machine learning books, this book has many cases, which are closer to reality, and is also very suitable for us, novices who are ready to learn machinelearning.
I think the disadvantage of this book is that it does not explain the formula of the algorithm, but implements it directly with code, so it causes inconvenience to understand the algorithm in detail, so I want to write a few articles to explain it in detail. The following is the first article, which explains Pearson's correlation coefficients and uses the Java language that I am more familiar with.
Pearson's mathematical formula is as follows, from Wikipedia.
Where E is mathematical expectation, cov represents covariance, and /sigma_X and /sigma_Y are standard deviations.
After simplification, you will get:
The algorithm for Pearson's similarity calculation is still very simple and not difficult to implement. Only the sum of the variables X, Y, product XY, the square of X, and the square of Y are required. The data test set used by my code comes from the book Collective Intelligence Programming . The code is as follows:
package pearlsonCorrelationScore;import java.util.ArrayList;import java.util.HashMap;import java.util.List;import java.util.Map;import java.util.Map;import java.util.Map.Entry;/** * @author shenchao * * Pearson Correlation Evaluation* * Tested with the user evaluation similarity dataset in the book "Collective Intelligent Programming"*/public class PearsonCorrelationScore {private Map<String, Map<String, double>> dataset = null;public PearsonCorrelationScore() {initDataSet();}/*** Initialize the dataset*/private void initDataSet() {dataset = new HashMap<String, Map<String, double>>();// Initialize the Lisa Rose dataset Map<String, double>();roseMap.put("Lady in the water", 2.5);roseMap.put("Snakes on a Plane", 3.5);roseMap.put("Just My Luck", 3.0);roseMap.put("Superman Returns", 3.5);roseMap.put("You, Me and Dupree", 2.5);roseMap.put("The Night Listener", 3.0);dataset.put("Lisa Rose", roseMap);// Initialize Jack Matthews dataset Map<String, double> jackMap = new HashMap<String, double>();jackMap.put("Lady in the water", 3.0);jackMap.put("Snakes on a Plane", 4.0);jackMap.put("Superman Returns", 5.0);jackMap.put("You, Me and Dupree", 3.5);jackMap.put("The Night Listener", 3.0);dataset.put("Jack Matthews", jackMap);// Initialize Jack Matthews dataset Map<String, double> geneMap = new HashMap<String, double>();geneMap.put("Lady in the water", 3.0);geneMap.put("Snakes on a Plane", 3.5);geneMap.put("Just My Luck", 1.5);geneMap.put("Superman Returns", 5.0);geneMap.put("You, Me and Dupree", 3.5);geneMap.put("The Night Listener", 3.0);dataset.put("Gene Seymour", geneMap);}public Map<String, Map<String, double>> getDataSet() {return dataset;}/*** @param person1* name* @param person2* name* @return Pearson correlation value*/public double sim_pearson(String person1, String person2) {// Find the movies that both parties have commented on, (Pearson's algorithm requires) List<String> list = new ArrayList<String>();for (Entry<String, double> p1 : dataset.get(person1).entrySet()) {if (dataset.get(person2).containsKey(p1.getKey())) {list.add(p1.getKey());}}double sumX = 0.0;double sumY = 0.0;double sumX_Sq = 0.0;double sumY_Sq = 0.0;double sumXY = 0.0;int N = list.size();for (String name : list) {Map<String, double> p1Map = dataset.get(person1);Map<String, double> p2Map = dataset.get(person2);sumX += p1Map.get(name);sumY += p2Map.get(name);sumX_Sq += Math.pow(p1Map.get(name), 2);sumY_Sq += Math.pow(p2Map.get(name), 2);sumXY += p1Map.get(name) * p2Map.get(name);}double numerator = sumXY - sumX * sumY / N;double denominator = Math.sqrt((sumX_Sq - sumX * sumX / N)* (sumY_Sq - sumY * sumY / N));// The denominator cannot be 0if (denominator == 0) {return 0;}return numerator / denominator;}public static void main(String[] args) {PearsonCorrelationScore pearlsonCorrelationScore = new PearsonCorrelationScore();System.out.println(pearsonCorrelationScore.sim_pearson("Lisa Rose","Jack Matthews"));}}Reflect the data from each test set into a two-dimensional coordinate plane as follows:
The value obtained by the above program is actually the slope of the line. The interval of its slope is between [-1,1], and the magnitude of its absolute value reflects the similarity between the two. The larger the slope, the greater the similarity. When the similarity is 1, the straight line is a diagonal.
Summarize
The above is all the detailed explanation of JAVA's similarity based on Pearson's correlation coefficient in this article. I hope it will be helpful to everyone. If there are any shortcomings, please leave a message to point it out. Thank you friends for your support for this site!