1. Introduction to bucket sorting
Bucket sort is a count-based sorting algorithm. The working principle is to divide the data into a limited number of buckets, and then each bucket is sorted separately (it is possible to use other sorting algorithms or continue to sort in a recurring way). When the values in the data to be sorted are evenly distributed, the bucket sorting time complexity is Θ(n). Bucket sorting is different from quick sorting, it is not a comparison sorting, and is not affected by the lower limit of time complexity O(nlogn).
Bucket sorting is carried out in the following 4 steps:
(1) Set a fixed number of empty buckets.
(2) Put the data into the corresponding bucket.
(3) Sort the data in each non-empty bucket.
(4) Splice the data from the non-empty bucket to get the result.
Bucket sorting is mainly suitable for small-range integer data, and is independently and evenly distributed. The amount of data that can be calculated is large and meets the linear expected time.
2. Bucket sorting algorithm demonstration
For example, there is now a set of data [7, 36, 65, 56, 33, 60, 110, 42, 42, 94, 59, 22, 83, 84, 63, 77, 67, 101]. How to sort it from small to large?
Operation steps:
(1) Set the number of buckets to 5 empty buckets, find the maximum value of 110 and the minimum value of 7, and the range of each bucket is 20.8=(110-7+1)/5.
(2) Traverse the original data, put it in the corresponding bucket with a linked list structure. The number 7, the bucket index value is 0, the calculation formula is floor((7 7) / 20.8), the number 36, the bucket index value is 1, the calculation formula floor((36 7) / 20.8).
(3) When inserting data to the bucket with the same index the second time, determine the size of the existing numbers and newly inserted numbers in the bucket, and insert them in order from left to right, from small to large. For example: when the bucket with index 2 is inserted, when inserting 63, there are already 4 numbers 56, 59, 60, and 65 in the bucket, and then the number 63 is inserted to the left of 65.
(4) Merge non-empty buckets, merge 0, 1, 2, 3, and 4 buckets in order from left to right.
(5) Get the structure of bucket sort
3. Nodejs program implementation
It is not difficult to implement mature algorithms like bucket sorting. According to the above ideas, I wrote a simple program to implement them. I feel that the most troublesome part is using Javascript to manipulate the linked list.
The actual code is as follows:
'use strict';///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// sort([1,4,1,5,3,2,3,3,2,5,2,8,9,2,1],5) * sort([1,4,1,5,3,2,3,3,2,5,2,8,9,2,1],5,0,5) */exports.sort = function (arr, count) { if (arr.length == 0) return []; count = count || (count > 1 ? count : 10); // Judge maximum and minimum values var min = arr[0], max = arr[0]; for (var i = 1; i < arr.length; i++) { min = min < arr[i] ? min : arr[i]; max = max > arr[i] ? max : arr[i]; } var delta = (max - min + 1) / count; // console.log(min+","+max+","+delta); // Initialize bucket var buckets = []; //Storage data to bucket for (var i = 0; i < arr.length; i++) { var idx = Math.floor((arr[i] - min) / delta); // Bucket index if (buckets[idx]) {//Non-empty bucket var bucket = buckets[idx]; var insert = false;//Insert the flag stone L.reTraversal(bucket, function (item, done) { if (arr[i] <= item.v) {//Smaller than, insert L.append(item, _val(arr[i])); insert = true; done();//Exit traversal} }); if (!insert) { //Greater than, insert L.append(bucket, _val(arr[i])); } } else {//Empty bucket var bucket = L.init(); L.append(bucket, _val(arr[i])); buckets[idx] = bucket; //Link list implementation} } var result = []; for (var i = 0, j = 0; i < count; i++) { L.reTraversal(buckets[i], function (item) { // console.log(i+":"+item.v); result[j++] = item.v; }); } return result;}//Linked list storage object function _val(v) { return {v: v}}Run the program:
var algo = require('./index.js');var data = [ 7, 36, 65, 56, 33, 60, 110, 42, 42, 94, 59, 22, 83, 84, 63, 77, 67, 101 ];console.log(data);console.log(algo.bucketsort.sort(data,5));//5 buckets console.log(algo.bucketsort.sort(data,10));//10 bucketsOutput:
7, 22, 33, 36, 42, 42, 56, 67, 67, 77, 83, 84, 94, 101, 110 ][ 7, 22, 33, 36, 42, 42, 56, 59, 60, 63, 65, 67, 77, 83, 84, 94, 101, 110 ][ 7, 22, 33, 36, 42, 42, 56, 59, 60, 63, 65, 67, 77, 83, 84, 94, 101, 110 ][ 7, 22, 33, 36, 42, 42, 56, 59, 60, 63, 65, 67, 77, 83, 84, 94, 101, 110 ][ 7, 22, 33, 36, 42, 42, 56, 59, 60, 63, 65, 67, 77, 83, 84, 94, 101, 110 ][ 7, 22, 33, 36, 42, 42, 56, 59, 60, 63, 65, 67, 77, 83, 84, 94, 101, 110 ][ 7, 22, 33, 36, 42, 42, 56, 59, 60, 63, 65, 67, 77, 83, 84, 94, 101, 110
What needs to be explained is:
(1) Sort in the bucket can be implemented during the insertion process as described in the program; or it can be inserted without sorting, and then sorted during the merge process, and fast sorting can be called.
(2) Linked list. In the underlying API of Node, there is an implementation of linked list. I did not use it directly, but called it through the linklist package: https://github.com/nodejs/node-v0.x-archive/blob/master/lib/_linklist.js
4. Case: Bucket sorting statistics on college entrance examination scores
One of the most famous application scenarios for bucket sorting is to count the scores of the college entrance examination. The number of national college entrance examination candidates in one year is 9 million, and the scores are standard, with a minimum of 200 and a maximum of 900. There is no decimal. If these 9 million numbers are sorted, what should we do?
Algorithm analysis:
(1) If you use comparison-based sorting, quick sorting, the average time complexity is O(nlogn) = O(9000000*log9000000)=144114616=144 million comparisons.
(2) If you use count-based sorting, bucket sorting, and average complexity, you can control the linear complexity. When creating 700 buckets, one bucket from 200 minutes to 900 minutes, O(N)=O(90000000), it is equivalent to scanning 900W pieces of data once.
We run a program to compare quick sort and bucket sorting at once.
//Create 100W pieces of data in [200,900] closed interval var data = algo.data.randomData(1000*1000,200,900);var s1 = new Date().getTime();algo.quicksort.sort(data);//Quick sort var s2 = new Date().getTime();algo.bucketsort.sort(data,700);//Load to 700 buckets var s3 = new Date().getTime();console.log("quicksort time: %sms",s2-s1);console.log("bucket time: %sms",s3-s2);Output:
quicksort time: 14768msbucket time: 1089ms
Therefore, for the case of college entrance examination scoring, bucket sorting is more suitable! Our use of appropriate algorithms in suitable scenarios will bring performance improvements to the program beyond hardware.
5. Bucket sorting cost analysis
BUT...
Bucket sorting utilizes the mapping relationship of functions, reducing almost all comparison work. In fact, the calculation of the f(k) value of bucket sorting is equivalent to the division in the quick order, and has divided a large amount of data into basically ordered data blocks (buckets). Then you only need to make advanced comparisons and sorting of a small amount of data in the bucket.
The time complexity of bucket sorting N keywords is divided into two parts:
(1) Looping to calculate the bucket mapping function of each keyword, and this time complexity is O(N).
(2) Use advanced comparison sorting algorithm to sort all data in each bucket, with a time complexity of ∑O(Ni*logNi). where Ni is the data amount of the i-th bucket.
Obviously, part (2) is the determinant of the performance of bucket sorting. Minimizing the amount of data in the bucket is the only way to improve efficiency (because the best average time complexity based on comparison sorting can only reach O(N*logN)). Therefore, we need to try our best to do the following two points:
(1) The mapping function f(k) can allocate N data to M buckets evenly, so that each bucket has [N/M] data volumes.
(2) Try to increase the number of barrels. In the extreme case, each bucket can only get one data, which completely avoids the "compare" sorting operation of data in the bucket. Of course, it is not easy to do this. When the amount of data is huge, the f(k) function will make the number of bucket collections huge and space waste is serious. This is a trade-off between the cost of time and space.
For N data to be sorted and M buckets, the average bucket sorting time complexity of each bucket [N/M] data is:
O(N)+O(M*(N/M)*log(N/M))=O(N+N*(logN-logM))=O(N+N*logN-N*logM)
When N=M, that is, when there is only one data per bucket under the limit. The best efficiency of bucket sorting can reach O(N).
6. Summary
The average time complexity of bucket sorting is linear O(N+C), where C=N*(logN-logM). If the number of barrels M is larger relative to the same N, the higher its efficiency is, and the best time complexity reaches O(N). Of course, the space complexity of bucket sorting is O(N+M). If the input data is very large and the number of buckets is very large, the space cost is undoubtedly expensive. In addition, the bucket sort is stable.
Actually, I have another feeling: among the search algorithms, the best time complexity of the comparison-based search algorithm is O(logN). For example, half-finish search, balanced binary trees, red and black trees, etc. However, the Hash table has O(C) linear level search efficiency (the search efficiency reaches O(1) in case of no conflict). Let’s take a good look at: Are the thoughts and bucket sorting of the Hash tables the same song?