The traditional software stack structure of Taobao online applications is Nginx + Velocity + Java, i.e.:
In this system, Nginx forwards the request to a Java application, which handles the transaction and renders the data into the final page using a Velocity template.
After introducing Node.js, we will inevitably face the following problems:
How to design the topology structure of the technology stack and how to choose the deployment method, is it considered scientific and reasonable? After the project is completed, how to divide the traffic, which is convenient and fast for operation and maintenance? When encountering online problems, how to eliminate danger as soon as possible and avoid greater losses? How to ensure the health of the application and manage it at the load balancing scheduling level? System topology
According to our thinking and practice on the separation of front and back ends (II) - Based on the template exploration of front and back ends, Velocity needs to be replaced by Node.js, so that this structure becomes:
This is of course the ideal goal. However, the first introduction of Node.js layer in the traditional stack is a new attempt after all. To be safe, we decided to only enable new technologies on the favorites page (shoucang.taobao.com/item_collect.htm) of our favorites, while other pages continue to use traditional solutions. That is, Nginx determines the page type of the request and determines whether the request is to be forwarded to Node.js or Java. So, the final structure became:
Deployment plan
The above structure seems to be fine, but in fact the new problem is still waiting for the front. In the traditional structure, Nginx and Java are deployed on the same server. Nginx listens to port 80 and communicates with Java listening to port 7001 at the high bit. Now that Node.js is introduced, a new process that needs to run a listening port is needed. Should Node.js be deployed on the same machine with Nginx + Java, or Node.js be deployed on a separate cluster?
Let’s compare the characteristics of the two methods:
Taobao Favorites is an application with a daily average PV of tens of millions, which has extremely high requirements for stability (in fact, online instability of any product is unacceptable). If you adopt the same cluster deployment solution, you only need to distribute files once and reboot twice to complete the release. In case you need to roll back, you only need to operate the baseline package once. In terms of performance, the same cluster deployment also has some theoretical advantages (although the switch bandwidth and latency of the intranet are very optimistic). As for the one-to-many or many-to-one relationship, it may theoretically be more fully utilized by the server, but compared with the stability requirements, this point is not so urgent to solve. So in the transformation of favorites, we chose the same cluster deployment solution.
Grayscale
In order to ensure maximum stability, this transformation did not directly remove the Velocity code completely. There are nearly 100 servers in the application cluster. We use the server as the granularity and gradually introduce traffic. In other words, although Java + Node.js processes are running on all servers, whether there are corresponding forwarding rules on Nginx determines whether the request to obtain the baby collection on this server will be processed through Node.js. The configuration of Nginx is:
location = "/item_collect.htm" { proxy_pass http://127.0.0.1:6001; # Node.js process listening port}Only servers that have added this Nginx rule will let Node.js handle the corresponding request. Through Nginx configuration, it is very convenient and quick to increase and decrease grayscale traffic, and the cost is very low. If you encounter problems, you can roll back the Nginx configuration directly, and instantly return to the traditional technology stack structure to relieve the danger.
When we first released, we only enabled this rule on two servers, which means that less than 2% of the online traffic is processed in Node.js, and the requests for the remaining traffic are still rendered by Velocity. In the future, the traffic will be gradually increased depending on the situation, and finally in the third week, all servers will be enabled. At this point, the product collection pages with 100% traffic in the production environment are rendered by Node.js (you can check the source code to search for the Node.js keyword).
change
The grayscale process is not smooth. Before cutting the flow in full, I encountered some problems, whether large or small. Most of the business is related to specific business, and what is worth learning from is a trap related to technical details.
Health check
In traditional architecture, the load balancing scheduling system will initiate a get request every second to a specific URL on port 80 of each server, and determine whether the server is working normally based on whether the returned HTTP Status Code is 200 . If the timeout after 1s is requested or the HTTP Status Code is not 200 , no traffic is introduced to the server to avoid online problems.
The path to this request is Nginx -> Java -> Nginx, which means that as long as 200 is returned, the Nginx and Java of this server are in a healthy state. After introducing Node.js, this path becomes Nginx -> Node.js -> Java -> Node.js -> Nginx. The corresponding code is:
var http = require('http'); app.get('/status.taobao', function(req, res) { http.get({ host: '127.1', port: 7001, path: '/status.taobao' }, function(res) { res.send(res.statusCode); }).on('error', function(err) { logger.error(err); res.send(404); }); });However, during the testing process, it was found that when Node.js forwards such requests, it took several seconds or even ten seconds to get the Java side return once every six or seven times. This will cause the load balancing scheduling system to think that an abnormality occurred on the server and then cut off the traffic, but in fact, the server can work normally. This is obviously a big problem.
After a search, I found that by default, Node.js will use the HTTP Agent class to create HTTP connections. This class implements a socket connection pool. The default upper limit of the number of connections for each host + port pair is 5. At the same time, the requests initiated by HTTP Agent class include Connection: Keep-Alive by default, resulting in the returned connection not being released in time, and the requests initiated later can only be queued.
There are three final solutions:
Disable HTTP Agent , that is, add additional parameter agent: false when calling the get method, and the final code is:
var http = require('http'); app.get('/status.taobao', function(req, res) { http.get({ host: '127.1', port: 7001, agent: false, path: '/status.taobao' }, function(res) { res.send(res.statusCode); }).on('error', function(err) { logger.error(err); res.send(404); }); }); Set the upper limit of the global socket number of http objects:
http.globalAgent.maxSockets = 1000;
When the request returns, it is timely and proactively disconnected:
http.get(options, function(res) { }).on("socket", function (socket) { socket.emit("agentRemove"); // Listen to socket events and dispatch agentRemove events in the callback});In practice, we chose the first method. After this adjustment, no other problems were found in the health examination.
combine
The practice of combining Node.js with traditional business scenarios has just begun, and there are still a lot of optimization points worth exploring in depth. For example, after Java applications are completely centralized, can you take the test of cluster deployment to improve server utilization? Or, can the release and rollback methods be more flexible and controllable? All the details are worth further research.