Spring Cloud Request Retry Mechanism Core Code Analysis

Author：Eve Cole Update Time：2025-08-29 15:32:01

Scene

The operation of publishing microservices is usually to type a new code package, kill it drops the running application, replace the new package, and start it.

Eureka is used as the registry center in spring cloud, which allows delay in service list data, that is, even if the application is no longer in the service list, the client will still request this address for a period of time. Then the request will appear, resulting in failure.

We will optimize the refresh time of the service list to improve the timeliness of the service list information. But no matter what, it is impossible to avoid a period of time when data is inconsistent.

So one way we thought of is to retry the mechanism. When a machine a is restarted, b in the same cluster can provide services normally. If there is a retry mechanism, you can retry b in the above scenario without affecting the correct response.

operate

The following operations are required:

 ribbon: ReadTimeout: 10000 ConnectTimeout: 10000 MaxAutoRetries: 0 MaxAutoRetriesNextServer: 1 OkToRetryOnAllOperations: false

Introducing spring-retry package

 <dependency> <groupId>org.springframework.retry</groupId> <artifactId>spring-retry</artifactId> </dependency>

Taking zuul as an example, you also need to configure and enable retry:

 zuul.retryable=true

Have encountered a problem

However, everything was not so smooth. The test retry mechanism took effect, but it did not request another healthy machine as I imagined. So I was forced to go to the open source code and finally found that it was a source code bug, but it had been fixed and the version was upgraded.

Code Analysis

The version used is

spring-cloud-netflix-core:1.3.6.RELEASE

spring-retry:1.2.1.RELEASE

spring cloud dependency version:

 <dependencyManagement> <dependencies> <dependency> <groupId>org.springframework.cloud</groupId> <artifactId>spring-cloud-dependencies</artifactId> <version>${spring-cloud.version}</version> <type>pom</type> <scope>import</scope> </dependency> </dependencies> </dependencyManagement>

Because retry is enabled, the RetryableRibbonLoadBalancingHttpClient.execute method is executed when requesting the application:

 public RibbonApacheHttpResponse execute(final RibbonApacheHttpRequest request, final IClientConfig configOverride) throws Exception { final RequestConfig.Builder builder = RequestConfig.custom(); IClientConfig config = configOverride != null ? configOverride : this.config; builder.setConnectTimeout(config.get( CommonClientConfigKey.ConnectTimeout, this.connectTimeout)); builder.setSocketTimeout(config.get( CommonClientConfigKey.ReadTimeout, this.readTimeout)); builder.setRedirectsEnabled(config.get( CommonClientConfigKey.FollowRedirects, this.followRedirects)); final RequestConfig requestConfig = builder.build(); final LoadBalancedRetryPolicy retryPolicy = loadBalancedRetryPolicyFactory.create(this.getClientName(), this); RetryCallback retryCallback = new RetryCallback() { @Override public RibbonApacheHttpResponse doWithRetry(RetryContext context) throws Exception { //on retries the policy will choose the server and set it in the context //extract the server and update the request being made RibbonApacheHttpRequest newRequest = request; if(context instanceof LoadBalancedRetryContext) { ServiceInstance service = ((LoadBalancedRetryContext)context).getServiceInstance(); if(service != null) { //Reconstruct the request URI using the host and port set in the retry context newRequest = newRequest.withNewUri(new URI(service.getUri().getScheme(), newRequest.getURI().getUserInfo(), service.getHost(), service.getPort(), newRequest.getURI().getPath(), newRequest.getURI().getQuery(), newRequest.getURI().getFragment())); } } newRequest = getSecureRequest(request, configOverride); HttpUriRequest httpUriRequest = newRequest.toRequest(requestConfig); final HttpResponse httpResponse = RetryableRibbonLoadBalancingHttpClient.this.delegate.execute(httpUriRequest); if(retryPolicy.retryableStatusCode(httpResponse.getStatusLine().getStatusCode())) { if(CloseableHttpResponse.class.isInstance(httpResponse)) { ((CloseableHttpResponse)httpResponse).close(); } throw new RetryableStatusCodeException(RetryableRibbonLoadBalancingHttpClient.this.clientName, httpResponse.getStatusLine().getStatusCode()); } return new RibbonApacheHttpResponse(httpResponse, httpUriRequest.getURI()); } }; return this.executeWithRetry(request, retryPolicy, retryCallback); }

We found that we first new a RetryCallback, and then execute this.executeWithRetry(request, retryPolicy, retryCallback);

We clearly see that the code of RetryCallback.doWithRetry is the actual requested code, which means that this.executeWithRetry method will eventually call RetryCallback.doWithRetry

 protected <T, E extends Throwable> T doExecute(RetryCallback<T, E> retryCallback, RecoveryCallback<T> recoveryCallback, RetryState state) throws E, ExhaustedRetryException { RetryPolicy retryPolicy = this.retryPolicy; BackOffPolicy backOffPolicy = this.backOffPolicy; // Allow the retry policy to initialise itself... RetryContext context = open(retryPolicy, state); if (this.logger.isTraceEnabled()) { this.logger.trace("RetryContext retrieved: " + context); } // Make sure the context is available globally for clients who need // it... RetrySynchronizationManager.register(context); Throwable lastException = null; boolean exhausted = false; try { // Give clients a chance to enhance the context... boolean running = doOpenInterceptors(retryCallback, context); if (!running) { throw new TerminatedRetryException( "Retry terminated abnormally by interceptor before first attempt"); } // Get or Start the backoff context... BackOffContext backOffContext = null; Object resource = context.getAttribute("backOffContext"); if (resource instanceof BackOffContext) { backOffContext = (BackOffContext) resource; } if (backOffContext == null) { backOffContext = backOffPolicy.start(context); if (backOffContext != null) { context.setAttribute("backOffContext", backOffContext); } } /* * We allow the whole loop to be skipped if the policy or context already * forbid the first try. This is used in the case of external retry to allow a * recovery in handleRetryExhausted without the callback processing (which * would throw an exception). */ while (canRetry(retryPolicy, context) && !context.isExhaustedOnly()) { try { if (this.logger.isDebugEnabled()) { this.logger.debug("Retry: count=" + context.getRetryCount()); } // Reset the last exception, so if we are successful // the close interceptors will not think we failed... lastException = null; return retryCallback.doWithRetry(context); } catch (Throwable e) { lastException = e; try { registerThrowable(retryPolicy, state, context, e); } catch (Exception ex) { throw new TerminatedRetryException("Could not register throwable", ex); } finally { doOnErrorInterceptors(retryCallback, context, e); } if (canRetry(retryPolicy, context) && !context.isExhaustedOnly()) { try { backOffPolicy.backOff(backOffContext); } catch (BackOffInterruptedException ex) { lastException = e; // back off was prevented by another thread - fail the retry if (this.logger.isDebugEnabled()) { this.logger .debug("Abort retry because interrupted: count=" + context.getRetryCount()); } throw ex; } } if (this.logger.isDebugEnabled()) { this.logger.debug( "Checking for rethrow: count=" + context.getRetryCount()); } if (shouldRethrow(retryPolicy, context, state)) { if (this.logger.isDebugEnabled()) { this.logger.debug("Rethrow in retry for policy: count=" + context.getRetryCount()); } throw RetryTemplate.<E>wrapIfNecessary(e); } } /* * A stateful attempt that can retry may rethrow the exception before now, * but if we get this far in a stateful retry there's a reason for it, * like a circuit breaker or a rollback classifier. */ if (state != null && context.hasAttribute(GLOBAL_STATE)) { break; } } if (state == null && this.logger.isDebugEnabled()) { this.logger.debug( "Retry failed last attempt: count=" + context.getRetryCount()); } exhausted = true; return handleRetryExhausted(recoveryCallback, context, state); } catch (Throwable e) { throw RetryTemplate.<E>wrapIfNecessary(e); } finally { close(retryPolicy, context, state, lastException == null || exhausted); doCloseInterceptors(retryCallback, context, lastException); RetrySynchronizationManager.clear(); } }

Implement the retry mechanism in a while loop. When an exception occurs when a retryCallback.doWithRetry(context) is executed, it will catch an exception. Then use retryPolicy to determine whether to retry. Pay special attention to registerThrowable(retryPolicy, state, context, e); method. Not only does it determine whether to retry, but in the case of retry, a new machine will be selected and put into the context, and then it will be brought in when retryCallback.doWithRetry(context) is executed, so the changer will be retryed.

But why didn't my configuration change the phone? The debugging code found that registerThrowable(retryPolicy, state, context, e); the selected machine is fine, it is a new and healthy machine, but when executing the retryCallback.doWithRetry(context) code, it is still requested.

So let's take a closer look at the code of retryCallback.doWithRetry(context):

We found this line of code:

 newRequest = getSecureRequest(request, configOverride); protected RibbonApacheHttpRequest getSecureRequest(RibbonApacheHttpRequest request, IClientConfig configOverride) { if (isSecure(configOverride)) { final URI secureUri = UriComponentsBuilder.fromUri(request.getUri()) .scheme("https").build(true).toUri(); return request.withNewUri(secureUri); } return request; }

The newRequest has been built using context in the previous example. The request is the data requested last time. As long as you execute this code, you will find that the newRequest will always be overwritten by the request. When we saw this, we found out that it was a source code bug.

issue address: https://github.com/spring-cloud/spring-cloud-netflix/issues/2667

Summarize

This is a very ordinary process of checking problems. During this process, when I found that the configuration did not meet my expectations, I first checked the meaning of the configuration and tried it many times without success. So after debugging the breakpoint, I found that the breakpoint was abnormal. Because the scene requires one machine to be healthy and one machine to be offline, I simulated it hundreds of times before finally I located this line of code. Even an open source project is an excellent project, and it will inevitably have bugs, not superstitious or blind. On the other hand, the ability to read source code is also an important ability to solve problems. For example, I am looking for source code entrances, and it takes a lot of time to locate the code.

The above is all the content of this article. I hope it will be helpful to everyone's learning and I hope everyone will support Wulin.com more.