Spring Cloud Config Refresh leads to starvation of servers

My Spring-Boot-based Web-application heavily relies on "refresh". Nearly every configuration value can be refreshed. Now this leads to starvation of some servers after a refresh.

What I can observe from my heapdump: Nearly all threads are waiting for a readLock that is requested by org.springframework.cloud.context.scope.GenericScope.LockedScopedProxyFactoryBean#invoke: "Lock lock = readWriteLock.readLock();" These threads are all waiting for the same lock as they want to call a method from the same refresh-scoped bean.

And there is one thread that is executing the refresh: And this thread requests the same lock as writeLock in org.springframework.cloud.context.scope.GenericScope#destroy: "Lock lock = this.locks.get(wrapper.getName()).writeLock();" and it is waiting for the lock.

Some details: My Web-Application is heavily used: there are > 10 Servers in Production and the Bean that is involved is the service to retrieve the advertising for the main page. This takes some time: it determines the ads to show, loads the content of the ads and the pictures from backends. The result can then be rendered in on go on the main page (=> minimal flickering). As it takes some time, this execution is done asynchronous: Main page is shown and ads added later in one go. Determining the ads is configurable, that's the reason why the bean (service) is a refresh-scoped bean.

Short: heavily used service running on many servers with long-duration service-calls modeled as refresh-scoped bean

What happens during a refresh: All servers are informed nearly at the same time. Every server tries to destroy the bean. This means: - request the write-Lock for the corresponding beanName. - This takes some time, because there are several threads using the bean having a readLock for the corresponding beanName and all of this read locks must be returned ("unlocked") before the writeLock can be given. - During this time, every new requesting thread cannot get a readLock => State "Waiting". This is the case as a writeLock is requested and waiting to be available.

What you can observe is a kind of "dry out" and an increasing number of waiting threads, followed by high pressure on the backends and a decrease of performance for a while.

Having only one server this behavior may be acceptable. But in a scenario of many servers there is a kind of "wave", that rolls over the servers and potentially kills the weakest server(s). Servers are not perfectly balanced. So when the first server is "ready" with destroying it causes a high pressure on the backends. This causes a stronger delay for the other servers to continue processing / "drying out". Then the second server is "ready" => the situation becomes a little more worse for the rest of the servers ... and so on. With some bad luck, the server(s) with the highest load run(s) out of available threads => starving.

I think, the root problem is, that destroying a bean also prohibits the access to a new instance for the same beanname. The lock is based on the beanname, not the bean instance. But as far as I understood the code: this locking is only necessary for ensuring, that no thread is using the bean instance anymore before it is destroyed. The lock seems to be to hard.

Comment From: Mobe91

I am experiencing the same problem with a multi-instance spring-boot application deployment. Requests to the application do not get processed during an ongoing bus refresh as all requests depend on a @RequestScoped bean holding the application configuration.

Are there any plans to address this or guidelines for how to work around it?

Versions used: Spring-Boot: 3.1.6 Spring-Cloud: 2022.0.3

Comment From: ryanjbaxter

Can you provide a way for us to reproduce the issue?

Comment From: spring-cloud-issues

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

Comment From: spring-cloud-issues

Closing due to lack of requested feedback. If you would like us to look at this issue, please provide the requested information and we will re-open the issue.