The /actuator/health/configServer endpoint always returns UP so long as the local git cache contains a repo for the application used for the health check. This would be okay if the ConfigServer serves only one client app.
However, when it serves multiple clients, it's not helpful. So, if it has local git cache for the health-indicator app, then it always returns UP even when git server is down. But if a different app is requested and git server is down, that other app will receive a 404. Yet the config server reports as UP, so load balancers won't take it out of the pool. (On the other hand, taking it out of the pool would cause clients to try the next URL if they are configured with multiple URLs.)
Can this health indicator be updated to report the health without relying on local git cache? To keep current behavior, perhaps make it configurable whether or not it uses local git cache during health check?
Comment From: ryanjbaxter
Do you have the Git Refresh Rate property set? https://github.com/spring-cloud/spring-cloud-config/blob/main/docs/src/main/asciidoc/spring-cloud-config.adoc#git-refresh-rate
Comment From: marnee01
I do have it set. I also tried setting up a special git setting for the health check and set refreshRate to 0. (I verified it was set by setting breakpoint in JGitEnvironmentRepository.shouldPull
and checking refreshRate value.) But it still uses the local git cache, even if it can't do a pull from the repo. I verified that by turning off the wireless after ConfigServer was up and had performed the first health check. I then called /actuator/health again. It logged an exception, but still returned UP (just as it would when something puts a standard request to it and the git server is down).
It always uses local git cache if present (which is good overall, but not for the health check).
Comment From: ryanjbaxter
OK I think I have a better understanding now.
I think this behavior makes sense mainly because the config server is still able to serve configuration data to the clients.
Now I can see the argument that this config data is stale and might now be 100% accurate do to not being able to refresh the repo. I am not sure we have the "verbiage" from the HealthIndicator API to describe the state. Maybe UNKNOWN
would work, but I don't think DOWN
would be accurate.
Ultimately I think the fact that we can't reach the backend repo but yet still serve configuration and maintain functionality despite the problem is beneficial for clients.
Comment From: marnee01
But it may not be able to serve configuration data to the client. It can only serve it if it has a local git cache for the specific app. It might have one app's configs cached, but might not have a different app's configs cached.
So, for example, config server might do a health check very early on startup and cache the configs for the health check app. Then the git server goes down shortly thereafter. Then for the next 80 apps that call the config server, they all receive a 404 because the git server is down and the only thing in the cache is the health check repo.
Meanwhile, the health check is still reporting UP.
Comment From: ryanjbaxter
Not sure what you mean by the "health check repo".
Assuming all the apps have their configuration data in the repos defined in the config server properties the health check is going to cause that data to be cloned locally as a result of the health check. If the repo is then unreachable all that data should be there. The only problem would come if new data is added/changed in those repos while the repo is unreachable.
Comment From: marnee01
Sorry - by "heath check repo", I meant the repo or repos that are configured for the config server health indicator.
The health check would only clone locally those repos that it is configured to clone. We can't have it cloning over 80 different repos during each check.
If we want to clone locally at startup, we'd probably instead set cloneOnStart
on each of those. However, it is not desirable to individually list every single app in the bootstrap file in order to set that property. Also, various teams add new apps all the time, so one could easily get missed without setting up some sort of processing. And having it clone that many apps at startup would increase startup time to an unacceptable duration. This is just not feasible. (Currently we use a placeholder in the URI for almost all of our apps, e.g. https://git-codecommit.us-east-1.amazonaws.com/v1/repos/blah_configs-{application}
.)
Reason we are looking at this:
I would add that we have set up a secondary config server. One points to CodeCommit, the other to Bitbucket as a backup. Most of our clients are now set up with two URLs: the primary and the secondary. We've implemented our own health indicator that reports as down if too many errors are encountered. And we'll set up our load balancer to call /actuator/health and if it does not report UP, then the load balancer takes it out of the loop and forces clients to call the secondary.
But if we had a health indicator that would report down when the git server is down, that would be more reliable. (The ideal would actually be this: https://github.com/spring-cloud/spring-cloud-config/issues/1845. I hope to be able to create a PR for that one next month. However, our client apps won't be upgraded to pick up that change for probably a couple of years).
Comment From: ryanjbaxter
Can you provide an example of your git server configuration?
Comment From: marnee01
Here it is. SampleGitServerConfig.txt
Comment From: ryanjbaxter
One thought about how to accommodate this is to set a flag so if the fetch fails the refresh fails. This would not be specific to the health check though, it would also apply to all requests to the config server that use git. Right now when the fetch fails we just move on and since nothing else requires us to use the remote repo it all succeeds.
https://github.com/spring-cloud/spring-cloud-config/blob/d04dad3c1aa4dfb322d300d6ca4675661f6d369a/spring-cloud-config-server/src/main/java/org/springframework/cloud/config/server/environment/JGitEnvironmentRepository.java#L273
@spencergibb do you have any opinion?
Comment From: spencergibb
Maybe building on https://github.com/spring-cloud/spring-cloud-config/issues/1915#issuecomment-872332492 and a new status API, something could be implemented that doesn't have all the caching and performance tweaks.