springboot Support liveness and readiness state for Kubernetes

The new health indicator groups feature in Spring Boot 2.2 allows to create arbitrary health indicator groups. It would be nice if Spring Boot provided a default for those when running on Kubernetes.

A programmatic callback (customizer?) with a condition on running with Kubernetes could be provided. If no liveliness and readiness group exists, we could create a liveliness with "ping" only and a readiness with all the rest. That doesn't help customizing the roles and the details but perhaps such thing can be configured via properties.

Comment From: bclozel

After a quick discussion with the team, it seems we need to think about this more.

First, we need to consider making that feature available without actuator on the classpath - that would change our previous plan about leveraging health indicator groups if we decide to do so.

We also need to reconsider the actual checks made by each probe.

readiness probes should tell whether the application is ready to receive traffic. In our case, we should: * make sure to wait for ApplicationReadyEvent when the application starts * consider graceful shutdown in that context (see #4657): should we fail that endpoint when shutdown starts? * if we expose this on a management port, we should check that the actual port responding to application requests is able to do so. There are cases where the management port is ready but the public port is not

liveness probes: * should be considered carefully and not necessarily exposed by default. It should only be used when the application is stuck with some internal, corrupted state and the best course of action is to kill it * relying on external checks might could make the problem worse: if an external service/database fails, all application instances depending on it are restarted. We should use this only if the application cannot recover from an error

startup probes: * should be available soon * tailored for slow startup times (large database migration, building huge local caches, etc) * not necessarily a different endpoint from a Spring Boot point of view, but maybe something worth documenting

In all cases, it could be useful to document/point to relevant documentation; each probe needs to be configured with a different spec (failure threshold, period).

Comment From: matthyx

@bclozel I am a Kubernetes reviewer and saw this issue on your side... Can we work together on defining the best strategy for each probe?

For instance there is more than just setting ReadinessProbe to true when you are ready - it can be used to give your application some breathing space when you process your queue.

The LivenessProbe should fail when your application requires a reset - I have seen cases where that probe relies on a different Spring Context than the application, and continues to work even when the latter crashed after an OOM...

Please let me know how I can help document and implement these. Thanks!

Comment From: ttddyy

Hi,

I just want to put some input for how we implemented our readiness and liveness. It's pretty much similar to what @bclozel commented above.

Readiness:

We have ReadinessEdnpoint class that simply keep boolean value and also takes HealthIndicator beans for readiness.(Built on spring-boot 2.1, so not using indicator group yet)

This class is an application context event listener that receives our ApplicationReadinessEvent. Initial value for this readiness bean indicates NOT_READY because we do NOT want traffic until application is ready for serving. Once application starts-up and bootstrapped necessary things, user fires ApplicationReadinessEvent with value=READY, then readiness starts returning value=READY. Application needs to decide when to issue readiness event(value=READY) because being ready to serve traffic is upto the application to decide.

Another place that issues readiness event is our graceful shutdown logic. When graceful shutdown logic is initiated (e.g: by receiving ContextClosedEvent), first thing we do is to fire ApplicationReadinessEvent with value=NOT_READY. This will stop receiving any more requests while shutdown is in progress.

Liveness:

Similar to readiness class, we have simple boolean to indicate LIVE/NOT_LIVE. The differences from readiness are initial value is set to LIVE and no event is issued to change liveness status right after application is bootstrapped.

There are also consideration for initial delay and frequency(period) for readiness/liveness probes in k8s config. Once initial delay is passed, we check readiness more often than liveness. Also, this frequency(period) may affect duration for graceful shutdown.

Currently, our readiness and liveness are implemented as actuator Endpoint but it is not necessary this way. As long as there is a boolean value to keep the state and receive application context event, it can be a service bean. Then, if actuator is available, put it into a HealthIndicator(HealthContributor) and be part of each readiness/liveness health groups.

Comment From: wilkinsona

Application needs to decide when to issue readiness event(value=READY) because being ready to serve traffic is upto the application to decide.

I'm intrigued by this, @ttddyy. Thanks for sharing your thoughts. I had hoped that performing work during application context refresh and in application and command-line runners and the subsequent ApplicationReadyEvent would be sufficient for indicating that the application was ready to start handling traffic. What's missing from the current events and startup sequencing that led to you issuing a separate event?

Comment From: ttddyy

Hi @wilkinsona

When application becomes ready to serve traffic is not necessarily tied to ApplicationContext lifecycle. If application is well behaved to fit in spring lifecycle, application developers would put initialization logic to [Command|Application]Runner; however, it is not something we can enforce. Also, application received ApplicationReadyEvent doesn't always mean it is ready to serve traffic. It is more like it is ready to perform application logic. The ready to serve flag might depends on external resources. Application may connect to cache cluster upon ApplicationReadyEvent and form the cluster and warm up local cache, then it becomes ready to serve traffic. So, we think it is application's responsibility to decide when to flip the ready flag.

From spring-boot perspective, I think it is ok to set ready=true at ApplicationReadyEvent by default. But it needs a way to disable the default ready event and let users manually flip the ready flag; so that, application can determine the timing.

Comment From: spencergibb

Netflix reports similar requirements and something similar is built into eureka.

Comment From: matthyx

If I can add to the debate...

Readiness isn't a definite state and devs could use it to give some time off to the application instance to avoid overfilling queues and prevent aggravating states when processing time and traffic snowball into timeouts.

Liveness should fail 100% and immediately when the application cannot recover and requires a kill. I don't know if your implementation ensures that.

Once the Startup probe reaches GA, every probe will have a clear and separate meaning: Startup=true: my application has started and you can verify other probes. Readiness=true: my application is functioning properly, give me traffic. Readiness=false: my application cannot handle more traffic at the moment, please remove me from the load balancer pool. Liveness=false: my application is dead, please kill the container.

Comment From: markpollack

I don't see the need for another event in the lifecycle, nothing enforces a user to implement that new event either. We could recommend a user write a custom health contribution (HealthContributor) so that whatever housekeeping that needs to be done at the start of an app feeds into the health endpoint. That is an existing mechanism that seems ideal for this use-case.

Comment From: bclozel

Nothing beats feedback and actual experience, so we're going to implement a first version of this with:

Probes will ship with Actuator
Spring Boot will not enable by default a liveness probe. As an opt-in feature, it should be using the ping endpoint provided by Actuator. Besides a working server, Spring Boot doesn't know enough about internal application state to have opinions about this.
We will provide a readyness probe with an ApplicationReadyHealthIndicator that by default looks for ApplicationReadyEvent and ContextClosedEvent to change the state of the probe. Developers will be able to create their own instance and configure the event types to look for.
This readyness health indicator should not be part of the default global health status; if it did, there would be no way to differentiate the application not accepting traffic from an unhealthy application (some platforms might just kill the app as a result).

This first step heavily relies on the existing Actuator infrastructure; the only missing piece is whether we can easily exclude this new indicator from the default group.

With our current understanding, this approach has a few advantages: * reusing existing infrastructure * flexible, can be combined with other health endpoints * can be enabled only on k8s platforms * it's under the "/actuator" URL namespace, so it won't clash with other endpoints

There are some issues as well: * those probes are part of the actuator child application context and don't share 100% of the main application infrastructure. So the main app might fail and the probe would still keep working * it requires actuator and arguably this should be core to all Spring Boot applications

After experimenting with this and getting feedback from the community, we will improve/reconsider this approach. We could make it independent of Actuator. This requires more design, more infrastructure (MVC, WebFlux, Jersey, etc) and a separate URL path which might clash with existing routes.

@matthyx we're also wondering about the following: 1. are there conventions around the actual URL paths for those probes? (names, regrouped under a single path segment like /probes?) 2. are there security considerations around exposing such probes? Does k8s route external requests to probes or should they be protected from Internet traffic? How? 3. you were mentioning Readyness as a way to get some breathing space for the application and "avoid overfilling queues". Are you thinking about messaging queues, HTTP server connection queues, threadpools, all of the above? We would like some pointers to other libraries docs about the features they provide for this.

Comment From: matthyx

@bclozel thanks for taking this subject seriously, I will reply first with the answers I know, and then after some research come back to the other points :-)

don't share 100% of the main application infrastructure. So the main app might fail and the probe would still keep working

Ok, this should be clearly stated in the documentation, and you should encourage users to implement their own liveness probe in the main application loop instead. Too many times I have seen OOM killed application contexts with the probe still working.

are there conventions around the actual URL paths for those probes

No, but I can look around and report if I find something.

Does k8s route external requests to probes or should they be protected from Internet traffic?

No, but depending on the ingress used you could block them. Or it might be possible to bind the probes to a different port that is not exposed to the outside world via service/ingress.

Are you thinking about messaging queues, HTTP server connection queues, threadpools, all of the above?

Yes, absolutely. I think the feature is underused today because most of the people think that readiness is a definite state (which it's not). I hope that with the startup probe enabled by default (still beta though) in 1.18 will help clear this confusion.

I will do some research about probe usages and report their "default" paths, features and their implementations.

Nice one: https://github.com/nodeshift/kube-probe Gitlab implementation (ruby on rails): https://gitlab.com/gitlab-org/gitlab-foss/-/blob/6a9d7c009e4e5975a89bcc3e458da4b3ec484bd1/spec/requests/health_controller_spec.rb

It is very important that all 3 probes should not depend on external dependencies as correctly stated here.

Comment From: bclozel

After several rounds of draft implementations, the team decided to go with the following.

First, "Liveness" and "Readiness" are promoted as first class application state. Their current state can be queried with the ApplicationStateProvider, and Spring Boot also provides ways to track changes or update them using Spring Application Events. Spring Boot uses the state of the application context and startup phases to update these states in regular applications.

If Actuator is on the classpath, we then provide additional HealthIndicators and create dedicated Health Groups at "/actuator/health/liveness" and "/actuator/health/readiness". Developers can configure additional indicators under those health groups if they wish to.

In general, we're also adding more guidance and opinions about Kubernetes probes and their meaning.

Comment From: matthyx

Awesome @bclozel ! Some questions though: 1. are those probes part of the application context? is there a chance that the application crashes without the probe reporting it? 1. what about startup? Kubernetes 1.18 is soon out and will have it enabled by default...

Thanks for the nice work 👍

Comment From: bclozel

Hey @matthyx

yes, they're part of the main application context. In case the developers choose to use a separate management context for the probes themselves, we've documented that this might give you an incomplete/wrong picture. The state itself is tracked in the main application context no matter what.
As far as I understand, startupProbe is a way to reuse an existing probe (such as liveness) and adapt to long startup times. This avoids cases where the platform will kill your application because it took too long to start. It seems that defining a liveness endpoint (that is up as soon as the app is up) and a separate readiness endpoint (that is up when the startup tasks are done) is enough and should not require a separate startup endpoint. It seems that the Spring Boot startup phase sequence is in line with Kubernetes concepts here.

Snapshot documentation should be up soon (I'll ping you with a link to it), you'll get a chance to have a more complete picture with the docs and guidance. I would be glad to get your feedback on that.

Comment From: matthyx

startupProbe is a way to reuse an existing probe

True, but it doesn't necessarily have to...one key point is that it's no longer launched once it has succeeded.

I will try using it as a "smoke test" probe which could trigger a heavier test that only makes sense at application startup - and that's too heavy to run periodically like the liveness one.

Comment From: bclozel

@matthyx See https://docs.spring.io/spring-boot/docs/2.3.x-SNAPSHOT/reference/html/spring-boot-features.html#boot-features-kubernetes-application-state

As for the startup sequence, here's how probes should behave:

Startup phase	LivenessState	ReadinessState
Start the app	broken	busy
App. context is starting	broken	busy
App. context started OK, web port is UP	live	busy
Startup tasks are running	live	busy
App is ready for client requests	live	ready

Comment From: matthyx

Looks good! I saw the other part regarding HTTP probes as well, which is good because exec probes should be avoided when possible.

Last week I saw some regressions in containerd causing zombies created after each probe exec...

Comment From: ttddyy

Thank you @bclozel and the team to implementing this. It seems I can replace our current readiness/liveness implementation to this new one.

I took a close look and here is my feedback for current implementation.

Graceful shutdown and readiness health indicator

I think there is a problem in ordering of graceful shutdown and flipping readiness-probe by ContextClosedEvent.

ServletWebServerApplicationContext#doClose first perform graceful shutdown(webServer.shutDownGracefully()), then it calls super.doClose() which is AbstractApplicationContext#doClose and it publishes ContextClosedEvent. Therefore, while graceful shutdown is in progress, the readiness is still READY which brings traffic to the pod. Readiness needs to be BUSY before graceful shutdown to happen. Ideally, once it became BUSY, it needs to wait k8s readiness probe frequency duration in order to k8s pickup the latest readiness state, then proceed to graceful shutdown.

Default health indicators for readiness group

It seems by default, liveness/readiness health group will solely have livenessProbe/readinessProbe bean based on ProbesHealthEndpointGroupsRegistrar. For readiness, I think many usecases are to include all health indicators except livenessProbe. (or maybe even include livenessProbe) To do so, with current implementation, it is required to set these properties.

management.endpoint.health.group.readiness.include=*
management.endpoint.health.group.readiness.exclude=livenessProbe
````

What do you think including all but `livenessProbe` health indicators for `readiness` group by default?


**Documentation**

Can you include the [liveness/readiness state behavior table in above comment](https://github.com/spring-projects/spring-boot/issues/19593#issuecomment-601212444)  to documentation?
 It is very informative and intuitive to understand.

Thanks,

**Comment From: wilkinsona**

> Therefore, while graceful shutdown is in progress, the readiness is still READY which brings traffic to the pod.

I don't believe this is the case. Once `SIGTERM` has been sent to the process (which is what will trigger the graceful shutdown), the liveness and readiness probes are no longer called and its response becomes irrelevant.

> What do you think including all but livenessProbe health indicators for readiness group by default?

We do not think this should be the default behaviour as it's dangerous for readiness checks to include external services. If one of those services is shared by multiple instances of the app and it goes down, every instance will indicate that it is not ready to receive traffic. This may trigger auto-scaling and the creation of more instances which is likely to only make the problem worse.

If you know that an external service is not shared among multiple instances, you can safely opt in by including its indicator in the readiness group.

**Comment From: scottmf**

>  I don't believe this is the case. Once SIGTERM has been sent to the process (which is what will trigger the graceful shutdown), the liveness and readiness probes are no longer called and its response becomes irrelevant.

In a standard k8s topology you route requests to a pod via the service.  Are you suggesting that a SIGTERM triggers the kube-controller and propagates the information to the service not to route traffic to its pods?  We tested this at scale 1-1.5 yrs back and I'm pretty certain that we found k8s does not work like this.  Perhaps something has changed?

Underneath the hood services are an abstraction on iptables routing rules.  These rules are updated by a kube-controller when an event occurs.  I don't think SIGTERM would trigger this event.

Please let me know if (or where) I'm mistaken.

**Comment From: matthyx**

I think @wilkinsona was referring to when the controller kills the pod for instance during a rolling update. In that case I'm almost sure the probes aren't called anymore, but I can check the relevant code and update here if you want.

**Comment From: scottmf**

Sure, that makes sense.  Please feel free to check.  We're just concerned about handling things in a robust way for all scenarios.  In that I think you'd need knowledge of the readiness probes.  Our current logic for graceful shutdown is this:

> 1. Send NOT_READY event - this signals Kubernetes that it should no longer route connections to the service web container
> 2. Shutdown all known ExecutorServices - includes anything inherited from ExecutorConfigurationSupport in the spring container
> 3. Sleep for `readinessProbeTimeout` duration - waits for the readiness probe to go into a `NOT_READY` state in kubernetes
> 4. Pause the tomcat connection pool - will cut off any client connections at the TCP layer and allow any task in the Connector pool to complete
> 5. Wait for the remainder of the specified `gracefulShutdownTimeout` for the pools to shutdown
> 6. After this spring can continue to shutdown


We keep our shared framework as in-line with spring as possible.  More or less - spring sets the direction and we blindly follow.  This is to reduce our surface area as much as possible.  In that @ttddyy was asking because the changes you are making are going to impact us since we'll align and make it work.  Right now we have mechanisms integrated directly into readiness / liveness probes.  This has worked extremely well.  In this thread I get the feeling that you all are saying this is a bad pattern, so we are trying to understand why that is.

**Comment From: wilkinsona**

>  Are you suggesting that a SIGTERM triggers the kube-controller and propagates the information to the service not to route traffic to its pods? We tested this at scale 1-1.5 yrs back and I'm pretty certain that we found k8s does not work like this. Perhaps something has changed?

It's not `SIGTERM` that triggers this, but the general shutdown processing that Kubernetes orchestrates. This shutdown processing happens in parallel so there's a window during which traffic will be routed to a pod that has also begun its shutdown processing. This eventual consistency is unfortunate, but my understanding is that the K8S team deem it to be necessary due to the distributed nature of the various components that are involved. The size of the windows is both undefined and unaffected by any probe responses.

To avoid requests being routed to a pod that has already received `SIGTERM` and has already begun shutting down, the recommendation is that a sleep should be configured in a pre-stop hook. This sleep should be long enough for new requests to stop being routed to the pod and its duration will vary from deployment to deployment. Times of 5-10 seconds seem to be quite common from what I have seen so that's probably a good starting point. Once the pre-stop hook has completed, `SIGTERM` will be sent to the container and graceful shutdown will begin, allowing any remaining in-flight requests to complete.

[This blog post](https://blog.gruntwork.io/delaying-shutdown-to-wait-for-pod-deletion-propagation-445f779a8304) describes things quite well, albeit with some slight differences as it's talking about Nginx.

**Comment From: scottmf**

thanks @wilkinsona .  My concern is not related to the upgrade flow.  Sorry if i wasn't clear about that.  I understand that irrespective of any probes that this will work correctly.  My concern is the around SIGTERM, more specifically we have flows that require us to restart a container within a pod.  For that to work graceful shutdown needs to communicate to k8s (or an ingress controller) to stop sending traffic to it.

A common scenario for us is key rotation.  We need to have the ability to rotate secrets on all of our services.  In order to achieve this we have mechanisms built in to restart the containers within a pod by killing the jvm (kill <pid>).  We don't want to delete the pod and k8s reschedule it as this is a MUCH longer flow.  All that  is needed here is to  update  our k8s secret (or other secret store) and simply restart the jvm.  During this time we rely on graceful shutdown to do the right thing.  The way it is laid out here that will not work for us.

Any thoughts on that?

**Comment From: wilkinsona**

Thanks for the additional details.

@bclozel has made a change (not yet pushed to master) that publishes the `ReadinessStateChangedEvent` prior to graceful shutdown commencing. This will result in the readiness probe indicating it is not ready before graceful shutdown begins. You could listen for this event and when it's received, perform any logic that you want before the graceful shutdown proceeds. You'd probably want to include a mechanism that knows the source of the shutdown so that the logic is performed only when it's necessary.

> All that is needed here is to update our k8s secret (or other secret store) and simply restart the jvm.

This sounds a little risky to me. What happens if the restart takes longer than expected and the liveness probe fails because the process is down?

**Comment From: scottmf**

> This sounds a little risky to me. What happens if the restart takes longer than expected and the liveness probe fails because the process is down?

We've seen wonky behavior in general, but not from your scenario.  Our `initialDelaySeconds` was carefully crafted to avoid this.  BUT, I have seen issues where the startup just hangs.  I don't think this is spring, but there is something funky going on.  In that case we simply restart again.  The outage of one container is fine as we do this in a rolling fashion and we (should be) n+1 in all areas to avoid it impacting us.

That's great that you have the `ReadinessChangedEvent`, all we need is an associated timeout to ensure it doesn't shutdown too early and I think we are good to go!

**Comment From: matthyx**

This is looking super good overall. I think Sprint Boot is getting one of the most mature health handling that exists to date!
I hope you plan to communicate this in articles, workshops and conferences!

**Comment From: ttddyy**

Thanks for the discussion and implementation. @wilkinsona @scottmf @bclozel 
The updated implementation looks great.
This allows us to support both pod delete and restart.

One last thing is that I think it is also better `ReadinessStateChangedEvent` to have `cause`.
According to @matthyx about readiness usage for draining queues [in comment above](https://github.com/spring-projects/spring-boot/issues/19593#issuecomment-582630745), it is possible to use readiness for other than graceful shutdown.

In that case, listener for `ReadinessStateChangedEvent` needs to differentiate what caused the event, then it can behave differently. (e.g: shutdown task executor for graceful shutdown v.s. simply finishup all tasks in executor for draining)


**Comment From: bclozel**

In the meantime, I've updated this  and now HTTP Probes are activated if the Kubernetes CloudPlatform is detected, or if the `management.health.probes.enabled` property is set to `true`.

`LivenessProbeHealthIndicator`, `ReadinessProbeHealthIndicator` and the related Health groups are not enabled by default for all applications. They're generally useful but if we did, here's what would happen:

* the `LivenessProbeHealthIndicator`, `ReadinessProbeHealthIndicator` would show up in the default group in `/actuator/health`. Many platforms look at this endpoint to figure out if an application is broken or not. During startup time (and especially with `ApplicationRunner` tasks), the Readiness probe will report `OUT_OF_SERVICE` which could trick platforms into restarting the app indefinitely if the configured timeout is too short. In short, we're adding a new facet to this that will make the upgrade experience not great...
* we could enable those health indicators but not take them into account in the default group. In this case, we'd have an inconsistent state (a specific group reports broken, the default group reports "all fine").

We can definitely consider flipping that default in a future release with more breaking changes.

**Comment From: philwebb**

@ttddyy 

> One last thing is that I think it is also better ReadinessStateChangedEvent to have cause.
According to @matthyx about readiness usage for draining queues in comment above, it is possible to use readiness for other than graceful shutdown.
>
> In that case, listener for ReadinessStateChangedEvent needs to differentiate what caused the event, then it can behave differently. (e.g: shutdown task executor for graceful shutdown v.s. simply finishup all tasks in executor for draining)

We've made a few more refinements that will be available in the next release that make it easier to add your own state types and/or events. To get the last event that actually caused the update you can use the `getLastChangeEvent` method. For example:

```java
@Component
public class MyComponent {

    private final ApplicationAvailability availability;

    public MyComponent(ApplicationAvailability availability) {
        this.availability = availability;
    }

    public void someMethod() {
        AvailabilityChangeEvent event = this.availability.getLastChangeEvent(ReadinessState.class);
        // check the event source or use instanceof if a custom subclass is in use
    }

}