springboot Improve handling of cancelled requests when publishing WebClient metrics

Currently, MetricsWebClientFilterFunction does handle the signal via doOnEach() method. When a timeout cancels the pipeline, there's no signal emitted so no metric will be recorded. It seems to me that this can cause the WebClient metrics to be misleading, as they will only show numbers lower than the timeout applied to the pipeline.

Comment From: bclozel

I'm not really sure about this.

We're recording metrics on client operations and we're adding tags using data available. For HTTP clients, we're tagging the "status" as the HTTP status itself, IO_ERROR in case of IOException and CLIENT_ERROR for other errors.

In this case, it seems that the application code is cancelling the operation with a timeout operator. From an infrastructure point of view, we don't know if the cancel signals comes from an expected use case, such as using the fastest responding service and cancelling the others, or an actual application exception that cancels the demand upstream. We don't support the now deprecated AsyncRestTemplate with an AsyncClientHttpRequestInterceptor, so we can't really align with an existing behavior of Future.cancel()

I've got a fix for this issue, but:

it only works for reactor 3.3.0 as I'm leveraging a new operator (so only applicable to Spring Boot 2.2.x)
so far the produced metrics will be outcome:UNKNOWN and status:CLIENT_ERROR, but I'm not sure this really makes sense

From metrics point of view, fixing this problem could show many errors with a timer right at the timeout value. If the timeout is really a feature, then dashboards will be overwhelmed by errors and would create false positives or hide actual errors. If the timeout is considered as an error, then it would help developers to tune their timeout value.

Maybe using a specific CANCEL status tag could at least avoid confusion.

Comment From: rubasace

I was thinking of a specific tag. As you state, UNKNOWN can cause a lot of false positives, while CANCEL will give enough granularity.

Regarding the new operator, can you provide any link to follow up/understand the context?

Also, I'm not sure about why the new operator is needed for this particular scenario, isn't the doOnCancel() good enough? I understand that current implementation depends on the SubscriberContext to store the startTime, and the context won't be available from the doOnCancel() runnable, but would it have any impact if the start time is just stored in a variable outside the pipeline?

Comment From: bclozel

I was thinking of a specific tag. As you state, UNKNOWN can cause a lot of false positives, while CANCEL will give enough granularity.

So you think the following tags outcome:UNKNOWN and status:CANCELLED would work?

Regarding the new operator, can you provide any link to follow up/understand the context?

I'm using Mono.deferredWithContext; there is another way to achieve the same thing but the CANCEL signal is not taken into account in my implementation in that case.

but would it have any impact if the start time is just stored in a variable outside the pipeline?

See this commit for why we're using the reactor context: bdd95f09a49bf02f4fef3d83815303e7d924f898

Comment From: rubasace

Sorry, on the comment above I meant CLIENT_ERROR can cause a lot of false positives.

So you think the following tags outcome:UNKNOWN and status:CANCELLED would work?

I think so, it offers enough information to filter them. Though I don't see the outcome tag now on the http_client_requests_* entries, only on the http_server_requests_* ones. Is this something being introduced on 2.2.x?

Comment From: bclozel

Before adding any new outcome tag value, we need to discuss this with @shakuzen as this use case might happen in with other parts of the infrastructure.

Comment From: rubasace

Sure, there's definitely no rush from my side on such scenario, and definitely I can hack a workaround on my side if needed. Could you, in the meantime, clarify this part? :

Though I don't see the outcome tag now on the http_client_requests_ entries, only on the http_server_requests_ ones. Is this something being introduced on 2.2.x?

Comment From: bclozel

@rubasace yes, see #15594

Comment From: rubasace

@bclozel what's the current implementation behaviour when we are consuming the body as a Flux and we limit it with something like next() or take(n). I implemented something very similar on a custom ExchangeFilterFunction for logging and in those scenarios, I was logging false positives for timeouts as after receiving the requested elements a cancellation signal is sent upstream to stop the producer. I think the same might happen here with the current implementation.

The way I solved it was simply flagging somehow on the doOnNext() operations that we already received something, so if that happens I just ignore the cancellation signal assuming it's desired behaviour. Of course, there's the edge case where we are actually timing out on a Flux if a particular element takes too long, but I think it's an acceptable tradeoff given that we don't know the reason of the cancellation.

What do you think?

Comment From: bclozel

@rubasace this feature is not about recording timeouts, but recording cancelled requests. This can happen for various operators like timeout or take, and in all cases we can't know whether this is the intended behavior or an error.

From a pure WebClient point of view, we're recording cancellations because the request did not complete and we cannot record tags with the response (status, execution time).

The workaround you're describing might work, but we'd still get "false negatives" if the client receives the response headers but no body, or only a partial body. You're right that in this case we could still record metrics twice for the same request: once when receiving the response, and another time if a cancel event happens while reading the body.

I think it's still a good compromise and I'll adapt these changes to only record cancellations if we did not receive any response at all.

Comment From: PyvesB

I also just bumped into this issue recently. Mapping cancelled signals for client requests to CLIENT_ERROR sounds sensible and happens to be consistent with the status tag value that gets used when you configure a ReadTimeoutHandler or WriteTimeoutHandler at the Netty level. Thanks for working on this!