springboot Investigate "Failed to create volume" CI issues

Hi,

since yesterday the amount of "failed to create volume" errors seems to be on a level that "forces" me into this issue: springboot Investigate

I wonder if there's anything from the Spring-Boot perspective that we can do to investigate what's going on as long as https://github.com/concourse/concourse/issues/800 remains unresolved.

Maybe we can put some "debug" commands to the build pipeline scripts. E.g. lsof | wc -l to show open files; displaying ulimits; some Docker diagnostics like docker system df. Etc.

Feel free to close this issue of course, in case you already know what's going on. But I thought it might be worth to track the findings in a centralized place.

Cheers, Christoph

Comment From: wilkinsona

I'm sure this is frustrating for you, @dreis2211. Thank you for raising it in such a constructive manner.

Unfortunately, based on my (rather superficial) understanding of how Concourse works, I don't think the diagnostics that we add to an individual task will help. I believe the failure is occurring outside of that task and possibly outside of the container in which it's running. We've escalated the problem with the Concourse team so hopefully we'll get some guidance soon on what we can do to diagnose the cause.

Given the timing of the volume creation failure – once a task appears to have completed – my theory at the moment is that the failure is occurring when Concourse is attempting to honour the task's cache configuration. We use this configuration to share the local Maven and Gradle caches between runs. This caching is task and worker-specific, i.e. we end up with a volume per task and per worker. This theory is backed up a little bit by things improving when we recreated all of the workers. New workers results in fresh caches for each of the new workers.

I'd like to further test my theory by deleting the task-cache volumes but there's no way for me to do that right now. We can perhaps take the more severe option of asking for all of the workers to be recreated. Another option would be to disable the task caching. That will have a negative impact on build times, but that may well be better than the current situation.

Comment From: wilkinsona

The workers were recreated over the weekend. It seems to have improved the situation a little, but volume creation failures are still occurring with 3 in build and 1 in build-pull-requests so far today.

We're continuing to work with the Concourse team to get to the bottom of the problem.

Comment From: dreis2211

Thanks for the heads-up. I really think https://github.com/concourse/concourse/issues/800 could help here.

Comment From: dreis2211

I don't want to jinx it, but it seems that on the current master this is happening less frequent now. Did anything change, @wilkinsona ? I suspected my PR #20157 , but there were failures afterwards.

Comment From: dreis2211

I jinxed it 😆 https://ci.spring.io/builds/107398

Comment From: wilkinsona

The only change that I'm aware of is that @trevormarshall has been recreating the workers on a semi-regular basis. Trevor, any news from the Concourse team that might move things towards a solution?

Comment From: wilkinsona

Our current strategy of recreating the workers periodically seems to be working. Any other improvements will require the involvement of the Concourse team and, perhaps, some changes in Concourse itself. I don't expect any changes in Boot's pipelines so I'm going to close this one for now. We can re-open it if it transpires that we do need to make some changes.

Comment From: dreis2211

It improved drastically, so totally agree. Thanks @wilkinsona

Comment From: trevormarshall

We've been monitoring the cluster for some time before updating here. We have not needed to recreate the workers for 3 weeks now. 21 days ago we re-deployed the Concourse workers to a secondary AZ, on different hardware, with newer CPU and flash storage. This might point to the Overlay driver and we will report in the Concourse issue to help in their data collecting.