springboot CassandraHealthIndicator runs a query that fails on some Consistency Levels

My team has recently decided to move from a default READ consistency level of LOCAL_QUORUM to THREE. After this change, the CassandraHealthIndicator can no longer execute the query below successfully. I'm wondering if there's a better test query that could work at all Consistency Levels?

@Override
protected void doHealthCheck(Health.Builder builder) throws Exception {
    Select select = QueryBuilder.select("release_version").from("system", "local");
    ResultSet results = this.cassandraOperations.getCqlOperations().queryForResultSet(select);

Comment From: philwebb

The cassandra health checks were contributed in #2064 by @jdubois. I'd be interested to know if he has any suggestions.

Comment From: jdubois

I would be very surprised that there is reason this request fails on a specific consistency level, could you provide some documentation, anything to support this claim?

Comment From: ankit--sethi

Here's one github issue discussing an almost identical problem.

Going by what they say -- which seems right based on my knowledge of the System tables -- some Consistency Levels can never successfully execute on system tables that use LocalStrategy.

The fix for this should be relatively straightforward -- ignore the default (or user-configured) Consistency Level set within CassandraOperations and explicitly set it to be ONE (or any of the workable values) for the the healthcheck.

Comment From: jdubois

I'm very surprised that system tables cannot have a high consistency level... In that case, they seem to be a poor choice to check the cluster health - lowering the consistency level would make the cluster look healthy, when in fact you cannot read or write... So if that's correct we would need to create a specific table in the database schema, which wouldn't be easy to use for Spring Boot (because then people would need to create it, etc).

If nobody finds a good solution, I'll check with my friends from Datastax when I'm back from holidays, in about 1 month.

Comment From: wilkinsona

@jdubois I hope you enjoyed your holidays. Unfortunately, I don't think we've found a good solution for this one in your absence. If you have a moment could you please check with your friends at Datastax and see what they would recommend?

Comment From: jdubois

Indeed, let me call @bguedes @clun for help!!!

Comment From: wilkinsona

@clun @bguedes If you have a few minutes, we'd be really grateful for your recommendation here.

Comment From: clun

Hi team,

Thank you @wilkinsona for the poke, I missed the last one for some reason. The system, like any other keyspace, has some replication_factor attribute and default is maybe 1. So, if you add nodes later on and do not increase the replication factor you will hit some errors.

Try this:

ALTER KEYSPACE system
    WITH REPLICATION= {'class' : 'NetworkTopologyStrategy', 
                       'data_center_name'_1 : 3, 'data_center_name' : 3};

I personally don't like TWO,THREE CL => seems not generic. I would go with ALL to ensure that all nodes are up and LOCAL_QUORUM in more optimistic approach.

select * from system.local is still an efficient query I would say but why not relying on the driver itself ?

This is the same for amy system-related keyspace. https://docs.datastax.com/en/security/6.7/security/secSystemKeyspace.html

Comment From: wilkinsona

Thanks very much, @clun. Unfortunately, we're not in a position to alter a keyspace and just have to rely upon what the user has configured.

select * from system.local is still an efficient query I would say but why not relying on the driver itself ?

This is intriguing. How would we go about relying on the driver itself? Is there something provided by the driver that we can call to determine Cassandra's health?

Comment From: adutra

I was just made aware of this issue.

The system keyspace is a bit special. It has a replication factor of 1 and uses a special replication strategy called LocalStrategy. Basically this means that this keyspace is local to each node.

Concretely speaking, this means that querying that keyspace can only work with the following consistency levels: ONE, LOCAL_ONE, QUORUM, LOCAL_QUORUM, EACH_QUORUM, ALL. This is because the quorum for replication factor 1 is 1, so all the aforementioned levels are equivalent to ONE with RF 1.

However TWO and THREE consistency levels cannot be met on that keyspace. You would get the following error:

UnavailableException: Not enough replicas available for query at consistency THREE (3 required but only 1 alive)

As a consequence, queries to system.local MUST force the consistency level to ONE or LOCAL_ONE. I will see if my team can provide a fix for this quickly.

And as a side note: do not use THREE, use QUORUM or ALL.

Comment From: jdubois

Thanks so much @adutra !! Don't hesitate to ping me

Comment From: snicoll

Closing in favour of PR #20709