Cassandra cluster unstable but nodetool shows all nodes are healthy (UN)

45 views Asked by At

My Cassandra cluster built on EC2 (5x node, r6i instance type) Sometimes some node on the cluster is unstable and will increase query latency enter image description here

And during the latency high time the connection from jmx_exporter seems like unstable (green one)

enter image description here The node is not able to write but able to read

enter image description here But the disk is not full

Filesystem       Size  Used Avail Use% Mounted on
/dev/root         29G  7.7G   22G  27% /
devtmpfs          62G     0   62G   0% /dev
tmpfs             62G     0   62G   0% /dev/shm
tmpfs             13G  960K   13G   1% /run
tmpfs            5.0M     0  5.0M   0% /run/lock
tmpfs             62G     0   62G   0% /sys/fs/cgroup
/dev/loop0        25M   25M     0 100% /snap/amazon-ssm-agent/6563
/dev/nvme0n1p15  105M  6.1M   99M   6% /boot/efi
/dev/nvme1n1     2.0T  170G  1.8T   9% /srv

And nodetool status are healthy (host id and address masked

# nodetool status
Datacenter: ap-southeast-1
==========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load        Tokens  Owns (effective)  Host ID  Rack
UN  ---  112.17 GiB  256     51.8%             ---  ap-southeast-1b
UN  ---  124.31 GiB  256     65.1%             ---  ap-southeast-1a
UN  ---  128.17 GiB  256     69.0%             ---  ap-southeast-1b
UN  ---  125.68 GiB  256     62.0%             ---  ap-southeast-1a
UN  ---  94.08 GiB   256     52.1%             ---  ap-southeast-1b

I'm using Cassandra 4.1.2

---@---:~$ nodetool describecluster
Cluster Information:
        Name: ---
        Snitch: org.apache.cassandra.locator.Ec2Snitch
        DynamicEndPointSnitch: enabled
        Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
        Schema versions:
                b237cd5a-fbfd-3fdc-b3ba-d435b41372f2: [---]

Stats for all nodes:
        Live: 5
        Joining: 0
        Moving: 0
        Leaving: 0
        Unreachable: 0

Data Centers:
        ap-southeast-1 #Nodes: 5 #Down: 0

Database versions:
        4.1.2: [---]

And after it take some time it will back to normal (around 30 mins) And no error logs from system.log debug.log gc.log

I would like to know how can I proceed to debug with this case, Thanks!

nodetool check, logs check, jmx metrics check.

0

There are 0 answers