My Cassandra cluster built on EC2 (5x node, r6i instance type) Sometimes some node on the cluster is unstable and will increase query latency enter image description here
And during the latency high time the connection from jmx_exporter seems like unstable (green one)
enter image description here The node is not able to write but able to read
enter image description here But the disk is not full
Filesystem Size Used Avail Use% Mounted on
/dev/root 29G 7.7G 22G 27% /
devtmpfs 62G 0 62G 0% /dev
tmpfs 62G 0 62G 0% /dev/shm
tmpfs 13G 960K 13G 1% /run
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 62G 0 62G 0% /sys/fs/cgroup
/dev/loop0 25M 25M 0 100% /snap/amazon-ssm-agent/6563
/dev/nvme0n1p15 105M 6.1M 99M 6% /boot/efi
/dev/nvme1n1 2.0T 170G 1.8T 9% /srv
And nodetool status are healthy (host id and address masked
# nodetool status
Datacenter: ap-southeast-1
==========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN --- 112.17 GiB 256 51.8% --- ap-southeast-1b
UN --- 124.31 GiB 256 65.1% --- ap-southeast-1a
UN --- 128.17 GiB 256 69.0% --- ap-southeast-1b
UN --- 125.68 GiB 256 62.0% --- ap-southeast-1a
UN --- 94.08 GiB 256 52.1% --- ap-southeast-1b
I'm using Cassandra 4.1.2
---@---:~$ nodetool describecluster
Cluster Information:
Name: ---
Snitch: org.apache.cassandra.locator.Ec2Snitch
DynamicEndPointSnitch: enabled
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
b237cd5a-fbfd-3fdc-b3ba-d435b41372f2: [---]
Stats for all nodes:
Live: 5
Joining: 0
Moving: 0
Leaving: 0
Unreachable: 0
Data Centers:
ap-southeast-1 #Nodes: 5 #Down: 0
Database versions:
4.1.2: [---]
And after it take some time it will back to normal (around 30 mins) And no error logs from system.log debug.log gc.log
I would like to know how can I proceed to debug with this case, Thanks!
nodetool check, logs check, jmx metrics check.