Springboot kafka consumer dies permanently

32 views Asked by At

Configuration

Spring-boot - 2.7.8 (Some services on 3.2.2, but havent seen the problem there yet)

Spring Kafka - 2.8.11 (Some services on 3.1.2, but havent seen the problem there yet)

Confluent 7.4.x on openshift 4.12

Microservices (2 pods / instances) deployed via HELM on K8 (OpenShift 4.12)

More info

  1. Each spring boot application has a kafka consumer (autowired @KafkaListener configured via ConcurrentMessageListenerContainer)
  2. No concurrency on each listener
  3. Services use springboot RestTemplate for communicating with each other

Problem After having run fine for days and weeks, suddenly we have seen cases in production where a kafka consumer simply stops consuming, impacting one or more pods. When this happens prometheus metrics for consumption for that consumer/topic is undefined (not showing a trend with 0 consumption) signalling the consumer stopped producing metrics as well. Kafka consumer lag is not helpful in this case because kafka rebalances all load to 2nd pod. And we have had instances where second consumer died as well causing business impact. So far we have found this happening in 2 instances

  1. Kafka consumer running out of heap and dies (Straightforward case which we were able to resolve)
  2. Service calling another service via RestTemplate .This mostly involves two calls one to Keycloak to grab token (is cached hence does not hit end point all the time) and actual service call

When investigating we have seen logs, where calls are made via #2 above (called API never sees that particular call). At this point it seems the thread is blocked (cant figure out on what) or died as cant see further logs. And that was the last time it consumed anything. Nothing substantial in logs kafka related.

Rest Template configuration

restTemplateBuilder
                .errorHandler(new RestClientErrorHandler())
                .requestFactory(() -> new BufferingClientHttpRequestFactory(new    HttpComponentsClientHttpRequestFactory()))
                .build();

Timeouts are at load balancer level, not here. Also no connection pooling.

Before I go and investigate RestTemplate further, I really want to understand why a blocked kafka consumer is considered dead. Is heartbeat thread not a background thread and will not be blocked due to worker thread?

Couple of our apps are using Kafka Streams and hit the similar issue under similar conditions

I was expecting when a consumer thread eventually returns from a blocked api call it will rejoin the consumer group.( RestTemplate is fine at this point as other processes within the service are still able to make calls during this time, though to a different host)

0

There are 0 answers