Multiple replicas of same application : How to equally split data to improve performance :Kubernetes |OpenShift?

396 views Asked by At

My Application:

I have one production application named "Realtime Student Streaming Application". and it is connected to MongoDB1 and i have Student table(collection) inside of it.

My Task:

  1. My Application will Listen for any insertion happened in student table.
  2. Once insertion happened, mongo will give the inserted record to my listener class.
  3. Once the record came to java class I will insert it into new database called MongoDB2.

What I did:

  1. I deployed my application in OpenShift cluster and it has 5 pods running on it.
  2. On this case, If any insertion is happened in MongoDB1, I'm receiving inserted data in all my 5 pods.

Current Output:

Pod 1: I will process inserted-document 1,2,3,4,5

Pod 2: I will process inserted-document 1,2,3,4,5

Pod 3: I will process inserted-document 1,2,3,4,5

Pod 4: I will process inserted-document 1,2,3,4,5

Pod 5: I will process inserted-document 1,2,3,4,5

Expected Output:

Pod 1: I need to process only inserted-document 1

Pod 2: I need to process only inserted-document 2

Pod 3: I need to process only inserted-document 3

Pod 4: I need to process only inserted-document 4

Pod 5: I need to process only inserted-document 5

Q1: Let say if i'm inserting 5 documents, How it will be shared between 5 pods.

Q2: In General, while using Kubernetes/Openshift , How to equally split data between multiple pods?

1

There are 1 answers

0
David Ogren On

This really isn't an OpenShift or Kubernetes question, it's an application design question. Fundamentally, it's up to you to shard the data in some fashion. Fundamentally it's a difficult problem to solve, and even tougher to solve once you start tackling problems around transactionality, scaling up/down and preserving message order.

Therefore, one common way to do this is to use Kafka. Either manually, or with something like Debezium, stream the changes from the DB into a Kafka topic. You can then use Kafka partitions and consumer groups to automatically divide the work between a dynamic number of workers. This may seem like overkill, but it solves a lot of the problems I mentioned above.

Of course you can also try to do this manually, just manually sharding the work and having each worker use Mongo features like findAndModify to "lock" specific updates and then update them again once completed. But then you have to build a lot of those features yourself, including how to recover if a worker locks an update but then fails before processing the change.