Your service cannot process events fast enough during peak hours.

There is no obvious quick and dirty fix.

Refactoring would take ages.

People have been unhappy for a while now.

What the hell do you do?

Rough representation of the situation we were facing.
Not pictured: honking and yelling.
Rough representation of the situation we were facing. Not pictured: honking and yelling.

Background

I had the pleasure of working with a legacy backend system recently. It had plenty of ongoing problems, but one of those was more acute compared to others: the service could not process certain very important events fast enough during peak hours and that was problematic for everyone that relied on those events.

The processing of events was very basic:

  • load 1000 entities
  • for each entity, do the following sequentially
    • do some processing
    • send a JMS message via ActiveMQ
    • wait for confirmation

The processing was triggered by the scheduled jobs system that the service had been relying on since its inception more than a decade ago. That system only allowed to run one instance of a scheduled job at any time.

This processing worked well for almost a decade at this point, but started causing more and more issues as the service was the backbone of a rapidly-growing business. At least it was a nice problem to have.

I was the lucky guy who got assigned to solving this issue, and in hindsight I’m really glad I got it because tackling this one was fun.

The slowness of the process was down to the part where events had to be sent via ActiveMQ. It’s possible to send events without waiting for any confirmation, but our service was going with the slower approach of waiting for a confirmation that the message was successfully sent. After consulting with my colleagues, going through the depths of git blame and doing some Jira archeology I soon learned that this slowness was there for a reason: some events could fail to be processed if the service suddenly died, which wasn’t that rare of an occurrence. Sure, the processing was quick, but you had the risk of losing events, and that was even worse than sending events with a big delay.

When looking at the behaviour of this code path I learned that sending each event took about 20 milliseconds. That is not a lot of time for a single message, but if you have queued up 1000 events, then that results in 20+ seconds required to send all of those messages. Take into account the fact that you are doing this processing sequentially and the fact that during peak hours you had to process thousands of events within one minute, and you can see where this becomes problematic.

To give some additional context: we’re dealing with a service that has at least 2 or more instances running at any time in a Kubernetes cluster. We had other high priority work ongoing as well and refactoring this part of the system was not feasible in a short time window. Spending weeks or even months on this issue was out of the question.

The solution

The first idea I tried was pretty basic: try to see if we could process each event in parallel using Java parallel streams. Hibernate and ActiveMQ put brakes on that idea pretty quick due to objects related to them not being thread-safe.

The second idea I tried was to find a way to bulk-send events since I suspected that the ActiveMQ connection setup and teardown was being done separately for each processed entity. That, however, was not the case.

The third idea was the one I went with. From my early programming days I learned about the modulo operation. I also had vague knowledge of Kafka and its way to split work via partitions. Didn’t take my mind long to connect the dots, and a TechTipsy certified quickfix™ was born.

I took the existing job, created multiple instances of it and gave each instance a number from 0 to 4, let’s call it a “worker ID”. Those modified scheduled jobs ran the same database query, but with a small adjustment: in addition to other criteria, each scheduled job only picked entities where the modulo 5 result of its numerical identifier matched the worker ID.

This means that the service could process 5 times more entries at once without having to commit to a big and risky rewrite.

I imagine this is what the "just one more lane, bro!" crowd thinks happens in 
the real world.
I imagine this is what the "just one more lane, bro!" crowd thinks happens in the real world.

The PostgreSQL modulo operator helped facilitate this process and was the key to avoiding loading all of the entities into memory and filtering that list down using the modulo operator in application code.

All-in-all, this took a few days of work to implement, roll out in staging and production, observing and refactoring to make the solution maintainable. For a service that’s part of the critical path in the tech stack and notorious for its slow deployment cycle, it was pretty fast.

Example

Here’s an example to help illustrate the process. The database holds 7 entities with the following ID-s: 1001, 1002, 1003, 1004, 1005, 1006, 1007.

Worker 0 starts up and selects all entities where the result of ID mod 5 = 0.

  • 1001 mod 5 = 1
  • 1002 mod 5 = 2
  • 1003 mod 5 = 3
  • 1004 mod 5 = 4
  • 1005 mod 5 = 0 <– this one
  • 1006 mod 5 = 1
  • 1007 mod 5 = 2

Worker 0 selects entity 1005.

Worker 1 selects entities 1001, 1006.

Worker 2 selects entities 1002, 1007.

Worker 3 selects entity 1003.

Worker 4 selects entity 1004.

All the workers operate in parallel.

Caveats

This solution isn’t perfect and has some caveats related to certain implementation details.

Adding additional workers is relatively simple, but requires lots of small but well-documented steps. This will only be a problem if the system sees additional growth that it cannot handle, but that should be years from now. I like to think of it as job security for future generations.

Due to the way the scheduled jobs solution is built, it is possible for one instance of the service to run more than one worker at once, which could be a problem for compute-heavy or memory-hungry processing work. However, for this use case it’s not a problem.

All-in-all, I’m happy with the solution, the team was happy after I made the TechTipsy certified quickfix™ more maintainable, and everyone relying on this solution to work properly were also happy.

Comments

Have any similar stories?

Or an urge to yell at how bad my solution is?

If you’re not a spammer, just send me an e-mail!