Elasticsearch is a highly scalable data store that allows you to ingest, search and aggregate millions of records with ease. Recently a customer of mine was running into some performance issues with their multi-node deployment. The cluster was required to ingest around 90,000 1kb records per second and their configuration looked like this:

  • Two physical servers with 40 cores, 256GB RAM and 8x 1TB HDDs
  • Each server ran 7 nodes on the host OS
  • Each node was allocated 24GB RAM (maximum should be 32GB)
  • Each node was allocated 5 cores (see Processors setting)
  • Each node was allocated one physical disk

Despite separating the resources like this, the CPU and memory load were extremely high and the performance of Elasticsearch was the bottleneck in their system. At first, we reduced the RAM allocated to nodes after reading the notes on Lucene’s off-heap caching. However this did not solve the issue. Next, we tried to reduce the number of nodes themselves. Same result. Increasing the number of cores for less nodes – same result again.

The solution turned out be related to the off-heap caching.

Lucene will try use up as much RAM as possible for off-heap caching. So even if you allocate a small amount of RAM for the Elasticsearch heap, the rest of the RAM will attempt to be used off-heap by Lucene. In a multi-node deployment on bare metal hardware this means that each node will be competing for as much of the RAM resources as possible. The solution was to put each node into a docker container to restrict the amount of RAM that Lucene can use up.

The result was outstanding. Both throughput and latency drastically improved and now the customer has nearly double the capacity before additional hardware is required. +1 for containerisation!


Leave a Reply

Your email address will not be published. Required fields are marked *