Elasticsearch on AWS or How I learned to stop worrying and love the lucene index

March 18, 2014

Ahh, Elasticsearch, the cause of, and solution to, all of lifes problems.

I run a Logstash/Elasticsearch/Kibana cluster on EC2 as a application/system log aggregator for the web service I’m supporting. And it’s not been plain sailing. I have a limited AWS budget so I am somewhat restricted in the instances I can fire up. No cc2.8xlarges for me. So I was stuck with two m1.larges. And they struggled.

It was processing around 3 million documents for a total index size of around 4Gb per day. And sometimes it coped and sometimes it didn’t. I often found myself restarting the logstash and elasticsearch services around once or twice a week sometimes losing 7-9 hours of processed logs.

And the most frustrating thing? I had no idea what I was doing wrong. Had I misconfigured? Is it just that the instances were too small?

So I’ve upped my game a bit. Not without some trial and error. “Fake it til you make it” as those of us without an extensive background in Lucene indices and grid clustering are fond of saying.

But I think I’ve cracked it. And this may be a good lesson for people starting out with a set up like this.

I’ve now got two c3.xlarges, which with 10 more compute units to play which makes a big difference to the throughput.
I’ve tweaked the Logstash command line to give me 8 filter workers instead of the default 1. Helps a lot when the document volume increases.

And the most important thing? I’ve done my homework and put some effort into making my Elasticsearch config right.

Port specification to prevent mismatch

transport.tcp.port: 9301

EC2 discovery plugin with filtering to ensure the instances see each other and increased ping timeout to account for network irregularities.

    discovery:
        type: ec2
        groups: elasticsearch-node
        ping_timeout: 65s
        tag:
            Elasticsearch: true

Making sure my nodes are given specific workloads using SaltStack jinja templating of the config .yml

    {{466614b89e1585428fc59ecdf288199e1bb262d0e643d13002ca39401be135ac} if grains['elasticsearch_master'] == False {466614b89e1585428fc59ecdf288199e1bb262d0e643d13002ca39401be135ac}}
        node.master: false
        node.data: true
    {{466614b89e1585428fc59ecdf288199e1bb262d0e643d13002ca39401be135ac} endif {466614b89e1585428fc59ecdf288199e1bb262d0e643d13002ca39401be135ac}}

Scheduled closing and deleting of old indices to cut load using cronned Elasticsearch Curator

For now my problems seem to be mitigated. We’ll see how easy it is in future to scale the service as my user load increases.