Smoothwall Firewall project

Friday, 12 May 2017

Getting an elastic search shard to relocate after a cluster nodes disk had become full



I came into work today to find one of the test environments elastic search(ES) nodes had run out of disk space. A bad job had gone berserk overnight and filled the logs - which then filled ES nodes disks.

So what to do with a volume with 100% utilisation? After chatting to a few colleagues , it was decided to delete one of the earlier indices - as the data in the dev environment was not life or death.

I first tried this from the GUI - which didn't work at all , so I switched to the CLI and issued the following

curl -XDELETE curl localhost:9200/mydodgylogs-2017.05.08 

That did the trick and we now had 80% disk utilisation, so I was expecting the shards to sort themselves calmly out. Unfortunately no go - there was one shard that was still refusing to relocate and it was effectively marking the cluster as yellow, so I had another chat with my colleague and he informed of a recovery log file which could make this happen. Effectively the disk being full had left the shard in an unstable state, and needed some help to sort itself out.

So on the cli again, I found the offending shard directory in the correct index and removed the following file

 mydodgylogs-2017.05.08/4/translog/translog-1234567890.recovering

As soon as that was removed the shard relocated fine and the system went back to being happy and green.

As it took some time with a few colleagues to get to the bottom of the problem, I thought others may find it useful in the future.

Running the normal health check command then gave me the following healthy output.

curl localhost:9200/_cluster/health?pretty

{
  "cluster_name" : "myclustername",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 45,
  "active_shards" : 90,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0

}