Fixing TaskManager out-of-memory crashes when running OsmLib in Flink with high parallelism
written on 14 March 2019
My goal was to run [OsmLib](https://github.com/conveyal/osm-lib) inside a high-parallelism Flink job to process geospatial data. OsmLib was first configured with the OSM map data of Europe, then Germany.
The server I ran the job on provided 40 logical cores and 128GB of memory.
For both maps, the TaskManager crashed after a few seconds trying to deploy the streaming job, while the Europe map crashed with an even lower parallelism than the Germany map.
The error message looked like this:
> *OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f0676d7c000, 262144, 0) failed; error='Cannot allocate memory' (errno=12)*
This led me to think that there actually was not sufficient memory, but `top` showed otherwise.
Eventually I found the solution in [this](http://www.mapdb.org/blog/mmap_files_alloc_and_jvm_crash/) Blog post from MapDB - the underlying database framework for OsmLib. Apparently, the TaskManager process hit the limit of `vm.max_map_count`. Running a quick check of `sudo watch -n1 'cat /proc/$TM-PID/maps | wc -l'` (on docker host) while the job was deploying confirmed the problem: The default limit of 65535 mmaps was hit rather quickly. And since each of the OsmLib instances (i.e. parallel instances of the OsmLib operator in Flink) maps the whole OSM database individually, a high parallelism will multiply the number of maps required.
To fix the problem, you can set a much higher value for `vm.max_map_count` in `sysctl.conf`. Note that if you're running Flink in Docker containers, this value still has to be set on the Docker host.
In my case, i set `vm.max_map_count = 512000` and checked the number of used mmaps again: With a parallelism of 32 and the German OSM map, this count reached over 180k after (finally) sucessfully deploying the job!