Part 2: Containerizing and Running Foursquare's TwoFishes with Persistent Storage

In Part 1 of this two-part series, I looked at Forward and Reverse Geocoding using FourSquare's TwoFishes. I provided some background on TwoFishes, and shared some lessons that I learned on building the geospatial database/indices for forward and reverse geocoding.

Building the TwoFishes database/indices takes some significant time; its not something you want to do on startup. Once you have the database/indices built, you want to get it up and running. There's no better way than to use a Docker container to package up the Foursquare code, and our newly built database. So, in this blog, I will dive into the best way (IMHO) to containerize TwoFishes to account for its idosyncracies.

For example, after the container is built, and TwoFishes is starts up for the first time, it performs some secondary processesing of the database as it prepares to handle requests. This results in the creation of some additional files, which are ephemeral if you run this inside a docker container that doesn't have a volume. This processing takes some time (more than 5 minutes), so it's not something I want to happen every time the container starts. If you don't attach a volume to your container, that's exactly what will happen. For this purpose, I'm using an EFS volume to provide persistent storage for my TwoFishes container.

Containerizing TwoFishes

Here is a working Docker implementation. When TwoFishes starts up the first time, it looks at the database/index, and does some pre-processing prior to being completely ready to handle requests. This takes a while, and I don't want to incur this cost on every startup. Therefore, I use a volume to avoid long start times on the container.

I have provided this code in https://github.com/CorkHounds/twofishes-docker.git.

Let's step through the dockerfile. I have provided comments inline below to explain what the various segments are doing.

# Pull base image.
FROM ubuntu:14.04

MAINTAINER Jeremy Glesner "jeremy@corkhounds.com"

# Use bash instead of sh
RUN rm /bin/sh && ln -s /bin/bash /bin/sh

# update system
RUN apt-get update

# set home directory
ENV FSQIO_BASE /home/docker
WORKDIR /home/docker

# make download directory
RUN mkdir -p /home/docker/download

# install openjdk 8
RUN apt-get install -y software-properties-common python-software-properties
RUN add-apt-repository ppa:openjdk-r/ppa
RUN apt-get update
RUN apt-get install -qy openjdk-8-jdk

# install the latest python 2.7 to avoid 
# SNIMissingWarning, InsecurePlatformWarning, etc. 
RUN add-apt-repository ppa:jonathonf/python-2.7
RUN apt-get update
RUN apt-get install -qy git python2.7 python-dev curl wget

# correctly link the cacerts to avoid 
# '...trustAnchors parameter must be non-empty...' errors
RUN bash -l -c "/var/lib/dpkg/info/ca-certificates-java.postinst configure"

# download fsqio git repository to the download directory
RUN git clone https://github.com/foursquare/fsqio.git /home/docker/download

# set LANG and LC_ALL to UTF-8
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8

# copy over the us-data folder and run.sh script
COPY us-data /home/docker/download/us-data/
COPY run.sh run.sh

# create the FSQIO_BUILD environment/folder/volume
ENV FSQIO_BUILD /home/docker/fsqio
RUN mkdir -p "$FSQIO_BUILD" 
VOLUME $FSQIO_BUILD

# expose the native twofishes ports
EXPOSE 4567 8080 8081

# set the entrypoint
ENTRYPOINT ["/home/docker/run.sh"]

Notice that you will need a data folder (e.g. us-data) that has your geo database/indices that we created in Part 1 of this series. At a minimum, this would provide forward geocoding, but could also include the data/indices for reverse geocoding depending on the steps you followed in Part 1. Just to be clear, the folder doesn't need to be named 'us-data' ... but whatever you call it, you will want to (1) upload it to your docker container, and then (2) reference it from your run.sh script below to start the service (e.g. serve.py -p 8080 us-data/).

Now, lets look at the run.sh file. First, we'll want to make sure that run.sh is executable by running chmod +x run.sh. Now, we can step through the file.

#!/bin/bash

# exit the shell if any action returns a non-zero status.
set -e

# if the build directory doesn't contain pants.ini, create it
# this will copy the FSQIO repo to persistent volume
if [ ! -e "$FSQIO_BUILD/pants.ini" ]; then
	sourceDir="/home/docker/download"
	targetDir="$FSQIO_BUILD"
	cp -a "$sourceDir/." "$targetDir"
fi

# change to the fsqio directory
cd /home/docker/fsqio

# initialize twofishes service
./src/jvm/io/fsq/twofishes/scripts/serve.py -p 8080 us-data/

This script copies the FSQIO contents to disk, and serves the application from disk. To build the container, you simply run docker build -t twofishes:latest .. You can replace the tag:version as you see fit.

Running the TwoFishes Container

Before you run this container, you'll need to change the vm.max_map_count that determines the maximum number of memory map areas a process may have in virtual memory. TwoFishes requires a larger memory map ceiling.

sudo sysctl -w vm.max_map_count=131072

Running that command only temporarily incrases the virtual mem map areas; we should set this permenantly by editing the /etc/sysctl.conf file by adding vm.max_map_count=131072.

Now we can run the container:

docker run -p 8081:8081 -v /home/ec2-user/fsqio:/home/docker/fsqio -d twofishes:latest

In the above command, I am only exposing port 8081, but there are other ports listed in the TwoFishes docs you may want to expose. Also, in the above docker run command, I detach the container so that it runs in the background. To see the logs, simply run docker ps, get the Container ID, and then run docker logs -f <CONTAINER ID>. You will see that the TwoFishes service traverses several phases. First, it installs all dependencies beginning with Installing setuptools, pip, wheel...done followed by dozens of others.

Next, you will see TwoFishes compiling the Scala source code using a standalone Zinc Compiler for the Pants Framework. Once all code is compiled, you will know that the TwoFishes service is up and running when you see the following lines indicating the ports on which it is listening:

Jul 15, 2018 3:49:00 PM io.fsq.twofishes.server.GeocodeFinagleServer$ main
INFO: serving finagle-thrift on port 8080
Jul 15, 2018 3:49:00 PM io.fsq.twofishes.server.GeocodeFinagleServer$ main
INFO: serving http/json on port 8081
Jul 15, 2018 3:49:00 PM io.fsq.twofishes.server.GeocodeFinagleServer$ main
INFO: serving debug info on port 8082
Jul 15, 2018 3:49:00 PM io.fsq.twofishes.server.GeocodeFinagleServer$ main
INFO: serving slow query http/json on port 8083
Jul 15, 2018 3:49:01 PM com.twitter.finagle.Init$$anonfun$1 apply$mcV$sp
INFO: Finagle version 6.25.0 (rev=78909170b7cc97044481274e297805d770465110) built at 20150423-135046
Jul 15, 2018 3:49:01 PM com.twitter.ostrich.admin.BackgroundProcess start
INFO: Starting LatchedStatsListener
Jul 15, 2018 3:49:01 PM com.twitter.ostrich.admin.BackgroundProcess start
INFO: Starting TimeSeriesCollector
Jul 15, 2018 3:49:01 PM com.twitter.ostrich.admin.AdminHttpService start
INFO: Admin HTTP interface started on port 8082.

To test it with a forward geocoding request, simply query your server:

http://<IP or DOMAIN>:8081/?query=rego+park+ny

If you curl the above URL from the local machine running the container, you should see a response similar to the below:

$ curl http://localhost:8081/?query=rego+park+ny

{"interpretations":[{"what":"","where":"rego park ny","feature":{"cc":"US","geometry":{"center":{"lat":40.72649,"lng":-73.85264}},"name":"Rego Park","displayName":"Rego Park, NY, United States","woeType":7,"ids":[{"source":"geonameid","id":"5133640"},{"source":"woeid","id":"2479954"}],"names":[{"name":"Rego Park","lang":"en","flags":[16,1]}],"highlightedName":"<b>Rego Park</b>, <b>NY</b>, United States","matchedName":"Rego Park, NY, United States","id":"geonameid:5133640","attributes":{"population":43925,"urls":["http://en.wikipedia.org/wiki/Rego_Park%2C_Queens"]},"longId":"72057594043061576","parentIds":["72057594044179937","72057594043056574","72057594043061204"]}}]}

If you built a reverse geocoding shapefile in Part 1, you can test the reverse geocoding endpoint to get a response. For example, to see a response for a New York state latitude/longitude w/ County data, use:

https://<IP or DOMAIN>:8081/?ll=38.70,-77.22&responseIncludes=DISPLAY_NAME

If you curl the above URL from the local machine running the container, you should see a response similar to the below:

$ curl http://localhost:8081/?ll=40.74,-74.0&responseIncludes=DISPLAY_NAME

{"interpretations":[{"what":"","where":"","feature":{"cc":"US","geometry":{"center":{"lat":40.77427,"lng":-73.96981},"bounds":{"ne":{"lat":40.882214,"lng":-73.907},"sw":{"lat":40.679548,"lng":-74.047285}},"source":"gn-us-adm2.shp"},"name":"New York","displayName":"New York","woeType":9,"ids":[{"source":"geonameid","id":"5128594"},{"source":"woeid","id":"12589342"}],"names":[{"name":"NYC","lang":"en","flags":[32,16]},{"name":"New York County","lang":"en","flags":[16]},{"name":"Manhattan","lang":"en","flags":[16]},{"name":"New York","lang":"en","flags":[128,16,8,1]}],"id":"geonameid:5128594","attributes":{"population":1585873,"urls":["http://en.wikipedia.org/wiki/Manhattan"]},"longId":"72057594043056530","parentIds":["72057594044179937","72057594043056574"]},"scores":{}}]}

You can read more about request parameters in the TwoFishes docs.

Troubleshooting

If we do not increase the virtual memory map areas, we see the following error during the first run.

Error: Insufficient per-process virtmem areas: 65530 required: 131060
Please increase the number of per-process VMA with sudo sysctl -w vm.max_map_count=X
or reduce the number required by passing --vm_map_count MAP_COUNT, but expect OOMS!

Next Steps

In my setup, I'm running this container on AWS Elastic Container Service (ECS), with an underlying Elastic File System (EFS) volume. If anyone is interested in that configuration, just ask and I can provide it.