Geospatial Queries on AWS DynamoDB
I'm currently using Postgres/PostGIS for all our Geospatial queries. This works well with our microservices architecture. However, as I build out the backend for our Mobile app architecture using API Gateway and Lambda, I'd prefer not to use AWS RDS. Because I don't want to expose my RDS Postgres instance publicly, I have to instantiate the Lambda function inside my VPC, causing a performance hit when AWS attaches an Elastic Network Interface. I've seen queries take up to 14 seconds on first call. That's a long time to make mobile users wait for results. Therefore, I started looking for an alternate solution ... preferably using DynamoDB because it's globally available and I can use AWS IAM permissions to access it. This would allow me to query Dynamo from a Lambda function outside my VPC to avoid some of the performance hit.
I quickly found AWS Labs' DynamoDB geospatial library. This looked like it would get me to where I needed to go ... trouble is that this project is 'mostly abandoned'. While there was a merged pull request committed in June of this year (2017), it's still using the 1.5.5 version of the AWS Java SDK. The current version, as of this blog, is 1.11.150, and I wanted to use the 1.11 release. After some quick study, it became apparent that it would be a larger undertaking to modernize it, requiring time I didn't want to spend. Turns out, someone already did this work: Dash Labs, Inc. in their fork of the dynamodb-geo library. I really appreciate their hard work to modernize and open source the library.
Update: Ultimately I had to convert the Dash Lab's dynamodb-geo library to a single Maven project because I couldn't figure out the ply compiler. I'm not python savvy. I recently made the maven version available through our CorkHounds Github account.
The documentation is nearly non-existent, so I had to spend some hours figuring out how to use it effectively, and wanted to share what I learned in case others could benefit.
Overview
Since you're reading this post, you likely have georectified data objects like us, and want to search for objects within a radius or box.
You will want your geocoordinates represented as decimal degree Latitude and Longitude, as this is the expected format of the dynamodb-geo library. It works by converting your latitude and longitude coordinates into a Geohash.
The Dash library will populate your Dynamo records with this Geohash, and a Geohash Key. The Geohash Key is an n-length prefix of the hash used to create bigger buckets for ease of querying. You can set the length of the prefix on ingest. Once all your Dynamo records have the Hash and Hash Key, you'll create a Global Secondary Index with the Geohash Key as the partition key, and the Geohash as the sort key. The Geohash and Geohash Key works well with the native query/scan conditions of Dynamo, specifically the equals (EQ) and range (BETWEEN) operators.
Given a boundingbox/radius for a query, the Dash dynamodb-geo library constructs queries for all the Geohash Key prefixes within a given area, each with a range condition for all the Geohashes associated with the given prefix.
Because the Geohash Key allows us to bucketize and parallelize our queries, Dash did some analysis on the recommended key sizes which they provide in their readme.md.
When Dash refactored AWS' dynamodb-geo library, they employed the Decorator Pattern to augment the PutItemRequest and QueryRequest objects. This approach builds on top of the AWS Java SDK, thereby reducing the learning curve, allowing us to work with objects we're already familiar with.
Table and Index Setup
There are a couple important things to know regarding table and index configuration. First, when creating your table, I would recommend against using the Geohash Key as your table's partition key, even if you use the Geohash as a sort key unless your dataset is very geographically dispersed. Otherwise, you can overwrite data as you insert records. I recommend using a separate, unique partition key for your table. Also, plan to insert/keep your original latitude and longitude data, using those explicit attribute field names.
You can create the Index before or after you create the table, but you'll want to use the Geohash Key as the partition key, and the Geohash as the sort key. Index partition/sort keys don't need to be unique, like table partition/sort composite keys do. Also, if you are not going to use the CompositeKeyDecorator (which I touch on later), then both the Geohash Key and Geohash will be inserted to Dynamo as Numbers. Therefore, you'll want to reference those fields as such. For example, my geoKeyAndHash index references a numeric geoHashKey partition key and a numeric geoHash sort key.
When using this Index later, you will need to return, at a minimum the latitude and longitude attributes in the resultset along with any other data you intend to use. So be sure to choose 'All' or select the specific attributes you want by choosing 'Include' from the 'Projected Attributes' menu item.
The Read and Write Capacity Units are automatically set to 5. You'll want to read up on this concept. Amazon recently made it possible to auto-scale your read/write capacity, which is something else to examine. Bottom line, on query (not ingest), Dash's dynamodb-geo library provides a couple convenience methods to parallelize your queries (at least one for each Geohash Key). You can control the number of simultaneous threads you use by setting the pool size, but if you opt for doing this to enhance performance, you'll want to be aware of your Read Capacity configuration.
General Use
Ingest
First, you will want to either create a new table or update an existing one to add the Geohash and Geohash Key attributes to your dataset. Regardless, you can use the PutItemRequest object to achieve this.
First we'll make our connection object and specify our table name.
AmazonDynamoDB client = AmazonDynamoDBClientBuilder.standard()
.withRegion("YOUR_REGION")
.build();
DynamoDB dynamoDB = new DynamoDB(client);
String tableName= "YOUR_TABLE_NAME";
Here, I'm using the Default Provider Chain, which retreives my credentials from my local machine that I sat using 'aws configure'. There are numerous AwsCredentialProviders that you can use to connect to Dynamo by specifying the .withCredentials() option. You can learn more about credentials, and their associated APIs. Also, I've already created my table using the AWS Management Console, but there are lots of examples on how to construct a table programmatically.
Next, I'm going to setup Dash's dynamodb-geo library. In order for this to work, you will need to either (1) clone their repository, build it locally and include it in your project, or (2) check out jitpack for referencing Github repositories as dependencies. Here, we will instantiate the dynamodb-geo library, and set the configuration.
Geo geoLib = new Geo();
GeoConfig geoConfig = new GeoConfig("geoKeyAndHash", "geoHashKey", "geoHash", 6, null, null);
The first three values of the GeoConfig constructor should look familiar from setting up the table and index earlier. The fourth value is the Geohash Key Length. I arbitrarily sat my key length to six for testing purposes, although I imagine that some factors come into play here like the density of your dataset. I also believe that the smaller the key length, the fewer queries the dynamodb-geo library will generate later when querying it. There's a balance to strike between your reliance on the partition key vs the sort key when querying.
The last two values in my GeoConfig are 'null'. These values are for doing a HashKeyDecorator and CompositeHashKeyColumn. I'll touch on this concept briefly later, but am skipping this for now.
Now, we'll want to prepare our data for insertion. In the example below, I'm only inserting one record. The dynamodb-geo library expects a PutItemRequest, so therefore we must use the Map<String, AttributeValue> for each attribute we want to insert as part of a record, which is a bit more painful than using the Item class.
You'll notice the AttributeValue().with functions in the example below. For instance, I'm using a UUID for as my partition key, so I'm inserting that as a String; hence the .withS(). You can read more about AttributeValue and Dynamo's supported data types.
One thing of note, you will want to include the original latitude and longitude in your table. Later, when you query the table using a radius or rectangle filter, the GeoFilter class looks for 'latitude' and 'longitude' attributes in the resultset, which the dynamodb-geo library then uses to filter the results. I'm inserting these as numbers; hence the .withN().
Map<String, AttributeValue> item = new HashMap<String, AttributeValue>();
item.put("key", new AttributeValue().withS(YOUR_KEY));
...
item.put("latitude", new AttributeValue().withN(YOUR_LATITUDE));
item.put("longitude", new AttributeValue().withN(YOUR_LONGITUDE));
PutItemRequest putItemRequest = new PutItemRequest();
putItemRequest.setTableName(tableName);
putItemRequest.setItem(item);
Next, you'll also need to put your GeoConfig into a List object for passing to the geoLib.putItemRequest() decorator method. This allows you to add multiple Geohashes and Keys, HashKeyDecorators, etc., for all the ways you want to index and query your data. I'm only using one GeoConfig in this example.
List<GeoConfig> geoConfigList = new ArrayList<GeoConfig>();
geoConfigList.add(geoConfig);
putItemRequest = geoLib.putItemRequest(putItemRequest, YOUR_LATITUDE, YOUR_LONGITUDE, geoConfigList);
PutItemResult putItemResult = client.putItem(putItemRequest);
For each item you insert, you provide the Latitude and Longitude for the Geo library to construct the Geohash and Key. You will receive an updated PutItemRequest from the geoLib.putItemRequest() method, which you can insert using the client we constructed at the beginning.
Query
Now, we're set to query the table. First, we create a QueryRequest object. I'm only passing in the tableName to this object; however, you can set most of the other values (except KeyConditions, and FilterExpression). Next, we'll create a GeoQueryRequest using the geoLib.radiusQuery() method to perform a radius query from a geopoint latitude and longitude. Because we're not using a HashKeyDecorator, the last value is null. This method will create a collection of QueryRequests, at least one for each Geohash Key in the search area, and an associated Radius GeoFilter which the dynamodb-geo library will use to filter the results from Dynamo.
QueryRequest queryRequest = new QueryRequest()
.withTableName(tableName);
GeoQueryRequest geoQueryRequest = geoLib.radiusQuery(queryRequest,YOUR_LATITUDE,YOUR_LONGITUDE, YOUR_DISTANCE_METERS, geoConfig, null);
The dynamodb-geo library allows us to concurrently process this batch of requests through the GeoQueryClient class using the Java ExecutorService. I set my thread pool to five because my Index is set to 5 Read Capacity Units (although I'll enable auto scaling in production).
ExecutorService executorService = Executors.newFixedThreadPool(5);
GeoQueryClient geoQueryClient = new GeoQueryClient(client, executorService);
List<Map<String, AttributeValue>> items = geoQueryClient.execute(geoQueryRequest);
The geoQueryClient.execute() method executes all the queries as Callables via a thread pool, pulling from a queue until all tasks have been executed. Once all the results are aggregated, the Radius GeoFilter will be applied, and the final results are returned to us. We can now loop through the List<Map<String, AttributeValue>> and process the results.
for (Map<String, AttributeValue> item : items) {
String key = UUID.fromString(item.get("key").getS());
...
String latitude = Double.parseDouble(item.get("latitude").getN());
String longitude = Double.parseDouble(item.get("longitude").getN());
}
The Composite Key Decorator
Should you want to, Dash's dynamodb-geo library provides a mechanism to concatenate another column's data with the Geohash Key to, in their words "Fetch an item where lat/long is 23.78787, -70.6767 AND category = 'restaurants'". To do this, you must pass in an empty HashKeyDecorator and Column name when inserting via the GeoConfig, and also provide a Column value when querying. Also, if you intend to use this approach, note that the GeoHashKey is inserted to Dynamo as a String, rather than a number. Therefore, you'll need to set your Index to reference the Geohash Key as a String partition key.
I have multiple columns that I would prefer to include explicitly in my QueryRequest, so this wouldn't work well for my use case. Therefore, I could use this library as-is and simply filter the result in my own code, or I can update their library to use the QueryRequest FilterExpression in order to have the database do the work.
FilterExpressions
After some analysis, I extended Dash's library to enable FilterExpressions. After I work this some more, I may contribute it back. But thought I'd just comment on what I did to make it work within their existing construct.
In their GeoQueryHelper.java, I added three additional lines to their 'copyQueryRequest' method. To make this fit on screen better, I changed the incoming variable 'queryRequest' to simply 'qr'.
private QueryRequest copyQueryRequest(QueryRequest qr) {
QueryRequest copiedQueryRequest = new QueryRequest()
.withAttributesToGet(qr.getAttributesToGet())
.withConsistentRead(qr.getConsistentRead())
.withExclusiveStartKey(qr.getExclusiveStartKey())
.withIndexName(qr.getIndexName())
.withKeyConditions(qr.getKeyConditions())
.withLimit(qr.getLimit())
.withReturnConsumedCapacity(qr.getReturnConsumedCapacity())
.withScanIndexForward(qr.getScanIndexForward())
.withSelect(qr.getSelect())
.withAttributesToGet(qr.getAttributesToGet())
.withTableName(qr.getTableName())
//-> start new code
.withFilterExpression(qr.getFilterExpression())
.withExpressionAttributeNames(qr.getExpressionAttributeNames())
.withExpressionAttributeValues(qr.getExpressionAttributeValues());
//-> finish new code
return copiedQueryRequest;
}
Once this is built, and added to my project, I can now add a FilterExpression and Expression Attributes as necessary when building my own QueryRequest, which I pass into the radiusQuery to build our the GeoQueryRequest. I'll augment the example I provided in the Query section above:
String name = "Speedy's Garage";
String region = "Southwest";
Map<String,String> expAttribNames = new HashMap<>();
expAttribNames.put("#name","NAME");
expAttribNames.put("#region","REGION");
Map<String,AttributeValue> expAttribValues = new HashMap<>();
expAttribValues.put(":varName",new AttributeValue().withS(name));
expAttribValues.put(":varRegion",new AttributeValue().withS(region));
QueryRequest queryRequest = new QueryRequest()
.withTableName(tableName)
.withFilterExpression("#name = :varName AND #region = :varRegion")
.withExpressionAttributeNames(expAttribNames)
.withExpressionAttributeValues(expAttribValues);
GeoQueryRequest geoQueryRequest = geoLib.radiusQuery(queryRequest,YOUR_LATITUDE,YOUR_LONGITUDE, YOUR_DISTANCE_METERS, geoConfig, null);
You must use ExpressionAttributeNames whenever you have a column name that matches a reserved word in the Dynamo lexicon. In the example above, I'm using a column called 'NAME' and a column called 'REGION', both of which are reserved words. Therefore, we let dynamo handle this substitution to avoid confusion. And you use Expression Attribute Values to easily substitute values at runtime, just as in the example above with the Strings 'name' and 'region'.
Testing Results
I tested the algorithm tonight from within a Java AWS Lambda function in the same region as my Dynamo tables with 1024M Memory. I analyzed differences in key length, radius/miles to see how many queries were generated. I also looked at changing thread pool sizes to see if that improved performance. Your mileage may vary depending on the density of your dataset, the memory setting you choose in Lambda (which apparently affects the machine type where the function runs), etc. Here are my results:
Key Length | Threads | Radius/Miles | Queries | Duration |
6 | 5 | 50 | 256 | 4055ms |
3 | 5 | 50 | 12 | 137ms |
1 | 5 | 50 | 12 | 180ms |
6 | 5 | 100 | 474 | 2139ms |
3 | 5 | 100 | 3 | 39ms |
1 | 5 | 100 | 3 | 37ms |
6 | 5 | 200 | 2096 | 8301ms |
3 | 5 | 200 | 8 | 199ms |
1 | 5 | 200 | 7 | 243ms |
6 | 10 | 50 | 256 | 17156ms |
3 | 10 | 50 | 12 | 38ms |
1 | 10 | 50 | 12 | 99ms |
6 | 10 | 100 | 474 | 36786ms |
3 | 10 | 100 | 3 | 92ms |
1 | 10 | 100 | 3 | 56ms |
6 | 10 | 200 | 2096 | 138629ms |
3 | 10 | 200 | 8 | 243ms |
1 | 10 | 200 | 7 | 63ms |
As I thought, a key length of 6 generates far more queries compared to shorter key lengths. And the number of queries grows proportionally to the radius size. However, I was surprised a bit by the results for 1 and 3 key lengths. There is not much difference in the number of queries, leading to similar performance. With fewer queries, we're putting the onus on the Sort Key range queries (and possibly the GeoFilter). Regardless, the performance for a key length of 1 or 3 is dramatically better for our purposes. Lastly, I didn't see much improvement when using 10 threads vs 5.