THOUGHTS, STORIES AND IDEAS

When I saw the XKCD strip about Reaction Maps my first thought was how much fun it would be communicating like that with my friends. Well, you probably can guess the second thought now. It was about how great it would be to implement something that enables me to do so.

Planning the App

The plan is to implement an application with the functionality shown in the below comic strip. So basically the end user can enter a phrase, which should be mapped to a route and displayed on a map.

In the following lines I'll explain a bit more about how the app will be structured. Basically the frontend will be a very dumb client, while the backend holds all the logic and phrase to route mapping.

Frontend

The UI will be really simple and should look like the below pictured mockups. So on the first screen the user can enter a phrase, which will be rendered to a map on the second screen.

As I'm currently learning React, I will implement the frontend with it. Because of the simple user interface it seems like a perfect beginner project to me.

Backend

The backend will provide a very simple REST interface for the frontend server. It will just contain one endpoint which is defined like this:

# REQUEST
GET /route
parameters:
    - phrase: String
# RESPONSE
- 200 OK: Returns a list of places that maps to the given phrase
- 400 Bad Request: No or invalid phrase passed
- 401 Unauthorized: The user is not allowed to make this request (see next section - "Security")
- 404 Not Found: No map for the whole phrase or a word found

As the application will probably idle a lot I will deploy it to a serverless function, to be more precise, to an AWS Lambda function implemented in Kotlin. To stay inside the AWS ecosystem, I will use AWS RDS to host a PostgreSQL database containing the names of all places (cities, towns etc) in the United States. As this will be a very small project it will be perfect to fiddle around with technologies that are mostly new to me. That's one of the biggest reasons I decided to implement it with React and AWS Lambda (amongst other things).

Security

Securing the application will be very easy, as the frontend will be accessible for the whole world anyways, so no authentication is necessary there. The backend (AWS Lambda function) will be accessible via an AWS API Gateway, which will be secured with an API key.

Implementing the Backend

Database Setup

Before starting with the backend itself, I created a PostgreSQL database on AWS RDS. Then I downloaded the US GeoNames from here and imported their names, latitude and longitude to the database using the Flyway Maven plugin. To do so I created a Java-based migration that fetches all lines from the file containing the place names and inserts them to the database. Of course the full source for this can be found on Github. Thanks to my friend Matthias for making this whole migration a lot faster.

Phrase to Route Mapping: Basic Idea

After some research about how to match words that sound similar I stumbled across phonetic matching algorithms and decided to elaborate on this approach. A phonetic matching algorithm matches words that sound similar by comparing their phonemes. So basically it tells you if two words sound approximately the same. According to Wikipedia a phoneme /ˈfoʊniːm/ is a unit of sound that distinguishes one word from another in a particular language.

There are multiple ways to map a sentence to a route. The most obvious approaches probably are mapping it word by word or mapping the whole sentence as one. Mapping the sentence word by word may make it easier to read the result for the end user as he doesn't have to think about splitting words in his mind. But what if there is no good match for a word or the match would be better if two words are matched together? To describe this case we can even take an example provided in the comic strip: "to loose" was mapped to "Toulouse". In the first iteration this problem will be ignored and the mapping will happen word by word. Depending on how good this approach works, I'll either leave it that way or also compare the single word mapping with group mappings (e.g. always try to map the word alone, but also try to map it with the previous and next word to find the best possible result).

Another challenge will be how to decide for a single match if there are multiple results. To solve this problem a string similarity metric might be a good solution.

Phrase to Route Mapping: First Try

The oldest and probably most widely known phonetic matching algorithm is Soundex. I stumbled across it on nearly every page I read, so I gave it a try. The fact that Soundex and other phonetic matching algorithms are included in the Apache Commons Codec makes the given problem a lot easier to solve.

To test Soundex I simply created a small utility class that returns all phonetic matches for the specified words. The result is returned in a simple JSON object. Below the results for the word "truly" are listed:

[
   {
      "word":"truly",
      "testResults":[
         {
            "algorithm":"Soundex",
            "matchedPlaces":[
               "Torrey Hill",
               "Torry Hill",
               "Tory Hill",
               "True Hill",
               "Tower Hill",
               "Trail",
               "Trow Hill",
               "Tar Hollow",
               "Terre Hill",
               "Terry Hill",
               "Thayer Hall",
               "Thayer Hollow",
               "Tyrrell",
               "Tarey Hill",
               "Thayer Hill",
               "Tire Hill",
               "Tirol",
               "Troy Hill",
               "Turley",
               "Torro Well",
               "Thoreau Hall",
               "Tower Hall",
               "Trull",
               "Tarryall",
               "Terry Well",
               "Thayer Well",
               "Toril",
               "Tree Well",
               "Terry Hollow",
               "Threeway Well",
               "Thurlow",
               "Truly",
               "True Hall",
               "Tyrol",
               "Terrell",
               "Three Hill",
               "Thrall",
               "The Rowel",
               "Trial",
               "Tar Heel",
               "Tree Wall",
               "Taral",
               "Turrell",
               "Trilla",
               "Tar Hill",
               "Tary Hollow",
               "Terrill",
               "Troy Hollow",
               "Traw Hollow",
               "Tarheel",
               "True Hollow",
               "Terral",
               "Tyrola",
               "Tri-Hill",
               "Tara Hall",
               "The Whirl",
               "Tory Hollow",
               "Tray Hollow",
               "Tralee",
               "Terril",
               "Torrey Hall"
            ]
         }
      ]
   }
]

Well.... Not really the result I was hoping for. It even returns some places where i can't imagine how they should sound like "truly". Some of these words are "Thayer Hollow", "Torrey Hill", "Torro Well" and so on. Of course there are some good matches as well, but all in all there are too many false positives to continue with that approach in my opinion.

Phrase to Route Mapping: Second First Try

Fortunately the Apache Commons Codec library contains the implementations of more phonetic matching algorithms than just Soundex. Therefore I decided to simply test all of them using the method shown above. Now it is the challenge to choose a good set of test words, which should include words that will be used a lot. As this is just a fun project, I'll use the phrases from the XKCD comic strip for testing and to get a feeling on how good the algorithms work. Please note that this is just a quick and dirty approach! In a professional project I would probably choose some other methodology. A first test run with the word "truly" mapped with all algorithms returned a JSON with more than 20.000 lines. So I adapted my test to just return the number of matched places for the given words and algorithms to find the algorithms that return so many results. Below the number of results per algorithm are pictured.

Algorithm truly sorry to loose to loose a friend this way
Beider-Morse 1 16 5 27 8 7 1 4 8
Caverphone 1 21 35 38 60 26 8 221 35 38
Caverphone 2 21 35 58 18 18 8 1 20 48
Cologne Phonetic 41 26 18 41 25 5 2 23 22
Daitch-Mokotoff Soundex 20 21 10 53 30 2 12 32 8
Double Metaphone 50 15 13 36 18 3 1030 1 13
Match Rating Approach Encoder 11359 2383 186 1692 2341 0 30645 2281 713
Metaphone 18 14 5 27 13 1 917 0 7
Nysiis 2 6 14 10 5 3 420 14 7
Refined Soundex 100093 64756 0 50308 39112 0 17765 5514 0
Soundex 61 34 24 207 86 8 1138 77 26

Looking at these results it became clear that Match Rating Approach Encoder and Refined Soundex are definitely not an option. For most of the words they return a ridiculously huge number of results while they seem to have problems with very short words, where they sometimes don't even return a single result (this especially applies for Refined Soundex).

On the other hand Beider-Morse & Nysiis seem quite appealing because of the low number of results. But especially the results for "friend" are interesting. Here Nysiis returns 420 results, while Beider-Morse only returns one. So lets take a look at the detailled results for these algorithms  here.

When looking at the results the huge amount of matches for "friendship" with the Nysiis algorithm becomes understandable very fast. It also included all words that start with friendship (e.g. "Friendship Cemetery") while Beider-Morse only returned "Friendship". Besides that the results of both algorithms seem to be quite good but with the given test words I couldn't find the one that is really better than the other. It really seemed to depend on the word which one performs better.

Therefore I decided to match the words against both of the algorithms. This means if a word is passed to the matching service, first the matches for Beider-Morse and Nysiis are computed. Next the result sets need to be combined. The first step to do so, is to check if there are results that are contained in both result sets. If yes, return these matches. If there is no intersection between the two sets, the union of them will be returned. If none of them returns a result Soundex will be used as fallback.

At this moment the service returns a nice result set, but still we need to decide for one of the results. To find the best one the Levenshtein distance seems to be a good approach. It is a string metric that measures the difference between two strings. So the Levenshtein distance of all results is calculated, and the best ones are returned. With this approach we can narrow the results down a lot, but still we can have multiple of them. For now just a random result will be chosen to be returned, but for the future I'm thinking about implementing something that decides which of the matches fits best into the resulting route.

To avoid that we always need to encode all places with the described algorithms I added a database migration which encodes all places with each algorithm and stores the computed codes and the place they belong to to the database.

After all of the above described logic has been implemented, it was time to run a first test. Passing the phrase "Truly sorry to loose a friend this way" returned "Truly","Surry","Taus","Lozes","Ai","Friend","Teas","Way", which is a result, I'm very happy about. One thing I was not so happy about, was performance - it took 8 minutes and 3 seconds (483654 ms) to compute this result, which is not acceptable.

Performance Optimization

As we got more than 2 million places also the tables for the Nysiis and Soundex encoded places have this size. Further the Beider-Morse table has more than 1 billion entries, as the Beider-Morse algorithm returns multiple phonetic codes for a word. For this reason computing the results for the test phrase took about 8 minutes. The solution for this problem is to add an index to code columns of the tables for the different algorithms and to the name column of the place table like this:

CREATE INDEX ON beider_morse_encoded_places (code);
CREATE INDEX ON nysiis_encoded_places (code);
CREATE INDEX ON soundex_encoded_places (code);
CREATE INDEX ON places (lower(name));

Computing the same result as above now takes only 832 ms - hooray! :D

This is already quite a big improvement but what about some further optimizations? There are some things already mentioned above that would be nice in the future. In the next section you can read up about them.

Possible future improvements

Group matching

If you take a look at the comic strip, which this project is based on, you can see that "to loose" is matched to "Toulouse" there. This means two words have been combined into one to improve matching. This is something we are planning to consider in the future as it could improve the accuracy of the results a lot.

Route optimization

I already mentioned that sometimes a couple of equally good matches are returned. At the moment the first match of those is taken into the result. In the future it would be great if the match that fits best into the route would be chosen instead of the first one.

Dictionary

Adding a dictionary with already matched places could make the whole solution much faster. This seems to be like an easy one, but it must not interfere with the above described features if they are implemented.

TLDR;

Within this blog post I described how I implemented the backend of an application based on this XKCD strip. It is implemented in Kotlin and running inside an AWS Lambda function. I used some phonetic matching algorithms (Beider-Morse, Nysiis and Soundex as fallback) to map a phrase to a list of places that sound similar, and calculated the Levenshtein distance to find the best match. Thanks again to Matthias who implemented some small improvements so far.

Resources

The full source can be found on Github. The Pre-Release v0.1 contains the code that has been written at the time this blog post was published.

You can try the backend solution by sending a GET request to https://api.map-reactions.ksick.dev/v0-1/route?phrase=[your phrase, separated with blanks]. E.g. https://api.map-reactions.ksick.dev/v0-1/route?phrase=truly%20sorry%20to%20loose%20a%20friend%20this%20way

https://download.geonames.org

Title Photo by Timo Wielink on Unsplash

You've successfully subscribed to ksick.dev
Welcome back! You've successfully signed in.
Great! You've successfully signed up.
Success! Your account is fully activated, you now have access to all content.