Geocoding and visualizing address information with R

The step-by-step in a nutshell

By Carlos A. Toruño P.

December 21, 2022

In the previous series of posts about Web Scraping, we extracted information about lawyers from the German and Polish Bar Associations. Within that information, we were able to retrieve the address for each one of these individuals. Nevertheless, these text strings are just that… text. If we really want to take advantage of this kind of information we need to be able to translate these text characters into a format that we can use to visualize our data in a map. This process of transforming the description of a location into a set of geographical coordinates is called Geocoding.

In this post, I will be using information extracted from the Austrian Bar Association to geocode and visualize the geographical distribution of lawyers across the country using R.

What’s Geocoding?

As mentioned before, geocoding is the process of transforming a description of a location into geographical coordinates such as longitude and latitude. Similarly, you can also perform reverse geocoding, which is the process of recognizing a set of geographical coordinates and provide an associated description about this location and how to reach it.

The most practical way to geocode huge volumes of information is to access the data sets of online mapping services like Google Maps or Open Street Map (OSM) through an API. If you want to know more about such services, I would suggest you check the following video by Google:

Even though Google Maps is the most famous mapping provider out there, in this post I will be showing you how to geocode Austrian addresses using the OSM API. Why? First of all, because their services are freely available through their Nominatim Service and, second, because their coverage and data quality is almost as good as the service provided by Google Maps.

Having said all this, we can begin with the actual example.

How to geocode addresses using the OSM Nominatim Service?

The key packages that we are going to be using are two: sf and tmaptools. The sf package allow us to access and handle simple features using R programming language. If you are not familiar with the concept of simple features, I would suggest you to read this article written by Edzer Pebesma. On the other hand, the tmaptools package is an R-library that provides a wide set of tools for reading and processing spatial data, including the geocode_OSM() function which will allow us to access the OSM Nominatim Service to geocode strings of text.

Both packages can be installed as follows:

install.packages("sf")
install_github("mtennekes/tmaptools")

Now that we know the tools, lets take a look at the data. For this, we first need to load the libraries that we are going to be using and then, we can read and check the data that we have.

# Loading required libraries
library(tmaptools)
library(sf)
library(tidyverse)

# Reading data
austria_data.df <- read_csv("Austria_data.csv")

# How does data look like?
austria_data.df %>%
  slice_head(n = 15) %>%
  knitr::kable(format = "html")

address	phone	email	website
1010 Wien--Stubenring 18	Yes	Yes	No
5020 Salzburg--Mildenburggasse 1/2	Yes	Yes	No
1010 Wien--Stubenring 18	Yes	Yes	Yes
1010 Wien--Stubenring 18	Yes	Yes	No
6900 Bregenz--Rathausstraße 37	Yes	Yes	Yes
1010 Wien--Spiegelgasse 19	Yes	Yes	No
1220 Wien--ARES Tower, Donau-City-Straße 11	Yes	Yes	Yes
1030 Wien--Riesgasse 3/14	Yes	Yes	No
1010 Wien--Wollzeile 17/22	Yes	Yes	Yes
1010 Wien--Wollzeile 17/22	Yes	Yes	Yes
1010 Wien--Kohlmarkt 8-10/Eing. Wallnerstr. 1	Yes	Yes	Yes
1070 Wien--Neustiftgasse 3/7	Yes	Yes	Yes
3100 St. Pölten--Domgasse 2	Yes	Yes	No
6020 Innsbruck--Wilhelm-Greil-Straße 21/IV.	Yes	Yes	No
6800 Feldkirch--Schloßgraben 10	Yes	Yes	Yes

We have a data frame with four columns. For now, we will be focusing exclusively in the first one, the address. In Austria, all these addresses follow the same pattern: a four digits sequence representing the corresponding ZIP code, followed by the name of the city or town in which the individual is located, followed by a line break represented by two hyphens (–) and then, after this line break, we have the street name and the house number. First, let’s change the line break by a comma and extract only the address information:

# Replacing line breaks by commas and retrieving the addresses as a vector
addresses <- austria_data.df %>%
  mutate(address = str_replace_all(address, "--", ", ")) %>%
  pull(address)

# How do the vector look like?
addresses[1:15]

 [1] "1010 Wien, Stubenring 18"                     
 [2] "5020 Salzburg, Mildenburggasse 1/2"           
 [3] "1010 Wien, Stubenring 18"                     
 [4] "1010 Wien, Stubenring 18"                     
 [5] "6900 Bregenz, Rathausstraße 37"               
 [6] "1010 Wien, Spiegelgasse 19"                   
 [7] "1220 Wien, ARES Tower, Donau-City-Straße 11"  
 [8] "1030 Wien, Riesgasse 3/14"                    
 [9] "1010 Wien, Wollzeile 17/22"                   
[10] "1010 Wien, Wollzeile 17/22"                   
[11] "1010 Wien, Kohlmarkt 8-10/Eing. Wallnerstr. 1"
[12] "1070 Wien, Neustiftgasse 3/7"                 
[13] "3100 St. Pölten, Domgasse 2"                  
[14] "6020 Innsbruck, Wilhelm-Greil-Straße 21/IV."  
[15] "6800 Feldkirch, Schloßgraben 10"

A clean string of text increases the likelihood of capturing the desired location with accuracy. Therefore, it is highly recommended to check for encoding or misspelling issues before passing the string to the geocoding function. Once that you have make sure that you are passing tidy and clean strings, you need to know beforehand certain aspects. For example, in which projection system do you want the coordinates to be?, or, in which data structure do you want the output to be returned?

For this example, I’m gonna be retrieving the coordinates in a longitude-latiutude system. More specifically, in the World Geodetic System or WGS84, which is also refered to as the EPSG 4326 system. An explanation of the different projection systems would required their own post, which escapes from the scope of this one at present. Therefore, if you are not familiar with projection systems, I would highly suggest you to watch the following video:

After you have decided in which projection system do you want your coordinates to be, geocode_OSM() allows you to decide in which data structure you want your data to be returned. For this example, I am asking the output to be returned as a simple feature. Let’s geocode the first 100 addresses that we have in our registry and check how does the information look like.

# Geocoding the first 100 addresses from the registry
geocoded_addresses <- geocode_OSM(addresses[1:100], 
                                  projection = 4326, 
                                  as.sf      = T)

# How does this new information looks like?
geocoded_addresses %>%
  slice_head(n = 15) %>%
  knitr::kable(format = "html")

query	x	y	y_min	y_max	x_min	x_max	bbox	point
1010 Wien, Stubenring 18	16.381229	48.20844	48.20839	48.20849	16.381179	16.381280	POLYGON ((16.38118 48.20839...	POINT (16.38123 48.20844)
5020 Salzburg, Mildenburggasse 1/2	13.062617	47.79873	47.79873	47.79881	13.062602	13.062617	POLYGON ((13.0626 47.79873,...	POINT (13.06262 47.79873)
1010 Wien, Stubenring 18	16.381229	48.20844	48.20839	48.20849	16.381179	16.381280	POLYGON ((16.38118 48.20839...	POINT (16.38123 48.20844)
1010 Wien, Stubenring 18	16.381229	48.20844	48.20839	48.20849	16.381179	16.381280	POLYGON ((16.38118 48.20839...	POINT (16.38123 48.20844)
6900 Bregenz, Rathausstraße 37	9.746105	47.50451	47.50442	47.50459	9.745983	9.746226	POLYGON ((9.745983 47.50442...	POINT (9.746105 47.5045)
1010 Wien, Spiegelgasse 19	16.369606	48.20635	48.20621	48.20649	16.369477	16.369900	POLYGON ((16.36948 48.20621...	POINT (16.36961 48.20635)
1030 Wien, Riesgasse 3/14	16.389144	48.19857	48.19813	48.19908	16.389128	16.389157	POLYGON ((16.38913 48.19813...	POINT (16.38914 48.19857)
1010 Wien, Wollzeile 17/22	16.376252	48.20840	48.20765	48.20965	16.373491	16.378753	POLYGON ((16.37349 48.20765...	POINT (16.37625 48.2084)
1010 Wien, Wollzeile 17/22	16.376252	48.20840	48.20765	48.20965	16.373491	16.378753	POLYGON ((16.37349 48.20765...	POINT (16.37625 48.2084)
1070 Wien, Neustiftgasse 3/7	16.337271	48.20616	48.20615	48.20616	16.337165	16.337400	POLYGON ((16.33717 48.20615...	POINT (16.33727 48.20616)
3100 St. Pölten, Domgasse 2	15.624754	48.20537	48.20532	48.20542	15.624704	15.624804	POLYGON ((15.6247 48.20532,...	POINT (15.62475 48.20537)
6020 Innsbruck, Wilhelm-Greil-Straße 21/IV.	11.397080	47.26507	47.26507	47.26518	11.397080	11.397115	POLYGON ((11.39708 47.26507...	POINT (11.39708 47.26507)
6800 Feldkirch, Schloßgraben 10	9.598812	47.23845	47.23824	47.23872	9.598467	9.599150	POLYGON ((9.598467 47.23824...	POINT (9.598812 47.23845)
4020 Linz, Schillerstraße 12	14.293445	48.29761	48.29756	48.29766	14.293395	14.293495	POLYGON ((14.2934 48.29756,...	POINT (14.29345 48.29761)
5020 Salzburg, Erzabt-Klotz-Straße 12/2	13.052071	47.79254	47.79251	47.79254	13.051968	13.052071	POLYGON ((13.05197 47.79251...	POINT (13.05207 47.79254)

Visualizing the results

The output we received has information regarding the longitude, latitude and bounding box of each one of the addresses that we submitted. Let’s try to visualize the coordinates that we now have in a map using the Leaflet package (Remember that you can zoom in a certain area of the map to distinguish between close individuals):

library(leaflet)
leaflet() %>% 
      addProviderTiles(providers$CartoDB.Voyager) %>%
      setView(lng  = 13.5306, 
              lat  = 47.4968, 
              zoom = 7) %>%
      addMarkers(
        data         = geocoded_addresses,
        lng          = ~x,
        lat          = ~y,
        popup        = ~query
      )

By geocoding the full addresses we get a very accurate point. However, it’s possible that we would prefer to group points that are very close to each other and see the major concentrations of lawyers in the country. For this, we can group individuals by major locations such as cities or towns. Let’s remember that, following the ZIP code, we have the name of the specific location in which the individual resides. We can extract this information from the address using regular expressions and use this demographic level to group individuals.

# Grouping individuals and extracting the information
grouped_data.df <- austria_data.df %>%
  mutate(location = str_extract(address, "(?<=\\d{4}\\s).+(?=--)"),
         location = paste0(location, ", Austria")) %>%
  group_by(location) %>%
  summarise(total = n()) %>%
  arrange(desc(total))

# How does the data look like?
grouped_data.df %>%
  slice_head(n = 15) %>%
  knitr::kable(format = "html")

location	total
Wien, Austria	3563
Graz, Austria	379
Innsbruck, Austria	342
Salzburg, Austria	333
Linz, Austria	315
Klagenfurt, Austria	133
Wels, Austria	88
Dornbirn, Austria	67
St. Pölten, Austria	63
Bregenz, Austria	53
Feldkirch, Austria	53
Wiener Neustadt, Austria	53
Villach, Austria	49
Mödling, Austria	46
Baden, Austria	34

As a result of the grouping, we have a data frame containing the names of 370 locations scattered all over Austria and sorted by the number of lawyers residing in that city or town. We can grab this list of locations and geocode them in order to retrieve their geographical coordinates as follows:

# Geocoding all the listed locations in Austria
geocoded_addresses <- geocode_OSM(grouped_data.df %>% pull(location), 
                                  projection = 4326, 
                                  as.sf      = T)

# Joining total number of lawyers to geocoded data
geocoded_addresses <- geocoded_addresses %>%
  select(location = query, x, y) %>%
  left_join(grouped_data.df,
            by = "location") %>%
  mutate(label4map = paste("<p><strong>", location, "</strong><br/>",
                           "Number of lawyers: ", total, "</p>"))

Let’s visualize this data into a Leaflet map and identify the major demographic concentrations of lawyers (Remember than you can scroll over a marker to see the location and the number of lawyers in that city or town):

leaflet() %>% 
      addProviderTiles(providers$CartoDB.Voyager) %>%
      setView(lng  = 13.5306, 
              lat  = 47.4968, 
              zoom = 7) %>%
  addCircleMarkers(data         = geocoded_addresses,
                   lng          = ~x,
                   lat          = ~y,
                   radius       = ~(total)^(1/3.5),
                   color        = "#00A08A",
                   fillOpacity  = 0.1,
                   label        = lapply(geocoded_addresses$label4map, shiny::HTML),
                   labelOptions = labelOptions(style = list("font-weight" = "normal", 
                                                            padding = "3px 8px", 
                                                            "color" = "#00A08A"),
                                               textsize  = "15px", 
                                               direction = "auto")
      )

Data looks beautiful!!!

You can now also display in a map the geographical distribution of other available data. For example, cities with the higher percentage of registered individuals registered with e-mail contacts or cities with the higher percentage of lawyers with websites personalized websites, among others.

Posted on:: December 21, 2022

Length:: 9 minute read, 1819 words

Tags:: GIS Geocoding

See Also:: Simplification Algorithms in R; Exploding geometries with GeoPandas; Programming Web Apps with Streamlit