Geocoding and visualizing address information with R

The step-by-step in a nutshell

By Carlos A. Toruño P.

December 21, 2022

In the previous series of posts about Web Scraping, we extracted information about lawyers from the German and Polish Bar Associations. Within that information, we were able to retrieve the address for each one of these individuals. Nevertheless, these text strings are just that… text. If we really want to take advantage of this kind of information we need to be able to translate these text characters into a format that we can use to visualize our data in a map. This process of transforming the description of a location into a set of geographical coordinates is called Geocoding.

In this post, I will be using information extracted from the Austrian Bar Association to geocode and visualize the geographical distribution of lawyers across the country using R.

What’s Geocoding?

As mentioned before, geocoding is the process of transforming a description of a location into geographical coordinates such as longitude and latitude. Similarly, you can also perform reverse geocoding, which is the process of recognizing a set of geographical coordinates and provide an associated description about this location and how to reach it.

The most practical way to geocode huge volumes of information is to access the data sets of online mapping services like Google Maps or Open Street Map (OSM) through an API. If you want to know more about such services, I would suggest you check the following video by Google:

Even though Google Maps is the most famous mapping provider out there, in this post I will be showing you how to geocode Austrian addresses using the OSM API. Why? First of all, because their services are freely available through their Nominatim Service and, second, because their coverage and data quality is almost as good as the service provided by Google Maps.

Having said all this, we can begin with the actual example.

How to geocode addresses using the OSM Nominatim Service?

The key packages that we are going to be using are two: sf and tmaptools. The sf package allow us to access and handle simple features using R programming language. If you are not familiar with the concept of simple features, I would suggest you to read this article written by Edzer Pebesma. On the other hand, the tmaptools package is an R-library that provides a wide set of tools for reading and processing spatial data, including the geocode_OSM() function which will allow us to access the OSM Nominatim Service to geocode strings of text.

Both packages can be installed as follows:

install.packages("sf")
install_github("mtennekes/tmaptools")

Now that we know the tools, lets take a look at the data. For this, we first need to load the libraries that we are going to be using and then, we can read and check the data that we have.

# Loading required libraries
library(tmaptools)
library(sf)
library(tidyverse)

# Reading data
austria_data.df <- read_csv("Austria_data.csv")

# How does data look like?
austria_data.df %>%
  slice_head(n = 15) %>%
  knitr::kable(format = "html")
address phone email website
1010 Wien--Stubenring 18 Yes Yes No
5020 Salzburg--Mildenburggasse 1/2 Yes Yes No
1010 Wien--Stubenring 18 Yes Yes Yes
1010 Wien--Stubenring 18 Yes Yes No
6900 Bregenz--Rathausstraße 37 Yes Yes Yes
1010 Wien--Spiegelgasse 19 Yes Yes No
1220 Wien--ARES Tower, Donau-City-Straße 11 Yes Yes Yes
1030 Wien--Riesgasse 3/14 Yes Yes No
1010 Wien--Wollzeile 17/22 Yes Yes Yes
1010 Wien--Wollzeile 17/22 Yes Yes Yes
1010 Wien--Kohlmarkt 8-10/Eing. Wallnerstr. 1 Yes Yes Yes
1070 Wien--Neustiftgasse 3/7 Yes Yes Yes
3100 St. Pölten--Domgasse 2 Yes Yes No
6020 Innsbruck--Wilhelm-Greil-Straße 21/IV. Yes Yes No
6800 Feldkirch--Schloßgraben 10 Yes Yes Yes

We have a data frame with four columns. For now, we will be focusing exclusively in the first one, the address. In Austria, all these addresses follow the same pattern: a four digits sequence representing the corresponding ZIP code, followed by the name of the city or town in which the individual is located, followed by a line break represented by two hyphens (–) and then, after this line break, we have the street name and the house number. First, let’s change the line break by a comma and extract only the address information:

# Replacing line breaks by commas and retrieving the addresses as a vector
addresses <- austria_data.df %>%
  mutate(address = str_replace_all(address, "--", ", ")) %>%
  pull(address)

# How do the vector look like?
addresses[1:15]
 [1] "1010 Wien, Stubenring 18"                     
 [2] "5020 Salzburg, Mildenburggasse 1/2"           
 [3] "1010 Wien, Stubenring 18"                     
 [4] "1010 Wien, Stubenring 18"                     
 [5] "6900 Bregenz, Rathausstraße 37"               
 [6] "1010 Wien, Spiegelgasse 19"                   
 [7] "1220 Wien, ARES Tower, Donau-City-Straße 11"  
 [8] "1030 Wien, Riesgasse 3/14"                    
 [9] "1010 Wien, Wollzeile 17/22"                   
[10] "1010 Wien, Wollzeile 17/22"                   
[11] "1010 Wien, Kohlmarkt 8-10/Eing. Wallnerstr. 1"
[12] "1070 Wien, Neustiftgasse 3/7"                 
[13] "3100 St. Pölten, Domgasse 2"                  
[14] "6020 Innsbruck, Wilhelm-Greil-Straße 21/IV."  
[15] "6800 Feldkirch, Schloßgraben 10"              

A clean string of text increases the likelihood of capturing the desired location with accuracy. Therefore, it is highly recommended to check for encoding or misspelling issues before passing the string to the geocoding function. Once that you have make sure that you are passing tidy and clean strings, you need to know beforehand certain aspects. For example, in which projection system do you want the coordinates to be?, or, in which data structure do you want the output to be returned?

For this example, I’m gonna be retrieving the coordinates in a longitude-latiutude system. More specifically, in the World Geodetic System or WGS84, which is also refered to as the EPSG 4326 system. An explanation of the different projection systems would required their own post, which escapes from the scope of this one at present. Therefore, if you are not familiar with projection systems, I would highly suggest you to watch the following video:

After you have decided in which projection system do you want your coordinates to be, geocode_OSM() allows you to decide in which data structure you want your data to be returned. For this example, I am asking the output to be returned as a simple feature. Let’s geocode the first 100 addresses that we have in our registry and check how does the information look like.

# Geocoding the first 100 addresses from the registry
geocoded_addresses <- geocode_OSM(addresses[1:100], 
                                  projection = 4326, 
                                  as.sf      = T)

# How does this new information looks like?
geocoded_addresses %>%
  slice_head(n = 15) %>%
  knitr::kable(format = "html")

query x y y_min y_max x_min x_max bbox point
1010 Wien, Stubenring 18 16.381229 48.20844 48.20839 48.20849 16.381179 16.381280 POLYGON ((16.38118 48.20839... POINT (16.38123 48.20844)
5020 Salzburg, Mildenburggasse 1/2 13.062617 47.79873 47.79873 47.79881 13.062602 13.062617 POLYGON ((13.0626 47.79873,... POINT (13.06262 47.79873)
1010 Wien, Stubenring 18 16.381229 48.20844 48.20839 48.20849 16.381179 16.381280 POLYGON ((16.38118 48.20839... POINT (16.38123 48.20844)
1010 Wien, Stubenring 18 16.381229 48.20844 48.20839 48.20849 16.381179 16.381280 POLYGON ((16.38118 48.20839... POINT (16.38123 48.20844)
6900 Bregenz, Rathausstraße 37 9.746105 47.50451 47.50442 47.50459 9.745983 9.746226 POLYGON ((9.745983 47.50442... POINT (9.746105 47.5045)
1010 Wien, Spiegelgasse 19 16.369606 48.20635 48.20621 48.20649 16.369477 16.369900 POLYGON ((16.36948 48.20621... POINT (16.36961 48.20635)
1030 Wien, Riesgasse 3/14 16.389144 48.19857 48.19813 48.19908 16.389128 16.389157 POLYGON ((16.38913 48.19813... POINT (16.38914 48.19857)
1010 Wien, Wollzeile 17/22 16.376252 48.20840 48.20765 48.20965 16.373491 16.378753 POLYGON ((16.37349 48.20765... POINT (16.37625 48.2084)
1010 Wien, Wollzeile 17/22 16.376252 48.20840 48.20765 48.20965 16.373491 16.378753 POLYGON ((16.37349 48.20765... POINT (16.37625 48.2084)
1070 Wien, Neustiftgasse 3/7 16.337271 48.20616 48.20615 48.20616 16.337165 16.337400 POLYGON ((16.33717 48.20615... POINT (16.33727 48.20616)
3100 St. Pölten, Domgasse 2 15.624754 48.20537 48.20532 48.20542 15.624704 15.624804 POLYGON ((15.6247 48.20532,... POINT (15.62475 48.20537)
6020 Innsbruck, Wilhelm-Greil-Straße 21/IV. 11.397080 47.26507 47.26507 47.26518 11.397080 11.397115 POLYGON ((11.39708 47.26507... POINT (11.39708 47.26507)
6800 Feldkirch, Schloßgraben 10 9.598812 47.23845 47.23824 47.23872 9.598467 9.599150 POLYGON ((9.598467 47.23824... POINT (9.598812 47.23845)
4020 Linz, Schillerstraße 12 14.293445 48.29761 48.29756 48.29766 14.293395 14.293495 POLYGON ((14.2934 48.29756,... POINT (14.29345 48.29761)
5020 Salzburg, Erzabt-Klotz-Straße 12/2 13.052071 47.79254 47.79251 47.79254 13.051968 13.052071 POLYGON ((13.05197 47.79251... POINT (13.05207 47.79254)

Visualizing the results

The output we received has information regarding the longitude, latitude and bounding box of each one of the addresses that we submitted. Let’s try to visualize the coordinates that we now have in a map using the Leaflet package (Remember that you can zoom in a certain area of the map to distinguish between close individuals):

library(leaflet)
leaflet() %>% 
      addProviderTiles(providers$CartoDB.Voyager) %>%
      setView(lng  = 13.5306, 
              lat  = 47.4968, 
              zoom = 7) %>%
      addMarkers(
        data         = geocoded_addresses,
        lng          = ~x,
        lat          = ~y,
        popup        = ~query
      )

By geocoding the full addresses we get a very accurate point. However, it’s possible that we would prefer to group points that are very close to each other and see the major concentrations of lawyers in the country. For this, we can group individuals by major locations such as cities or towns. Let’s remember that, following the ZIP code, we have the name of the specific location in which the individual resides. We can extract this information from the address using regular expressions and use this demographic level to group individuals.

# Grouping individuals and extracting the information
grouped_data.df <- austria_data.df %>%
  mutate(location = str_extract(address, "(?<=\\d{4}\\s).+(?=--)"),
         location = paste0(location, ", Austria")) %>%
  group_by(location) %>%
  summarise(total = n()) %>%
  arrange(desc(total))

# How does the data look like?
grouped_data.df %>%
  slice_head(n = 15) %>%
  knitr::kable(format = "html")
location total
Wien, Austria 3563
Graz, Austria 379
Innsbruck, Austria 342
Salzburg, Austria 333
Linz, Austria 315
Klagenfurt, Austria 133
Wels, Austria 88
Dornbirn, Austria 67
St. Pölten, Austria 63
Bregenz, Austria 53
Feldkirch, Austria 53
Wiener Neustadt, Austria 53
Villach, Austria 49
Mödling, Austria 46
Baden, Austria 34

As a result of the grouping, we have a data frame containing the names of 370 locations scattered all over Austria and sorted by the number of lawyers residing in that city or town. We can grab this list of locations and geocode them in order to retrieve their geographical coordinates as follows:

# Geocoding all the listed locations in Austria
geocoded_addresses <- geocode_OSM(grouped_data.df %>% pull(location), 
                                  projection = 4326, 
                                  as.sf      = T)

# Joining total number of lawyers to geocoded data
geocoded_addresses <- geocoded_addresses %>%
  select(location = query, x, y) %>%
  left_join(grouped_data.df,
            by = "location") %>%
  mutate(label4map = paste("<p><strong>", location, "</strong><br/>",
                           "Number of lawyers: ", total, "</p>"))

Let’s visualize this data into a Leaflet map and identify the major demographic concentrations of lawyers (Remember than you can scroll over a marker to see the location and the number of lawyers in that city or town):

leaflet() %>% 
      addProviderTiles(providers$CartoDB.Voyager) %>%
      setView(lng  = 13.5306, 
              lat  = 47.4968, 
              zoom = 7) %>%
  addCircleMarkers(data         = geocoded_addresses,
                   lng          = ~x,
                   lat          = ~y,
                   radius       = ~(total)^(1/3.5),
                   color        = "#00A08A",
                   fillOpacity  = 0.1,
                   label        = lapply(geocoded_addresses$label4map, shiny::HTML),
                   labelOptions = labelOptions(style = list("font-weight" = "normal", 
                                                            padding = "3px 8px", 
                                                            "color" = "#00A08A"),
                                               textsize  = "15px", 
                                               direction = "auto")
      )

Data looks beautiful!!!

You can now also display in a map the geographical distribution of other available data. For example, cities with the higher percentage of registered individuals registered with e-mail contacts or cities with the higher percentage of lawyers with websites personalized websites, among others.

Posted on:
December 21, 2022
Length:
9 minute read, 1819 words
Tags:
GIS Geocoding
See Also:
Simplification Algorithms in R
Exploding geometries with GeoPandas
Programming Web Apps with Streamlit