Geocoding and visualizing address information with R
The step-by-step in a nutshell
By Carlos A. Toruño P.
December 21, 2022
In the previous series of posts about Web Scraping, we extracted information about lawyers from the German and Polish Bar Associations. Within that information, we were able to retrieve the address for each one of these individuals. Nevertheless, these text strings are just that… text. If we really want to take advantage of this kind of information we need to be able to translate these text characters into a format that we can use to visualize our data in a map. This process of transforming the description of a location into a set of geographical coordinates is called Geocoding.
In this post, I will be using information extracted from the Austrian Bar Association to geocode and visualize the geographical distribution of lawyers across the country using R.
What’s Geocoding?
As mentioned before, geocoding is the process of transforming a description of a location into geographical coordinates such as longitude and latitude. Similarly, you can also perform reverse geocoding, which is the process of recognizing a set of geographical coordinates and provide an associated description about this location and how to reach it.
The most practical way to geocode huge volumes of information is to access the data sets of online mapping services like Google Maps or Open Street Map (OSM) through an API. If you want to know more about such services, I would suggest you check the following video by Google:
Even though Google Maps is the most famous mapping provider out there, in this post I will be showing you how to geocode Austrian addresses using the OSM API. Why? First of all, because their services are freely available through their Nominatim Service and, second, because their coverage and data quality is almost as good as the service provided by Google Maps.
Having said all this, we can begin with the actual example.
How to geocode addresses using the OSM Nominatim Service?
The key packages that we are going to be using are two: sf
and tmaptools
. The
sf package allow us to access and handle simple features using R programming language. If you are not familiar with the concept of simple features, I would suggest you to read
this article written by Edzer Pebesma. On the other hand, the
tmaptools package is an R-library that provides a wide set of tools for reading and processing spatial data, including the geocode_OSM()
function which will allow us to access the OSM Nominatim Service to geocode strings of text.
Both packages can be installed as follows:
install.packages("sf")
install_github("mtennekes/tmaptools")
Now that we know the tools, lets take a look at the data. For this, we first need to load the libraries that we are going to be using and then, we can read and check the data that we have.
# Loading required libraries
library(tmaptools)
library(sf)
library(tidyverse)
# Reading data
austria_data.df <- read_csv("Austria_data.csv")
# How does data look like?
austria_data.df %>%
slice_head(n = 15) %>%
knitr::kable(format = "html")
address | phone | website | |
---|---|---|---|
1010 Wien--Stubenring 18 | Yes | Yes | No |
5020 Salzburg--Mildenburggasse 1/2 | Yes | Yes | No |
1010 Wien--Stubenring 18 | Yes | Yes | Yes |
1010 Wien--Stubenring 18 | Yes | Yes | No |
6900 Bregenz--Rathausstraße 37 | Yes | Yes | Yes |
1010 Wien--Spiegelgasse 19 | Yes | Yes | No |
1220 Wien--ARES Tower, Donau-City-Straße 11 | Yes | Yes | Yes |
1030 Wien--Riesgasse 3/14 | Yes | Yes | No |
1010 Wien--Wollzeile 17/22 | Yes | Yes | Yes |
1010 Wien--Wollzeile 17/22 | Yes | Yes | Yes |
1010 Wien--Kohlmarkt 8-10/Eing. Wallnerstr. 1 | Yes | Yes | Yes |
1070 Wien--Neustiftgasse 3/7 | Yes | Yes | Yes |
3100 St. Pölten--Domgasse 2 | Yes | Yes | No |
6020 Innsbruck--Wilhelm-Greil-Straße 21/IV. | Yes | Yes | No |
6800 Feldkirch--Schloßgraben 10 | Yes | Yes | Yes |
We have a data frame with four columns. For now, we will be focusing exclusively in the first one, the address. In Austria, all these addresses follow the same pattern: a four digits sequence representing the corresponding ZIP code, followed by the name of the city or town in which the individual is located, followed by a line break represented by two hyphens (–) and then, after this line break, we have the street name and the house number. First, let’s change the line break by a comma and extract only the address information:
# Replacing line breaks by commas and retrieving the addresses as a vector
addresses <- austria_data.df %>%
mutate(address = str_replace_all(address, "--", ", ")) %>%
pull(address)
# How do the vector look like?
addresses[1:15]
[1] "1010 Wien, Stubenring 18"
[2] "5020 Salzburg, Mildenburggasse 1/2"
[3] "1010 Wien, Stubenring 18"
[4] "1010 Wien, Stubenring 18"
[5] "6900 Bregenz, Rathausstraße 37"
[6] "1010 Wien, Spiegelgasse 19"
[7] "1220 Wien, ARES Tower, Donau-City-Straße 11"
[8] "1030 Wien, Riesgasse 3/14"
[9] "1010 Wien, Wollzeile 17/22"
[10] "1010 Wien, Wollzeile 17/22"
[11] "1010 Wien, Kohlmarkt 8-10/Eing. Wallnerstr. 1"
[12] "1070 Wien, Neustiftgasse 3/7"
[13] "3100 St. Pölten, Domgasse 2"
[14] "6020 Innsbruck, Wilhelm-Greil-Straße 21/IV."
[15] "6800 Feldkirch, Schloßgraben 10"
A clean string of text increases the likelihood of capturing the desired location with accuracy. Therefore, it is highly recommended to check for encoding or misspelling issues before passing the string to the geocoding function. Once that you have make sure that you are passing tidy and clean strings, you need to know beforehand certain aspects. For example, in which projection system do you want the coordinates to be?, or, in which data structure do you want the output to be returned?
For this example, I’m gonna be retrieving the coordinates in a longitude-latiutude system. More specifically, in the World Geodetic System or WGS84, which is also refered to as the EPSG 4326 system. An explanation of the different projection systems would required their own post, which escapes from the scope of this one at present. Therefore, if you are not familiar with projection systems, I would highly suggest you to watch the following video:
After you have decided in which projection system do you want your coordinates to be, geocode_OSM()
allows you to decide in which data structure you want your data to be returned. For this example, I am asking the output to be returned as a simple feature. Let’s geocode the first 100 addresses that we have in our registry and check how does the information look like.
# Geocoding the first 100 addresses from the registry
geocoded_addresses <- geocode_OSM(addresses[1:100],
projection = 4326,
as.sf = T)
# How does this new information looks like?
geocoded_addresses %>%
slice_head(n = 15) %>%
knitr::kable(format = "html")
query | x | y | y_min | y_max | x_min | x_max | bbox | point |
---|---|---|---|---|---|---|---|---|
1010 Wien, Stubenring 18 | 16.381229 | 48.20844 | 48.20839 | 48.20849 | 16.381179 | 16.381280 | POLYGON ((16.38118 48.20839... | POINT (16.38123 48.20844) |
5020 Salzburg, Mildenburggasse 1/2 | 13.062617 | 47.79873 | 47.79873 | 47.79881 | 13.062602 | 13.062617 | POLYGON ((13.0626 47.79873,... | POINT (13.06262 47.79873) |
1010 Wien, Stubenring 18 | 16.381229 | 48.20844 | 48.20839 | 48.20849 | 16.381179 | 16.381280 | POLYGON ((16.38118 48.20839... | POINT (16.38123 48.20844) |
1010 Wien, Stubenring 18 | 16.381229 | 48.20844 | 48.20839 | 48.20849 | 16.381179 | 16.381280 | POLYGON ((16.38118 48.20839... | POINT (16.38123 48.20844) |
6900 Bregenz, Rathausstraße 37 | 9.746105 | 47.50451 | 47.50442 | 47.50459 | 9.745983 | 9.746226 | POLYGON ((9.745983 47.50442... | POINT (9.746105 47.5045) |
1010 Wien, Spiegelgasse 19 | 16.369606 | 48.20635 | 48.20621 | 48.20649 | 16.369477 | 16.369900 | POLYGON ((16.36948 48.20621... | POINT (16.36961 48.20635) |
1030 Wien, Riesgasse 3/14 | 16.389144 | 48.19857 | 48.19813 | 48.19908 | 16.389128 | 16.389157 | POLYGON ((16.38913 48.19813... | POINT (16.38914 48.19857) |
1010 Wien, Wollzeile 17/22 | 16.376252 | 48.20840 | 48.20765 | 48.20965 | 16.373491 | 16.378753 | POLYGON ((16.37349 48.20765... | POINT (16.37625 48.2084) |
1010 Wien, Wollzeile 17/22 | 16.376252 | 48.20840 | 48.20765 | 48.20965 | 16.373491 | 16.378753 | POLYGON ((16.37349 48.20765... | POINT (16.37625 48.2084) |
1070 Wien, Neustiftgasse 3/7 | 16.337271 | 48.20616 | 48.20615 | 48.20616 | 16.337165 | 16.337400 | POLYGON ((16.33717 48.20615... | POINT (16.33727 48.20616) |
3100 St. Pölten, Domgasse 2 | 15.624754 | 48.20537 | 48.20532 | 48.20542 | 15.624704 | 15.624804 | POLYGON ((15.6247 48.20532,... | POINT (15.62475 48.20537) |
6020 Innsbruck, Wilhelm-Greil-Straße 21/IV. | 11.397080 | 47.26507 | 47.26507 | 47.26518 | 11.397080 | 11.397115 | POLYGON ((11.39708 47.26507... | POINT (11.39708 47.26507) |
6800 Feldkirch, Schloßgraben 10 | 9.598812 | 47.23845 | 47.23824 | 47.23872 | 9.598467 | 9.599150 | POLYGON ((9.598467 47.23824... | POINT (9.598812 47.23845) |
4020 Linz, Schillerstraße 12 | 14.293445 | 48.29761 | 48.29756 | 48.29766 | 14.293395 | 14.293495 | POLYGON ((14.2934 48.29756,... | POINT (14.29345 48.29761) |
5020 Salzburg, Erzabt-Klotz-Straße 12/2 | 13.052071 | 47.79254 | 47.79251 | 47.79254 | 13.051968 | 13.052071 | POLYGON ((13.05197 47.79251... | POINT (13.05207 47.79254) |
Visualizing the results
The output we received has information regarding the longitude, latitude and bounding box of each one of the addresses that we submitted. Let’s try to visualize the coordinates that we now have in a map using the Leaflet package (Remember that you can zoom in a certain area of the map to distinguish between close individuals):
library(leaflet)
leaflet() %>%
addProviderTiles(providers$CartoDB.Voyager) %>%
setView(lng = 13.5306,
lat = 47.4968,
zoom = 7) %>%
addMarkers(
data = geocoded_addresses,
lng = ~x,
lat = ~y,
popup = ~query
)
By geocoding the full addresses we get a very accurate point. However, it’s possible that we would prefer to group points that are very close to each other and see the major concentrations of lawyers in the country. For this, we can group individuals by major locations such as cities or towns. Let’s remember that, following the ZIP code, we have the name of the specific location in which the individual resides. We can extract this information from the address using regular expressions and use this demographic level to group individuals.
# Grouping individuals and extracting the information
grouped_data.df <- austria_data.df %>%
mutate(location = str_extract(address, "(?<=\\d{4}\\s).+(?=--)"),
location = paste0(location, ", Austria")) %>%
group_by(location) %>%
summarise(total = n()) %>%
arrange(desc(total))
# How does the data look like?
grouped_data.df %>%
slice_head(n = 15) %>%
knitr::kable(format = "html")
location | total |
---|---|
Wien, Austria | 3563 |
Graz, Austria | 379 |
Innsbruck, Austria | 342 |
Salzburg, Austria | 333 |
Linz, Austria | 315 |
Klagenfurt, Austria | 133 |
Wels, Austria | 88 |
Dornbirn, Austria | 67 |
St. Pölten, Austria | 63 |
Bregenz, Austria | 53 |
Feldkirch, Austria | 53 |
Wiener Neustadt, Austria | 53 |
Villach, Austria | 49 |
Mödling, Austria | 46 |
Baden, Austria | 34 |
As a result of the grouping, we have a data frame containing the names of 370 locations scattered all over Austria and sorted by the number of lawyers residing in that city or town. We can grab this list of locations and geocode them in order to retrieve their geographical coordinates as follows:
# Geocoding all the listed locations in Austria
geocoded_addresses <- geocode_OSM(grouped_data.df %>% pull(location),
projection = 4326,
as.sf = T)
# Joining total number of lawyers to geocoded data
geocoded_addresses <- geocoded_addresses %>%
select(location = query, x, y) %>%
left_join(grouped_data.df,
by = "location") %>%
mutate(label4map = paste("<p><strong>", location, "</strong><br/>",
"Number of lawyers: ", total, "</p>"))
Let’s visualize this data into a Leaflet map and identify the major demographic concentrations of lawyers (Remember than you can scroll over a marker to see the location and the number of lawyers in that city or town):
leaflet() %>%
addProviderTiles(providers$CartoDB.Voyager) %>%
setView(lng = 13.5306,
lat = 47.4968,
zoom = 7) %>%
addCircleMarkers(data = geocoded_addresses,
lng = ~x,
lat = ~y,
radius = ~(total)^(1/3.5),
color = "#00A08A",
fillOpacity = 0.1,
label = lapply(geocoded_addresses$label4map, shiny::HTML),
labelOptions = labelOptions(style = list("font-weight" = "normal",
padding = "3px 8px",
"color" = "#00A08A"),
textsize = "15px",
direction = "auto")
)
Data looks beautiful!!!
You can now also display in a map the geographical distribution of other available data. For example, cities with the higher percentage of registered individuals registered with e-mail contacts or cities with the higher percentage of lawyers with websites personalized websites, among others.
- Posted on:
- December 21, 2022
- Length:
- 9 minute read, 1819 words