Geocoding is basically the process that transforms street addresses into geographic coordinates, which can be further processed. Previously I've stayed away from geocoding in general as it seemed a complicated process, which requires large chunks of time and a lot of patience, especially when this process starts from a PDF file. However, sometimes this process is necessary.
A recent tragic event in Bucharest, Romania made me reconsider this technique as I took into account the benefits of having a map representing the buildings classified as buildings exposed to seismic risk. Your objective might differ from mine, but this technique can be easily replicated. Don't get me wrong, this is a fairly complicated process and requires a certain degree of skill, but I think that once you define your workflow and have a good use of your tools the process is pretty straight forward.
I will not pretend to reinvent the wheel or that I am the only one trying to build this map, as this was initially attempted by the Urboteca project. The problem that I wanted to solve was that of reducing the number of points placed based on approximation (might sound complicated or a bit vague at the moment, but I will come back to this below).
DISCLAIMER: This type of Geocoding is not an exact science and is generally based on and limited to the capabilities of Google Maps and OpenStreetMap APIs, so accuracy of the process varies according to the level of detail these services offer for the places you are trying to geocode.
Let's get to work!
The first step is to get the tools you will need in the process. Here is what I used:
- QGIS with MMQGIS plugin installed - for geocoding purposes.
- Tabula - for extracting data from pdfs in tabular format.
- LibreOffice - for editing/cleaning data in CSV format.
Extracting the data
If you are lucky enough to receive a list of points (addresses) that are already in a tabular/editable format your work is almost done, but experience shows us that this is not usually the case. In my case the data came in the form of a PDF file exported from an Excel table. Copy-Pasting or directly importing into an office application in this case is not really and option due to text formatting, while "OCRising" can put out errors due to language processing.
Tabula is different in this respect as it extracts data from tables inside PDF files, which you can then export into CSV or Excel format. After you download and unzip the archive file containing the software you have to run the EXE file and tabula will open in your default browser. The tool is pretty straight forward to use - you select the table(s) and you hit Preview and Export, however I do recommend two things:
- If you have several tables into one PDF file you should not use the Repeat Selection function and I recommend running the file in batches according to the number/type of tables.
- experiment with Stream and Lattice extraction methods on small sections to see which one offers the best results for your case.
According to the size of your tables the process could take a while, so at this point you can go ahead and make some coffee.
When the preview is ready check to see if the tables look good, select your output file type - I used CSV - and hit Export. Don't stress too much if the data doesn't look perfect, it most likely never will at this stage, but we will fix this later.
Cleaning up the data
After extracting the data we have to clean it and standardize it. I found that the best way to do this is to import the CSV file into LibreOffice Calc. When importing use the preview window to select two key things: the character set (mine was UTF-8, but try some others if it doesn't look right as it also depends on the character set) and the delimiter (usually a comma), but be aware of other separators that your dataset might contain.
After you imported the data, you need to first ensure that all the cells have only one line in them, otherwise QGIS might interpret these as multiple rows. Also be careful if you have data cells that have been split into multiple cells. According to your dataset size these tasks can take quite some time, patience and attention.
Next, you can use the sort function to ensure no blank rows exist between the data rows.
One of the most important steps at this stage is formatting the address fields. Here the approach may vary. MMQGIS, the plugin that does the heavy lifting when geocoding in QGIS uses Google Maps and OpenStreetMap Nominatim. The trick is to format your address fields to match the formatting Google or OpenStreetMap uses for the region you are geocoding from. I was pretty lucky and I only had to unite two cells to match the format "Street Name No" by introducing a new column and using this formula (replace cell numbers according to your case):
After you created this column of address copy the column and paste the text and numbers (using the Paste Special function in LibreOffice Calc) into a new column. This way we ensure that the final dataset will not be composed of useless Excel formulas. After this you can delete the original and formula columns.
Thinking ahead I suggest one more step. In order to reduce geocoding errors create a new column for the city. This is the case only if you know that your street names can be found in other cities as well.
Depending on the language used for the addresses, also consider finding and replacing all non-ASCII characters in order to avoid weird characters in QGIS' attributes table.
After this "technical" cleaning we have to organize our table headers to meet our requirements and at this moment you're pretty much all set to geocode your address list.
To geocode you address list we use MMQGIS - a QGIS plugin that offers the possibility to batch geocode using CSV files. To do this fire up QGIS Desktop, go to the MMQGIS dropdown menu, Geocode, Geocode CSV with Google / OpenStreetMap.
Once the Geocoding window opens you need to setup a few things:
- select your CSV file from the location you saved it in;
- select the corresponding fields from the dropdowns - using the city field here will improve accuracy.
- select your service: Google Maps or OpenStreetMap/Nominatim - this depends on which of the services is more exact for your location; in my case I used both in subsequent runs.
- select the name and location of the output SHP file.
- select the name and location of the CSV file where the addresses that are not found are exported in.
An important note here is that the Google Maps API only allows 2500 requests / day. If you use it and your dataset is larger that 2500 lines, then you will have to run it in batches.
At this point I suggest running a small number of addresses in order to see how the file is handled and if you need to do some tweaks to your CSV file.
After you ran the geocoding process, if you are really lucky the notfound.csv file is empty and your job is done. But this is rarely the case, and this is where the gritty work begins. If you want all addresses to be coded correctly you have to figure out what is wrong with those addresses that were not found, and then to attempt a new geocoding process. The objective here is to have 0 not found addresses.
If all ran well and all your addresses are geocoded, now you can look into the attribute table of the newly added SHP layer and you will see that MMQGIS added two more columns: addrtype - which shows what the address type is; and addrlocat - which shows how exact was the geocoding process in the case of each address. You can probably guess that "ROOFTOP" is the most accurate, while "APPROXIMATE" is the least accurate. The latter are those that I once again export back to CSV format and try to edit the addresses in order to properly fit Google or OSM address formats. There are a multiple factors that can cause these errors. The ones that I found to matter the most are: word order in the addresses; street name changes; bad spelling; or lack off indexed addresses in the two APIs for certain locations. If you have a small number of points you can also manually move the points to the proper address.
When you are happy with your geocoding results all you are left to do is to choose a background webmap that you like and publish it.