Mongo Inside Out

Thursday, March 27, 2014

Importing Weather Data

Large sample dataset

To use as sample data for doing statistical analysis, I was looking for a fairly large dataset. I wanted more than a few thousands records, but it still needs to be manageable on my own computer. A few gigabytes is ok, terrabytes is too much. At http://www.ncdc.noaa.gov/cdo-web/datasets I found 23G of weather data from 91.000 stations all over the world which looks exactly what I need.

Import the dataset

As almost always, the data is not in a simple mongoDB importable structure.
A typical approach would be to use tools and programs (sed, grep, awk, ...) to manipulate the text files to fit the format to the import routine. This typically involves very long, incomprehensible commands or scripts (and lots of googling ;-). (example)
Scripts and commands which are typically lost when we need them again, and need to be reconstructed again, you know the feeling ...
It would be nice if someone tried to import the same data with mongoimport (and sed, grep...) to see the difference with the approach presented here.
The text file definition is mostly stored in external files (or hard coded in the file manipulation commands), which can make the whole import process difficult and error prone.
MongoDB is so different from typical DB systems, that it also requires / allows a different way of thinking and working with data. There's nothing wrong with the classical dba habits, but I think mongoDB offers some possibilities that might help a lot. In a relational database, we don't often use tables with only 1 record. The effort to create the table is quite high. In mongo, it's very convenient to store additional info (source, import commands, metadata ...) together with the data itself. Often used queries, map-reduce commands, javascript functions, source url's, application settings, metadata and even documentation ... can very easy be stored in the database itself.
In our case, the data is in a lot of small fixed column text files, all with an identical structure, so we also keep the information about these files in the database itself.

Facebook

Fun with mongodb and the Facebook Graph API

In a previous post we showed how to make http calls from within the mongodb shell. Another nice example of what we can do with this wget function is to make calls to the facebook graph API. Since the fb graph API uses json, we can store the results immediately in the database.

We put all the relevant functions and parameters in a single object (and even save that in the db. I prefer this to using .mongorc or the system.js collection. I'll write another blog post about this technique later).

wget or XMLHttpRequest

The javascript possibilities inside mongodb allow us to do interesting things from within the database itself. One thing that's missing is the XMLHttpRequest object. That would allow us to make calls to json webservices from within the mongo shell itself. Apparently, I'm not the only one who thinks this. The issue is marked as 'lower priority' and 'features we're not sure of', so I doubt we will see this added soon.

run("wget")

To show the power of making an http request from within the database, we might use a workaround. There is an undocumented, 'for internal use only', run() method, which can start an arbitrary process, like eg. wget. (Some others found the run() method interesting as well).
wget is a tool, on *nix machines, to make http calls in the command line or in scripts. These code examples run fine on linux, but might need some changes to make them run on windows.
The run() method returns the process' exit code, so we can't capture the output immediately. Instead, we let wget dump the output to a temporary file and read the contents of that file by using cat().

function wget(url){
    var tmp = "/tmp";
    var id = new ObjectId();
    var outFile= tmp+"/wget"+id;
    var p = run("wget","-o log","--output-document="+outFile,url);
    if (p==0){
        var result = cat(outFile);
        removeFile(outFile);
        return result;
    } else {
        return "";
    }
}

It would be cleaner with a real XMLHttpRequest, but it works ok, and opens up many possibilities.

webcrawler
read / use webservices
read JSON formatted Rest API's and save the results in the database
...