Mongo Inside Out

Importing Weather Data

2014-03-27T04:33:00.001-07:00

Large sample dataset

To use as sample data for doing statistical analysis, I was looking for a fairly large dataset. I wanted more than a few thousands records, but it still needs to be manageable on my own computer. A few gigabytes is ok, terrabytes is too much. At http://www.ncdc.noaa.gov/cdo-web/datasets I found 23G of weather data from 91.000 stations all over the world which looks exactly what I need.

Import the dataset

As almost always, the data is not in a simple mongoDB importable structure.
A typical approach would be to use tools and programs (sed, grep, awk, ...) to manipulate the text files to fit the format to the import routine. This typically involves very long, incomprehensible commands or scripts (and lots of googling ;-). (example)
Scripts and commands which are typically lost when we need them again, and need to be reconstructed again, you know the feeling ...
It would be nice if someone tried to import the same data with mongoimport (and sed, grep...) to see the difference with the approach presented here.
The text file definition is mostly stored in external files (or hard coded in the file manipulation commands), which can make the whole import process difficult and error prone.
MongoDB is so different from typical DB systems, that it also requires / allows a different way of thinking and working with data. There's nothing wrong with the classical dba habits, but I think mongoDB offers some possibilities that might help a lot. In a relational database, we don't often use tables with only 1 record. The effort to create the table is quite high. In mongo, it's very convenient to store additional info (source, import commands, metadata ...) together with the data itself. Often used queries, map-reduce commands, javascript functions, source url's, application settings, metadata and even documentation ... can very easy be stored in the database itself.
In our case, the data is in a lot of small fixed column text files, all with an identical structure, so we also keep the information about these files in the database itself.

Importing the station info

First we import the data about the weather stations.

ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt
This information is in a fixed column format:

IV. FORMAT OF "ghcnd-stations.txt"
------------------------------
Variable   Columns   Type
------------------------------
ID            1-11   Character
LATITUDE     13-20   Real
LONGITUDE    22-30   Real
ELEVATION    32-37   Real
STATE        39-40   Character
NAME         42-71   Character
GSNFLAG      73-75   Character
HCNFLAG      77-79   Character
WMOID        81-85   Character
------------------------------

Since the data file isn't very big (7.5M), we can import it with the cat() function instead of mongoimport, and split up the lines according to the column definition. As mentioned before, we store this column definition in the database itself.

var settings = new Object();
settings.stationDef = [
{name:"ID", type:"String", start:1,stop:11},
{name:"Latitude",type:"Float",start:13,stop:20 },
{name:"Longitude",type:"Float",start:22,stop:30 },
{name:"Elevation",type:"Float",start:32,stop:37},
{name:"State",type:"String",start:39,stop:40},
{name:"Name",type:"String",start:42,stop:71},
{name:"GSNFlag",type:"String",start:73,stop:75},
{name:"HCNFlag",type:"String",start:77,stop:79},
{name:"WMOID",type:"String",start:81,stop:85},
{name:"Country", type:"String", start:1,stop:2}
]
db.weather.settings.save(settings);

If we want to use the definitions later, we simply call:

var settings = db.weather.settings.findOne();

With 2 simple functions, we can import the data in the format defined by 'stationDef'. The first function splits a text line in the columns defined by 'definition'

function split(line, definition){
    var result = new Object();
    for (key in definition){
        var column = definition[key];
        if (column.type=="fixed"){
            result[column.name] = column.value;
        } else {
            var s = line.substring(column.start-1,column.stop);
            if (column.type=="Float"){
                result[column.name] = parseFloat(s);
            } else if (column.type=="Integer"){
                result[column.name] = parseInt(s);
            } else {
                result[column.name] = s.trim();
            }
        }
    }
    return result;
}

Another function reads a file (with cat()), and uses the split() function to parse every line.

function readFile(name,collection,definition){
    var data = cat(name);
    var all = new Array();
    var lines = data.split("\n");
    for(i=1;i<lines.length;i++){
        var result = split(lines[i],definition);
        all.push(result);
    }
    collection.insert(all);
    return all.length;
}
readFile("projects/mongo/weather/ghcnd-stations.txt",db.stations,settings.weatherDef);

Inserting the station info inside the for loop would be easier, but keeping the documents in an array, and use that to do a bulk insert, improves the speed a lot. On my PC I read, parsed and inserted the >90.000 records in about 5 seconds. The 7MB txt file results in a 21MB collection.
Actually, it could be best to also store these 2 functions in the settings object (and thus is db.weather.settings). But this makes it a bit more difficult to call the one function from the other.

Importing the weather data

The weather data itself is located in 1 file per station and has the following format (full details)

III. FORMAT OF DATA FILES (".dly" FILES)

Each ".dly" file contains data for one station.  The name of the file
corresponds to a station's identification code.  For example, "USC00026481.dly"
contains the data for the station with the identification code USC00026481).

Each record in a file contains one month of daily data.  The variables on each
line include the following:

------------------------------
Variable   Columns   Type
------------------------------
ID            1-11   Character
YEAR         12-15   Integer
MONTH        16-17   Integer
ELEMENT      18-21   Character
VALUE1       22-26   Integer
MFLAG1       27-27   Character
QFLAG1       28-28   Character
SFLAG1       29-29   Character
VALUE2       30-34   Integer
MFLAG2       35-35   Character
QFLAG2       36-36   Character
SFLAG2       37-37   Character
  .           .          .
  .           .          .
  .           .          .
VALUE31    262-266   Integer
MFLAG31    267-267   Character
QFLAG31    268-268   Character
SFLAG31    269-269   Character
------------------------------

We could import the weather data itself with the same functions (and a new column definition), but that would give us 1 document per station per month, which isn't really the structure we want in our database. We might import the data 1 month per document, and split that to document per day by using map-reduce or looping over the documents and rewrite the documents with javascript.
Instead, we simply adjust the 2 functions a little bit to allow us to use an array of column definitions (instead of just 1). Each definition creates a new document. Every line in the text files (one per month) generates 31 new documents (one per day).

function split(line, definitions){
    var allResults = new Array();
    for (var i in definitions){
        var definition = definitions[i];
        var result = new Object();
        for (key in definition){
            var column = definition[key];
            if (column.type=="fixed"){
                result[column.name] = column.value;
            } else {
                var s = line.substring(column.start-1,column.stop);
                if (column.type=="Float"){
                    result[column.name] = parseFloat(s);
                } else if (column.type=="Integer"){
                    result[column.name] = parseInt(s);
                } else {
                    result[column.name] = s.trim();
                }
            }
        }
        allResults.push(result);
    }
    return allResults;
}

function readFile(name,collection,definition){
    var data = cat(name);
    var all = new Array();
    var lines = data.split("\n");
    for(i=1;i<lines.length;i++){
        var result = split(lines[i],definition);
        all=all.concat(result);
    }
    collection.insert(all);
    return all.length;
}

The array that describes the column definitions for the weather data itself, and creates a new document for every day, is something like

settings.weatherDefs = [
[{"name" :"day","type" : "fixed","value" : 1},
{"name" :"ID", "type" : "String", "start" : 1,"stop" : 11},
{"name" :"Country","type" : "String","start" : 1,"stop" : 2},
{"name" :"year","type" : "Integer","start" : 12,"stop" : 15},
{"name" :"month","type" : "Integer","start" : 16,"stop" : 17},
{"name" :"element","type" : "String","start" : 18,"stop" : 21},
{"name" :"Value","type" : "Integer","start" : 22,"stop" : 26},
{"name" :"M","type" : "String","start" : 27,"stop" : 27},
{"name" :"Q","type" : "String","start" : 28,"stop" : 28},
{"name" :"S","type" : "String","start" : 29,"stop" : 29}
],
{"name" :"day","type" : "fixed","value" : 2},
{"name" :"ID", "type" : "String", "start" : 1,"stop" : 11},
{"name" :"Country","type" : "String","start" : 1,"stop" : 2},
{"name" :"year","type" : "Integer","start" : 12,"stop" : 15},
{"name" :"month","type" : "Integer","start" : 16,"stop" : 17},
{"name" :"element","type" : "String","start" : 18,"stop" : 21},
{"name" :"Value","type" : "Integer","start" : 30,"stop" : 34},
{"name" :"M","type" : "String","start" : 35,"stop" : 35},
{"name" :"Q","type" : "String","start" : 36,"stop" : 36},
{"name" :"S","type" : "String","start" : 37,"stop" : 37}
],
......

[ {"name" :"day","type" : "fixed","value" : 31},
{"name" :"ID", "type" : "String", "start" : 1,"stop" : 11},
{"name" :"Country","type" : "String","start" : 1,"stop" : 2},
{"name" :"year","type" : "Integer","start" : 12,"stop" : 15},
{"name" :"month","type" : "Integer","start" : 16,"stop" : 17},
{"name" :"element","type" : "String","start" : 18,"stop" : 21},
{"name" :"Value","type" : "Integer","start":262,"stop" :266},
{"name" :"M","type" : "String","start" :267,"stop" :267},
{"name" :"Q","type" : "String","start" :268,"stop" :268},
{"name" :"S","type" : "String","start" :269,"stop" :269}
]
]

The whole definition is available at http://pastebin.com/996ec5dg
Now we can import a specific file, eg. all 187.000 observations for Belgium (about 10 seconds):

readFile("projects/mongo/weather/ghcnd_all/BE000006447.dly", db.weather,settings.weatherDefs)

In the UK there are 36 weather stations. To import all the data, we can use

var tel=0;
db.stations.find({Country:"UK"}).forEach(function(entry){
    tel++;
    var file = "projects/mongo/weather/ghcnd_all/"+entry.ID+".dly";
    var count = readFile(file,db.weather,settings.weatherDefs);
    print(new Date()+" "+tel+" "+entry.ID+"   "+count);
})

If you have a fast computer or lots of time, you might import the data for all the 49000 stations in the US.

Since, in the text files, every month has 31 days, we need to clean up a bit with

db.weather.remove({Value:-9999})

The documents in the database now look like
db.weather.find()
{ "_id" : ObjectId("5333f13b018f79b84edfcbff"), "day" : 2, "ID" : "UKW00035047", "Country" : "UK", "year" : 1953, "month" : 2, "element" : "WT08", "Value" : 1, "M" : "", "Q" : "", "S" : "X" }
{ "_id" : ObjectId("5333f13b018f79b84edfcc00"), "day" : 3, "ID" : "UKW00035047", "Country" : "UK", "year" : 1953, "month" : 2, "element" : "WT08", "Value" : 1, "M" : "", "Q" : "", "S" : "X" }
{ "_id" : ObjectId("5333f13b018f79b84edfcc01"), "day" : 4, "ID" : "UKW00035047", "Country" : "UK", "year" : 1953, "month" : 2, "element" : "WT08", "Value" : 1, "M" : "", "Q" : "", "S" : "X" }
{ "_id" : ObjectId("5333f13b018f79b84edfcbf9"), "day" : 27, "ID" : "UKW00035047", "Country" : "UK", "year" : 1953, "month" : 2, "element" : "WT01", "Value" : 1, "M" : "", "Q" : "", "S" : "X" }
{ "_id" : ObjectId("5333f13b018f79b84edfcbfa"), "day" : 28, "ID" : "UKW00035047", "Country" : "UK", "year" : 1953, "month" : 2, "element" : "WT01", "Value" : 1, "M" : "", "Q" : "", "S" : "X" }...

Statistics

I'll try some more advanced statistics in a later blog post, but we can already show some basic information.
eg. the average minimum temperature for februari over all UK stations:

db.weather.aggregate(
    {$match:{"Country":"UK","year":{$gt:1994},month:2, element:"TMIN"}},
    {$group:{_id: {year:"$year",month:"$month"},
        Temp:{$avg:"$Value"}
        }},
    {$sort:{_id:1}}
)
{ "_id" : { "year" : 1995, "month" : 2 }, "Temp" : 46.26623376623377 }
{ "_id" : { "year" : 1996, "month" : 2 }, "Temp" : 23.849529780564264 }
{ "_id" : { "year" : 1997, "month" : 2 }, "Temp" : 48.561688311688314 }
{ "_id" : { "year" : 1998, "month" : 2 }, "Temp" : 64.81063122923588 }
{ "_id" : { "year" : 1999, "month" : 2 }, "Temp" : 43.16233766233766 }
{ "_id" : { "year" : 2000, "month" : 2 }, "Temp" : 48.623824451410655 }
{ "_id" : { "year" : 2001, "month" : 2 }, "Temp" : 31.214285714285715 }
{ "_id" : { "year" : 2002, "month" : 2 }, "Temp" : 49.424710424710426 }
{ "_id" : { "year" : 2003, "month" : 2 }, "Temp" : 30.35907335907336 }
{ "_id" : { "year" : 2004, "month" : 2 }, "Temp" : 32.4365671641791 }
{ "_id" : { "year" : 2005, "month" : 2 }, "Temp" : 19.71563981042654 }
{ "_id" : { "year" : 2006, "month" : 2 }, "Temp" : 35.280701754385966 }
{ "_id" : { "year" : 2007, "month" : 2 }, "Temp" : 50.10666666666667 }
{ "_id" : { "year" : 2008, "month" : 2 }, "Temp" : 43 }
{ "_id" : { "year" : 2009, "month" : 2 }, "Temp" : 35.508849557522126 }
{ "_id" : { "year" : 2010, "month" : 2 }, "Temp" : 17.67699115044248 }
{ "_id" : { "year" : 2011, "month" : 2 }, "Temp" : 55.18274111675127 }
{ "_id" : { "year" : 2012, "month" : 2 }, "Temp" : 46.019417475728154 }
{ "_id" : { "year" : 2013, "month" : 2 }, "Temp" : 53.49473684210526 }
{ "_id" : { "year" : 2014, "month" : 2 }, "Temp" : 66.68627450980392 }
(Temperatures are recorded as integers in 0.1°C)

Of course, an index on Country, element, and month speeds this up a lot, but lowers the import speed if you'd like to add new data.

Facebook

2014-03-11T09:47:00.000-07:00

Fun with mongodb and the Facebook Graph API

In a previous post we showed how to make http calls from within the mongodb shell. Another nice example of what we can do with this wget function is to make calls to the facebook graph API. Since the fb graph API uses json, we can store the results immediately in the database.

We put all the relevant functions and parameters in a single object (and even save that in the db. I prefer this to using .mongorc or the system.js collection. I'll write another blog post about this technique later).

We create a facebook object and copy the wget function in it.

var facebook = {
        "wget" : function wget(url){
        var tmp = "/tmp";
        var id = new ObjectId();
        var outFile= tmp+"/wget"+id;
        var p = run("wget","-o log","--output-document="+outFile,url);
        if (p==0){
        var result = cat(outFile);
        removeFile(outFile);
        return result;
        } else {
        return "";
        }
    }
}

(Allmost) all calls to the API need an access token. You can generate a token (with the access rights you need) at https://developeallows us tors.facebook.com/tools/explorer which is valid for 2 hours.

facebook.token = "CAACEdEose0cBAFzPP7M........udv2TjMTQQKEZD";

Now we create the function to make the graph call itself.

facebook.graph = function graph(user,item){
        if (!user){
           user = "me";
        }        var url = "https://graph.facebook.com/"+user;
        if (item){
           url = url+"/"+item;
        }
        url=url+"?access_token="+this.token;
        var result = this.wget(url);
        return JSON.parse(result);
    }

This function graph makes a call to the graph API (using the wget function) with 2 (second optional) parameters.

user (which defaults to 'me')
item: specifies what object from the graph you want (friends, events, photos, ...)

Let's try a few things:

facebook.graph('me','friends'); shows a list of all my friends.
facebook.graph('mongodb'); shows info about the mongodb page
facebook.graph('mongodb','photos');
facebook.graph('mongodb','likes');

When returning a single object, the json reply simply describes the object. When the API returns a list of objects, this list is an array named 'data' (mostly followed by pagination url's).
The nice thing is that the function returns a JSON object, so it's very easy to save the results in the database.

var friends = acebook.graph('me','friends'); friends = friends.data; friends.forEach(function(friend){db.friends.save(friend)});

Or simply print a list of your friends

friends.forEach(function(friend){print(friend.name)});

Now we can loop over all these friends and lookup basic info for each of them with this code

db.friends.find().forEach(
function(friend){
friend.info = facebook.graph(friend.id); db.friends.save(friend);})

There's lots of extra functionality that we could add (extra parameters like 'limit', use different access_tokens for different users, use POST and DELETE calls instead of only GET ...)

Have fun!!

wget or XMLHttpRequest

2014-02-27T07:00:00.002-08:00

The javascript possibilities inside mongodb allow us to do interesting things from within the database itself. One thing that's missing is the XMLHttpRequest object. That would allow us to make calls to json webservices from within the mongo shell itself. Apparently, I'm not the only one who thinks this. The issue is marked as 'lower priority' and 'features we're not sure of', so I doubt we will see this added soon.

run("wget")

To show the power of making an http request from within the database, we might use a workaround. There is an undocumented, 'for internal use only', run() method, which can start an arbitrary process, like eg. wget. (Some others found the run() method interesting as well).
wget is a tool, on *nix machines, to make http calls in the command line or in scripts. These code examples run fine on linux, but might need some changes to make them run on windows.
The run() method returns the process' exit code, so we can't capture the output immediately. Instead, we let wget dump the output to a temporary file and read the contents of that file by using cat().

function wget(url){
    var tmp = "/tmp";
    var id = new ObjectId();
    var outFile= tmp+"/wget"+id;
    var p = run("wget","-o log","--output-document="+outFile,url);
    if (p==0){
        var result = cat(outFile);
        removeFile(outFile);
        return result;
    } else {
        return "";
    }
}

It would be cleaner with a real XMLHttpRequest, but it works ok, and opens up many possibilities.

webcrawler
read / use webservices
read JSON formatted Rest API's and save the results in the database
...

WebCrawler

Now that we can make an http request, it's not that difficult to extract all hyperlinks. If we use these hyperlinks again we essectially have a web crawler.
A real crawler will need to deal with a lot more issues (relative hyperlinks, robots.txt, javascript links ...), but the basic functionality should work.
We can extract all hyperlinks from an html page like this:

var html =wget("http://www.google.com");
var regex = /href=["'']?([^"'' >]+)/gi;
var links = new Array();
var f=regex.exec(html);
while(f!=null){
links.push(f[1]);
f=regex.exec(html);
}

We could save the results in a collection and use that collection again to perform a new search.

Webservice

In this example we use our wget function to download google finance data and store these in mongodb. The normal approach would be to make a script/ application to download the data, convert the data to a format that can be imported into the database or parse the data and insert it.
Here we show that we can implement this functionality inside the database itself. We also store the javascript code and the application parameters inside the same database.

Google finance lets us download historical stock prices in csv format (eg. https://www.google.com/finance/historical?q=NYSE%3ATWTR ).
We use the collection quotes.settings as a place to keep both the application parameters.

There's only 1 document in quotes.settings, so we assign "unique" as _id value. This doesn't really do anything, but it makes it clear this collection is meant as a singleton, containing only 1 document. I often use the .settings name to keep everything together. It makes it obvious that the data inside this collection refers to the 'quotes' collection.

db.quotes.settings.save({
   _id:"unique",
   root : "http://www.google.com/finance/historical?q=$0&start=0&startdate=$1&output=csv",
   tmpDir : "/tmp",
   startdate:"01-jan-2014"
})

In the mongodb command line we use the settings by loading it in a javascript object.

var quotes = db.quotes.settings.findOne();

Not only do we store the app parameters here, also the code itself can be put in the same object. First we add the wget function we had before and save it in the settings collection:

quotes.wget = wget;
db.quotes.settings.save(quotes);

Or if you didn't make the original wget function:

quotes.wget = function(url){
   var tmp = this.tmpDir;
   var id = new ObjectId();
   var outFile= tmp+"/wget"+id;
   var p = run("wget","-o log","--output-document="+outFile,url);
   if (p==0){
        var result = cat(outFile);
        removeFile(outFile);
        return result;
   } else {
        return "";
   }
}
db.quotes.settings.save(quotes);

Notice that we got rid of the hardcoded "/tmp/" and use an application parameter instead.
We will add a few more functions to read the financial data, parse it and store it in collection quotes.
The codes for the different companies are stored in collection quotes.sources. In our example we want to download data for Google, Twitter and Facebook.

db.quotes.sources.find()
{
    "_id" : ObjectId("530e4a12bc808878d61d1fb1"),
    "name" : "Google",
    "param" : "NASDAQ%3AGOOG"
}
{
    "_id" : ObjectId("530e4a7ebc808878d61d1fb2"),
    "name" : "Twitter",
    "param" : "NYSE%3ATWTR"

}
{
    "_id" : ObjectId("530e4aadbc808878d61d1fb3"),
    "name" : "Facebook",
    "param" : "NASDAQ%3AFB"
}

Now we can add some more functions to perform the download and parse the data.

quotes.findAll = function(){
    var ctx=this;
    sources = db.quotes.sources.find();
    sources.forEach(function(entry){ctx.read(entry,ctx)});
}

quotes.read = function (entry,ctx){
    var last = entry.last;
    if (last===undefined) last="01-Jan-2014";
    url = ctx.root.replace("$0",entry.param);
    url = url.replace("$1",last);
    var quote = ctx.wget(url);
    var lines = quote.split("\n");
    for(i=1;i<lines.length;i++){
        var day = ctx.parseLine(lines[i]);
        if (day.day!==undefined){
            var result = db.quotes.findOne({"name":entry.name,"day":day.day});
            if (result==null){
                result=new Object();
                result.name=entry.name;
            }
            for (key in day){
                result[key] = day[key];
            }
            db.quotes.save(result);
        }
    }
    entry.lastRun = new Date();
    entry.lastCount=lines.length;
    db.quotes.sources.save(entry);
}

quotes.parseLine = function (line){
    var values = line.split(",");
    var day = new Object();
    if (values && values.length>5){
        var day = new Object();
        day.day = new Date(values[0]);
        day.open=parseFloat(values[1]);
        day.high=parseFloat(values[2]);
        day.low=parseFloat(values[3]);
        day.close=parseFloat(values[4]);
        day.volume=parseInt(values[5]);
    }
    return day;
}

db.quotes.settings.save(quotes);

The findAll function loops over the sources collection and calls the read function for every entry. The read function downloads the csv data and uses parseLine to parse the received data and store it in the quotes collection.

db.quotes.find()
{    {
   "_id" : ObjectId("533154a732e5bdd7813212ae"),
   "name" : "Google",
   "day" : ISODate("2014-01-02T00:00:00Z"),
   "open" : 1115.46,
   "high" : 1117.75,
   "low" : 1108.26,
   "close" : 1113.12,
   "volume" : 1822719
}
{
   "_id" : ObjectId("533154a732e5bdd7813212b0"),
   "name" : "Twitter",
   "day" : ISODate("2014-01-02T00:00:00Z"),
   "open" : 65,
   "high" : 67.5,
   "low" : 64.4,
   "close" : 67.5,
   "volume" : 29286655
}
{
   "_id" : ObjectId("533154a832e5bdd7813212b2"),
   "name" : "Facebook",
   "day" : ISODate("2014-01-02T00:00:00Z"),
   "open" : 54.83,
   "high" : 55.22,
   "low" : 54.19,
   "close" : 54.71,
   "volume" : 43257622
}
{
   "_id" : ObjectId("5331575c32e5bdd7813212ba"),
   "name" : "Google",
   "day" : ISODate("2014-03-24T00:00:00Z"),
   "open" : 1184.19,
   "high" : 1184.9,
   "low" : 1145.95,
   "close" : 1157.93,
   "volume" : 3039151
}
{
   "_id" : ObjectId("5331575c32e5bdd7813212bb"),
   "name" : "Google",
   "day" : ISODate("2014-03-21T00:00:00Z"),
   "open" : 1206.31,
   "high" : 1209.63,
   "low" : 1182.45,
   "close" : 1183.04,
   "volume" : 3210656
}
{
   "_id" : ObjectId("5331575c32e5bdd7813212bc"),
   "name" : "Google",
   "day" : ISODate("2014-03-20T00:00:00Z"),
   "open" : 1199.95,
   "high" : 1209.61,
   "low" : 1195.36,
   "close" : 1197.16,
   "volume" : 1684975
}
......

}

We could easily call the readAll routine on a daily basis, so that we keep adding the most recent data.