Thursday, March 27, 2014

Importing Weather Data

Large sample dataset

To use as sample data for doing statistical analysis, I was looking for a fairly large dataset. I wanted more than a few thousands records, but it still needs to be manageable on my own computer. A few gigabytes is ok, terrabytes is too much. At http://www.ncdc.noaa.gov/cdo-web/datasets I found 23G of weather data from 91.000 stations all over the world which looks exactly what I need.

Import the dataset

As almost always, the data is not in a simple mongoDB importable structure.
A typical approach would be to use tools and programs (sed, grep, awk, ...) to manipulate the text files to fit the format to the import routine. This typically involves very long, incomprehensible commands or scripts (and lots of googling ;-).  (example)
Scripts and commands which are typically lost when we need them again, and need to be reconstructed again, you know the feeling ...
It would be nice if someone tried to import the same data with mongoimport (and sed, grep...) to see the difference with the approach presented here.
The text file definition is mostly stored in external files (or hard coded in the file manipulation commands), which can make the whole import process difficult and error prone.
MongoDB is so different from typical DB systems, that it also requires / allows a different way of thinking and working with data. There's nothing wrong with the classical dba habits, but I think mongoDB offers some possibilities that might help a lot. In a relational database, we don't often use tables with only 1 record. The effort to create the table is quite high. In mongo, it's very convenient to store additional info (source, import commands, metadata ...) together with the data itself. Often used queries, map-reduce commands, javascript functions, source url's, application settings, metadata and even documentation ... can very easy be stored in the database itself.
In our case, the data is in a lot of small fixed column text files, all with an identical structure, so we also keep the information about these files in the database itself.

Tuesday, March 11, 2014

Facebook

Fun with mongodb and the Facebook Graph API

In a previous post we showed how to make http calls from within the mongodb shell. Another nice example of what we can do with this wget function is to make calls to the facebook graph API. Since the fb graph API uses json, we can store the results immediately in the database.

We put all the relevant functions and parameters in a single object (and even save that in the db. I prefer this to using .mongorc or the system.js collection. I'll write another blog post about this technique later).