Thursday, March 27, 2014

Importing Weather Data

Large sample dataset

To use as sample data for doing statistical analysis, I was looking for a fairly large dataset. I wanted more than a few thousands records, but it still needs to be manageable on my own computer. A few gigabytes is ok, terrabytes is too much. At http://www.ncdc.noaa.gov/cdo-web/datasets I found 23G of weather data from 91.000 stations all over the world which looks exactly what I need.

Import the dataset

As almost always, the data is not in a simple mongoDB importable structure.
A typical approach would be to use tools and programs (sed, grep, awk, ...) to manipulate the text files to fit the format to the import routine. This typically involves very long, incomprehensible commands or scripts (and lots of googling ;-).  (example)
Scripts and commands which are typically lost when we need them again, and need to be reconstructed again, you know the feeling ...
It would be nice if someone tried to import the same data with mongoimport (and sed, grep...) to see the difference with the approach presented here.
The text file definition is mostly stored in external files (or hard coded in the file manipulation commands), which can make the whole import process difficult and error prone.
MongoDB is so different from typical DB systems, that it also requires / allows a different way of thinking and working with data. There's nothing wrong with the classical dba habits, but I think mongoDB offers some possibilities that might help a lot. In a relational database, we don't often use tables with only 1 record. The effort to create the table is quite high. In mongo, it's very convenient to store additional info (source, import commands, metadata ...) together with the data itself. Often used queries, map-reduce commands, javascript functions, source url's, application settings, metadata and even documentation ... can very easy be stored in the database itself.
In our case, the data is in a lot of small fixed column text files, all with an identical structure, so we also keep the information about these files in the database itself.


Importing the station info

First we import the data about the weather stations.

ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt
This information is in a fixed column format:

IV. FORMAT OF "ghcnd-stations.txt"
------------------------------
Variable   Columns   Type
------------------------------
ID            1-11   Character
LATITUDE     13-20   Real
LONGITUDE    22-30   Real
ELEVATION    32-37   Real
STATE        39-40   Character
NAME         42-71   Character
GSNFLAG      73-75   Character
HCNFLAG      77-79   Character
WMOID        81-85   Character
------------------------------ 

Since the data file isn't very big (7.5M), we can import it with the cat() function instead of mongoimport, and split up the lines according to the column definition. As mentioned before, we store this column definition in the database itself.


var settings = new Object();
settings.stationDef = [
{name:"ID", type:"String", start:1,stop:11},
{name:"Latitude",type:"Float",start:13,stop:20 },
{name:"Longitude",type:"Float",start:22,stop:30 },
{name:"Elevation",type:"Float",start:32,stop:37},
{name:"State",type:"String",start:39,stop:40},
{name:"Name",type:"String",start:42,stop:71},
{name:"GSNFlag",type:"String",start:73,stop:75},
{name:"HCNFlag",type:"String",start:77,stop:79},
{name:"WMOID",type:"String",start:81,stop:85},
{name:"Country", type:"String", start:1,stop:2}
]
db.weather.settings.save(settings);


If we want to use the definitions later, we simply call:

var settings = db.weather.settings.findOne();

With 2 simple functions, we can import the data in the format defined by 'stationDef'. The first function splits a text line in the columns defined by 'definition'

function split(line, definition){
    var result = new Object();
    for (key in definition){
        var column = definition[key];
        if (column.type=="fixed"){
            result[column.name] = column.value;
        } else {
            var s = line.substring(column.start-1,column.stop);
            if (column.type=="Float"){
                result[column.name] =  parseFloat(s);
            } else if (column.type=="Integer"){
                result[column.name] =  parseInt(s);
            } else {
                result[column.name] =  s.trim();
            }
        }
    }
    return result;
}


Another function reads a file (with cat()), and uses the split() function to parse  every line.

function readFile(name,collection,definition){
    var data = cat(name);
    var all = new Array();
    var lines = data.split("\n");
    for(i=1;i<lines.length;i++){
        var result = split(lines[i],definition);
        all.push(result);
    }
    collection.insert(all);
    return all.length;
}

readFile("projects/mongo/weather/ghcnd-stations.txt",db.stations,settings.weatherDef);

Inserting the station info inside the for loop would be easier, but keeping the documents in an array, and use that to do a bulk insert, improves the speed a lot. On my PC I read, parsed and inserted the >90.000 records in about 5 seconds. The 7MB txt file results in a 21MB collection.
Actually, it could be best to also store these 2 functions in the settings object (and thus is db.weather.settings). But this makes it a bit more difficult to call the one function from the other.

Importing the weather data

The weather data itself is located in 1 file per station and has the following format  (full details)

III. FORMAT OF DATA FILES (".dly" FILES)

Each ".dly" file contains data for one station.  The name of the file
corresponds to a station's identification code.  For example, "USC00026481.dly"
contains the data for the station with the identification code USC00026481).

Each record in a file contains one month of daily data.  The variables on each
line include the following:

------------------------------
Variable   Columns   Type
------------------------------
ID            1-11   Character
YEAR         12-15   Integer
MONTH        16-17   Integer
ELEMENT      18-21   Character
VALUE1       22-26   Integer
MFLAG1       27-27   Character
QFLAG1       28-28   Character
SFLAG1       29-29   Character
VALUE2       30-34   Integer
MFLAG2       35-35   Character
QFLAG2       36-36   Character
SFLAG2       37-37   Character
  .           .          .
  .           .          .
  .           .          .
VALUE31    262-266   Integer
MFLAG31    267-267   Character
QFLAG31    268-268   Character
SFLAG31    269-269   Character
------------------------------

We could import the weather data itself with the same functions (and a new column definition), but that would give us 1 document per station per month, which isn't really the structure we want in our database. We might import the data 1 month per document, and split that to document per day by using map-reduce or looping over the documents and rewrite the documents with javascript.
Instead, we simply adjust the 2 functions a little bit to allow us to use an array of  column definitions (instead of just 1). Each definition creates a new document. Every line in the text files (one per month) generates 31 new documents (one per day).

function split(line, definitions){
    var allResults = new Array();
    for (var i in definitions){
        var definition = definitions[i];
        var result = new Object();
        for (key in definition){
            var column = definition[key];
            if (column.type=="fixed"){
                result[column.name] = column.value;
            } else {
                var s = line.substring(column.start-1,column.stop);
                if (column.type=="Float"){
                    result[column.name] =  parseFloat(s);
                } else if (column.type=="Integer"){
                    result[column.name] =  parseInt(s);
                } else {
                    result[column.name] =  s.trim();
                }
            }
        }
        allResults.push(result);
    }
    return allResults;
}

function readFile(name,collection,definition){
    var data = cat(name);
    var all = new Array();
    var lines = data.split("\n");
    for(i=1;i<lines.length;i++){
        var result = split(lines[i],definition);
        all=all.concat(result);
    }
    collection.insert(all);
    return all.length;
}


The array that describes the column definitions for the weather data itself, and creates a new document for every day, is something like

settings.weatherDefs = [
  [{"name" :"day","type" : "fixed","value" : 1},
  {"name" :"ID", "type" : "String", "start" : 1,"stop" : 11},
  {"name" :"Country","type" : "String","start" : 1,"stop" : 2},
  {"name" :"year","type" : "Integer","start" : 12,"stop" : 15},
  {"name" :"month","type" : "Integer","start" : 16,"stop" : 17},
  {"name" :"element","type" : "String","start" : 18,"stop" : 21},
  {"name" :"Value","type" : "Integer","start" : 22,"stop" : 26},
  {"name" :"M","type" : "String","start" : 27,"stop" : 27},
  {"name" :"Q","type" : "String","start" : 28,"stop" : 28},
  {"name" :"S","type" : "String","start" : 29,"stop" : 29}
],

{"name" :"day","type" : "fixed","value" : 2},
  {"name" :"ID", "type" : "String", "start" : 1,"stop" : 11},
  {"name" :"Country","type" : "String","start" : 1,"stop" : 2},
  {"name" :"year","type" : "Integer","start" : 12,"stop" : 15},
  {"name" :"month","type" : "Integer","start" : 16,"stop" : 17},
  {"name" :"element","type" : "String","start" : 18,"stop" : 21},
  {"name" :"Value","type" : "Integer","start" : 30,"stop" : 34},
  {"name" :"M","type" : "String","start" : 35,"stop" : 35},
  {"name" :"Q","type" : "String","start" : 36,"stop" : 36},
  {"name" :"S","type" : "String","start" : 37,"stop" : 37}
],
 

......

[ {"name" :"day","type" : "fixed","value" : 31},
  {"name" :"ID", "type" : "String", "start" : 1,"stop" : 11},
  {"name" :"Country","type" : "String","start" : 1,"stop" : 2},
  {"name" :"year","type" : "Integer","start" : 12,"stop" : 15},
  {"name" :"month","type" : "Integer","start" : 16,"stop" : 17},
  {"name" :"element","type" : "String","start" : 18,"stop" : 21},
  {"name" :"Value","type" : "Integer","start":262,"stop" :266},
  {"name" :"M","type" : "String","start" :267,"stop" :267},
  {"name" :"Q","type" : "String","start" :268,"stop" :268},
  {"name" :"S","type" : "String","start" :269,"stop" :269}
]
 

]      

The whole definition is available at http://pastebin.com/996ec5dg
Now we can import a specific file, eg. all 187.000 observations for Belgium (about 10 seconds):

readFile("projects/mongo/weather/ghcnd_all/BE000006447.dly", db.weather,settings.weatherDefs)

In the UK there are 36 weather stations. To import all the data, we can use

var tel=0;
db.stations.find({Country:"UK"}).forEach(function(entry){
    tel++;
    var file = "projects/mongo/weather/ghcnd_all/"+entry.ID+".dly";
    var count = readFile(file,db.weather,
settings.weatherDefs);
    print(new Date()+" "+tel+" "+entry.ID+"   "+count);
})


If you have a fast computer or lots of time, you might import the data for all the 49000 stations in the US.

Since, in the text files, every month has 31 days, we need to clean up a bit with

db.weather.remove({Value:-9999})

The documents in the database now look like
 db.weather.find()
{ "_id" : ObjectId("5333f13b018f79b84edfcbff"), "day" : 2, "ID" : "UKW00035047", "Country" : "UK", "year" : 1953, "month" : 2, "element" : "WT08", "Value" : 1, "M" : "", "Q" : "", "S" : "X" }
{ "_id" : ObjectId("5333f13b018f79b84edfcc00"), "day" : 3, "ID" : "UKW00035047", "Country" : "UK", "year" : 1953, "month" : 2, "element" : "WT08", "Value" : 1, "M" : "", "Q" : "", "S" : "X" }
{ "_id" : ObjectId("5333f13b018f79b84edfcc01"), "day" : 4, "ID" : "UKW00035047", "Country" : "UK", "year" : 1953, "month" : 2, "element" : "WT08", "Value" : 1, "M" : "", "Q" : "", "S" : "X" }
{ "_id" : ObjectId("5333f13b018f79b84edfcbf9"), "day" : 27, "ID" : "UKW00035047", "Country" : "UK", "year" : 1953, "month" : 2, "element" : "WT01", "Value" : 1, "M" : "", "Q" : "", "S" : "X" }
{ "_id" : ObjectId("5333f13b018f79b84edfcbfa"), "day" : 28, "ID" : "UKW00035047", "Country" : "UK", "year" : 1953, "month" : 2, "element" : "WT01", "Value" : 1, "M" : "", "Q" : "", "S" : "X" }
...


Statistics

I'll try some more advanced statistics in a later blog post, but we can already show some basic information.
eg. the average minimum temperature for februari over all UK stations:

db.weather.aggregate(
    {$match:{"Country":"UK","year":{$gt:1994},month:2, element:"TMIN"}},
    {$group:{_id: {year:"$year",month:"$month"},
        Temp:{$avg:"$Value"}
        }},
    {$sort:{_id:1}}   
)
{ "_id" : { "year" : 1995, "month" : 2 }, "Temp" : 46.26623376623377 }
{ "_id" : { "year" : 1996, "month" : 2 }, "Temp" : 23.849529780564264 }
{ "_id" : { "year" : 1997, "month" : 2 }, "Temp" : 48.561688311688314 }
{ "_id" : { "year" : 1998, "month" : 2 }, "Temp" : 64.81063122923588 }
{ "_id" : { "year" : 1999, "month" : 2 }, "Temp" : 43.16233766233766 }
{ "_id" : { "year" : 2000, "month" : 2 }, "Temp" : 48.623824451410655 }
{ "_id" : { "year" : 2001, "month" : 2 }, "Temp" : 31.214285714285715 }
{ "_id" : { "year" : 2002, "month" : 2 }, "Temp" : 49.424710424710426 }
{ "_id" : { "year" : 2003, "month" : 2 }, "Temp" : 30.35907335907336 }
{ "_id" : { "year" : 2004, "month" : 2 }, "Temp" : 32.4365671641791 }
{ "_id" : { "year" : 2005, "month" : 2 }, "Temp" : 19.71563981042654 }
{ "_id" : { "year" : 2006, "month" : 2 }, "Temp" : 35.280701754385966 }
{ "_id" : { "year" : 2007, "month" : 2 }, "Temp" : 50.10666666666667 }
{ "_id" : { "year" : 2008, "month" : 2 }, "Temp" : 43 }
{ "_id" : { "year" : 2009, "month" : 2 }, "Temp" : 35.508849557522126 }
{ "_id" : { "year" : 2010, "month" : 2 }, "Temp" : 17.67699115044248 }
{ "_id" : { "year" : 2011, "month" : 2 }, "Temp" : 55.18274111675127 }
{ "_id" : { "year" : 2012, "month" : 2 }, "Temp" : 46.019417475728154 }
{ "_id" : { "year" : 2013, "month" : 2 }, "Temp" : 53.49473684210526 }
{ "_id" : { "year" : 2014, "month" : 2 }, "Temp" : 66.68627450980392 }

(Temperatures are recorded as integers in 0.1°C)

Of course, an index on Country, element, and month speeds this up a lot, but lowers the import speed if you'd like to add new data.


No comments:

Post a Comment