Large sample dataset
To use as sample data for doing statistical analysis, I was looking for a fairly large dataset. I wanted more than a few thousands records, but it still needs to be manageable on my own computer. A few gigabytes is ok, terrabytes is too much. At http://www.ncdc.noaa.gov/cdo-web/datasets I found 23G of weather data from 91.000 stations all over the world which looks exactly what I need.Import the dataset
As almost always, the data is not in a simple mongoDB importable structure.A typical approach would be to use tools and programs (sed, grep, awk, ...) to manipulate the text files to fit the format to the import routine. This typically involves very long, incomprehensible commands or scripts (and lots of googling ;-). (example)
Scripts and commands which are typically lost when we need them again, and need to be reconstructed again, you know the feeling ...
It would be nice if someone tried to import the same data with mongoimport (and sed, grep...) to see the difference with the approach presented here.
The text file definition is mostly stored in external files (or hard coded in the file manipulation commands), which can make the whole import process difficult and error prone.
MongoDB is so different from typical DB systems, that it also requires / allows a different way of thinking and working with data. There's nothing wrong with the classical dba habits, but I think mongoDB offers some possibilities that might help a lot. In a relational database, we don't often use tables with only 1 record. The effort to create the table is quite high. In mongo, it's very convenient to store additional info (source, import commands, metadata ...) together with the data itself. Often used queries, map-reduce commands, javascript functions, source url's, application settings, metadata and even documentation ... can very easy be stored in the database itself.
In our case, the data is in a lot of small fixed column text files, all with an identical structure, so we also keep the information about these files in the database itself.
Importing the station info
First we import the data about the weather stations.ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt
This information is in a fixed column format:
IV. FORMAT OF "ghcnd-stations.txt"
------------------------------
Variable Columns Type
------------------------------
ID 1-11 Character
LATITUDE 13-20 Real
LONGITUDE 22-30 Real
ELEVATION 32-37 Real
STATE 39-40 Character
NAME 42-71 Character
GSNFLAG 73-75 Character
HCNFLAG 77-79 Character
WMOID 81-85 Character
------------------------------
Since the data file isn't very big (7.5M), we can import it with the cat() function instead of mongoimport, and split up the lines according to the column definition. As mentioned before, we store this column definition in the database itself.
var settings = new Object();
settings.stationDef = [
{name:"ID", type:"String", start:1,stop:11},
{name:"Latitude",type:"Float",start:13,stop:20 },
{name:"Longitude",type:"Float",start:22,stop:30 },
{name:"Elevation",type:"Float",start:32,stop:37},
{name:"State",type:"String",start:39,stop:40},
{name:"Name",type:"String",start:42,stop:71},
{name:"GSNFlag",type:"String",start:73,stop:75},
{name:"HCNFlag",type:"String",start:77,stop:79},
{name:"WMOID",type:"String",start:81,stop:85},
{name:"Country", type:"String", start:1,stop:2}
]
db.weather.settings.save(settings);
If we want to use the definitions later, we simply call:
var settings = db.weather.settings.findOne();
With 2 simple functions, we can import the data in the format defined by 'stationDef'. The first function splits a text line in the columns defined by 'definition'
function split(line, definition){
var result = new Object();
for (key in definition){
var column = definition[key];
if (column.type=="fixed"){
result[column.name] = column.value;
} else {
var s = line.substring(column.start-1,column.stop);
if (column.type=="Float"){
result[column.name] = parseFloat(s);
} else if (column.type=="Integer"){
result[column.name] = parseInt(s);
} else {
result[column.name] = s.trim();
}
}
}
return result;
}
Another function reads a file (with cat()), and uses the split() function to parse every line.
function readFile(name,collection,definition){
var data = cat(name);
var all = new Array();
var lines = data.split("\n");
for(i=1;i<lines.length;i++){
var result = split(lines[i],definition);
all.push(result);
}
collection.insert(all);
return all.length;
}
readFile("projects/mongo/weather/ghcnd-stations.txt",db.stations,settings.weatherDef);
Inserting the station info inside the for loop would be easier, but keeping the documents in an array, and use that to do a bulk insert, improves the speed a lot. On my PC I read, parsed and inserted the >90.000 records in about 5 seconds. The 7MB txt file results in a 21MB collection.
Actually, it could be best to also store these 2 functions in the settings object (and thus is db.weather.settings). But this makes it a bit more difficult to call the one function from the other.
Importing the weather data
The weather data itself is located in 1 file per station and has the following format (full details)III. FORMAT OF DATA FILES (".dly" FILES)
Each ".dly" file contains data for one station. The name of the file
corresponds to a station's identification code. For example, "USC00026481.dly"
contains the data for the station with the identification code USC00026481).
Each record in a file contains one month of daily data. The variables on each
line include the following:
------------------------------
Variable Columns Type
------------------------------
ID 1-11 Character
YEAR 12-15 Integer
MONTH 16-17 Integer
ELEMENT 18-21 Character
VALUE1 22-26 Integer
MFLAG1 27-27 Character
QFLAG1 28-28 Character
SFLAG1 29-29 Character
VALUE2 30-34 Integer
MFLAG2 35-35 Character
QFLAG2 36-36 Character
SFLAG2 37-37 Character
. . .
. . .
. . .
VALUE31 262-266 Integer
MFLAG31 267-267 Character
QFLAG31 268-268 Character
SFLAG31 269-269 Character
------------------------------
We could import the weather data itself with the same functions (and a new column definition), but that would give us 1 document per station per month, which isn't really the structure we want in our database. We might import the data 1 month per document, and split that to document per day by using map-reduce or looping over the documents and rewrite the documents with javascript.
Instead, we simply adjust the 2 functions a little bit to allow us to use an array of column definitions (instead of just 1). Each definition creates a new document. Every line in the text files (one per month) generates 31 new documents (one per day).
function split(line, definitions){
var allResults = new Array();
for (var i in definitions){
var definition = definitions[i];
var result = new Object();
for (key in definition){
var column = definition[key];
if (column.type=="fixed"){
result[column.name] = column.value;
} else {
var s = line.substring(column.start-1,column.stop);
if (column.type=="Float"){
result[column.name] = parseFloat(s);
} else if (column.type=="Integer"){
result[column.name] = parseInt(s);
} else {
result[column.name] = s.trim();
}
}
}
allResults.push(result);
}
return allResults;
}
function readFile(name,collection,definition){
var data = cat(name);
var all = new Array();
var lines = data.split("\n");
for(i=1;i<lines.length;i++){
var result = split(lines[i],definition);
all=all.concat(result);
}
collection.insert(all);
return all.length;
}
The array that describes the column definitions for the weather data itself, and creates a new document for every day, is something like
settings.weatherDefs = [
[{"name" :"day","type" : "fixed","value" : 1},
{"name" :"ID", "type" : "String", "start" : 1,"stop" : 11},
{"name" :"Country","type" : "String","start" : 1,"stop" : 2},
{"name" :"year","type" : "Integer","start" : 12,"stop" : 15},
{"name" :"month","type" : "Integer","start" : 16,"stop" : 17},
{"name" :"element","type" : "String","start" : 18,"stop" : 21},
{"name" :"Value","type" : "Integer","start" : 22,"stop" : 26},
{"name" :"M","type" : "String","start" : 27,"stop" : 27},
{"name" :"Q","type" : "String","start" : 28,"stop" : 28},
{"name" :"S","type" : "String","start" : 29,"stop" : 29}
],
{"name" :"day","type" : "fixed","value" : 2},
{"name" :"ID", "type" : "String", "start" : 1,"stop" : 11},
{"name" :"Country","type" : "String","start" : 1,"stop" : 2},
{"name" :"year","type" : "Integer","start" : 12,"stop" : 15},
{"name" :"month","type" : "Integer","start" : 16,"stop" : 17},
{"name" :"element","type" : "String","start" : 18,"stop" : 21},
{"name" :"Value","type" : "Integer","start" : 30,"stop" : 34},
{"name" :"M","type" : "String","start" : 35,"stop" : 35},
{"name" :"Q","type" : "String","start" : 36,"stop" : 36},
{"name" :"S","type" : "String","start" : 37,"stop" : 37}
],
......
[ {"name" :"day","type" : "fixed","value" : 31},
{"name" :"ID", "type" : "String", "start" : 1,"stop" : 11},
{"name" :"Country","type" : "String","start" : 1,"stop" : 2},
{"name" :"year","type" : "Integer","start" : 12,"stop" : 15},
{"name" :"month","type" : "Integer","start" : 16,"stop" : 17},
{"name" :"element","type" : "String","start" : 18,"stop" : 21},
{"name" :"Value","type" : "Integer","start":262,"stop" :266},
{"name" :"M","type" : "String","start" :267,"stop" :267},
{"name" :"Q","type" : "String","start" :268,"stop" :268},
{"name" :"S","type" : "String","start" :269,"stop" :269}
]
]
The whole definition is available at http://pastebin.com/996ec5dg
Now we can import a specific file, eg. all 187.000 observations for Belgium (about 10 seconds):
readFile("projects/mongo/weather/ghcnd_all/BE000006447.dly", db.weather,settings.weatherDefs)
In the UK there are 36 weather stations. To import all the data, we can use
var tel=0;
db.stations.find({Country:"UK"}).forEach(function(entry){
tel++;
var file = "projects/mongo/weather/ghcnd_all/"+entry.ID+".dly";
var count = readFile(file,db.weather,settings.weatherDefs);
print(new Date()+" "+tel+" "+entry.ID+" "+count);
})
If you have a fast computer or lots of time, you might import the data for all the 49000 stations in the US.
Since, in the text files, every month has 31 days, we need to clean up a bit with
db.weather.remove({Value:-9999})
The documents in the database now look like
db.weather.find()
{ "_id" : ObjectId("5333f13b018f79b84edfcbff"), "day" : 2, "ID" : "UKW00035047", "Country" : "UK", "year" : 1953, "month" : 2, "element" : "WT08", "Value" : 1, "M" : "", "Q" : "", "S" : "X" }
{ "_id" : ObjectId("5333f13b018f79b84edfcc00"), "day" : 3, "ID" : "UKW00035047", "Country" : "UK", "year" : 1953, "month" : 2, "element" : "WT08", "Value" : 1, "M" : "", "Q" : "", "S" : "X" }
{ "_id" : ObjectId("5333f13b018f79b84edfcc01"), "day" : 4, "ID" : "UKW00035047", "Country" : "UK", "year" : 1953, "month" : 2, "element" : "WT08", "Value" : 1, "M" : "", "Q" : "", "S" : "X" }
{ "_id" : ObjectId("5333f13b018f79b84edfcbf9"), "day" : 27, "ID" : "UKW00035047", "Country" : "UK", "year" : 1953, "month" : 2, "element" : "WT01", "Value" : 1, "M" : "", "Q" : "", "S" : "X" }
{ "_id" : ObjectId("5333f13b018f79b84edfcbfa"), "day" : 28, "ID" : "UKW00035047", "Country" : "UK", "year" : 1953, "month" : 2, "element" : "WT01", "Value" : 1, "M" : "", "Q" : "", "S" : "X" }...
Statistics
I'll try some more advanced statistics in a later blog post, but we can already show some basic information.eg. the average minimum temperature for februari over all UK stations:
db.weather.aggregate(
{$match:{"Country":"UK","year":{$gt:1994},month:2, element:"TMIN"}},
{$group:{_id: {year:"$year",month:"$month"},
Temp:{$avg:"$Value"}
}},
{$sort:{_id:1}}
)
{ "_id" : { "year" : 1995, "month" : 2 }, "Temp" : 46.26623376623377 }
{ "_id" : { "year" : 1996, "month" : 2 }, "Temp" : 23.849529780564264 }
{ "_id" : { "year" : 1997, "month" : 2 }, "Temp" : 48.561688311688314 }
{ "_id" : { "year" : 1998, "month" : 2 }, "Temp" : 64.81063122923588 }
{ "_id" : { "year" : 1999, "month" : 2 }, "Temp" : 43.16233766233766 }
{ "_id" : { "year" : 2000, "month" : 2 }, "Temp" : 48.623824451410655 }
{ "_id" : { "year" : 2001, "month" : 2 }, "Temp" : 31.214285714285715 }
{ "_id" : { "year" : 2002, "month" : 2 }, "Temp" : 49.424710424710426 }
{ "_id" : { "year" : 2003, "month" : 2 }, "Temp" : 30.35907335907336 }
{ "_id" : { "year" : 2004, "month" : 2 }, "Temp" : 32.4365671641791 }
{ "_id" : { "year" : 2005, "month" : 2 }, "Temp" : 19.71563981042654 }
{ "_id" : { "year" : 2006, "month" : 2 }, "Temp" : 35.280701754385966 }
{ "_id" : { "year" : 2007, "month" : 2 }, "Temp" : 50.10666666666667 }
{ "_id" : { "year" : 2008, "month" : 2 }, "Temp" : 43 }
{ "_id" : { "year" : 2009, "month" : 2 }, "Temp" : 35.508849557522126 }
{ "_id" : { "year" : 2010, "month" : 2 }, "Temp" : 17.67699115044248 }
{ "_id" : { "year" : 2011, "month" : 2 }, "Temp" : 55.18274111675127 }
{ "_id" : { "year" : 2012, "month" : 2 }, "Temp" : 46.019417475728154 }
{ "_id" : { "year" : 2013, "month" : 2 }, "Temp" : 53.49473684210526 }
{ "_id" : { "year" : 2014, "month" : 2 }, "Temp" : 66.68627450980392 }
(Temperatures are recorded as integers in 0.1°C)
Of course, an index on Country, element, and month speeds this up a lot, but lowers the import speed if you'd like to add new data.
No comments:
Post a Comment