tag:blogger.com,1999:blog-50523583170431125912024-03-08T01:55:09.537-08:00Mongo Inside OutGeert Van Dammehttp://www.blogger.com/profile/04086330944435977758noreply@blogger.comBlogger3125tag:blogger.com,1999:blog-5052358317043112591.post-4777054211706544452014-03-27T04:33:00.001-07:002014-03-27T04:35:56.976-07:00Importing Weather Data<h3>
Large sample dataset</h3>
To use as sample data for doing statistical analysis, I was looking for a fairly large dataset. I wanted more than a few thousands records, but it still needs to be manageable on my own computer. A few gigabytes is ok, terrabytes is too much. At <a href="http://www.ncdc.noaa.gov/cdo-web/datasets">http://www.ncdc.noaa.gov/cdo-web/datasets</a> I found 23G of weather data from 91.000 stations all over the world which looks exactly what I need.<br />
<br />
<h3>
Import the dataset</h3>
As almost always, the data is not in a simple mongoDB importable structure.<br />
A typical approach would be to use tools and programs (sed, grep, awk, ...) to manipulate the text files to fit the format to the import routine. This typically involves very long, incomprehensible commands or scripts (and lots of googling ;-). (<a href="http://ronaldbradford.com/blog/mongodb-experience-bulk-loading-data-2010-06-09/" target="_blank">example</a>)<br />
Scripts and commands which are typically lost when we need them again, and need to be reconstructed again, you know the feeling ...<br />
It would be nice if someone tried to import the same data with mongoimport (and sed, grep...) to see the difference with the approach presented here. <br />
The text file definition is mostly stored in external files (or hard coded in the file manipulation commands), which can make the whole import process difficult and error prone. <br />
MongoDB is so different from typical DB systems, that it also requires / allows a different way of thinking and working with data. There's nothing wrong with the classical dba habits, but I think mongoDB offers some possibilities that might help a lot. In a relational database, we don't often use tables with only 1 record. The effort to create the table is quite high. In mongo, it's very convenient to store additional info (source, import commands, metadata ...) together with the data itself. Often used queries, map-reduce commands, javascript functions, source url's, application settings, metadata and even documentation ... can very easy be stored in the database itself.<br />
In our case, the data is in a lot of small fixed column text files, all with an identical structure, so we also keep the information about these files in the database itself.<br />
<a name='more'></a><br />
<br />
<h3>
Importing the station info</h3>
First we import the data about the weather stations. <br />
<br />
<a href="ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt">ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt</a><br />
This information is in a fixed column format:<br />
<br />
<pre><span style="font-size: x-small;">IV. FORMAT OF "ghcnd-stations.txt"
------------------------------
Variable Columns Type
------------------------------
ID 1-11 Character
LATITUDE 13-20 Real
LONGITUDE 22-30 Real
ELEVATION 32-37 Real
STATE 39-40 Character
NAME 42-71 Character
GSNFLAG 73-75 Character
HCNFLAG 77-79 Character
WMOID 81-85 Character
------------------------------ </span></pre>
<br />
Since the data file isn't very big (7.5M), we can import it with the cat() function instead of mongoimport, and split up the lines according to the column definition. As mentioned before, we store this column definition in the database itself.<br />
<span style="font-family: "Courier New",Courier,monospace;"><br /></span>
<br />
<span style="font-family: "Courier New",Courier,monospace; font-size: x-small;">var settings = new Object();</span><br />
<span style="font-family: "Courier New",Courier,monospace; font-size: x-small;">settings.stationDef = [<br />{name:"ID", type:"String", start:1,stop:11},<br />{name:"Latitude",type:"Float",start:13,stop:20 },<br />{name:"Longitude",type:"Float",start:22,stop:30 },<br />{name:"Elevation",type:"Float",start:32,stop:37},<br />{name:"State",type:"String",start:39,stop:40},<br />{name:"Name",type:"String",start:42,stop:71},<br />{name:"GSNFlag",type:"String",start:73,stop:75},<br />{name:"HCNFlag",type:"String",start:77,stop:79},<br />{name:"WMOID",type:"String",start:81,stop:85},<br />{name:"Country", type:"String", start:1,stop:2}<br />]<br />db.weather.settings.save(settings); </span><br />
<span style="font-family: "Courier New",Courier,monospace;"><br /></span>
If we want to use the definitions later, we simply call:<br />
<br />
<span style="font-size: x-small;"><span style="font-family: "Courier New",Courier,monospace;">var settings = db.weather.settings.findOne();</span></span> <br />
<br />
With 2 simple functions, we can import the data in the format defined by 'stationDef'. The first function splits a text line in the columns defined by 'definition'<br />
<span style="font-family: "Courier New",Courier,monospace;"><br /></span>
<span style="font-size: x-small;"><span style="font-family: "Courier New",Courier,monospace;">function split(line, definition){<br /> var result = new Object();<br /> for (key in definition){<br /> var column = definition[key];<br /> if (column.type=="fixed"){<br /> result[column.name] = column.value;<br /> } else {<br /> var s = line.substring(column.start-1,column.stop);<br /> if (column.type=="Float"){<br /> result[column.name] = parseFloat(s);<br /> } else if (column.type=="Integer"){<br /> result[column.name] = parseInt(s);<br /> } else {<br /> result[column.name] = s.trim();<br /> }<br /> }<br /> }<br /> return result;<br />}</span></span><br />
<br />
Another function reads a file (with cat()), and uses the split() function to parse every line.<br />
<br />
<span style="font-size: x-small;"><span style="font-family: "Courier New",Courier,monospace;">function readFile(name,collection,definition){<br /> var data = cat(name);<br /> var all = new Array();<br /> var lines = data.split("\n");<br /> for(i=1;i<lines.length;i++){<br /> var result = split(lines[i],definition);<br /> all.push(result);<br /> }<br /> collection.insert(all);<br /> return all.length;<br />}</span></span><br />
<span style="font-size: x-small;"><span style="font-family: "Courier New",Courier,monospace;">readFile("projects/mongo/weather/ghcnd-stations.txt",db.stations,settings.weatherDef);</span></span><br />
<br />
Inserting the station info inside the for loop would be easier, but keeping the documents in an array, and use that to do a <a href="http://docs.mongodb.org/manual/core/bulk-inserts/" target="_blank">bulk insert</a>, improves the speed a lot. On my PC I read, parsed and inserted the >90.000 records in about 5 seconds. The 7MB txt file results in a 21MB collection.<br />
Actually, it could be best to also store these 2 functions in the settings object (and thus is db.weather.settings). But this makes it a bit more difficult to call the one function from the other.<br />
<h3>
Importing the weather data</h3>
The weather data itself is located in 1 file per station and has the following format (<a href="ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt" target="_blank">full details</a>)<br />
<br />
<pre><span style="font-size: x-small;">III. FORMAT OF DATA FILES (".dly" FILES)
Each ".dly" file contains data for one station. The name of the file
corresponds to a station's identification code. For example, "USC00026481.dly"
contains the data for the station with the identification code USC00026481).
Each record in a file contains one month of daily data. The variables on each
line include the following:
------------------------------
Variable Columns Type
------------------------------
ID 1-11 Character
YEAR 12-15 Integer
MONTH 16-17 Integer
ELEMENT 18-21 Character
VALUE1 22-26 Integer
MFLAG1 27-27 Character
QFLAG1 28-28 Character
SFLAG1 29-29 Character
VALUE2 30-34 Integer
MFLAG2 35-35 Character
QFLAG2 36-36 Character
SFLAG2 37-37 Character
. . .
. . .
. . .
VALUE31 262-266 Integer
MFLAG31 267-267 Character
QFLAG31 268-268 Character
SFLAG31 269-269 Character
------------------------------</span>
</pre>
<br />
We could import the weather data itself with the same functions (and a new column definition), but that would give us 1 document per station per month, which isn't really the structure we want in our database. We might import the data 1 month per document, and split that to document per day by using map-reduce or looping over the documents and rewrite the documents with javascript. <br />
Instead, we simply adjust the 2 functions a little bit to allow us to use an array of column definitions (instead of just 1). Each definition creates a new document. Every line in the text files (one per month) generates 31 new documents (one per day).<br />
<br />
<span style="font-family: "Courier New",Courier,monospace;"><span style="font-size: x-small;">function split(line, definitions){<br /> var allResults = new Array();<br /> for (var i in definitions){<br /> var definition = definitions[i];<br /> var result = new Object();<br /> for (key in definition){<br /> var column = definition[key];<br /> if (column.type=="fixed"){<br /> result[column.name] = column.value;<br /> } else {<br /> var s = line.substring(column.start-1,column.stop);<br /> if (column.type=="Float"){<br /> result[column.name] = parseFloat(s);<br /> } else if (column.type=="Integer"){<br /> result[column.name] = parseInt(s);<br /> } else {<br /> result[column.name] = s.trim();<br /> }<br /> }<br /> }<br /> allResults.push(result);<br /> }<br /> return allResults;<br />}<br /><br />function readFile(name,collection,definition){<br /> var data = cat(name);<br /> var all = new Array();<br /> var lines = data.split("\n");<br /> for(i=1;i<lines.length;i++){<br /> var result = split(lines[i],definition);<br /> all=all.concat(result);<br /> }<br /> collection.insert(all);<br /> return all.length;<br />}</span></span><br />
<br />
The array that describes the column definitions for the weather data itself, and creates a new document for every day, is something like<br />
<span style="font-size: x-small;"><br /></span>
<span style="font-size: x-small;"><span style="font-family: "Courier New",Courier,monospace;">settings.weatherDefs = [ <br /> [{"name" :"day","type" : "fixed","value" : 1}, <br /> {"name" :"ID", "type" : "String", "start" : 1,"stop" : 11}, <br /> {"name" :"Country","type" : "String","start" : 1,"stop" : 2}, <br /> {"name" :"year","type" : "Integer","start" : 12,"stop" : 15}, <br /> {"name" :"month","type" : "Integer","start" : 16,"stop" : 17}, <br /> {"name" :"element","type" : "String","start" : 18,"stop" : 21}, <br /> {"name" :"Value","type" : "Integer","start" : 22,"stop" : 26}, <br /> {"name" :"M","type" : "String","start" : 27,"stop" : 27}, <br /> {"name" :"Q","type" : "String","start" : 28,"stop" : 28}, <br /> {"name" :"S","type" : "String","start" : 29,"stop" : 29}<br />],</span></span><br />
<span style="font-size: x-small;"><span style="font-family: "Courier New",Courier,monospace;"><span style="font-family: "Courier New",Courier,monospace;">{"name" :"day","type" : "fixed","value" : 2}, <br /> {"name" :"ID", "type" : "String", "start" : 1,"stop" : 11}, <br /> {"name" :"Country","type" : "String","start" : 1,"stop" : 2}, <br /> {"name" :"year","type" : "Integer","start" : 12,"stop" : 15}, <br /> {"name" :"month","type" : "Integer","start" : 16,"stop" : 17}, <br /> {"name" :"element","type" : "String","start" : 18,"stop" : 21}, <br /> {"name" :"Value","type" : "Integer","start" : 30,"stop" : 34}, <br /> {"name" :"M","type" : "String","start" : 35,"stop" : 35}, <br /> {"name" :"Q","type" : "String","start" : 36,"stop" : 36}, <br /> {"name" :"S","type" : "String","start" : 37,"stop" : 37}<br />],</span> </span></span><br />
<span style="font-size: x-small;"><span style="font-family: "Courier New",Courier,monospace;">...... </span></span><br />
<span style="font-size: x-small;"><br /><span style="font-family: "Courier New",Courier,monospace;"><span style="font-family: "Courier New",Courier,monospace;">[ {"name" :"day","type" : "fixed","value" : 31}, <br /> {"name" :"ID", "type" : "String", "start" : 1,"stop" : 11}, <br /> {"name" :"Country","type" : "String","start" : 1,"stop" : 2}, <br /> {"name" :"year","type" : "Integer","start" : 12,"stop" : 15}, <br /> {"name" :"month","type" : "Integer","start" : 16,"stop" : 17}, <br /> {"name" :"element","type" : "String","start" : 18,"stop" : 21}, <br /> {"name" :"Value","type" : "Integer","start":262,"stop" :266}, <br /> {"name" :"M","type" : "String","start" :267,"stop" :267}, <br /> {"name" :"Q","type" : "String","start" :268,"stop" :268}, <br /> {"name" :"S","type" : "String","start" :269,"stop" :269}<br />]</span> </span></span><br />
<span style="font-family: "Courier New",Courier,monospace;"><span style="font-size: x-small;">] </span></span><br />
<br />
The whole definition is available at <a href="http://pastebin.com/996ec5dg">http://pastebin.com/996ec5dg</a> <br />
Now we can import a specific file, eg. all 187.000 observations for Belgium (about 10 seconds): <br />
<br />
<span style="font-family: "Courier New",Courier,monospace;"><span style="font-size: x-small;">readFile("projects/mongo/weather/ghcnd_all/BE000006447.dly", db.weather,settings.weatherDefs)</span></span><br />
<br />
In the UK there are 36 weather stations. To import all the data, we can use<br />
<span style="font-size: x-small;"><br /></span>
<span style="font-size: x-small;"><span style="font-family: "Courier New",Courier,monospace;">var tel=0;<br />db.stations.find({Country:"UK"}).forEach(function(entry){<br /> tel++;<br /> var file = "projects/mongo/weather/ghcnd_all/"+entry.ID+".dly";<br /> var count = readFile(file,db.weather,</span></span><span style="font-size: x-small;"><span style="font-family: "Courier New",Courier,monospace;"><span style="font-family: "Courier New",Courier,monospace;">settings.weatherDefs</span>);<br /> print(new Date()+" "+tel+" "+entry.ID+" "+count);<br />})</span></span><br />
<br />
If you have a fast computer or lots of time, you might import the data for all the 49000 stations in the US.<br />
<br />
Since, in the text files, every month has 31 days, we need to clean up a bit with<br />
<br />
<span style="font-size: x-small;"><span style="font-family: "Courier New",Courier,monospace;">db.weather.remove({Value:-9999})</span></span><br />
<br />
The documents in the database now look like<br />
<span style="font-size: x-small;"><span style="font-family: "Courier New",Courier,monospace;"> db.weather.find()<br />{ "_id" : ObjectId("5333f13b018f79b84edfcbff"), "day" : 2, "ID" : "UKW00035047", "Country" : "UK", "year" : 1953, "month" : 2, "element" : "WT08", "Value" : 1, "M" : "", "Q" : "", "S" : "X" }<br />{ "_id" : ObjectId("5333f13b018f79b84edfcc00"), "day" : 3, "ID" : "UKW00035047", "Country" : "UK", "year" : 1953, "month" : 2, "element" : "WT08", "Value" : 1, "M" : "", "Q" : "", "S" : "X" }<br />{ "_id" : ObjectId("5333f13b018f79b84edfcc01"), "day" : 4, "ID" : "UKW00035047", "Country" : "UK", "year" : 1953, "month" : 2, "element" : "WT08", "Value" : 1, "M" : "", "Q" : "", "S" : "X" }<br />{ "_id" : ObjectId("5333f13b018f79b84edfcbf9"), "day" : 27, "ID" : "UKW00035047", "Country" : "UK", "year" : 1953, "month" : 2, "element" : "WT01", "Value" : 1, "M" : "", "Q" : "", "S" : "X" }<br />{ "_id" : ObjectId("5333f13b018f79b84edfcbfa"), "day" : 28, "ID" : "UKW00035047", "Country" : "UK", "year" : 1953, "month" : 2, "element" : "WT01", "Value" : 1, "M" : "", "Q" : "", "S" : "X" }</span></span>...<br />
<br />
<br />
<h3>
Statistics</h3>
I'll try some more advanced statistics in a later blog post, but we can already show some basic information.<br />
eg. the average minimum temperature for februari over all UK stations: <br />
<span style="font-family: "Courier New",Courier,monospace;"><span style="font-size: x-small;"><br /></span></span>
<span style="font-family: "Courier New",Courier,monospace;"><span style="font-size: x-small;">db.weather.aggregate(<br /> {$match:{"Country":"UK","year":{$gt:1994},month:2, element:"TMIN"}},<br /> {$group:{_id: {year:"$year",month:"$month"},<br /> Temp:{$avg:"$Value"}<br /> }},<br /> {$sort:{_id:1}} <br />)<br />{ "_id" : { "year" : 1995, "month" : 2 }, "Temp" : 46.26623376623377 }<br />{ "_id" : { "year" : 1996, "month" : 2 }, "Temp" : 23.849529780564264 }<br />{ "_id" : { "year" : 1997, "month" : 2 }, "Temp" : 48.561688311688314 }<br />{ "_id" : { "year" : 1998, "month" : 2 }, "Temp" : 64.81063122923588 }<br />{ "_id" : { "year" : 1999, "month" : 2 }, "Temp" : 43.16233766233766 }<br />{ "_id" : { "year" : 2000, "month" : 2 }, "Temp" : 48.623824451410655 }<br />{ "_id" : { "year" : 2001, "month" : 2 }, "Temp" : 31.214285714285715 }<br />{ "_id" : { "year" : 2002, "month" : 2 }, "Temp" : 49.424710424710426 }<br />{ "_id" : { "year" : 2003, "month" : 2 }, "Temp" : 30.35907335907336 }<br />{ "_id" : { "year" : 2004, "month" : 2 }, "Temp" : 32.4365671641791 }<br />{ "_id" : { "year" : 2005, "month" : 2 }, "Temp" : 19.71563981042654 }<br />{ "_id" : { "year" : 2006, "month" : 2 }, "Temp" : 35.280701754385966 }<br />{ "_id" : { "year" : 2007, "month" : 2 }, "Temp" : 50.10666666666667 }<br />{ "_id" : { "year" : 2008, "month" : 2 }, "Temp" : 43 }<br />{ "_id" : { "year" : 2009, "month" : 2 }, "Temp" : 35.508849557522126 }<br />{ "_id" : { "year" : 2010, "month" : 2 }, "Temp" : 17.67699115044248 }<br />{ "_id" : { "year" : 2011, "month" : 2 }, "Temp" : 55.18274111675127 }<br />{ "_id" : { "year" : 2012, "month" : 2 }, "Temp" : 46.019417475728154 }<br />{ "_id" : { "year" : 2013, "month" : 2 }, "Temp" : 53.49473684210526 }<br />{ "_id" : { "year" : 2014, "month" : 2 }, "Temp" : 66.68627450980392 }</span></span><br />
<span style="font-size: x-small;">(Temperatures are recorded as integers in 0.1°C)</span><br />
<br />
Of course, an index on Country, element, and month speeds this up a lot, but lowers the import speed if you'd like to add new data. <br />
<br />
<br />Geert Van Dammehttp://www.blogger.com/profile/04086330944435977758noreply@blogger.com0tag:blogger.com,1999:blog-5052358317043112591.post-80385190520238849592014-03-11T09:47:00.000-07:002014-03-11T09:49:40.238-07:00Facebook<h3>
Fun with mongodb and the Facebook Graph API</h3>
In a <a href="http://mongoinsideout.blogspot.be/2014/02/wget-or-xmlhttprequest.html" target="_blank">previous post</a> we showed how to make http calls from within the mongodb shell. Another nice example of what we can do with this wget function is to make calls to the <a href="https://developers.facebook.com/docs/reference/api/" target="_blank">facebook graph API</a>. Since the fb graph API uses json, we can store the results immediately in the database.<br />
<br />
We put all the relevant functions and parameters in a single object (and even save that in the db. I prefer this to using .mongorc or the system.js collection. I'll write another blog post about this technique later).<br />
<br />
<a name='more'></a><br />
<br />
We create a facebook object and copy the wget function in it.<br />
<blockquote class="tr_bq">
<span style="font-family: "Courier New",Courier,monospace;">var facebook = {<br /> "wget" : function wget(url){<br /> var tmp = "/tmp";<br /> var id = new ObjectId();<br /> var outFile= tmp+"/wget"+id;<br /> var p = run("wget","-o log","--output-document="+outFile,url);<br /> if (p==0){<br /> var result = cat(outFile);<br /> removeFile(outFile);<br /> return result;<br /> } else {<br /> return "";<br /> }<br /> }<br />}</span></blockquote>
<br />
<br />
(Allmost) all calls to the API need an access token. You can generate a token (with the access rights you need) at <a href="https://developers.facebook.com/tools/explorer">https://developeallows us tors.facebook.com/tools/explorer</a> which is valid for 2 hours. <br />
<br />
<blockquote class="tr_bq">
<code>facebook.token = "CAACEdEose0cBAFzPP7M........udv2TjMTQQKEZD";</code></blockquote>
Now we create the function to make the graph call itself.<br />
<blockquote class="tr_bq">
<span style="font-family: "Courier New",Courier,monospace;">facebook.graph = function graph(user,item){</span><br />
<span style="font-family: "Courier New",Courier,monospace;"><span style="font-family: "Courier New",Courier,monospace;"> if (!user){<br /> user = "me";<br /> }</span> var url = "https://graph.facebook.com/"+user;<br /> if (item){<br /> url = url+"/"+item;<br /> }<br /> url=url+"?access_token="+this.token;<br /> var result = this.wget(url);<br /> return JSON.parse(result);<br /> }</span></blockquote>
<br />
<br />
This function graph makes a call to the graph API (using the wget function) with 2 (second optional) parameters.<br />
<ul>
<li>user (which defaults to 'me')</li>
<li>item: specifies what object from the graph you want (friends, events, photos, ...)</li>
</ul>
Let's try a few things: <br />
<br />
<code>facebook.graph('me','friends');</code> shows a list of all my friends.<br />
<code><code>facebook.g</code>raph('mongodb');</code> shows info about the mongodb page<br />
<code><code>facebook.g</code>raph('mongodb','photos');</code><br />
<code><code>facebook.g</code>raph('mongodb','likes');</code><br />
<br />
When returning a single object, the json reply simply describes the object. When the API returns a list of objects, this list is an array named 'data' (mostly followed by pagination url's).<br />
The nice thing is that the function returns a JSON object, so it's very easy to save the results in the database.<br />
<blockquote class="tr_bq">
<span style="font-family: "Courier New",Courier,monospace;"><code>var friends = acebook.graph('me','friends');<br />friends = friends.data;<br />friends.forEach(function(friend){db.friends.save(friend)});</code> </span></blockquote>
Or simply print a list of your friends<br />
<blockquote class="tr_bq">
<span style="font-family: "Courier New",Courier,monospace;"><code>friends.forEach(function(friend){print(friend.name)});</code></span></blockquote>
Now we can loop over all these friends and lookup basic info for each of them with this code<br />
<blockquote class="tr_bq">
<span style="font-family: "Courier New",Courier,monospace;">db.friends.find().forEach(<br />function(friend){<br />friend.info = facebook.graph(friend.id); db.friends.save(friend);})</span></blockquote>
<br />
There's lots of extra functionality that we could add (extra parameters like 'limit', use different access_tokens for different users, use POST and DELETE calls instead of only GET ...)<br />
<br />
Have fun!! <br />
<br />
<br />Geert Van Dammehttp://www.blogger.com/profile/04086330944435977758noreply@blogger.com4tag:blogger.com,1999:blog-5052358317043112591.post-86101147046732141132014-02-27T07:00:00.002-08:002014-03-25T07:07:45.573-07:00wget or XMLHttpRequest<div class="tr_bq">
The javascript possibilities inside mongodb allow us to do interesting things from within the database itself. One thing that's missing is the XMLHttpRequest object. That would allow us to make calls to json webservices from within the mongo shell itself. Apparently, <a href="https://jira.mongodb.org/browse/SERVER-5989" target="_blank">I'm not the only one who thinks this</a>. The issue is marked as 'lower priority' and 'features we're not sure of', so I doubt we will see this added soon.</div>
<br />
<h3>
run("wget")</h3>
To show the power of making an http request from within the database, we might use a workaround. There is an undocumented, 'for internal use only', <a href="http://docs.mongodb.org/manual/reference/method/run/" target="_blank">run()</a> method, which can start an arbitrary process, like eg. wget. (<a href="http://blog.scrt.ch/2013/03/24/mongodb-0-day-ssji-to-rce/" target="_blank">Some others found the run() method interesting as well</a>). <br />
wget is a tool, on *nix machines, to make http calls in the command line or in scripts. These code examples run fine on linux, but might need some changes to make them run on windows.<br />
The run() method returns the process' exit code, so we can't capture the output immediately. Instead, we let wget dump the output to a temporary file and read the contents of that file by using <a href="http://docs.mongodb.org/manual/reference/method/cat/" target="_blank">cat().</a><br />
<blockquote class="tr_bq">
<span style="font-family: "Courier New",Courier,monospace;">function wget(url){</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> var tmp = "/tmp";</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> var id = new ObjectId();</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> var outFile= tmp+"/wget"+id;</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> var p = run("wget","-o log","--output-document="+outFile,url);</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> if (p==0){</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> var result = cat(outFile);</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> removeFile(outFile);</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> return result;</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> } else {</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> return "";</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> }</span><br />
<span style="font-family: "Courier New",Courier,monospace;">}</span></blockquote>
It would be cleaner with a real XMLHttpRequest, but it works ok, and opens up many possibilities.<br />
<ul>
<li>webcrawler</li>
<li>read / use webservices</li>
<li>read JSON formatted Rest API's and save the results in the database</li>
<li>...</li>
</ul>
<h3>
<a name='more'></a>
WebCrawler </h3>
Now that we can make an http request, it's not that difficult to extract all hyperlinks. If we use these hyperlinks again we essectially have a web crawler.<br />
A real crawler will need to deal with a lot more issues (relative hyperlinks, robots.txt, javascript links ...), but the basic functionality should work.<br />
We can extract all hyperlinks from an html page like this:<br />
<blockquote class="tr_bq">
</blockquote>
<blockquote class="tr_bq">
<span style="font-family: "Courier New",Courier,monospace;">var html =wget("http://www.google.com");</span><br />
<span style="font-family: "Courier New",Courier,monospace;">var regex = /href=["'']?([^"'' >]+)/gi;</span><br />
<span style="font-family: "Courier New",Courier,monospace;">var links = new Array();</span><br />
<span style="font-family: "Courier New",Courier,monospace;">var f=regex.exec(html);</span><br />
<span style="font-family: "Courier New",Courier,monospace;">while(f!=null){</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> links.push(f[1]);</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> f=regex.exec(html);</span><br />
<span style="font-family: "Courier New",Courier,monospace;">}</span><br />
<span style="font-family: "Courier New",Courier,monospace;"></span></blockquote>
<br />
We could save the results in a collection and use that collection again to perform a new search.<br />
<br />
<h3>
Webservice</h3>
In this example we use our wget function to download <a href="https://www.google.com/finance" target="_blank">google finance</a> data and store these in mongodb. The normal approach would be to make a script/ application to download the data, convert the data to a format that can be imported into the database or parse the data and insert it.<br />
Here we show that we can implement this functionality inside the database itself. We also store the javascript code and the application parameters inside the same database. <br />
<br />
Google finance lets us download historical stock prices in csv format (eg. https://www.google.com/finance/historical?q=NYSE%3ATWTR ). <br />
We use the collection quotes.settings as a place to keep both the application parameters. <br />
<span style="font-family: "Courier New",Courier,monospace;"> </span><br />
There's only 1 document in quotes.settings, so we assign "unique" as _id value. This doesn't really do anything, but it makes it clear this collection is meant as a singleton, containing only 1 document. I often use the .settings name to keep everything together. It makes it obvious that the data inside this collection refers to the 'quotes' collection.<br />
<blockquote class="tr_bq">
<span style="font-family: "Courier New",Courier,monospace;">db.quotes.settings.save({</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> _id:"unique",</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> root : "http://www.google.com/finance/historical?q=$0&start=0&startdate=$1&output=csv",</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> tmpDir : "/tmp",<br /> startdate:"01-jan-2014"<br />})</span></blockquote>
<span style="font-family: "Courier New",Courier,monospace;"></span>In the mongodb command line we use the settings by loading it in a javascript object.<br />
<blockquote class="tr_bq">
<span style="font-family: "Courier New",Courier,monospace;">var quotes = db.quotes.settings.findOne();</span> </blockquote>
Not only do we store the app parameters here, also the code itself can be put in the same object. First we add the wget function we had before and save it in the settings collection: <br />
<blockquote class="tr_bq">
<span style="font-family: "Courier New",Courier,monospace;">quotes.wget = wget;</span><br />
<span style="font-family: "Courier New",Courier,monospace;">db.quotes.settings.save(quotes);</span></blockquote>
Or if you didn't make the original wget function:<br />
<blockquote class="tr_bq">
<br />
<span style="background-color: white;"><span style="font-family: "Courier New",Courier,monospace;">quotes.wget = function(url){</span></span><br />
<span style="background-color: white;"><span style="font-family: "Courier New",Courier,monospace;"> var tmp = this.tmpDir;</span></span><br />
<span style="background-color: white;"><span style="font-family: "Courier New",Courier,monospace;"> var id = new ObjectId();</span></span><br />
<span style="background-color: white;"><span style="font-family: "Courier New",Courier,monospace;"> var outFile= tmp+"/wget"+id;</span></span><br />
<span style="background-color: white;"><span style="font-family: "Courier New",Courier,monospace;"> var p = run("wget","-o log","--output-document="+outFile,url);</span></span><br />
<span style="background-color: white;"><span style="font-family: "Courier New",Courier,monospace;"> if (p==0){</span></span><br />
<span style="background-color: white;"><span style="font-family: "Courier New",Courier,monospace;"> var result = cat(outFile);</span></span><br />
<span style="background-color: white;"><span style="font-family: "Courier New",Courier,monospace;"> removeFile(outFile);</span></span><br />
<span style="background-color: white;"><span style="font-family: "Courier New",Courier,monospace;"> return result;</span></span><br />
<span style="background-color: white;"><span style="font-family: "Courier New",Courier,monospace;"> } else {</span></span><br />
<span style="background-color: white;"><span style="font-family: "Courier New",Courier,monospace;"> return "";</span></span><br />
<span style="background-color: white;"><span style="font-family: "Courier New",Courier,monospace;"> }</span></span><br />
<span style="background-color: white;"><span style="font-family: "Courier New",Courier,monospace;">}</span></span><span style="font-family: "Courier New",Courier,monospace;"><br />db.quotes.settings.save(quotes);</span> </blockquote>
Notice that we got rid of the hardcoded "/tmp/" and use an application parameter instead. <br />
We will add a few more functions to read the financial data, parse it and store it in collection quotes.<br />
The codes for the different companies are stored in collection quotes.sources. In our example we want to download data for Google, Twitter and Facebook.<br />
<blockquote class="tr_bq">
<span style="font-family: "Courier New",Courier,monospace;">db.quotes.sources.find()<br />{<br /> "_id" : ObjectId("530e4a12bc808878d61d1fb1"),<br /> "name" : "Google",<br /> "param" : "NASDAQ%3AGOOG"<br />}<br />{<br /> "_id" : ObjectId("530e4a7ebc808878d61d1fb2"),<br /> "name" : "Twitter",<br /> "param" : "NYSE%3ATWTR"</span> </blockquote>
<blockquote class="tr_bq">
<span style="font-family: "Courier New",Courier,monospace;">}<br />{<br /> "_id" : ObjectId("530e4aadbc808878d61d1fb3"),<br /> "name" : "Facebook",<br /> "param" : "NASDAQ%3AFB"<br />}</span></blockquote>
<br />
Now we can add some more functions to perform the download and parse the data.<br />
<blockquote>
<span style="font-family: "Courier New",Courier,monospace;">quotes.findAll = function(){<br /> var ctx=this;<br /> sources = db.quotes.sources.find();<br /> sources.forEach(function(entry){ctx.read(entry,ctx)}); <br />}</span> </blockquote>
<blockquote>
<span style="font-family: "Courier New",Courier,monospace;">quotes.read = function (entry,ctx){<br /> var last = entry.last;<br /> if (last===undefined) last="01-Jan-2014";<br /> url = ctx.root.replace("$0",entry.param);<br /> url = url.replace("$1",last);<br /> var quote = ctx.wget(url);<br /> var lines = quote.split("\n");<br /> for(i=1;i<lines.length;i++){<br /> var day = ctx.parseLine(lines[i]);<br /> if (day.day!==undefined){<br /> var result = db.quotes.findOne({"name":entry.name,"day":day.day});<br /> if (result==null){<br /> result=new Object();<br /> result.name=entry.name;<br /> }<br /> for (key in day){<br /> result[key] = day[key];<br /> }<br /> db.quotes.save(result);<br /> }<br /> }<br /> entry.lastRun = new Date();<br /> entry.lastCount=lines.length; <br /> db.quotes.sources.save(entry);<br />}</span></blockquote>
<blockquote>
<span style="font-family: "Courier New",Courier,monospace;">quotes.parseLine = function (line){<br /> var values = line.split(",");<br /> var day = new Object();<br /> if (values && values.length>5){<br /> var day = new Object();<br /> day.day = new Date(values[0]);<br /> day.open=parseFloat(values[1]);<br /> day.high=parseFloat(values[2]);<br /> day.low=parseFloat(values[3]);<br /> day.close=parseFloat(values[4]);<br /> day.volume=parseInt(values[5]);<br /> }<br /> return day;<br />}</span></blockquote>
<blockquote class="tr_bq">
<span style="font-family: "Courier New",Courier,monospace;">db.quotes.settings.save(quotes);</span> </blockquote>
The findAll function loops over the sources collection and calls the read function for every entry. The read function downloads the csv data and uses parseLine to parse the received data and store it in the quotes collection.<br />
<br />
<blockquote class="tr_bq">
<span style="font-family: "Courier New",Courier,monospace;"> db.quotes.find()</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> { {<br /> "_id" : ObjectId("533154a732e5bdd7813212ae"),<br /> "name" : "Google",<br /> "day" : ISODate("2014-01-02T00:00:00Z"),<br /> "open" : 1115.46,<br /> "high" : 1117.75,<br /> "low" : 1108.26,<br /> "close" : 1113.12,<br /> "volume" : 1822719<br />}<br />{<br /> "_id" : ObjectId("533154a732e5bdd7813212b0"),<br /> "name" : "Twitter",<br /> "day" : ISODate("2014-01-02T00:00:00Z"),<br /> "open" : 65,<br /> "high" : 67.5,<br /> "low" : 64.4,<br /> "close" : 67.5,<br /> "volume" : 29286655<br />}<br />{<br /> "_id" : ObjectId("533154a832e5bdd7813212b2"),<br /> "name" : "Facebook",<br /> "day" : ISODate("2014-01-02T00:00:00Z"),<br /> "open" : 54.83,<br /> "high" : 55.22,<br /> "low" : 54.19,<br /> "close" : 54.71,<br /> "volume" : 43257622<br />}<br />{<br /> "_id" : ObjectId("5331575c32e5bdd7813212ba"),<br /> "name" : "Google",<br /> "day" : ISODate("2014-03-24T00:00:00Z"),<br /> "open" : 1184.19,<br /> "high" : 1184.9,<br /> "low" : 1145.95,<br /> "close" : 1157.93,<br /> "volume" : 3039151<br />}<br />{<br /> "_id" : ObjectId("5331575c32e5bdd7813212bb"),<br /> "name" : "Google",<br /> "day" : ISODate("2014-03-21T00:00:00Z"),<br /> "open" : 1206.31,<br /> "high" : 1209.63,<br /> "low" : 1182.45,<br /> "close" : 1183.04,<br /> "volume" : 3210656<br />}<br />{<br /> "_id" : ObjectId("5331575c32e5bdd7813212bc"),<br /> "name" : "Google",<br /> "day" : ISODate("2014-03-20T00:00:00Z"),<br /> "open" : 1199.95,<br /> "high" : 1209.61,<br /> "low" : 1195.36,<br /> "close" : 1197.16,<br /> "volume" : 1684975<br />}</span><br />
<span style="font-family: "Courier New",Courier,monospace;">......</span></blockquote>
<blockquote class="tr_bq">
<span style="font-family: "Courier New",Courier,monospace;">}</span><br />
<span style="font-family: "Courier New",Courier,monospace;"></span></blockquote>
<br />
We could easily call the readAll routine on a daily basis, so that we keep adding the most recent data. <br />
<br />
<br />Geert Van Dammehttp://www.blogger.com/profile/04086330944435977758noreply@blogger.com2