Thursday, February 27, 2014

wget or XMLHttpRequest

The javascript possibilities inside mongodb allow us to do interesting things from within the database itself. One thing that's missing is the XMLHttpRequest object. That would allow us to make calls to json webservices from within the mongo shell itself. Apparently, I'm not the only one who thinks this. The issue is marked as 'lower priority' and 'features we're not sure of', so I doubt we will see this added soon.

run("wget")

To show the power of making an http request from within the database, we might use a workaround. There is an undocumented, 'for internal use only', run() method, which can start an arbitrary process, like eg. wget. (Some others found the run() method interesting as well).
wget is a tool, on *nix machines, to make http calls in the command line or in scripts. These code examples run fine on linux, but might need some changes to make them run on windows.
The run() method returns the process' exit code, so we can't capture the output immediately. Instead, we let wget dump the output to a temporary file and read the contents of that file by using  cat().
function wget(url){
    var tmp = "/tmp";
    var id = new ObjectId();
    var outFile= tmp+"/wget"+id;
    var p = run("wget","-o log","--output-document="+outFile,url);
    if (p==0){
        var result = cat(outFile);
        removeFile(outFile);
        return result;
    } else {
        return "";
    }
}
It would be cleaner with a real XMLHttpRequest, but it works ok, and opens up many possibilities.
  • webcrawler
  • read / use webservices
  • read JSON formatted Rest API's and save the results in the database
  • ...

WebCrawler

Now that we can make an http request, it's not that difficult to extract all hyperlinks. If we use these hyperlinks again we essectially have a web crawler.
A real crawler will need to deal with a lot more issues (relative hyperlinks, robots.txt, javascript links ...), but the basic functionality should work.
We can extract all hyperlinks from an html page like this:
var html =wget("http://www.google.com");
var regex = /href=["'']?([^"'' >]+)/gi;
var links = new Array();
var f=regex.exec(html);
while(f!=null){
    links.push(f[1]);
    f=regex.exec(html);
}

We could save the results in a collection and use that collection again to perform a new search.

Webservice

In this example we use our wget function to download google finance data and store these in mongodb. The normal approach would be to make a script/ application to download the data, convert the data to a format that can be imported into the database or parse the data and insert it.
Here we show that we can implement this functionality inside the database itself. We also store the javascript code and the application parameters inside the same database.

Google finance lets us download historical stock prices in csv format (eg. https://www.google.com/finance/historical?q=NYSE%3ATWTR ).
We use the collection quotes.settings as a place to keep both the application parameters.
  
There's only 1 document in quotes.settings, so we assign "unique" as _id  value. This doesn't really do anything, but it makes it clear this collection is meant as a singleton, containing only 1 document.  I often use the .settings name to keep everything together. It makes it obvious that the data inside this collection refers to the 'quotes' collection.
db.quotes.settings.save({
   _id:"unique",
   root : "http://www.google.com/finance/historical?q=$0&start=0&startdate=$1&output=csv",
   tmpDir : "/tmp",
   startdate:"01-jan-2014"
})
In the mongodb command line we use the settings by loading it in a javascript object.
var quotes = db.quotes.settings.findOne();
Not only do we store the app parameters here, also the code itself can be put in the same object. First we add the wget function we had before and save it in the settings collection:
quotes.wget = wget;
db.quotes.settings.save(quotes);
Or if you didn't make the original wget function:

quotes.wget = function(url){
    var tmp = this.tmpDir;
    var id = new ObjectId();
    var outFile= tmp+"/wget"+id;
    var p = run("wget","-o log","--output-document="+outFile,url);
    if (p==0){
        var result = cat(outFile);
        removeFile(outFile);
        return result;
    } else {
        return "";
    }
}
db.quotes.settings.save(quotes);
Notice that we got rid of the hardcoded "/tmp/" and use an application parameter instead. 
We will add a few more functions to read the financial data, parse it and store it in collection quotes.
The codes for the different companies are stored in collection quotes.sources. In our example we want to download data for Google, Twitter and Facebook.
db.quotes.sources.find()
{
    "_id" : ObjectId("530e4a12bc808878d61d1fb1"),
    "name" : "Google",
    "param" : "NASDAQ%3AGOOG"
}
{
    "_id" : ObjectId("530e4a7ebc808878d61d1fb2"),
    "name" : "Twitter",
    "param" : "NYSE%3ATWTR"
 
}
{
    "_id" : ObjectId("530e4aadbc808878d61d1fb3"),
    "name" : "Facebook",
    "param" : "NASDAQ%3AFB"
}

Now we can add some more functions to perform the download and parse the data.
quotes.findAll = function(){
    var ctx=this;
    sources = db.quotes.sources.find();
    sources.forEach(function(entry){ctx.read(entry,ctx)});  
}
 
quotes.read = function (entry,ctx){
    var last = entry.last;
    if (last===undefined) last="01-Jan-2014";
    url = ctx.root.replace("$0",entry.param);
    url = url.replace("$1",last);
    var quote = ctx.wget(url);
    var lines = quote.split("\n");
    for(i=1;i<lines.length;i++){
        var day = ctx.parseLine(lines[i]);
        if (day.day!==undefined){
            var result = db.quotes.findOne({"name":entry.name,"day":day.day});
            if (result==null){
                result=new Object();
                result.name=entry.name;
            }
            for (key in day){
                result[key] = day[key];
            }
            db.quotes.save(result);
        }
    }
    entry.lastRun = new Date();
    entry.lastCount=lines.length;   
    db.quotes.sources.save(entry);
}
quotes.parseLine = function (line){
    var values = line.split(",");
    var day = new Object();
    if (values && values.length>5){
        var day = new Object();
        day.day = new Date(values[0]);
        day.open=parseFloat(values[1]);
        day.high=parseFloat(values[2]);
        day.low=parseFloat(values[3]);
        day.close=parseFloat(values[4]);
        day.volume=parseInt(values[5]);
    }
    return day;
}
db.quotes.settings.save(quotes); 
The findAll function loops over the sources collection and calls the read function for every entry. The read function downloads the csv data and uses parseLine to parse the received data and store it in the quotes collection.

 db.quotes.find()
 {    {
    "_id" : ObjectId("533154a732e5bdd7813212ae"),
    "name" : "Google",
    "day" : ISODate("2014-01-02T00:00:00Z"),
    "open" : 1115.46,
    "high" : 1117.75,
    "low" : 1108.26,
    "close" : 1113.12,
    "volume" : 1822719
}
{
    "_id" : ObjectId("533154a732e5bdd7813212b0"),
    "name" : "Twitter",
    "day" : ISODate("2014-01-02T00:00:00Z"),
    "open" : 65,
    "high" : 67.5,
    "low" : 64.4,
    "close" : 67.5,
    "volume" : 29286655
}
{
    "_id" : ObjectId("533154a832e5bdd7813212b2"),
    "name" : "Facebook",
    "day" : ISODate("2014-01-02T00:00:00Z"),
    "open" : 54.83,
    "high" : 55.22,
    "low" : 54.19,
    "close" : 54.71,
    "volume" : 43257622
}
{
    "_id" : ObjectId("5331575c32e5bdd7813212ba"),
    "name" : "Google",
    "day" : ISODate("2014-03-24T00:00:00Z"),
    "open" : 1184.19,
    "high" : 1184.9,
    "low" : 1145.95,
    "close" : 1157.93,
    "volume" : 3039151
}
{
    "_id" : ObjectId("5331575c32e5bdd7813212bb"),
    "name" : "Google",
    "day" : ISODate("2014-03-21T00:00:00Z"),
    "open" : 1206.31,
    "high" : 1209.63,
    "low" : 1182.45,
    "close" : 1183.04,
    "volume" : 3210656
}
{
    "_id" : ObjectId("5331575c32e5bdd7813212bc"),
    "name" : "Google",
    "day" : ISODate("2014-03-20T00:00:00Z"),
    "open" : 1199.95,
    "high" : 1209.61,
    "low" : 1195.36,
    "close" : 1197.16,
    "volume" : 1684975
}

......
}

We could easily call the readAll routine on a daily basis, so that we keep adding the most recent data.


2 comments:

  1. Thank you so much! Do you know how can I avoid the shell prints using the wget command? ('shell: started program (sh5467): /usr/local/bin/wget...')

    ReplyDelete
  2. Great discovery! And good wget function implementation.
    Looking forward they'll implement an XHTTP client

    ReplyDelete