Large sample datasetTo use as sample data for doing statistical analysis, I was looking for a fairly large dataset. I wanted more than a few thousands records, but it still needs to be manageable on my own computer. A few gigabytes is ok, terrabytes is too much. At http://www.ncdc.noaa.gov/cdo-web/datasets I found 23G of weather data from 91.000 stations all over the world which looks exactly what I need.
Import the datasetAs almost always, the data is not in a simple mongoDB importable structure.
A typical approach would be to use tools and programs (sed, grep, awk, ...) to manipulate the text files to fit the format to the import routine. This typically involves very long, incomprehensible commands or scripts (and lots of googling ;-). (example)
Scripts and commands which are typically lost when we need them again, and need to be reconstructed again, you know the feeling ...
It would be nice if someone tried to import the same data with mongoimport (and sed, grep...) to see the difference with the approach presented here.
The text file definition is mostly stored in external files (or hard coded in the file manipulation commands), which can make the whole import process difficult and error prone.
In our case, the data is in a lot of small fixed column text files, all with an identical structure, so we also keep the information about these files in the database itself.