Saturday, April 26, 2014

How to keep in sync with git repository of another person on github or bitbucket

suppose rjha94 has a repo on bitbucket.org called dl (rjha94/dl - copy #1). Now you also want a copy of this repository on your machine. First thing you have to do is to fork this repo using your bitbucket.org account.

 #1. fork this repository on bitbucket.org first (fork rjha94/dl as srj0408/dl )
 #2. to get this code on your local machine, you can just clone your fork of repo.

 $git clone srj0408/dl (get the actual clone URL on bitbucket interface)

when you forked, you made a copy at hosting server (copy #2) . when you cloned, you made a copy of your server repo on your local m/c (copy #3) . from a git point of view, all these copies are equally valid (there is no central or one true copy). so you have rjha94/dl that you forked into srj0408/dl on server. Then you cloned the same on your local m/c creating a third copy.

All these copies can be independent of each other. Like you can make changes on your local m/c that no one knows about. Same way, the original repo (copy #1) can be changed. Now it could be that some new changes have come to rjha94/dl. how can you get them into the repo on your local m/c (copy #1 -> copy #3) and push it to your own server repo srj0408/dl (copy #3 -> copy #2) ?

 #3 To get rjha94/dl repo changes to the repo on your local m/c
$git checkout master
$git remote add rjha94/dl ssh://git@bitbucket.org/rjha94/dl.git

check out your local master branch and then add a new remote URL called rjha94/dl that points to the original server repo your forked (copy #1) . Then to merge the new changes from rjha94/dl repo (copy #3)

$git fetch rjha94/dl
$git merge remotes/rjha94/dl/master

This would pull changes from rjha94/dl repo and merge into your local copy. (copy #1 -> copy #3) To get these changes into your server repo copy

$git commit -m "merged changes from upstream on 25-apr-14"
$git push origin

Doing this will push the changes you just merged into your local copy into your server repo (copy #3 -> copy #2) where origin is a shortcut (alias) for your own server copy.

Sunday, April 13, 2014

Time series database survey for IoT and m2m devices

This is a survey of time series databases available for use, both the cloud offerings as well as "install on our own machines" solutions. The requirement we have is


  • store high velocity time series data (frequent data arriving from one node)
  • store data from lot of nodes 
  • compute aggregates (sum over a days worth of data)
  • Grouping functions (average, STDEV) 
  • Analyze the data for patterns etc.

No one is paying me to write this so I will stay clear of jargons like, slice and dice, Cubes and all that b.s. in plain simple terms, we are receiving data from lot of devices very frequently, so first problem is simply storing a lot of data. Mysql and other RDBMS are not optimized for storing such time series data.  That is problem #1.

Another problem is that it may not be prudent to fetch all the raw data points for certain queries later on. Let's say that you want to watch the trend over a month then just fetching all the raw datapoints may be a overkill. What you instead would like to do is to just fetch 30 data points, each an average over a day's worth of datapoints. Now, creating such buckets (rollups) on demand would be expensive operation, so we need to push data into such buckets (rollup) as and when they arrive. That is problem #2, a good solid support for whatever rollup I would like to create. For data arriving at millsecond intervals that can just be one minute! 

There is actually a rollup hierarchy. say, data is arriving at 5 minute intervals and then you make rollup of an hour (average over 12 datapoints) . Further you would like to make a rollup of a day (averaged over 24 datapoints of previous bucket) etc. 

Then we also need aggregates. We would like to sum over datapoints for a particular interval for reporting. (say Rainfall over a day). 

For IoT/m2m kind of use cases, you also need to detect patterns in real times (this is apart from the threshold alerts). Then we would like to analyze the data and perform statistical opeartions on it.

RRDTOOL

Nice circular buffer
Expects data at requried intervals
Language bindings available
Good fit for small numer of metric

KairosDB

forked from openTSDB
storing metric in HBase/ Cassandra
Good storage facility, allows tagging of data 
However Data model is very limited. 
Aggreates are calculated during query time and can be a performance drag
No support for automatic rollup


OpenTSDB 

Looks very married to the Graphs
Good for computer metric cases 
Does not look a good fit for device case 
(where data dictionary is device dependent)

Graphite

Cloud offerings
Xivey a.k.a  pachcube a.k.a whatever-it-was

Good PR buzz
Good ecosystem
support is a black hole if you are in Asia
Rollup supported (in their own way)
Good provisioning and device activation support
Device side things are unnecessarily complicated
support for average function only (haven't found others yet)


Librato

Digi m2m cloud

Tempo-DB


I think all cloud based application would run into limitation for serious applications.  Also, there is no way others can do your analytic for you. For the moment, my strategy is to prototype on xively and then switch to influxdb (or maybe another on-my-machine solution). For realtime analytic, look at 
amazon Kinesis or numPy with HDF5. The debate is far from settled.


© Life of a third world developer
Maira Gall