A few years back, around the early 1980s, the size of a gigabyte of data storage was nearly that of a refrigerator. Nowadays, even the common smartphone provides us with the capacity to store and access hundreds of gigabytes within the palm of our hands.
Technology has evolved a lot – enabling massive shrinkage of storage space among other things. Technology however has also allowed us to create data at an unfathomable rate.
Feeds from Social Media and online streaming sites, voice and video calling, sensors attached to drill bits on onshore oil rigs, sensors on production systems, satellites, traffic monitoring systems, smart health devices, online transactions – all contribute to an enormous sea of data.
This data while being humongous is heterogeneous. It comes in various forms – static, dynamic, historical as well as streaming, persistent versus transient, structured and unstructured, and local as opposed to remote or distributed.
This kind of data is now called Big Data and is characterized by the five Vs: Volume, Variety, Velocity, Variability, and Veracity.
The techniques required to create, access, and process Big Data are thus very different from the way data has been handled by traditional systems. The skills required are thus not easy to master and as a result much in demand today.
Below are a few of the interesting Big Data projects we have been involved in:
Internet based cloud telephony companies need to analyze performance of streaming media. We have helped them in understanding the pattern of consumption of network bandwidth along with quality of the media that was being exchanged between parties. For this we extensively used open source technologies like Kafka, Storm, Spark and Cassandra.
We have written solutions enabling Retail Chains to analyze stocks and their movements. Data was ingested using Spark DStreams, Kafka queues were used to process it and Cassandra was used to store the output for visualization.
Among the more recent activities, we have been working with a team that is building a data base system which uses the parallel processing power of GPUs on clusters running commodity grade hardware to fast track processing of Big Data applications. The system uses CUDA libraries to run processing in parallel to hasten up database operations by orders of magnitude.
We have more than 15 man-years of experience on this domain.