Big Data
As the name indicates “Big Data” means data in huge volume. The process of generation has accelerated its pace towards the end of 20th century. This was mainly because computer and technical devices has become more cheaper, smaller, advanced, efficient and faster. This in turn accelerated the number of users that resulted in generating humongous amount of data.
Data can be Structured, Semi-Structured or Unstructured. No matter in what ever form the data exists, extreme effort and efficient technology is required to extract information from these data.
In Structured data we know the field as well as the field’s data type, for example SQL, MySQL, etc. In a Semi-Structured data, we are only known about the field whereas the field’s datatype is unknown. Data in CSV is an example of Semi-Structured data. In an Unstructured data we are unknown about the field as well as the data type of the field. Data in words and PDF are common examples for Unstructured data.
The process of converting unstructured data into structured data is known as Extract, Transform and Load or ETL.
As outlined before, data in humongous amount is what we call big data. But all huge amount of data can never be called as big data. A data can be categorized into big data only if it possesses the following characteristics:
- Volume – huge amount of data
- Velocity – multiple number of inputs in a second
- Variety – complex data
In short, big data cannot be processed using normal tools and techniques. It requires advanced softwares and expertise to accomplish this job. This is where distributed architecture of computing attains importance. Big data can only be processed using distributed computing. i.e., a group of networked computers working to attain a single specific goal.
Normally a computer is composed of four components namely – the CPU [ which determines the speed of the computer], the RAM, the SSD and the network. While handling humongous data one of these four elements can prove to be a bottleneck in processing. This is why we depend on Distributed computing network to process big data.