How Big Tech are ruling the world with Big Data.

Pawan Trivedi
6 min readSep 17, 2020

--

Understanding What is Big Data and How they(FB, Google, Twitter)handle it.

Did you ever think what happens when you click that 👍 like button on Facebook or 🔍 search the date of your favorite festival on Google?

Every single click, search, or any activity that you do creates a bunch of raw data that gets collected by big giants like Google, Facebook, Twitter, Amazon, Salesforce, Cloudera, and many more and they make use of that data to make profit out of that. All these companies work on the philosophy of ‘ The more information/data you have about your customers, the better you can understand their interests, wants and needs’.

This (collecting data and making a business out of it ) isn’t easy as it seems, it takes a lot of computer power, intelligence, and storage capacity.

To make you understand How all this is happening in real-time without being delayed for a second I’m gonna explain this in detail.

What I meat by real-time is that What #hastag is trending on twitter, How people are reacting to particular story/post on Facebook, putting video on top in trending section based on like, comment and views on YouTube. All this happens because of the power of Big Data — collecting in mass and processing in the matter on second and affecting the real world in real-time.

To understand this better take a look at this twitter trend picture.

Compare 1st picture trend 2 and 3 ( at 17:33 ) with 2nd picture trend 3 & 2 (at 17:34)
Twitter trend in 30 Sec.

And now the question that arises is that how they are processing so much data at this fast rate, but before getting into that let’s look at actually how much we are creating data per day or in the coming year.

  • 5bn search made across all the search engine
  • 300bn emails are sent every day
  • 300 PB of data created by Facebook ( 350m photos, 100m hours of watch)
  • 65bn message sent over whats app (including 2bn of voice and video calls )
  • 95m photos and videos are shared on Instagram
  • ~ 28 PB of data from wearable devices /day

In totality, we are producing 2.5 Quintilian bytes data every day. And it is estimated that 463 EB of data will be created every day by 2050.

By definition: Three ‘V’ that define Big Data are — Volume, Variety, Velocity.

Even though they are collecting so much data but they are not able to utilize all of that and this unknown and unused data is known as Dark Data, and that comprises approx 7.5 Septillion of data generated every day. And why companies aren’t able to use this dark data.

Now let’s talk about how they manage this vast amount of data with high speed and much efficiency — And

Distributed storage and Apache Hadoop(tool) are solutions to the Big Data problem.

In the traditional system, I/O is the most costly operation in data processing ( for example — the traditional database system store data in a single machine and when need data, you make a request and fetch data from the store and you can do with limited data in control and limited process capability using this method.

And in Big Data, data stores on multiple computers (thousands of devices ) by using the distributed technique so it is fast to process, and to reduce the cost of I/O operation, you can reduce moment to minimal and this can be done using an algorithm known as MapReduce algorithm and what it does, it works on your query in all place(nodes) where your data is present and then aggregates the final result and return to you.

So Big Tech uses Hadoop which runs applications on a distributed system with thousands of nodes(computer) with petabytes of information. It has a distributed file system, called Hadoop Distributed File System or HDFS, which enables fast data transfer among the nodes. In HDFS, System(master or Name -Node) splits a single large file into N chunks(or small files) and send to N (number of) other systems ( node /slave) to store the file and in case of need master again makes a request to nodes and fetch the data as one single file as it was original form and this technique is called clustering.

This is how a distributed system works.

How does Big Data work : —

Collect -> Store -> Process & Analyze -> Consume & Visualize

→ How does Facebook manage the big data challenge?

Facebook runs the world’s largest Hadoop cluster” — Jay Parikh, VP Infrastructure Engineering, Facebook.

Facebook Big Data Arch. to handle/Process data

With a huge amount of unstructured data coming across each day, Facebook uses self-developed tool Scuba (as the name suggests), which helps Hadoop developer to dive into massive data sets and carry analysis in real-time.

Scuba

→ How does Google manage the big data challenge?

Google has designed a system known as Google File System (GFS), a scalable distributed file system for large distributed data. The largest cluster provides hundreds of terabytes of storage across thousands of disks on over a thousand machines.

A GFS cluster consists of a single master and multiple chunk servers aka slave/node. In this, files are divided into fixed-size chunks each size of 64 MB. Master store metadata of chunks to connect back and major three things that store are — file and chunk namespace, mapping from file to chunks, and location of each chunks replica. All metadata kept in the master’s memory.

Each chunk server uses check-summing to detect the corruption of stored data. GFS cluster often has thousands of disks on hundreds of machines, it regularly experiences disk failures that cause data corruption or loss on both threaded and write paths. It recovers from corruption using other chunk replicas. Each chunk server must independently verify the integrity of its copy by maintaining checksums.

→ How does Twitter manage the big data challenge?

Hundred of millions of tweets are sent every day. They(tweets) are processed, stored, cached, served, and analyzed. With such massive content, they(twitter organization) need a consequent infrastructure.

To overcome this problem they use Hadoop and Manhattan (Inbuilt tool) and many more…

This chart shows how twitter manages the Big Data problem.

In the next blog, I’ll guide you through ‘How can you perform this on the cloud’.

Thank you for reading this and If you find this article helpful, share it with your college and give a 👏🏿.

In case of any query, you can reach me at 59r@protonmail.com

--

--