Featured image of post Why Do We Need to Explore Datasets?

Why Do We Need to Explore Datasets?

During the COVID-19 pandemic, many people became infected, and many were worried about themselves and their relatives. Consequently, there is a high demand for information, and individuals like to share their experiences and news about disease. As a result, a huge information of prevention instructions and treatments flooded social media.

Among the official news, a large number of misinformation and conspiracy theories were also spread daily on social networks. Examples of these false news are:

  • Outbreak was a population-control scheme created by former Microsoft CEO Bill Gates.
  • 5G mobile networks caused the pandemic.
  • Drinking alcohol and Using Cocaine are good to protect against COVID-19.

This huge mixture of true and false information together is called an infodemic. As stated by the WHO, An infodemic is too much information including false or misleading information in digital and physical environments during a disease outbreak.

The main question is whether the infodemic is unique to COVID-19. Do we not have a similar situation in other areas? Looking at other areas, we can see that the problem of exponentially increasing the volume of information is occurring in almost many aspects of daily life, and this issue poses a serious challenge in choosing the proper content. In the following, we will try to give examples of this issue.

Academic Publications

The Microsoft Open Academic dataset contains the meta data of academic papers, books, and patents. This data includes information about 208 million publications. The annual publication count can be found in the plot below.

According to the plot, the number of publications has increased exponentially. To see growth better, the graph’s vertical axis is scaled logarithmically in base 2 and data is fitted by linear regression model.

Podcasts

Nowadays, podcasts are one of the most popular digital media, and their number is increasing rapidly.

The number of podcast accounts is growing exponentially and doubling every two years.

Videos

Almost 5 billion videos are watched on Youtube every single day. This is one of the mind blowing facts about Youtube. To see the growth of YouTube content, the change in the annual total length of videos uploaded per minute is shown.

As you can see in the next plot, the growth of content on YouTube is exponential and doubling every 15 months.

Apps

In 2020, 108.5 billion apps were downloaded from the Google Play Store. In addition to the number of downloads, the number of apps available on Play Store has also grown exponentially.

As you can see, between 2010 and 2018, the number of unique apps doubled every 16 months.

Worldwide Global Data

The worldwide Global Data is a measure of how much new data is created, captured, and replicated each year. We can see the past volume and future estimation of worldwide data.

As can be seen in the figure below, the data is well fitted to the exponential function. Worldwide data volume doubles every 30 months.

Data Exploration

As seen above, the information volume increasing is not just about COVID-19 tweets, and we are seeing an exponential growth of data in almost every field. This trend is expected to accelerate with the advent of 5G technology. Among the huge data, it is very difficult to distinguish correct information from misinformation and conspiracy theories. Some tech companies, such as Google, Facebook, and Twitter, claim to be developing algorithms to detect fake news. However, the results of these efforts have not been effective so far, and perhaps due to the business approach of these companies, they may never provide data without bias. Most of these companies’ revenue comes from advertising, so in the best case, the content related to their customers is shown to us more. The below plot shows the share of advertising in Google’s total revenue between 2017 and 2020.

What should we do now? Obviously, this problem does not have a clear or easy solution that can be written in a few lines. I think in this disappointing infodemic era, data analysis has provided us a tool like a magnifying glass so that we can sometimes look directly at data to find facts. For example, if we are interested in understanding what people think about COVID-19, we can extract the related tweets and start analyzing them with NLP tools. Naturally, this work requires data and is time-consuming, and we cannot use it in all cases, but at least sometimes we can have first-hand stories for ourselves. On this blog, I try to write posts in which we try to understand the world around us better by using data.

Licensed under CC BY-NC-SA 4.0
Last updated on Nov 21, 2021 18:23 CET
Built with Hugo
Theme Stack designed by Jimmy