Machine Learning and Big Data


As the datasets used in the field of machine learning grows bigger and more complex, we enter the realm of big data–massive amounts of data, growing exponentially, and coming from a variety of sources. This information needs to be analyzed and stored efficiently and rapidly in order to provide business value by enhancing insight and assisting in decision making and process automation. Additionally, size cannot by itself be used to identify big data. Complexity is another important factor that must also be taken into account.

The challenges of big data

Unfortunately, big-data analysis cannot be accomplished using traditional processing methods. Some examples of big data sources are social media interactions, collection of geographical coordinates for Global Positioning Systems, sensor readings, and retail store transactions, to name a few. These examples give us a glimpse into what is known as the three Vs of big data:
  • Volume: Huge amounts of data that are being generated every moment
  • Variety: Different forms that data can come in–plain text, pictures, audio and video, geospatial data, and so on
  • Velocity: The speed at which data is generated, transmitted, stored, and retrieved
The wide variety of sources that a business or company can use to gather data for analysis results in large amounts of information–and it only keeps growing. This requires special technologies for data storage and management that were not available or whose use was not widespread a decade ago.

The first V in big data – volume

When we talk about volume as one of the dimensions of big data, one of the challenges is the required physical space that is required to store it efficiently considering its size and projected growth. Another challenge is that we need to retrieve, move, and analyze this data efficiently to get the results when we need them. At this point, there are yet other challenges associated with the handling of high volumes of information–the availability and maintenance of high speed networks and sufficient bandwidth, along with the related costs, are only two of them.

While the huge volume of information can refer to a single dataset, it can also refer to thousands or millions of smaller sets put together. Think of the millions of e-mails sent, tweets and Facebook posts published, and YouTube videos uploaded every day and you will get a grasp of the vast amount of data that is being generated, transmitted, and analyzed as a result.

The second V – variety

Variety does not only refer to the many sources where data comes from but also to the way it is represented (structural variety), the medium in which it gets delivered (medium variety), and its availability over time. As an example of structural variety, we can mention that the satellite image of a forming hurricane is different from tweets sent out by people who are observing it as it makes its way over an area. Medium variety refers to the medium in which the data gets delivered: an audio speech and the transcript may represent the same information but are delivered via different media. Finally, we must take into consideration that data may be available all the time, in real time (for example, a security camera), or only intermittently (when a satellite is over an area of interest). Additionally, the study of data can't be restricted only to the analysis of structured data (traditional databases, tables, spreadsheets, and files), however, valuable these all-time resources can be. As we mentioned in the introduction, in the era of big data, lots of unstructured data (SMSes, images, audio files, and so on) is being generated, transmitted, and analyzed using special methods and tools . That said, it is no wonder that data scientists agree that variety actually means diversity and complexity.

The third V – velocity

When we consider Velocity as one of the dimensions of big data, we may think it only refers to the way it is transmitted from one point to another. However, as we indicated in the introduction, it means much more than that. It also implies the speed at which it is generated, stored, and retrieved for analysis. Failure to take advantage of data as it is being generated can lead to loss of business opportunities. Let's consider the following examples to illustrate the importance of velocity in big data analytics:
  • If you want to give your son or daughter a present for his or her birthday, would you consider what they wanted a year ago, or would you ask them what they would like today?
  • If you are considering moving to a new career, would you take into consideration the top careers from a decade ago or the ones that are most relevant today and are expected to experience a remarkable growth in the future?

Introducing a fourth V – veracity

For introductory purposes, simplifying big data into the three Vs (Volume, Variety, and Velocity) can be considered a good approach. However, it may be somewhat overly simplistic in that there is yet (at least) another dimension that we must consider in our analysis–the veracity (or quality) of data.

Why is big data so important?

“Why is big data such a big deal in today's computing?” In other words, what makes big data so valuable as to deserve million-dollar investments from big companies on a periodic basis? Let's consider the following real-life story to illustrate the answer. In the early 2000s, a large retailer in the United States hired a statistician to analyze the shopping habits of its customers. In time, as his computers analyzed past sales associated with customer data and credit card information, he was able to assign a pregnancy prediction score and estimate due dates within a small window. Without going into the nitty-gritty of the story, it is enough to say that the retailer used this information to mail
discount coupons to people buying a combination of several pregnancy and baby care products. Needless to say, this ended up increasing the retailer's revenue significantly.

MapReduce and Hadoop

During the early 2000s, Google and Yahoo were starting to experiment the challenges associated with the massive amounts of information they were handling. However, the challenge not only resided in the amounts but also in the complexity of the data . As opposed to structured data, this information could not be easily processed using traditional methods–if at all. As a result of subsequent studies and joint effort, Hadoop was born as an efficient and costeffective tool for reducing huge analytical problems to small tasks that can be executed in parallel on clusters made up of commodity (affordable) hardware. Hadoop can handle many data sources: sensor data, social network trends, regular table data, search engine queries and results, and geographical coordinates, which are only a few examples. In time, Hadoop grew to become a powerful and robust ecosystem that includes multiple tools to make the management of big data a walk in the park for data scientists. It is currently being maintained by the Apache Foundation with Yahoo being one of their main contributors.

No comments

Powered by Blogger.