Machine Learning and Big Data
As the datasets used in the field of machine learning grows bigger and more complex, we enter the realm of big data–massive amounts of data, growing exponentially, and coming from a variety of sources. This information needs to be analyzed and stored efficiently and rapidly in order to provide business value by enhancing insight and assisting in decision making and process automation. Additionally, size cannot by itself be used to identify big data. Complexity is another important factor that must also be taken into account.
The challenges of big data
Unfortunately, big-data analysis cannot be accomplished using traditional processing methods. Some examples of big data sources are social media interactions, collection of geographical coordinates for Global Positioning Systems, sensor readings, and retail store transactions, to name a few. These examples give us a glimpse into what is known as the three Vs of big data:- Volume: Huge amounts of data that are being generated every moment
- Variety: Different forms that data can come in–plain text, pictures, audio and video, geospatial data, and so on
- Velocity: The speed at which data is generated, transmitted, stored, and retrieved
The first V in big data – volume
When we talk about volume as one of the dimensions of big data, one of the challenges is the required physical space that is required to store it efficiently considering its size and projected growth. Another challenge is that we need to retrieve, move, and analyze this data efficiently to get the results when we need them. At this point, there are yet other challenges associated with the handling of high volumes of information–the availability and maintenance of high speed networks and sufficient bandwidth, along with the related costs, are only two of them.While the huge volume of information can refer to a single dataset, it can also refer to thousands or millions of smaller sets put together. Think of the millions of e-mails sent, tweets and Facebook posts published, and YouTube videos uploaded every day and you will get a grasp of the vast amount of data that is being generated, transmitted, and analyzed as a result.
The second V – variety
Variety does not only refer to the many sources where data comes from but also to the way it is represented (structural variety), the medium in which it gets delivered (medium variety), and its availability over time. As an example of structural variety, we can mention that the satellite image of a forming hurricane is different from tweets sent out by people who are observing it as it makes its way over an area. Medium variety refers to the medium in which the data gets delivered: an audio speech and the transcript may represent the same information but are delivered via different media. Finally, we must take into consideration that data may be available all the time, in real time (for example, a security camera), or only intermittently (when a satellite is over an area of interest). Additionally, the study of data can't be restricted only to the analysis of structured data (traditional databases, tables, spreadsheets, and files), however, valuable these all-time resources can be. As we mentioned in the introduction, in the era of big data, lots of unstructured data (SMSes, images, audio files, and so on) is being generated, transmitted, and analyzed using special methods and tools . That said, it is no wonder that data scientists agree that variety actually means diversity and complexity.The third V – velocity
When we consider Velocity as one of the dimensions of big data, we may think it only refers to the way it is transmitted from one point to another. However, as we indicated in the introduction, it means much more than that. It also implies the speed at which it is generated, stored, and retrieved for analysis. Failure to take advantage of data as it is being generated can lead to loss of business opportunities. Let's consider the following examples to illustrate the importance of velocity in big data analytics:- If you want to give your son or daughter a present for his or her birthday, would you consider what they wanted a year ago, or would you ask them what they would like today?
- If you are considering moving to a new career, would you take into consideration the top careers from a decade ago or the ones that are most relevant today and are expected to experience a remarkable growth in the future?
Introducing a fourth V – veracity
For introductory purposes, simplifying big data into the three Vs (Volume, Variety, and Velocity) can be considered a good approach. However, it may be somewhat overly simplistic in that there is yet (at least) another dimension that we must consider in our analysis–the veracity (or quality) of data.Why is big data so important?
“Why is big data such a big deal in today's computing?” In other words, what makes big data so valuable as to deserve million-dollar investments from big companies on a periodic basis? Let's consider the following real-life story to illustrate the answer. In the early 2000s, a large retailer in the United States hired a statistician to analyze the shopping habits of its customers. In time, as his computers analyzed past sales associated with customer data and credit card information, he was able to assign a pregnancy prediction score and estimate due dates within a small window. Without going into the nitty-gritty of the story, it is enough to say that the retailer used this information to maildiscount coupons to people buying a combination of several pregnancy and baby care products. Needless to say, this ended up increasing the retailer's revenue significantly.
Leave a Comment