Predictive analytics is a subset of big data analytics that attempts to forecast future events or behavior based on historical data. It draws on data mining, modeling and machine learning techniques to predict what will happen next. It is often used for fraud detection, credit scoring, marketing, and finance and business analysis purposes. Put simply, big data is larger, more complex data sets, especially from new data sources. These data sets are so voluminous that traditional data processing software just can’t manage them. But these massive volumes of data can be used to address business problems you wouldn’t have been able to tackle before. While Apache Hadoop may not be as dominant as it once was, it’s nearly impossible to talk about big data without mentioning this open source framework for distributed processing of large data sets. Last year, Forrester predicted, “100% of all large enterprises will adopt it (Hadoop and related technologies such as Spark) for big data analytics within the next two years.”
Multidimensional big data can also be represented as data cubes or, mathematically, tensors. Array Database Systems have set out to provide storage and high-level query support on this data type. Additional technologies being applied to big data include efficient tensor-based computation, such as multilinear subspace learning, massively parallel processing (MPP) databases, search-based applications, data mining, distributed file systems, distributed databases, cloud and HPC-based infrastructure (applications, storage, and computing resources) and the Internet. Although many approaches and technologies have been developed, it still remains difficult to carry out machine learning with big data.
MIKE2.0 is an open approach to information management that acknowledges the need for revisions due to big data implications identified in an article titled “Big Data Solution Offering”. The methodology addresses handling big data in terms of useful permutations of data sources, complexity in interrelationships, and difficulty in deleting (or modifying) individual records. Big data brings together data from many disparate sources and applications. Traditional data integration mechanisms, such as ETL (extract, transform, and load) generally aren’t up to the task. It requires new strategies and technologies to analyze big data sets at terabyte, or even petabyte, scale. Some MPP relational databases have the ability to store and manage petabytes of data. Implicit is the ability to load, monitor, back up, and optimize the use of the large data tables in the RDBMS.
In 2004, Google published a paper on a process called MapReduce that uses a similar architecture. The MapReduce concept provides a parallel processing model, and an associated implementation was released to process huge amounts of data. With MapReduce, queries are split and distributed across parallel nodes and processed in parallel (the Map step). The results are then gathered and delivered (the Reduce step). The framework was very successful, so others wanted to replicate the algorithm. Therefore, an implementation of the MapReduce framework was adopted by an Apache open-source project named Hadoop. Apache Spark was developed in 2012 in response to limitations in the MapReduce paradigm, as it adds the ability to set up many operations (not just map followed by reducing).
The fastest growth in investment in emerging Big Data technologies is in banking healthcare, insurance, securities and investment, and telecommunication sectors. Three of these industries lies in the financial sector. And irrespective of sector, Big Data is doing wonders for businesses to improve operational efficiency, and the ability to make informed decisions on the basis of very latest up-to-the-moment information.
To know more, you can download our recent whitepapers on Data Analytics.