The development of Data Sets in industry and research has led to computational challenges but also great opportunities [46]. The size of the data is s large enough and can not be edited on a single computer. Therefore, the need for cluster-based applications has been created and the data will be stored in clusters. Initially, the applications developed were specific to something specific, such as MapReduce developed for Bach programming or Dremel for Interactive queries, streaming, or even Pregel for Writing Questions. These applications need to support different ways of processing because, by their very nature, large data come from different sources having different structures and different variables. For example, an application may need MapReduce or even Sql queries support or Machine Learning functions. The above reasons became the reasons why in 2009 the University of Berkeley began to develop the Apache Spark: a single unified data processing machine stored in clusters of computers.
4.1.1 Databases in Large Data.
The relational models used in recent decades have been the default for modeling all data. However, with the development of large data (Big Data), relational databases were not able to cope. [38] In addition, data come from different sources, such as social networks or logs, and more. Different sources of data and their diversity required that individuality, consistency, isolation and resilience that are essential for the relational bases (ACID properties, Atomicity, Consistency, Isolation, Durability) were established [38] [47]. For all of the above, new tools and models have been developed, known as NoSQL, which have the CAP theorem [39]. The CAP theorem states that in a distributed computer system it is impossible to have a status in the system so as to guarantee the simultaneous provision of the following features [40]:1. Integrity: all nodes have the same data at any time. 2. Availability: The system guarantees that every request from each node will be answered.
3. Tolerance in distribution: the system continues to function even with network failure.
NoSQL bases are divided into four families. 1) Name / value storage 2) Storage columns 3) Document databases and 4) Graph databases.
Apache Spark is a relatively new open source programming framework that supports the distributed processing of large volumes of data. It was originally developed by the University of California at Berkley at the AMLAB lab and later the core of the code was donated to the ApacheSoftware Foundation. Today's latest stable version is 2.2.0. Spark contains an API for applicationprogramming interface, data parallelism and hardware tolerance even if a computer is experiencing a problem or executing the application will not stop but will continue with the help of the available active nodes [ 43] [44] [45]. Spark has as a central axis a data set called RDD (resilient distributed dataset) that is a data-read-only multiples divided into the cluster of computers [49] and we refer to it as RDD.
The Apache Spark
Figure : Basic modules running spark applications
Spark's kernel
Spark's kernel can be run on different cluster managers and can access Hadoop data, in addition, many different packages have been created to work with the spark kernel or Spark ladies' libraries. The Spark core contains the basic features of Apache Spark like memory management, error recovery, interaction with the storage system. Spark's kernel has been created with the Scala programming language, but there is a programming framework other than Scala in Java, Python, and R. The above programmatic frameworks support several functions such as data transformations and actions that are necessary for data analysis. With the above functions, applications using Spark can be created using processing power, memory, and storage space from the calculation cluster. The basic idea is that we control the data structure by applying functions stored in RDDs that are also stored in the calculation cluster [51] [50].SQL and DataFrames: One of the most common data processing features is relational queries. Spark SQL is the replacement of Spark Shark and allows us to query Spark that is technically similar to analytical databases. The basic idea is that the system behaves as with relational analytical databases. Each RDD record has a series of lines stored in binary format and the system generates the code to execute in that layout. An improvement to Spark is DataFrame and is part of Spark SQL. DataFrame is conceptually similar to a matrix from relational databases but performs better than relational bases because Spark performs it lazily. DataFrames are distributed data collections as well as RDDs but are organized into Data Schema. The above allows Spark to have more information about the data structure and this information can be used for extra optimization in calculations. DataFrames compared to RDDs show more automation opportunities because they contain information about their structure. Additionally, they can integrate spark applications with relational databases to apply to RDDs. Finally, a recent development of DataFrame is also the DataSet programming environment that provides a framework for object-oriented programming.
Spark streaming: Spark Streaming is progressively applied to processing streams and the model is called discretized streams.
4.2.4 Apache Spark Architecture
The cluster manager binds resources to the cluster system. Resources are bundled for each task being run on the system. The spark core can be run on one of the following managers: Hadoop Yarn, Apache Mesos, Amazon EC2, and the Spark (standalone) interior. The Spark administrator manages the cluster resources and applies partitioning between spark applications. The spark can access data stored in HDOS (Hadoop Distributed File System), Cassandra, HBase, Hive, Alluxio and any Hadoop data source [16].Spark Apps.
The following entities need to execute spark applications
Driver program (Spark Driver)
Cluster Manager
Workers
Executors
Tasks
Figure shows how entities interactively interact with Spark applications.
The wizard is an application that uses Spark libraries and defines the flow of control of that calculation. When the Workers provide their resources, CPU - memory - storage space, in the spark application the performer (a java virtual machine) creates a process for each worker. A task is a set of calculations made by Spark and executed in the cluster to be able to accept the results in the Driver program. Apache Spark breaks down tasks into step-by-step, step-by-step (DAG) graphs and each step consists of Tasks, Figure 22. The command is the smallest work unit that Spark can send to the executor,
Spark (packages) are open source software and are integrated with Spark but are not part of Apache Spark. Some of these packages have been built and run directly with the Apache Spark core, some with its core libraries. There are more than 200 packages in various categories such as linking with new data sources, new libraries for machine learning, graphing algorithms, streaming, pyspark.
Spark (packages) has open source software and are integrated with Spark but are not part of Apache Spark. Some of these packages have been built and run directly with the Apache Spark Core, some Apache spark has been adopted and supported by both the academic community and industry. The development community is large enough and the contribution is made from all over the world. Everyday changes to the application code are dozens. The last major change was in version 2.0. Apache Spark turns into a default data analysis framework. Apache Spark provides us with a unified machine that goes beyond batch processing by combining different components such as iteration algorithms, streaming, graphs, learning engineering models, and external packages that expand its functions. It can implement data sorting directly into the memory without having to read data from the disk. This results in memory being an important resource in the system
0 Σχόλια