Name: Aerospike Connect for Spark
Brand: Aerospike

Aerospike Connect for SparkAerospike

Aerospike Connect for Spark supports APIs that leverage Structured Spark Streaming and Spark SQL. This provides very low latency for both reads and writes enabling AI/ML use cases to run against data stored in Aerospike while simultaneously ingesting new data at high rates.

Vendor

Aerospike

Company Website

aerospike.com

Product details

Aerospike Connect for Spark

Aerospike Connect for Spark loads data from Aerospike into a Spark streaming or batch DataFrame in a massively parallel manner, so that you can process the data by using Spark APIs, which are supported in multiple languages, such as Java, Scala, and Python. A Spark connector can be easily added as a JAR to a Spark application, which can then be deployed in a Kubernetes or non-Kubernetes environment, either in the cloud or on bare-metal.

Aerospike is a NoSQL DB and is schemaless. Further, it offers a rich data model which is a little different from Spark SQL. An Aerospike set, unlike a Spark DataFrame, is not required to have the same bins (which correspond to columns in DataFrames) across all of its records. The Spark connector reconciles those differences and offers a SQL experience in both DataFrame and Spark SQL syntaxes that Spark users are familiar with, while leveraging the gamut of Aerospike database’s benefits, which include strong consistency and hybrid-memory architecture.

The Spark connector allows your applications to parallelize work on a massive scale by leveraging up to 32,768 Spark partitions (configured by setting aerospike.partition.factor) to read data from an Aerospike namespace storing up to 32 billion records per namespace per node across 4096 partitions. For example, if you had a 256-node Aerospike cluster, which is the maximum allowed, you would be able to store a whopping 140 trillion records.

The Spark connector scans Aerospike database partitions in parallel; such support, when combined with Spark workers, enables Spark to rapidly process large volumes of Aerospike data. This can help significantly reduce the execution time of a Spark job. For example, you can use Spark APIs in your ETL applications to read from and write to Aerospike by using the connector. Then, when an ETL application is submitted as a Spark job, Spark distributes the job across your cluster’s worker nodes to read from or write to Aerospike in a massively parallel manner.

Features

Enables Python users to use PySpark to load Aerospike data into a Spark DataFrame and bring to bear various open-source Python libraries, such as Pandas and NumPy; AI/ML frameworks, such as Scikit-learn, PyTorch, TensorFlow, SparkML; and visualization libraries, such as Matplotlib.
Enables users to write query results, predictions, enriched and transformed data, and more from DataFrames to Aerospike.
Enables streaming use cases that need millisecond latency, such as fraud detection, using Spark Structured Streaming read and write APIs.
Supports Jupyter Notebook to help you to quickly progress from your POC to production.
Supports Aerospike’s rich data model, which includes collection data types such as maps and lists, allowing you to write compelling Spark applications.
Supports Aerospike’s secondary index.
Supports aerolookup for Python, Scala, and Java for looking up records in the database corresponding to a set of keys in a Spark DataFrame. aerolookup returns faster results than using inner joins in Spark for the same purpose.
Enables secure connections between Aerospike and Spark clusters with TLS and LDAP.
Supports schema inference to allow you to query data stored in Aerospike when the schema is not known a priori. You can also provide the schema, if it’s available. The connector also supports flexible schemas.
Allows you to pushdown Aerospike Expressions to the database directly from Spark APIs to limit the data movement between the Aerospike and the Spark clusters and consequently accelerate queries.

Find more products by category

Development Software View all