Interface class to represent a Spark Resilient Distributed Dataset (RDD)
A Resilient Distributed Dataset or RDD is a programming abstraction in Spark™. It represents a collection of elements distributed across many nodes that can be operated in parallel. All work in Spark is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result. You can create RDDs in two ways:
By loading an external dataset
By parallelizing a collection of objects in the driver program
Once created, two types of operations can be performed using RDDs: transformations and actions.
RDD object can only be created using the
methods of the
SparkContext class. A collection of
used to create RDDs is listed below for convenience. See the documentation
for more information.
|SparkContext Method Name||Purpose|
Create an RDD from local MATLAB® values
Create an RDD from a text file
Once an RDD has been created using a method from the
you can use any of the methods in the
RDD class to
manipulate your RDD.
The properties of this class are hidden.
|aggregateByKey||Aggregate the values of each key, using given combine functions and a neutral “zero value”|
|cartesian||Create an RDD that is the Cartesian product of two RDDs|
|coalesce||Reduce the number of partitions in an RDD|
|cogroup||Group data from RDDs sharing the same key|
|combineByKey||Combine the elements for each key using a custom set of aggregation functions|
|distinct||Return a new RDD containing the distinct elements of an existing RDD|
|filter||Return a new RDD containing only the elements that satisfy a predicate function|
|flatMap||Return a new RDD by first applying a function to all elements of an existing RDD, and then flattening the results|
|flatMapValues||Pass each value in the key-value pair RDD through a |
|foldByKey||Merge the values for each key using an associative function and a neutral “zero value”|
|fullOuterJoin||Perform a full outer join between two key-value pair RDDs|
|glom||Coalesce all elements within each partition of an RDD|
|groupBy||Return an RDD of grouped items|
|groupByKey||Group the values for each key in the RDD into a single sequence|
|intersection||Return the set intersection of one RDD with another|
|join||Return an RDD containing all pairs of elements with matching keys|
|keyBy||Create tuples of the elements in an RDD by applying a function|
|keys||Return an RDD with the keys of each tuple|
|leftOuterJoin||Perform a left outer join|
|map||Return a new RDD by applying a function to each element of an input RDD|
|mapValues||Pass each value in a key-value pair RDD through a map function without modifying the keys|
|reduceByKey||Merge the values for each key using an associative reduce function|
|repartition||Return a new RDD that has exactly |
|rightOuterJoin||Perform a right outer join|
|sortBy||Sort an RDD by a given function|
|sortByKey||Sort RDD consisting of key-value pairs by key|
|subtract||Return the values resulting from the set difference between two RDDs|
|subtractByKey||Return key-value pairs resulting from the set difference of keys between two RDDs|
|union||Return the set union of one RDD with another|
|values||Return an RDD with the values of each tuple|
|zip||Zip one RDD with another|
|zipWithIndex||Zip an RDD with its element indices|
|zipWithUniqueId||Zip an RDD with generated unique Long IDs|
|aggregate||Aggregate the elements of each partition and subsequently the results for all partitions into a single value|
|collect||Return a MATLAB cell array that contains all of the elements in an RDD|
|collectAsMap||Return the key-value pairs in an RDD as a MATLAB |
|count||Count number of elements in an RDD|
|fold||Aggregate elements of each partition and the subsequent results for all partitions|
|reduce||Reduce elements of an RDD using the specified commutative and associative function|
|reduceByKeyLocally||Merge the values for each key using an associative reduce function, but return the results immediately to the driver|
|saveAsKeyValueDatastore||Save key-value RDD as a binary file that can be read back
using the |
|saveAsTallDatastore||Save RDD as a MATLAB tall array to a binary file
that can be read back using the |
|saveAsTextFile||Save RDD as a text file|
|cache||Store an RDD in memory|
|checkpoint||Mark an RDD for checkpointing|
|getCheckpointFile||Get the name of the file to which an RDD is checkpointed|
|getDefaultReducePartitions||Get the number of default reduce partitions in an RDD|
|getNumPartitions||Return the number of partitions in an RDD|
|isEmpty||Determine if an RDD contains any elements|
|keyLimit||Return threshold of unique keys that can be stored before spilling to disk|
|persist||Set the value of an RDD’s storage level to persist across operations after it is computed|
|toDebugString||Obtain a description of an RDD and its recursive dependencies for debugging|
|unpersist||Mark an RDD as nonpersistent, remove all blocks for it from memory and disk|
Resilient Distributed Dataset
A Resilient Distributed Dataset or RDD is a programming abstraction in Spark. It represents a collection of elements distributed across many nodes that can be operated in parallel. RDDs tend to be fault-tolerant. You can create RDDs in two ways:
By loading an external dataset.
By parallelizing a collection of objects in the driver program.
After creation, you can perform two types of operations using RDDs: transformations and actions.
Transformations are operations on an existing RDD that return a new RDD. Many, but not all, transformations are element-wise operations.
Actions compute a final result based on an RDD and either return that result to the driver program or save it to an external storage system such as HDFS™.
See the latest Spark documentation for more information.
Introduced in R2016b