Data is new gold

Spark data abstractions

In Spark, RDD, DataFrames, and Datasets are three different abstractions for working with distributed data. RDD is the fundamental data structure in Spark and stands for Resilient Distributed Dataset. DataFrame is a distributed collection of data organized into named columns. Dataset is an extension of the DataFrame API, providing a type-safe, object-oriented programming interface.

July 10, 2023
Tuning Spark code

How some teams have lean spark workloads when others doing similar work are using 2-3x more compute and their jobs are running 5-10x longer. There isn’t any rocket science behind this. Follow these basic steps often. Minimize data shuffling: You can achieve this by partitioning data using columns that are frequently used for joins. Avoid…

July 9, 2023

Spark data abstractions