Data is new gold

  • Spark data abstractions

    In Spark, RDD, DataFrames, and Datasets are three different abstractions for working with distributed data. RDD is the fundamental data structure in Spark and stands for Resilient Distributed Dataset. DataFrame is a distributed collection of data organized into named columns. Dataset is an extension of the DataFrame API, providing a type-safe, object-oriented programming interface.

  • Tuning Spark code

    How some teams have lean spark workloads when others doing similar work are using 2-3x more compute and their jobs are running 5-10x longer. There isn’t any rocket science behind this. Follow these basic steps often. Minimize data shuffling: You can achieve this by partitioning data using columns that are frequently used for joins. Avoid…