Should I use PySpark or Scala for Apache Spark?

Use Scala for data engineering pipelines requiring the full Dataset API and compile-time type safety. PySpark suits data scientists focused on analysis with notebook workflows. Most production Spark jobs at large companies use Scala.

What is the difference between local and cluster mode in Spark?

local[*] runs Spark entirely in a single JVM on your machine using all cores with no cluster. Cluster mode distributes jobs across multiple machines via YARN or Kubernetes by changing the master setting.

Why use provided scope for Spark dependencies in the build file?

Cluster nodes already have Spark installed so bundling it into your JAR would waste hundreds of megabytes. Provided scope tells SBT that Spark will be available at runtime without including it in the packaged artifact.

Scala with Apache Spark: Complete Introduction

Apache Spark is the dominant framework for large-scale data processing, and Scala is its native language. Spark itself is written in Scala, and while Python (PySpark) is popular, Scala gives you full access to Spark's API, better performance, and the ability to contribute to Spark itself. This module gets you up and running with Spark in Scala.

Why Scala for Spark?

Scala has unique advantages for Spark development:

Native API - Spark's Scala API is the source of truth. Python and Java APIs are wrappers
Type safety - The Dataset API gives you compile-time type checking that PySpark can't offer
Performance - No Python serialization overhead; Spark jobs run on the JVM
Expressiveness - Functional programming style maps naturally to Spark transformations
Industry standard - Most large Spark installations (banks, fintechs, FAANG) use Scala

Setting Up a Spark Project with SBT

Create a new SBT project:

text

my-spark-app/
+-- build.sbt
+-- project/
|   +--- build.properties
+--- src/
    +--- main/
        +--- scala/
            +--- Main.scala

build.sbt:

scala

name := "my-spark-app"
version := "0.1.0"
scalaVersion := "2.12.18"

val sparkVersion = "3.5.0"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion % "provided",
  "org.apache.spark" %% "spark-sql"  % sparkVersion % "provided"
)

// For running locally without "provided" scope:
// run := Defaults.runTask(fullClasspath in Runtime, mainClass in run, runner in run).evaluated

project/build.properties:

text

sbt.version=1.9.7

The provided scope means Spark won't be bundled into your JAR - the cluster already has it. For local testing, remove % "provided" or use spark-submit.

SparkSession: The Entry Point

SparkSession is the unified entry point for Spark 2.x and later:

scala

import org.apache.spark.sql.SparkSession

object Main extends App {

  val spark = SparkSession.builder()
    .appName("MySparkApp")
    .master("local[*]")   // local mode - use all CPU cores
    .getOrCreate()

  spark.sparkContext.setLogLevel("WARN")

  println(s"Spark version: ${spark.version}")

  // Your Spark code goes here

  spark.stop()
}

local[*] - run locally using all available cores
local[2] - run locally with 2 cores
In production, the master URL is set by spark-submit, not in code

SparkContext vs SparkSession

SparkContext (spark.sparkContext) - the original entry point (Spark 1.x), used for RDD operations
SparkSession - the modern entry point (Spark 2.x+), wraps SparkContext and adds SQL/DataFrame support

Access SparkContext from SparkSession when needed:

scala

val sc = spark.sparkContext

Your First Spark Job: Word Count

The classic Spark example:

scala

import org.apache.spark.sql.SparkSession

object WordCount extends App {

  val spark = SparkSession.builder()
    .appName("WordCount")
    .master("local[*]")
    .getOrCreate()

  spark.sparkContext.setLogLevel("WARN")

  val lines = spark.sparkContext.parallelize(Seq(
    "apache spark is a fast engine",
    "spark uses scala as its native language",
    "scala is functional and object oriented"
  ))

  val wordCounts = lines
    .flatMap(line => line.split(" "))
    .map(word => (word, 1))
    .reduceByKey(_ + _)
    .sortBy(_._2, ascending = false)

  wordCounts.collect().foreach { case (word, count) =>
    println(s"$word: $count")
  }

  spark.stop()
}

Output:

text

scala: 2
spark: 2
is: 2
a: 1
fast: 1
...

Reading and Writing Data

Spark can read from files, databases, and streaming sources:

scala

// Read CSV
val df = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("/path/to/data.csv")

df.show(5)
df.printSchema()

// Read Parquet (the preferred format for big data)
val parquet = spark.read.parquet("/path/to/data.parquet")

// Write to Parquet
df.write
  .mode("overwrite")
  .parquet("/path/to/output")

// Write partitioned Parquet
df.write
  .partitionBy("year", "month")
  .mode("overwrite")
  .parquet("/path/to/partitioned-output")

Running with spark-submit

To run on a cluster or locally with the full Spark runtime:

bash

spark-submit   --class Main   --master local[*]   target/scala-2.12/my-spark-app_2.12-0.1.0.jar

Build the JAR with sbt package first.

Spark's Execution Model

Understanding how Spark executes is important for performance:

Driver - the JVM process running your Main object; coordinates the job
Executor - JVM processes on worker nodes that run tasks
RDD/DataFrame - distributed collections partitioned across executors
Transformation - lazy operation that builds a computation plan (map, filter, select)
Action - triggers actual computation (collect, count, show, write)

Spark is lazy - nothing runs until you call an action. This lets Spark optimize the full computation plan before executing.

Frequently Asked Questions

Q: Should I use PySpark or Scala for Spark? If you're a data scientist focused on analysis and ML, PySpark is fine - the ecosystem (pandas, scikit-learn, notebooks) fits that workflow. If you're a data engineer building pipelines, ETL jobs, or Spark infrastructure, Scala is the better choice: you get the full Dataset API with compile-time type safety, better performance, and you can read Spark's source code when debugging. Most production Spark jobs at large companies are written in Scala.

Q: What is the difference between local[*] and cluster mode? local[*] runs Spark entirely in a single JVM on your machine using all cores - there's no cluster involved. This is for development and testing. Cluster mode (yarn, k8s, spark://host:7077) distributes the job across multiple machines. You switch between them by changing the .master() setting or the --master argument to spark-submit. Your application code doesn't change.

Q: Why does the Spark JAR use "provided" scope for Spark dependencies? When you run on a cluster, Spark is already installed on the cluster nodes - you don't want to bundle it into your application JAR (it would be hundreds of MBs). The provided scope tells SBT that the dependency will be provided at runtime by the environment. For local development, you can either remove % "provided" or use sbt run with special configuration to include provided dependencies on the classpath.

Part of Scala Mastery Course - Module 15 of 22.