Scala with Apache Spark: Complete Introduction

TT

Scala with Apache Spark: Complete Introduction

Apache Spark is the dominant framework for large-scale data processing, and Scala is its native language. Spark itself is written in Scala, and while Python (PySpark) is popular, Scala gives you full access to Spark's API, better performance, and the ability to contribute to Spark itself. This module gets you up and running with Spark in Scala.

Why Scala for Spark?

Scala has unique advantages for Spark development:

  • Native API — Spark's Scala API is the source of truth. Python and Java APIs are wrappers
  • Type safety — The Dataset API gives you compile-time type checking that PySpark can't offer
  • Performance — No Python serialization overhead; Spark jobs run on the JVM
  • Expressiveness — Functional programming style maps naturally to Spark transformations
  • Industry standard — Most large Spark installations (banks, fintechs, FAANG) use Scala

Setting Up a Spark Project with SBT

Create a new SBT project:

text
my-spark-app/
├── build.sbt
├── project/
│   └── build.properties
└── src/
    └── main/
        └── scala/
            └── Main.scala

build.sbt:

scala
name := "my-spark-app"
version := "0.1.0"
scalaVersion := "2.12.18"

val sparkVersion = "3.5.0"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion % "provided",
  "org.apache.spark" %% "spark-sql"  % sparkVersion % "provided"
)

// For running locally without "provided" scope:
// run := Defaults.runTask(fullClasspath in Runtime, mainClass in run, runner in run).evaluated

project/build.properties:

text
sbt.version=1.9.7

The provided scope means Spark won't be bundled into your JAR — the cluster already has it. For local testing, remove % "provided" or use spark-submit.

SparkSession: The Entry Point

SparkSession is the unified entry point for Spark 2.x and later:

scala
import org.apache.spark.sql.SparkSession

object Main extends App {

  val spark = SparkSession.builder()
    .appName("MySparkApp")
    .master("local[*]")   // local mode — use all CPU cores
    .getOrCreate()

  spark.sparkContext.setLogLevel("WARN")

  println(s"Spark version: ${spark.version}")

  // Your Spark code goes here

  spark.stop()
}
  • local[*] — run locally using all available cores
  • local[2] — run locally with 2 cores
  • In production, the master URL is set by spark-submit, not in code

SparkContext vs SparkSession

  • SparkContext (spark.sparkContext) — the original entry point (Spark 1.x), used for RDD operations
  • SparkSession — the modern entry point (Spark 2.x+), wraps SparkContext and adds SQL/DataFrame support

Access SparkContext from SparkSession when needed:

scala
val sc = spark.sparkContext

Your First Spark Job: Word Count

The classic Spark example:

scala
import org.apache.spark.sql.SparkSession

object WordCount extends App {

  val spark = SparkSession.builder()
    .appName("WordCount")
    .master("local[*]")
    .getOrCreate()

  spark.sparkContext.setLogLevel("WARN")

  val lines = spark.sparkContext.parallelize(Seq(
    "apache spark is a fast engine",
    "spark uses scala as its native language",
    "scala is functional and object oriented"
  ))

  val wordCounts = lines
    .flatMap(line => line.split(" "))
    .map(word => (word, 1))
    .reduceByKey(_ + _)
    .sortBy(_._2, ascending = false)

  wordCounts.collect().foreach { case (word, count) =>
    println(s"$word: $count")
  }

  spark.stop()
}

Output:

text
scala: 2
spark: 2
is: 2
a: 1
fast: 1
...

Reading and Writing Data

Spark can read from files, databases, and streaming sources:

scala
// Read CSV
val df = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("/path/to/data.csv")

df.show(5)
df.printSchema()

// Read Parquet (the preferred format for big data)
val parquet = spark.read.parquet("/path/to/data.parquet")

// Write to Parquet
df.write
  .mode("overwrite")
  .parquet("/path/to/output")

// Write partitioned Parquet
df.write
  .partitionBy("year", "month")
  .mode("overwrite")
  .parquet("/path/to/partitioned-output")

Running with spark-submit

To run on a cluster or locally with the full Spark runtime:

bash
spark-submit   --class Main   --master local[*]   target/scala-2.12/my-spark-app_2.12-0.1.0.jar

Build the JAR with sbt package first.

Spark's Execution Model

Understanding how Spark executes is important for performance:

  • Driver — the JVM process running your Main object; coordinates the job
  • Executor — JVM processes on worker nodes that run tasks
  • RDD/DataFrame — distributed collections partitioned across executors
  • Transformation — lazy operation that builds a computation plan (map, filter, select)
  • Action — triggers actual computation (collect, count, show, write)

Spark is lazy — nothing runs until you call an action. This lets Spark optimize the full computation plan before executing.

Frequently Asked Questions

Q: Should I use PySpark or Scala for Spark? If you're a data scientist focused on analysis and ML, PySpark is fine — the ecosystem (pandas, scikit-learn, notebooks) fits that workflow. If you're a data engineer building pipelines, ETL jobs, or Spark infrastructure, Scala is the better choice: you get the full Dataset API with compile-time type safety, better performance, and you can read Spark's source code when debugging. Most production Spark jobs at large companies are written in Scala.

Q: What is the difference between local[*] and cluster mode? local[*] runs Spark entirely in a single JVM on your machine using all cores — there's no cluster involved. This is for development and testing. Cluster mode (yarn, k8s, spark://host:7077) distributes the job across multiple machines. You switch between them by changing the .master() setting or the --master argument to spark-submit. Your application code doesn't change.

Q: Why does the Spark JAR use "provided" scope for Spark dependencies? When you run on a cluster, Spark is already installed on the cluster nodes — you don't want to bundle it into your application JAR (it would be hundreds of MBs). The provided scope tells SBT that the dependency will be provided at runtime by the environment. For local development, you can either remove % "provided" or use sbt run with special configuration to include provided dependencies on the classpath.


Part of Scala Mastery Course — Module 15 of 22.