Scala with Apache Spark: Complete Introduction
Scala with Apache Spark: Complete Introduction
Apache Spark is the dominant framework for large-scale data processing, and Scala is its native language. Spark itself is written in Scala, and while Python (PySpark) is popular, Scala gives you full access to Spark's API, better performance, and the ability to contribute to Spark itself. This module gets you up and running with Spark in Scala.
Why Scala for Spark?
Scala has unique advantages for Spark development:
- Native API — Spark's Scala API is the source of truth. Python and Java APIs are wrappers
- Type safety — The Dataset API gives you compile-time type checking that PySpark can't offer
- Performance — No Python serialization overhead; Spark jobs run on the JVM
- Expressiveness — Functional programming style maps naturally to Spark transformations
- Industry standard — Most large Spark installations (banks, fintechs, FAANG) use Scala
Setting Up a Spark Project with SBT
Create a new SBT project:
my-spark-app/
├── build.sbt
├── project/
│ └── build.properties
└── src/
└── main/
└── scala/
└── Main.scalabuild.sbt:
name := "my-spark-app"
version := "0.1.0"
scalaVersion := "2.12.18"
val sparkVersion = "3.5.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided"
)
// For running locally without "provided" scope:
// run := Defaults.runTask(fullClasspath in Runtime, mainClass in run, runner in run).evaluatedproject/build.properties:
sbt.version=1.9.7The provided scope means Spark won't be bundled into your JAR — the cluster already has it. For local testing, remove % "provided" or use spark-submit.
SparkSession: The Entry Point
SparkSession is the unified entry point for Spark 2.x and later:
import org.apache.spark.sql.SparkSession
object Main extends App {
val spark = SparkSession.builder()
.appName("MySparkApp")
.master("local[*]") // local mode — use all CPU cores
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
println(s"Spark version: ${spark.version}")
// Your Spark code goes here
spark.stop()
}local[*]— run locally using all available coreslocal[2]— run locally with 2 cores- In production, the master URL is set by
spark-submit, not in code
SparkContext vs SparkSession
- SparkContext (
spark.sparkContext) — the original entry point (Spark 1.x), used for RDD operations - SparkSession — the modern entry point (Spark 2.x+), wraps SparkContext and adds SQL/DataFrame support
Access SparkContext from SparkSession when needed:
val sc = spark.sparkContextYour First Spark Job: Word Count
The classic Spark example:
import org.apache.spark.sql.SparkSession
object WordCount extends App {
val spark = SparkSession.builder()
.appName("WordCount")
.master("local[*]")
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
val lines = spark.sparkContext.parallelize(Seq(
"apache spark is a fast engine",
"spark uses scala as its native language",
"scala is functional and object oriented"
))
val wordCounts = lines
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
.sortBy(_._2, ascending = false)
wordCounts.collect().foreach { case (word, count) =>
println(s"$word: $count")
}
spark.stop()
}Output:
scala: 2
spark: 2
is: 2
a: 1
fast: 1
...Reading and Writing Data
Spark can read from files, databases, and streaming sources:
// Read CSV
val df = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("/path/to/data.csv")
df.show(5)
df.printSchema()
// Read Parquet (the preferred format for big data)
val parquet = spark.read.parquet("/path/to/data.parquet")
// Write to Parquet
df.write
.mode("overwrite")
.parquet("/path/to/output")
// Write partitioned Parquet
df.write
.partitionBy("year", "month")
.mode("overwrite")
.parquet("/path/to/partitioned-output")Running with spark-submit
To run on a cluster or locally with the full Spark runtime:
spark-submit --class Main --master local[*] target/scala-2.12/my-spark-app_2.12-0.1.0.jarBuild the JAR with sbt package first.
Spark's Execution Model
Understanding how Spark executes is important for performance:
- Driver — the JVM process running your
Mainobject; coordinates the job - Executor — JVM processes on worker nodes that run tasks
- RDD/DataFrame — distributed collections partitioned across executors
- Transformation — lazy operation that builds a computation plan (map, filter, select)
- Action — triggers actual computation (collect, count, show, write)
Spark is lazy — nothing runs until you call an action. This lets Spark optimize the full computation plan before executing.
Frequently Asked Questions
Q: Should I use PySpark or Scala for Spark? If you're a data scientist focused on analysis and ML, PySpark is fine — the ecosystem (pandas, scikit-learn, notebooks) fits that workflow. If you're a data engineer building pipelines, ETL jobs, or Spark infrastructure, Scala is the better choice: you get the full Dataset API with compile-time type safety, better performance, and you can read Spark's source code when debugging. Most production Spark jobs at large companies are written in Scala.
Q: What is the difference between local[*] and cluster mode?
local[*] runs Spark entirely in a single JVM on your machine using all cores — there's no cluster involved. This is for development and testing. Cluster mode (yarn, k8s, spark://host:7077) distributes the job across multiple machines. You switch between them by changing the .master() setting or the --master argument to spark-submit. Your application code doesn't change.
Q: Why does the Spark JAR use "provided" scope for Spark dependencies?
When you run on a cluster, Spark is already installed on the cluster nodes — you don't want to bundle it into your application JAR (it would be hundreds of MBs). The provided scope tells SBT that the dependency will be provided at runtime by the environment. For local development, you can either remove % "provided" or use sbt run with special configuration to include provided dependencies on the classpath.
Part of Scala Mastery Course — Module 15 of 22.
