Spark OOM is firstly a Modeling Problem
Some Qlik data modeling practices are actually underrated fixes for Spark OOM error.
Underrated fix 1: Data Type
Spark infers schemas using “greedy” data types, change to an optimized schema:
- use
ShortTypeinstead ofIntegerTypeif values never exceed 32,767. - use
DecimalTypeonly when necessary; otherwise, useFloatTypeorDoubleType.
Before joining using StringType ID values, consider hashing them to an IntegerType or LongType (lesson from Qlik Autonumber()).
Underrated fix 2: Cardinality
When Spark groups by a column with mostly the same/null values, those rows are sent to the same partition causing OOM for that executor.
Add a salt column = id % N, then add the salt to the group by to evenly distribute data to N executors. Also see Partition Salting Example.
Note on joins: just broadcast the small table.
Underrated fix 3: avoid SELECT *
Enough said.
Last modified on 2026-05-16