Nguyen Pham

Spark OOM is firstly a Modeling Problem

Some Qlik data modeling practices are actually underrated fixes for Spark OOM error. Underrated fix 1: Data Type Spark infers schemas using “greedy” data types, change to an optimized schema: use ShortType instead of IntegerType if values never exceed 32,767. use DecimalType only when necessary; otherwise, use FloatType or DoubleType. Before joining using StringType ID values, consider hashing them to an IntegerType or LongType (lesson from Qlik Autonumber()). Underrated fix 2: Cardinality When Spark groups by a column with mostly the same/null values, those rows are sent to the same partition causing OOM for that executor.

Why AI is Bad at SQL but Good at DataFrame

Why AI is Bad at SQL but Good at DataFrame Logic DataFrame SQL Filter before Agg .filter().groupBy().agg() (WHERE) AS cte, ... GROUP BY Filter after Agg .groupBy().agg().filter() GROUP BY ... HAVING Filter in Window .withColumn().filter() OVER() ... QUALIFY Cross Join .crossJoin(df2) CROSS JOIN / , Exists / SemiJoin .join(df2, "id", "left_semi") WHERE EXISTS (SELECT 1 FROM) Not Exists / AntiJoin .join(df2, "id", "left_anti") WHERE NOT EXISTS / EXCEPT For DataFrame, code is a stream of tokens in the exact same order of the logic, with the same consistent keywords.

DBT Macro is Bad for Business Logic

DBT Macro is Bad for Business Logic Macro is good for technical logic (e.g. format, SCD2 from and to dates), but is bad for business logic: invisible in data lineage macro cannot be unit tested every other re-use in model needs a redundant unit test technically complex for business user to work on For business logic re-use, apply a design pattern that puts logic in a base upstream model, instead of macro.

DBT AI Guardrail

DBT AI Guardrail Before letting AI touch your DBT code, put these guardrails in place: code integrity: enforce linting (SQLFluff) structure integrity: dbt-bouncer logic integrity: unit test first model integrity: primary_key constraints and enforced contract data integrity: automate data diff in CI/CD

Go Rust Python code lazy iteration

Go Rust Python code lazy iteration Python def OneTwo(): yield "One" yield "Two" for val in OneTwo(): print(val) Go func OneTwo() <-chan string { ch := make(chan string) go func() { defer close(ch) ch <- "One" ch <- "Two" }() return ch } func OneTwoNew(yield func(string) bool) { if !yield("One") { return } if !yield("Two") { return } } func main() { for val := range OneTwo { fmt.Printf(val) } for val := range OneTwoNew { fmt.

DBT Hourly Refresh Anti-Pattern

DBT Hourly Refresh Anti-Pattern When we join a Daily heavy table (Fact) with an Hourly light table (Status update), we often fall into a costly trap: materializing the result as a table. The moment it is materialized, data becomes stale. To keep statuses fresh, we are forced to run the entire heavy model hourly, wasting massive compute credits on data that hasn’t actually changed. The Solution: Base Table + View Pattern to break the dependency by separating heavy transformations from the “fresh” join.

Online Shop vs Petabytes

Online Shop vs Petabytes Shop: I need to understand how users navigate the website. Consultant: Best practice is to capture every user actions in BigQuery/<insert certification here>. Consultant: Best practice for BigQuery is append-only so we keep adding user actions into raw table, deduplication/aggregation is for later. Consultant: Best practice is to partition by day and cluster by user_id/anonymous_id. Consultant: Best practice is to build materialized views. Consultant: Best practice is to build data cubes.

Partition Salting Example

Partition Skew Example recent partitions 2025-10-11, 2025-10-12 are 6 times the size of history partition 2001-10-12. Day Partition | OrderID | Customer ID ----------------+---------+------------- 2001-10-12 | 01 | ----------------+---------+------------- 2001-10-13 | 02 | 2001-10-13 | 03 | ----------------+---------+------------- 2001-10-14 | 04 | 2001-10-14 | 05 | ----------------+---------+------------- ... | | ----------------+---------+------------- 2025-10-10 | 06 | 2025-10-10 | 07 | 2025-10-10 | 08 | ----------------+---------+------------- 2025-10-11 | 09 | 2025-10-11 | 10 | 2025-10-11 | 11 | 2025-10-11 | 12 | 2025-10-11 | 13 | 2025-10-11 | 14 | ----------------+---------+------------- 2025-10-12 | 15 | 2025-10-12 | 16 | 2025-10-12 | 17 | 2025-10-12 | 18 | 2025-10-12 | 19 | 2025-10-12 | 20 | Partition Salting for BigQuery/Databricks Use if STRING partition is not supported, Day is DATE datatype.

Go Rust Python code complexity AI review

Go Rust Python code complexity AI review For AI code complexity review (for example Checkmarx), code complexity varies among Go, Rust and Python. Most code branches: Go because of idiomatic error handling if err != nil. Fewest code branches: Python thanks to list comprehensions, map, filter, try/except. Flexible: Rust High if using match for error handling and branching. Low if using ? operator and iterator chains map, filter, collect. High Complexity Go High Complexity Rust // estimated complexity 4 func classify(values []string) ([]string, error) { var result []string for _, v := range values { num, err := strconv.

Python List Comprehension Performance

List Comprehension is better than Loop List Comprehension Loop Syntax [i * i for i in range(10)] result = [None]*10 for i in range(10): result[i] = i * i # append is slower Bytecode LOAD_FAST i LOAD_FAST i BINARY_MULTIPLY LIST_APPEND LOAD_FAST i LOAD_FAST i BINARY_MULTIPLY LOAD_FAST result LOAD_FAST i STORE_SUBSCR Comparison Benchmark and Bytecodes import dis import timeit def comprehension_version() -> list[int]: return [i * i for i in range(10)] def loop_version() -> list[int]: result = [None] * 10 for i in range(10): result[i] = i * i # append is slower return result comp_time = timeit.

Data Lake Analytics Lead | DB Architect | Multi-Cloud | R, Python, SQL

Timeline Tags LinkedIn

1

-

4

1/4