Nguyen Pham

Online Shop vs Petabytes

Online Shop vs Petabytes Shop: I need to understand how users navigate the website. Consultant: Best practice is to capture every user actions in BigQuery/<insert certification here>. Consultant: Best practice for BigQuery is append-only so we keep adding user actions into raw table, deduplication/aggregation is for later. Consultant: Best practice is to partition by day and cluster by user_id/anonymous_id. Consultant: Best practice is to build materialized views. Consultant: Best practice is to build data cubes.

Partition Salting Example

Partition Skew Example recent partitions 2025-10-11, 2025-10-12 are 6 times the size of history partition 2001-10-12. Day Partition | OrderID | Customer ID ----------------+---------+------------- 2001-10-12 | 01 | ----------------+---------+------------- 2001-10-13 | 02 | 2001-10-13 | 03 | ----------------+---------+------------- 2001-10-14 | 04 | 2001-10-14 | 05 | ----------------+---------+------------- ... | | ----------------+---------+------------- 2025-10-10 | 06 | 2025-10-10 | 07 | 2025-10-10 | 08 | ----------------+---------+------------- 2025-10-11 | 09 | 2025-10-11 | 10 | 2025-10-11 | 11 | 2025-10-11 | 12 | 2025-10-11 | 13 | 2025-10-11 | 14 | ----------------+---------+------------- 2025-10-12 | 15 | 2025-10-12 | 16 | 2025-10-12 | 17 | 2025-10-12 | 18 | 2025-10-12 | 19 | 2025-10-12 | 20 | Partition Salting for BigQuery/Databricks Use if STRING partition is not supported, Day is DATE datatype.

Go Rust Python code complexity AI review

Go Rust Python code complexity AI review For AI code complexity review (for example Checkmarx), code complexity varies among Go, Rust and Python. Most code branches: Go because of idiomatic error handling if err != nil. Fewest code branches: Python thanks to list comprehensions, map, filter, try/except. Flexible: Rust High if using match for error handling and branching. Low if using ? operator and iterator chains map, filter, collect. High Complexity Go High Complexity Rust // estimated complexity 4 func classify(values []string) ([]string, error) { var result []string for _, v := range values { num, err := strconv.

Python List Comprehension Performance

List Comprehension is better than Loop List Comprehension Loop Syntax [i * i for i in range(10)] result = [None]*10 for i in range(10): result[i] = i * i # append is slower Bytecode LOAD_FAST i LOAD_FAST i BINARY_MULTIPLY LIST_APPEND LOAD_FAST i LOAD_FAST i BINARY_MULTIPLY LOAD_FAST result LOAD_FAST i STORE_SUBSCR Comparison Benchmark and Bytecodes import dis import timeit def comprehension_version() -> list[int]: return [i * i for i in range(10)] def loop_version() -> list[int]: result = [None] * 10 for i in range(10): result[i] = i * i # append is slower return result comp_time = timeit.

Code Security Python Package

Code Security for Python Package # audit uv add --dev pip-audit pip-audit # available versions pip index versions urllib3 # installed version uv pip freeze | grep urllib3 poetry show urllib3 # why package is installed uv tree --invert --package urllib3 poetry show --tree --why urllib3 # installs and adds to pyproject.toml uv add urllib3~=1.2.3 poetry add urllib3~=1.2.3 # requirements.txt if needed uv export --no-hashes > requirements.txt poetry export --without-hashes --format=requirements.

Data Model SCD Type 2, Vault, Kimball, Inmon

SCD Type 2, 6NF/7NF, Data Vault SCD Type 2: One big table, all attributes together, duplicates on change. ProductID | Name | Category | Price | ValidFrom | ValidTo -----------+---------+-------------+-------+------------+---------- 456 | iPhone9 | Electronics | 1000 | 2020 | 2021 456 | iPhone9 | Electronics | 900 | 2021 | 2022 456 | iPhone9 | Computers | 900 | 2022 | NULL 6NF/7NF: Each table per attribute ProductID | Name | ValidFrom | ValidTo -----------+---------+------------+---------- 456 | iPhone9 | 2020 | NULL ProductID | Price | ValidFrom | ValidTo -----------+-------+------------+---------- 456 | 1000 | 2020 | 2021 456 | 900 | 2021 | NULL ProductID | Category | ValidFrom | ValidTo -----------+-------------+------------+---------- 456 | Electronics | 2020 | 2022 456 | Computers | 2022 | NULL Data Vault: Source-based grouping ProductID | BusinessKey -----------+------------- 456 | iPhone9 ProductID | Category | Price | Source | LoadDate -----------+-------------+-------+------------------+---------- 456 | Electronics | 1000 | Dell ERP | 2020 456 | Electronics | 900 | Distributor Feed | 2021 456 | Electronics | 900 | Dell ERP | 2021 456 | Computers | 900 | Dell ERP | 2022 Inmon vs Kimball Inmon builds Kimball-like views in Data Mart from core 3NF EDW

dbt Core, SQLMesh, Fivetran, Spark Declarative Pipelines

Data Wars The Phantom Menace: dbt Core Attack of the Clones: SQLMesh Revenge of the Sith: Fivetran A New Hope: Spark Declarative Pipelines

Data Engineering Hard Things

Data Engineering Hard Things There are only two hard things in Data Engineering: schema changes, source of truth, and off-by-one counts. Oh wait, that’s three things.

LIMIT Effectiveness

LIMIT Effectiveness PostgreSQL: row-store Snowflake: micro-partition pruning Iceberg: rich metadata parquet pruning Delta Lake: basic metadata parquet pruning BigQuery: columnar, MPP

Python Structural Pattern Match

Python Structural Pattern Match match in Python 3.10+ is much more powerful than traditional switch/case in other languages. For example, to merge overlapping intervals [[1,3],[2,6],[8,10],[10,15]] into [[1,6],[8,15]]. Without match With match intervals.sort() result = [] for interval in intervals: if not result: result.append(interval) # repeated, unclear index access elif result[-1][1] >= interval[0]: result[-1][1] = max( result[-1][1], interval[1] ) else: result.append(interval) intervals.sort() result = [] for interval in intervals: match (result, interval): case ([], _): result.

Data Lake Analytics Lead | DB Architect | Multi-Cloud | R, Python, SQL

Timeline Tags LinkedIn

1

-

3

1/3