Understanding OLAP, Partitioning, and Why Data Warehouses like BigQuery Exist
This week a lot clicked for me, so this post is longer than usual. Explaining concepts in detail helps me understand them. Hopefully it helps you too.
Data is abundant and messy, but properly partitioning and clustering your data warehouse can save your company money and improve query performance. This week I learned why.
Fundamental Data System Concepts: OLAP and OLTP, Partitioning, Clustering
Before jumping right into BigQuery, it is important to understand the difference between online analytical processing (OLAP) and online transaction processing (OLTP). Both are “database management systems for storing and processing data in large volumes.” We use OLAP to analyze aggregated data, and OLTP to process database transactions.
Think OLAP for Big Data.
OLAP systems aggregate across data dimensions (i.e., customers, purchases, inventory) and OLTP systems are your traditional relational databases that store things like credit card transactions or customer purchases.
As I am learning and brushing up on these concepts, I often need real-world analogies to solidify my mental model. I found this analogy from the data engineering subreddit, and it clicked.
Keeping an eye on the score displayed on the big screen at a cricket stadium (OLAP) is much better than running onto the field to ask the umpire (OLTP) about the score after every ball.
The umpire’s primary role is to make crucial decisions during the game (process transactions), and while they can provide the score, interrupting them is of course not ideal LOL. The scoreboard (OLAP), which is regularly updated after each ball or over, is a more efficient and non-disruptive way to stay informed.
Google’s BigQuery and Amazon’s Redshift are examples of OLAP data warehouses. They are read-heavy systems that are optimized to allow users to execute highly performant queries across large data sets and aggregate insights from that data.
This week, I used Python to load parquet yellow taxi data to Google Cloud Storage (GCP). To solidify concepts, I used Google’s BigQuery to play with this data.
Partitioning and Clustering
We want queries we run in our data warehouse to be as performant as possible.
Partitions allow us to divide a dataset into smaller segments for manageability and data skipping. Clustering organizes data in storage blocks (i.e. sort order) based on a column value so related values are stored together. This speeds up queries that filter or aggregate by that column because BigQuery can skip irrelevant blocks.
For example, any query where we filter by dates will automatically skip any partition that does not contain that date. It avoids the need to scan massive amounts of data to return our results. Faster and cheaper results.
So, keeping with our yellow taxi dataset from previous weeks, I can partition my table by tpep_pickup_datetime and analyze what changes when BigQuery evaluates my query. The partitioned table is also clustered by VendorID.
-- No partition
-- BQ Scans 310.24MB of data
SELECT count(distinct VendorID)
FROM yellow_taxi_tripdata_2024
WHERE tpep_dropoff_datetime BETWEEN '2024-03-01' AND '2024-03-15';
-- Partitioned by tpep_dropoff_datetime
-- BQ Scans 26.84 MB of data
SELECT count(distinct VendorID)
FROM yellow_taxi_tripdata_2024_partitioned
WHERE tpep_dropoff_datetime BETWEEN '2024-03-01' AND '2024-03-15';Note the difference in the amount of data scanned in the queries.
Let’s take a look at another example. I created two tables: one with partitioning by datetime and the other with clustering by VendorID and no partitioning at all.
-- Clustering, no partition
-- 465.36MB scanned, 352ms
SELECT VendorID, avg(fare_amount)
FROM yellow_taxi_tripdata_2024_clus_no_partition
WHERE tpep_dropoff_datetime BETWEEN '2024-03-01' AND '2024-03-15'
GROUP BY VendorID
-- Partitioning, no clustering
-- 40.26MB scanned, 263ms
SELECT VendorID, avg(fare_amount)
FROM yellow_taxi_tripdata_2024_partition_no_clus
WHERE tpep_dropoff_datetime BETWEEN '2024-03-01' AND '2024-03-15'
GROUP BY VendorIDFor these types of queries, partitioning is more important since we are filtering by datetime. BigQuery can discard entire chunks by date. This happens before any data is read.
Clustering does not help in the second example since the filter is on datetime and not VendorID. BigQuery needs to scan all the data to filter down by datetime. If this dataset had many records by VendorID and the query filtered by VendorID, we’d see some improvement in the query.
Key takeaway: partition on what you filter and cluster as a secondary optimization.
Using BigQuery to Learn about Query and Cost Optimization
In BigQuery, we can create external tables to query data directly from GCP. Using an external table is a great option for previewing data but not for a frequently accessed longer term solution. Querying external tables is slower since the data must be read from its source every time and they can’t leverage BQ’s columnar storage optimizations.
CREATE OR REPLACE EXTERNAL TABLE `yellow_taxi_tripdata_2024_ext`
OPTIONS (
format = "parquet",
uris = ['gs://<bucket>/<path>/*.parquet']
);External tables are great if the data changes frequently, as it avoids re-ingestion.
When we create a proper materialized table in BigQuery from an external table, we can reap exciting benefits! If you query your data more than occasionally, these tables win on cost and performance.
Evaluating data reads and columnar storage
Let’s take a look at the difference between reads from an external table and those from a materialized BigQuery table:
-- create a proper BQ table from the external table
CREATE OR REPLACE TABLE `yellow_taxi_tripdata_2024` AS
SELECT * FROM `yellow_taxi_tripdata_2024_ext`;
-- Materialized table
-- BQ Scans 155.15MB of data
-- Query time: 272ms
SELECT count(distinct PULocationID) FROM `yellow_taxi_tripdata_2024`;
-- External table
-- BQ Scans 0 B of data (reads from GCS, not counted in BQ metrics)
-- Query time: 535ms
SELECT count(distinct PULocationID) FROM `yellow_taxi_tripdata_2024_ext`;This is a relatively small dataset compared to other big data datasets, so the difference in query time would likely be more dramatic for larger datasets and more complex queries that are run frequently. Materialized tables are stored in Colossus, BigQuery’s distributed file system. As a result, queries are faster.
This week made me realize the questions I should be asking when designing a data warehouse:
What insights do I need to gather about my business or product?
Who are the downstream consumers (i.e. machine learning models, data scientists, AI Agents, executives) and what do they need?
Which columns in the data warehouse are going to be filtered or aggregated?
Which columns are high cardinality?
Key takeaways
Spend more time thinking about your data before jumping right into data warehouse design
BigQuery is a great option if you prioritize simplicity and minimal management overhead
What’s still fuzzy for me
Thinking about clustering, what happens if business or product use cases pivot and you have petabytes of data in your data warehouse? A partition / clustering strategy redesign could be expensive and time consuming. How do organizations that operate at scale handle this problem?
These questions keep me digging deeper. The beautiful thing about diving into data engineering fundamentals is that each concept builds on the last. We’ve now covered Docker, infrastructure as code, workflow orchestrators, cloud, and data warehouses.
Understanding these fundamentals transforms data warehouse design from a guessing game into an intentional architectural decision.
Next week, I’ll be taking a look at analytics engineering. What kind of insights can we drive from our data warehouse and what tools will help us get there?
Thanks for reading!

