SIf you work with large volumes of data in BigQuery, you’ve probably wondered how to optimize your queries and manage costs more efficiently. Upon researching, you’ll find various configurations that can be applied to a table, including the possibility of partitioning it—one of the most powerful strategies for achieving this. But when is the right time to partition a table? In this article, we will explore in depth the criteria and benefits of partitioning in BigQuery, analyze practical use cases, and debunk some common misconceptions surrounding this topic.
What is a partitioned table?
Let's suppose you have a huge collection of books stored in your library, but they are all mixed together, organized only in the order in which you acquired them. In this case, searching for a book with a particular characteristic—such as a science fiction book—becomes very complicated because you would have to go through each book to identify the ones that meet that criterion. This would be equivalent to having a flat table in BigQuery and applying a filter (WHERE).
Now, imagine that you decide to organize your books into sections or shelves based on a specific criterion: one for science fiction novels, another for thrillers, one for biographies, and so on. Each section represents a part of your collection, so each book is classified into a single group, and when combined, they form the complete collection. This example is equivalent to having a partitioned table in BigQuery.
Just as when searching for a science fiction book, you only need to check the corresponding section, in a partitioned table, queries are executed only on the relevant subset. This makes the search much faster and more efficient, optimizing both time and processing costs. Mathematically, this process is known as set partitioning: dividing a set into disjoint subsets—without common elements—whose union forms the original set.
Partitioned Table: A partitioned table is a table in which data is divided into segments called partitions. These segments are created based on the values of a partition column (e.g., a date column or a numeric range).[documentation source]
What performance and cost advantages does partitioning offer?
When you run a query on a partitioned table in BigQuery, the query engine uses a process known as "partition pruning"or partition elimination. This means that if your query includes a filter on the partitioning column (e.g., a date), BigQuery identifies and scans only the partitions that contain relevant data for the filter.
To understand the impact, let's talk in concrete numbers: Suppose we have a table with 100 GB of data. If we leave the table flat, when executing a query that filters by a date range, BigQuery will have to scan the entire 100 GB. You’ll notice that if your query is:
SELECT *
FROM table_2024
it will have the same cost as if it were:
SELECT *
FROM table_2024
WHERE date >= "2024-12-01" and date <= "2024-12-31"
despite the fact that the output will return fewer rows when executing this second query.
On the other hand, if we partition this table by the date column, this would mean that, on average, each day would represent a 273 MB partition.
Partition size ≈ 100 GB / 366 días ≈ 0.273 GB ≈ 273 MB |
So instead of scanning 100 GB, the query will scan approximately 8.46 GB.
31 × 273 MB ≈ 8.463 GB |
It is important to note that in BigQuery, the cost of a query is based on the amount of data scanned. According to the documentation [link], at this moment, the on-demand query pricing is $6.25 USD per TiB scanned. It is important to note that the term "TiB" (tebibyte) refers to 2⁴⁰ bytes, which is slightly different from a decimal terabyte (1 TB = 10¹² bytes)
So, going back to our example of the 100 GB table queried for the month of December, the calculations would be as follows:
1 TiB ≈ 1.1 TB.
100 GB = 0.1 TB
⇒ 1 TiB = 1024 GB
⇒ 100 GB ≈ 0.09766 TiB.
∴ The approximate cost for 100 GB would be 0.09766 × $6.25 ≈ $0.61 USD.
∴ The approximate cost for 8.463 GB would be 0.00826 × $6.25 ≈ $0.05 USD.
Therefore, we are talking about a cost reduction of approximately 91.8% by partitioning the table, incredible!
In summary, the benefits are seen in:
Efficiency: By applying a filter on the partition column, the table is "pruned," scanning only the necessary subset.
Cost: Fewer scanned data means lower processing costs in BigQuery.
Performance: By reducing the amount of data to process, queries run faster.
So, not only does it improve query efficiency, but it also optimizes resource usage and reduces costs, which is especially important when handling large volumes of data.
It is also important to note that, in BigQuery, storage costs are based on the amount of data stored and do not vary significantly whether the table is partitioned or flat. In other words, the price per GB stored is the same for both structures.
Is it always a good idea to partition?
And here’s the crux of the matter: while it may seem ideal, partitioning is not always the best choice. Let’s go back to the library example: organizing books by genre on different shelves sounds great—if I feel like reading a crime novel, I can easily find my options. But now, suppose I have 15 novels, each from a different genre; in that case, it wouldn’t make much sense to organize 15 books across 15 different shelves, right? In BigQuery, this concept is known as "overhead."
Overhead: In the context of computing systems, overhead refers to the additional resource cost (such as processing time, memory, and storage space) required to manage, coordinate, or support a primary operation. In other words, it is the “extra expense” needed for the system to perform administrative or control tasks without directly contributing to the final outcome of the operation.
So, the "overhead" of a partitioned table refers to the small extra cost associated with organizing and managing the table in partitions instead of keeping it as a single entity. In BigQuery, this overhead is primarily the cost of managing the additional information (metadata) for each partition.
It is important to highlight that while this additional cost does not appear as an extra charge in monetary terms—since it is built into the system and does not generate a separate charge on your bill—it does have implications for performance and data management efficiency. This is why we must carefully consider whether partitioning makes sense or not.
If you look at the official BigQuery documentation, the term "overhead" is not always used explicitly, but there are references to the impact of having partitions that are too small. In the section on Best Practices for Partitioned Tables [link documentation] it is recommended that each partition contain at least 1 GB of data. This recommendation is clearly not about saving money but rather about preventing the creation of too many small partitions, which can impact system performance in the following ways:
Metadata management: Each partition generates additional information (metadata) that BigQuery must handle. If you have too many partitions with only one row each, this metadata overhead can make internal operations less efficient.
Query optimization: The goal of partitioning is to allow BigQuery to scan only the relevant portion of the table when filtering. If partitions are too small, the advantage of skipping large volumes of data diminishes, potentially complicating query optimization.
Internal limits: BigQuery imposes certain limits on the number of partitions allowed per table (with a maximum of 10,000 partitions). Having too many small partitions could bring you closer to these limits and create management issues. [link documentación]
In summary, it's not just about "partitioning everything and that's it," as partitioning without a clear strategy would be like ordering a cold drink at a bar and receiving it with a cardboard sleeve—the thermal cover meant for hot beverages.
The partitioning strategy should align with the data volume and query patterns you will be executing. This way, you ensure that data organization truly improves query performance without overloading the system with an excessive number of partitions.
How can we anticipate whether to partition or not?
As mentioned earlier, the recommendation is that each partition should be at least 1 GB in size. Therefore, to determine whether a table should be partitioned, we need to estimate its total size—or its potential future size—and calculate how many partitions would be created. With this information, we can make an informed decision on whether partitioning is truly beneficial.
To calculate the total storage cost of a table, we need to:
Estimate the number of rows it will have.
Calculate the storage cost based on the official documentation.
To make it easier to understand, I'll provide an example I worked on this week—the one that led me to this entire investigation.
Table information: Number of followers per day, per social network, per business unit of the company.
So, to estimate the approximate size of the table, we will perform a permutation calculation by indicating the number of possibilities in each dimensional column.
Column | Data type | Detail | Posibilities |
---|---|---|---|
Date | Date | Historical data for 1 year and 4 months | 468 |
Business unit | String | Working with 5 business units | 5 |
Social Media | String | Working with 7 social networks | 7 |
Therefore, the maximum number of permutations that can occur is 16,380, which represents the approximate number of rows that the table can have.
468 x 5 x 7 = 16 380 |
BigQuery uses columnar storage and has estimated sizes per data type. [link documentation]:
Data type | Size approx |
INT64 | 8 bytes |
FLOAT64 | 8 bytes |
STRING | Length * 2 bytes |
BOOLEAN | 1 byte |
DATE | 4 bytes |
TIMESTAMP | 8 bytes |
To calculate the table's storage cost, we need to determine the average size per row and then multiply it by the total number of rows obtained.
Average row size ≈ ∑(Size of each column) |
Total Size ≈ Average row size × Number of rows |
So, going back to our initial example, to calculate the cost of a row, we need to determine the storage cost of each column.
Column | Data type | Size (bytes) |
Date | DATE | 4 |
Bussines unit | STRING (10) | 20 |
Socila media | STRING (15) | 30 |
Followers | INT64 | 8 |
One row contains: 1 date value, 1 string 10, 1 string 15 and 1 integer.
⇒ 4+20+30+8 = 62 bytes por fila
⇒ 62 bytes x 16 380 = 1,015,560 bytes
⇒ ≈ 992 KB
⇒ ≈ 0.97 MB
⇒ ≈ 0.00095 GB ⟺ approximately one-thousandth of 1 GB
Things to keep in mind about these calculations:
We are calculating the maximum number of permutations, meaning if not all business units have all social networks, the number of rows would be reduced.
When determining the weight of the STRING columns, we should select the case with the longest length. For example, "X" will have a size of 2 bytes, while "Instagram" will have a size of 18 bytes. This way, we ensure we're calculating the maximum possible size.
And the indisputable conclusion is: This size is small and clearly does not justify the need for partitioning, as BigQuery optimizes flat tables better for this volume of data.
The last consideration is to think that, although this extraction is static, the table is not, since new rows with data will continue to be added. So, could it not make sense to partition now, but make sense later? This is an important question because we need to consider that the goal is not to modify the table in the future. To address this, we will perform a similar calculation, but estimating future changes.
Column | Data type | Detail | Posibilities |
---|---|---|---|
Date | Date | Historical data for 5 years | 1825 |
Bussines unit | String | Working with 10 business units | 10 |
Social media | String | Working with 60 social networks | 60 |
New number of rows: 1825 x 10 x 60 = 1 095 000
Si una fila son 62 bytes
⇒ 62 bytes x 1 095 000 = 67 890 00 bytes
⇒ ≈ 992 KB
⇒ ≈ 6.47 MB
⇒ ≈ 0.00632 GB ⟺ approximately one-157th of 1 GB
Even with these projections, the table remains small compared to the 1 GB per partition threshold, indicating that partitioning is not justified.
In conclusion, this table, in both the current and future scenarios, is too small to benefit from partitioning in BigQuery. Partitioning would introduce overhead in metadata management without offering significant advantages in query performance.
To put it simply
While partitioning a table in BigQuery can improve performance and reduce costs, it’s important to do it thoughtfully. As a general rule, a table should have more than 10 GB of data to justify partitioning, and each partition should approach the 1 GB threshold. This means that, for example, partitioning by day is not always the best option if daily data doesn't reach that size.
The key is to assess the volume and growth of your data to choose a strategy that provides real benefits without creating unnecessary overhead. With this approach, optimizing your queries and managing resources becomes an informed and effective decision.
Thanks for reading!
And don't forget, you can give this note a "like" without even logging in!
Comments