top of page

Data governance in GCP: the Marie Kondo of BigQuery

  • Foto del escritor: perdidosenmatematica
    perdidosenmatematica
  • 30 jun
  • 10 Min. de lectura

Compulsive hoarding disorder is a mental health condition marked by a persistent difficulty in discarding or parting with possessions—regardless of their actual value. Sound familiar?


If you work in data, or have ever been part of a data-driven company, chances are you've already witnessed what I’m calling the Data Hoarding Disorder: the endless stockpiling of datasets, tables, views, logs, backups—all “just in case”.


But why does it happen? The honest answer is: there’s no real reason. Still, if we try to make sense of something that might not have any, here are a few likely suspects:

  • Fear of losing information — classic digital FOMO.

  • The belief that cleaning up is risky or a waste of time — “best leave it”.

  • A ‘more is better’ mindset — as if storage were free and searching painless.

  • No automated cleanup tools or policies — no one owns it, so no one acts.


From my experience working with multiple clients, I can confirm that many BigQuery projects live in a state of chronic digital clutter: orphaned datasets, forgotten tables, broken views... everyone sees them, no one deletes them.


Where there’s data, there’s power. But when that data is disorganised, duplicated or simply forgotten, it becomes noise, cost, and vulnerability instead of value.


In this piece, you'll learn:

  • How to spot unnecessary data hoarding in BigQuery.

  • A simple way to audit your environment—no extra tools needed.

  • What metadata is already available to govern your data better.

  • How to make decisions without checking tables one by one.

  • Why this approach saves time, money and frustration.



The BrotDani Method: Auditing BigQuery Without Losing Your Mind

Marie Kondo is a world-renowned organisation consultant and author, famous for creating the KonMari Method — a system that encourages people to keep only the items that “spark joy.” Her method recommends tidying by category rather than by physical location. More than just order, it’s about aligning your space with your values and priorities.


Daniela Rodriguez-Brot Rey — better known in certain Slack circles as Daniela Brot — is a (still) little-known communicator bridging data science and data engineering. One day, she was given a daunting task: audit the BigQuery estate of several client projects with over ten thousand tables. The goal was clear: understand what was there and clean it up. And so, the BrotDani Method was born: a practical approach to auditing and documenting BigQuery, based entirely on metadata.


The BrotDani Method offers a tangible way to apply data governance principles to BigQuery, using metadata as a starting point and visualisation as a decision-making tool.


In short, it involves:

  • Mapping what exists: Gathering information about all BigQuery assets — datasets, tables, views, jobs, structure, size, usage, and more.

  • Crossing real usage with structure: Combining activity (last query, last update) with technical traits (number of columns, partitions, clustering, size, naming conventions).

  • Detecting common issues: Unused views, outdated tables, forgotten backups and tests, poorly designed or expensive structures.

  • Generating automated recommendations: Based on a simple set of CASE WHEN rules, each object is classified into possible actions: delete, optimise, document, rename, monitor.


The BrotDani Method is a way to audit BigQuery with clarity and purpose. It doesn’t just list what’s there — it diagnoses, classifies, and suggests specific actions to improve the health of your data ecosystem.



Where does all this information come from?

INFORMATION_SCHEMA views in BigQuery are automatic, system-generated views that are read-only and give you access to the metadata of your objects — tables, views, columns, jobs, etc. No external tools, no obscure scripts, and no special permissions beyond access to the project or dataset.


The official documentation lists around 70 INFORMATION_SCHEMA views you can query to retrieve metadata in BigQuery. Some are marked with a cross (†), which means they have variants like BY_PROJECT, BY_USER, or BY_FOLDER. Others show a flask icon, indicating they are experimental.


There are views for almost everything: listing datasets, columns or routines; checking permissions; monitoring sessions; or reviewing automatic system recommendations.


Keep in mind that each view has its own way of being queried. For some, it’s enough to specify the region:

SELECT *
FROM `region-{eu/us}.INFORMATION_SCHEMA.SCHEMATA`

Others only accept dataset-level qualifiers — meaning you’ll need to specify the full path, including project and dataset, to query them.

SELECT *
FROM `{project_id}.{dataset_id}.INFORMATION_SCHEMA.PARTITIONS`

Some accept both formats — region-level and dataset-level — giving you more flexibility when querying across your BigQuery environment.

SELECT *
FROM `{project_id}.{dataset_id}.INFORMATION_SCHEMA.TABLES`

SELECT *
FROM `region-{eu/us/...}.INFORMATION_SCHEMA.TABLES`

There are other cases where project-level qualifiers are required, and filtering becomes mandatory to avoid overly large or restricted queries.

SELECT *
FROM `{project_id}`.`region-{eu/us/...}`.INFORMATION_SCHEMA.OBJECT_PRIVILEGES
WHERE object_schema = "{dataset_id}"
  AND object_name   = "{table_id}";

All of this is detailed in the syntax section or on each individual view’s documentation page.



What was used in the BrotDani method?

As a quick reminder, the goal of the BrotDani Method is to audit and document BigQuery projects using only the available metadata—no external tools required. The idea is to build a living inventory that combines structure with actual usage, allowing for clear decisions: what to delete, what to document, what to optimise, and so on.


This method relies on seven main INFORMATION_SCHEMA views, all queried at the regional level.

View

What does it provide?

SCHEMATA

identifies all datasets, their location, creation and modification dates — a snapshot of your data estate.

TABLES

lists all tables and views, including type and creation dates — helps detect proliferation and redundancies.

JOBS

reveals who runs what, when, and at what cost — crucial for understanding usage and accountability.

TABLE_STORAGE

gives storage metrics (logical/physical bytes, time travel, fail-safe) — supports cost and performance decisions.

TABLE_OPTIONS

retrieves descriptive metadata (labels, descriptions, tags) — essential for assessing documentation quality.

COLUMNS

shows table structures and clustering — useful for detecting oversized or badly designed tables.

VIEWS

maps views to their source tables — helpful for identifying broken links or unnecessary views.

Together, these views allow you to connect structure, usage, and documentation — the three pillars for informed decision-making in data governance.



Execution of the BrotDani Method

The execution of the method involved designing a series of SQL queries that cross-reference structure, usage, and documentation to build a living inventory of the entire BigQuery ecosystem.


These queries were organised into five main blocks:

Bloque de análisis

Objetivo general

Consulta principal (vistas usadas)

Dataset Detail

Map all project datasets

SCHEMATA

Dimension Table/View

Standardise and group objects by logical name

TABLES

Job Detail

Obtener uso real, costo, duración y errores

JOBS + referenced_tables + labels

Table Detail

Retrieve actual usage, cost, duration, and errors

TABLE_STORAGE, TABLE_OPTIONS, COLUMNS, JOBS

View Detail

Audit views: origin, documentation, and activity

VIEWS, TABLE_OPTIONS, COLUMNS, JOBS

Each block follows its own logic to combine relevant metadata. In the case of tables and views, this combination enables an automatic recommendation per object (e.g., delete, optimise, document, no action, etc.), based on simple rules defined using CASE WHEN statements.


Dataset detail

Returns the list of all datasets in the project, including creation and last modified dates, as well as their regional location.


What’s it for?

  • Helps identify how many datasets exist, when they were created, and whether they’re active or abandoned.

  • Enables cross-referencing with other views (like JOBS or TABLES) to spot datasets with no recent use or no clear owner.

  • Acts as a starting point to build an inventory of the BigQuery ecosystem.


Resulting table:

Field

Meaning

project_id

Project ID (from catalog_name)

dataset_id

Full dataset path (project_id.dataset_name)

dataset_name

Dataset name (schema_name)

creation_date

Dataset creation date

last_modified_date

Last modified date

location

Region where the dataset is hosted (location: EU)

Code:

SELECT catalog_name as project_id
      , CONCAT(catalog_name, ".",schema_name ) as dataset_id
      , schema_name as dataset_name
      , EXTRACT(DATE FROM creation_time) as creation_date
      , EXTRACT(DATE FROM last_modified_time) as last_modified_date
      , location
FROM `region-eu`.INFORMATION_SCHEMA.SCHEMATA


Dimension table view

Groups and standardises tables and views to create a dimension of logical objects, by removing numerical suffixes and harmonising similar names (e.g. date stamped versions of the same table).


What’s it for?

  • Identifies how often the same table has been recreated with different dates or suffixes.

  • Helps detect patterns of automated creation or unsustainable practices (such as excessive duplication).

  • Serves as a key to join with other views (like storage or jobs) and consolidate metrics by logical group.


Resulting table:

Field

Meaning

project_id

Project ID (from table_catalog)

dataset_id

Dataset the table or view belongs to

table_id

Full object path, using unified name (_* as wildcard)

table_view_name

Logical name: removes suffixes like dates or versions

table_type

Whether it is a BASE TABLE or a VIEW

first_creation_date

Earliest creation date among all objects sharing the same logical name

last_creation_date

Most recent creation date among all objects sharing the same logical name

Code:

SELECT table_catalog as project_id
      , CONCAT (table_catalog, ".", table_schema) as dataset_id
      , CONCAT (table_catalog, ".", table_schema, ".",
REGEXP_REPLACE(table_name, r'_\d{1,8}\*?$', '_*')) as table_id
      , REGEXP_REPLACE(table_name, r'_\d{1,8}\*?$', '_*') AS table_view_name
      , table_type
      , MIN(EXTRACT(DATE FROM creation_time)) as first_creation_date
      , MAX(EXTRACT(DATE FROM creation_time)) as last_creation_date
FROM `region-eu`.INFORMATION_SCHEMA.TABLES
GROUP BY ALL


Job detail

Explore the history of jobs run within the project, extracting metadata on queries, users, costs, durations, errors, and referenced tables.


What’s it for?

  • Helps identify who accessed which table and when it was last queried.

  • Enables analysis of estimated costs and detection of expensive or failed queries.

  • Essential for linking actual usage to the objects analysed (tables, views) and assigning recent activity.


Resulting table:

Field

Meaning

creation_date

Job creation date (formatted as YYYYMMDD)

user_email

User who ran the job, or the requestor (if labelled)

job_type, statement_type

Job type (e.g., QUERY, LOAD) and executed SQL statement

duration_seconds

Job's duration

state, error_message

Final job state and any error message

total_bytes_processed

Raw data processed

total_slot_ms, total_bytes_billed

Slot consumption and billed bytes

estimated_cost_usd

Estimated cost in USD

referenced_table

Table accessed or modified

table_id_in_query

Table inferred from SQL if not listed as referenced_table (useful for SELECT, INSERT, etc.)

destination_table.*

Destination table (if applicable)

looker_studio_report_id

Report ID if triggered from Looker Studio

is_scheduled_query

Whether the job was a Scheduled Query

Code:



Table detail

Consolidates structure, size, usage and technical configuration of all tables in the project. It merges multiple sources to enrich each table with key metadata.


What’s it for?

  • Audits logical/physical size, partitioning, clustering, and number of columns.

  • Assesses whether tables are properly documented (description, labels, tags).

  • Detects inactivity, inefficient or poorly designed structures, and applies rules to generate automatic recommendations such as delete, optimise, document, etc.


Resulting table:

Field

Meaning

table_id

Unique identifier of the table

total_logical_gb / total_physical_gb

Logical and physical size in GB

total_rows

Total number of rows

naming_snake_case

Whether the table name follows naming conventions

num_columns

Number of columns in the table

description, labels, tags

Useful metadata for data cataloguing

last_date_query

Last time the table was queried

storage_last_modified_date

Last actual content update

query_made_by

Who queried the table (user, Looker, or process)

days_since_last_query / days_since_last_update

Key metrics for data clean-up

recommendation

Suggested action: delete, document, optimise, etc.

Code:



View detail

Extracts information about views created in the project: their origin (base table), structure, documentation, and recent usage.


What’s it for?

  • Helps detect views that haven’t been queried in a long time or have poorly defined names.

  • Makes it easier to check whether views are documented or reference non-existent objects.

  • Generates recommendations such as delete view, add description, or change name, based on defined rules.


Resulting table:

Field

Meaning

view_id / view_name

Full name and logical name of the view

origin_table / origin_table_group

Base table from which the view is built

naming_snake_case

Whether it follows a naming convention

description, labels, tags

Metadata for documentation

last_date_query, query_made_by

Last time it was used, and by whom

days_since_last_query

Time since the last execution

num_columns

Structural complexity

recommendation

Suggested action: delete, document, rename, etc.

Code:


The full code is published in this public GitHub repository: Link to repository

BrotDani Model Schema

There’s no magic behind the BrotDani Method—just structure. A relational model was built, where tables are connected through primary and foreign keys. This setup brings together real usage, technical structure, and descriptive metadata into a final table that enables clear recommendations and informed decision-making.


Snapshot of the data model in Power BI
Snapshot of the data model in Power BI

Lo que todo gobernador del dato quiere

This approach goes beyond technical auditing: it provides an automated and scalable way to cover many of the core responsibilities a tactical data steward should handle..


Specifically, it addresses:

  • Usage and activity auditing: Understand which tables and views are used, when, and by whom — and detect abandoned resources.

  • Data documentation: Check whether objects have descriptions, labels, or other metadata that clarify their purpose.

  • Naming convention control: Evaluate whether names follow a defined format (like snake_case) to ensure consistency.

  • Storage and performance optimisation: Identify opportunities to partition, cluster, or reduce columns — lowering costs and improving performance.

  • Obsolete asset removal: Combine real usage with structure and metadata to suggest deletions or revisions.

  • Living data catalogue: Build a comprehensive view of your BigQuery ecosystem with health logic per object and full traceability.


This solution covers roughly 80% of the tactical data governance work needed for BigQuery. Best of all: it’s automated, replicable, and sustainable. What often feels like a never-ending project becomes a practical system with real impact.



Too long, don't read

Auditing BigQuery isn’t just a technical task — it’s a core practice of data governance. Without proper controls, projects quickly become cluttered with unnecessary tables, views, and datasets — driving up costs, creating confusion, and putting data quality at risk.


The BrotDani Method offers a practical and automatable approach to auditing and documenting BigQuery, using metadata alone. By querying INFORMATION_SCHEMA, it builds a relational model that combines real usage, technical structure, and descriptive metadata to generate automatic recommendations: delete, optimise, document, or monitor.


The result isn’t just a one-off diagnosis — it’s a living, scalable inventory that covers around 80%, if not more, of the tactical responsibilities of a BigQuery data steward. A concrete solution to support informed decision-making and foster a more organised, efficient, and sustainable data culture.



Thanks for reading!

And don’t forget—you can “like” this post without logging in.

A heartfelt thank you to Esther Ponce for the shared learning and support.

Hi, I'm Daniela. For many years, I was a mathematics teacher at schools and universities, and I also edited and wrote educational books. Today, I work as a Data Science Engineer and share content about mathematics, statistics, and data science on this site and social media. Creating and sharing knowledge is my way of staying connected to teaching. I strongly believe in two ideas that have always guided me: "acquiring new knowledge keeps you relevant" and "sharing is living."


댓글


bottom of page