Data governance in GCP: the Marie Kondo of BigQuery

perdidosenmatematica
30 jun
10 Min. de lectura

Compulsive hoarding disorder is a mental health condition marked by a persistent difficulty in discarding or parting with possessions—regardless of their actual value. Sound familiar?

If you work in data, or have ever been part of a data-driven company, chances are you've already witnessed what I’m calling the Data Hoarding Disorder: the endless stockpiling of datasets, tables, views, logs, backups—all “just in case”.

But why does it happen? The honest answer is: there’s no real reason. Still, if we try to make sense of something that might not have any, here are a few likely suspects:

Fear of losing information — classic digital FOMO.
The belief that cleaning up is risky or a waste of time — “best leave it”.
A ‘more is better’ mindset — as if storage were free and searching painless.
No automated cleanup tools or policies — no one owns it, so no one acts.

From my experience working with multiple clients, I can confirm that many BigQuery projects live in a state of chronic digital clutter: orphaned datasets, forgotten tables, broken views... everyone sees them, no one deletes them.

Where there’s data, there’s power. But when that data is disorganised, duplicated or simply forgotten, it becomes noise, cost, and vulnerability instead of value.

In this piece, you'll learn:

How to spot unnecessary data hoarding in BigQuery.
A simple way to audit your environment—no extra tools needed.
What metadata is already available to govern your data better.
How to make decisions without checking tables one by one.
Why this approach saves time, money and frustration.

The BrotDani Method: Auditing BigQuery Without Losing Your Mind

Marie Kondo is a world-renowned organisation consultant and author, famous for creating the KonMari Method — a system that encourages people to keep only the items that “spark joy.” Her method recommends tidying by category rather than by physical location. More than just order, it’s about aligning your space with your values and priorities.

Daniela Rodriguez-Brot Rey — better known in certain Slack circles as Daniela Brot — is a (still) little-known communicator bridging data science and data engineering. One day, she was given a daunting task: audit the BigQuery estate of several client projects with over ten thousand tables. The goal was clear: understand what was there and clean it up. And so, the BrotDani Method was born: a practical approach to auditing and documenting BigQuery, based entirely on metadata.

The BrotDani Method offers a tangible way to apply data governance principles to BigQuery, using metadata as a starting point and visualisation as a decision-making tool.

In short, it involves:

Mapping what exists: Gathering information about all BigQuery assets — datasets, tables, views, jobs, structure, size, usage, and more.
Crossing real usage with structure: Combining activity (last query, last update) with technical traits (number of columns, partitions, clustering, size, naming conventions).
Detecting common issues: Unused views, outdated tables, forgotten backups and tests, poorly designed or expensive structures.
Generating automated recommendations: Based on a simple set of CASE WHEN rules, each object is classified into possible actions: delete, optimise, document, rename, monitor.

The BrotDani Method is a way to audit BigQuery with clarity and purpose. It doesn’t just list what’s there — it diagnoses, classifies, and suggests specific actions to improve the health of your data ecosystem.

Where does all this information come from?

INFORMATION_SCHEMA views in BigQuery are automatic, system-generated views that are read-only and give you access to the metadata of your objects — tables, views, columns, jobs, etc. No external tools, no obscure scripts, and no special permissions beyond access to the project or dataset.

The official documentation lists around 70 INFORMATION_SCHEMA views you can query to retrieve metadata in BigQuery. Some are marked with a cross (†), which means they have variants like BY_PROJECT, BY_USER, or BY_FOLDER. Others show a flask icon, indicating they are experimental.

There are views for almost everything: listing datasets, columns or routines; checking permissions; monitoring sessions; or reviewing automatic system recommendations.

Keep in mind that each view has its own way of being queried. For some, it’s enough to specify the region:

SELECT *
FROM `region-{eu/us}.INFORMATION_SCHEMA.SCHEMATA`

Others only accept dataset-level qualifiers — meaning you’ll need to specify the full path, including project and dataset, to query them.

SELECT *
FROM `{project_id}.{dataset_id}.INFORMATION_SCHEMA.PARTITIONS`

Some accept both formats — region-level and dataset-level — giving you more flexibility when querying across your BigQuery environment.

SELECT *
FROM `{project_id}.{dataset_id}.INFORMATION_SCHEMA.TABLES`

SELECT *
FROM `region-{eu/us/...}.INFORMATION_SCHEMA.TABLES`

There are other cases where project-level qualifiers are required, and filtering becomes mandatory to avoid overly large or restricted queries.

SELECT *
FROM `{project_id}`.`region-{eu/us/...}`.INFORMATION_SCHEMA.OBJECT_PRIVILEGES
WHERE object_schema = "{dataset_id}"
  AND object_name   = "{table_id}";

All of this is detailed in the syntax section or on each individual view’s documentation page.

What was used in the BrotDani method?

As a quick reminder, the goal of the BrotDani Method is to audit and document BigQuery projects using only the available metadata—no external tools required. The idea is to build a living inventory that combines structure with actual usage, allowing for clear decisions: what to delete, what to document, what to optimise, and so on.

This method relies on seven main INFORMATION_SCHEMA views, all queried at the regional level.

View	What does it provide?
SCHEMATA	identifies all datasets, their location, creation and modification dates — a snapshot of your data estate.
TABLES	lists all tables and views, including type and creation dates — helps detect proliferation and redundancies.
JOBS	reveals who runs what, when, and at what cost — crucial for understanding usage and accountability.
TABLE_STORAGE	gives storage metrics (logical/physical bytes, time travel, fail-safe) — supports cost and performance decisions.
TABLE_OPTIONS	retrieves descriptive metadata (labels, descriptions, tags) — essential for assessing documentation quality.
COLUMNS	shows table structures and clustering — useful for detecting oversized or badly designed tables.
VIEWS	maps views to their source tables — helpful for identifying broken links or unnecessary views.

Together, these views allow you to connect structure, usage, and documentation — the three pillars for informed decision-making in data governance.

Execution of the BrotDani Method

The execution of the method involved designing a series of SQL queries that cross-reference structure, usage, and documentation to build a living inventory of the entire BigQuery ecosystem.

These queries were organised into five main blocks:

Bloque de análisis	Objetivo general	Consulta principal (vistas usadas)
Dataset Detail	Map all project datasets	SCHEMATA
Dimension Table/View	Standardise and group objects by logical name	TABLES
Job Detail	Obtener uso real, costo, duración y errores	JOBS + referenced_tables + labels
Table Detail	Retrieve actual usage, cost, duration, and errors	TABLE_STORAGE, TABLE_OPTIONS, COLUMNS, JOBS
View Detail	Audit views: origin, documentation, and activity	VIEWS, TABLE_OPTIONS, COLUMNS, JOBS

Each block follows its own logic to combine relevant metadata. In the case of tables and views, this combination enables an automatic recommendation per object (e.g., delete, optimise, document, no action, etc.), based on simple rules defined using CASE WHEN statements.

Dataset detail

Returns the list of all datasets in the project, including creation and last modified dates, as well as their regional location.

What’s it for?

Helps identify how many datasets exist, when they were created, and whether they’re active or abandoned.
Enables cross-referencing with other views (like JOBS or TABLES) to spot datasets with no recent use or no clear owner.
Acts as a starting point to build an inventory of the BigQuery ecosystem.

Resulting table:

Field	Meaning
project_id	Project ID (from catalog_name)
dataset_id	Full dataset path (project_id.dataset_name)
dataset_name	Dataset name (schema_name)
creation_date	Dataset creation date
last_modified_date	Last modified date
location	Region where the dataset is hosted (location: EU)

Code:

SELECT catalog_name as project_id
      , CONCAT(catalog_name, ".",schema_name ) as dataset_id
      , schema_name as dataset_name
      , EXTRACT(DATE FROM creation_time) as creation_date
      , EXTRACT(DATE FROM last_modified_time) as last_modified_date
      , location
FROM `region-eu`.INFORMATION_SCHEMA.SCHEMATA

💻 View the code in my GitHub repository

Dimension table view

Groups and standardises tables and views to create a dimension of logical objects, by removing numerical suffixes and harmonising similar names (e.g. date stamped versions of the same table).

What’s it for?

Identifies how often the same table has been recreated with different dates or suffixes.
Helps detect patterns of automated creation or unsustainable practices (such as excessive duplication).
Serves as a key to join with other views (like storage or jobs) and consolidate metrics by logical group.

Resulting table:

Field	Meaning
project_id	Project ID (from table_catalog)
dataset_id	Dataset the table or view belongs to
table_id	Full object path, using unified name (_* as wildcard)
table_view_name	Logical name: removes suffixes like dates or versions
table_type	Whether it is a BASE TABLE or a VIEW
first_creation_date	Earliest creation date among all objects sharing the same logical name
last_creation_date	Most recent creation date among all objects sharing the same logical name

Code:

SELECT table_catalog as project_id
      , CONCAT (table_catalog, ".", table_schema) as dataset_id
      , CONCAT (table_catalog, ".", table_schema, ".",
REGEXP_REPLACE(table_name, r'_\d{1,8}\*?$', '_*')) as table_id
      , REGEXP_REPLACE(table_name, r'_\d{1,8}\*?$', '_*') AS table_view_name
      , table_type
      , MIN(EXTRACT(DATE FROM creation_time)) as first_creation_date
      , MAX(EXTRACT(DATE FROM creation_time)) as last_creation_date
FROM `region-eu`.INFORMATION_SCHEMA.TABLES
GROUP BY ALL

💻 V iew the code in my GitHub repository

Job detail

Explore the history of jobs run within the project, extracting metadata on queries, users, costs, durations, errors, and referenced tables.

What’s it for?

Helps identify who accessed which table and when it was last queried.
Enables analysis of estimated costs and detection of expensive or failed queries.
Essential for linking actual usage to the objects analysed (tables, views) and assigning recent activity.

Resulting table:

Field	Meaning
creation_date	Job creation date (formatted as YYYYMMDD)
user_email	User who ran the job, or the requestor (if labelled)
job_type, statement_type	Job type (e.g., QUERY, LOAD) and executed SQL statement
duration_seconds	Job's duration
state, error_message	Final job state and any error message
total_bytes_processed	Raw data processed
total_slot_ms, total_bytes_billed	Slot consumption and billed bytes
estimated_cost_usd	Estimated cost in USD
referenced_table	Table accessed or modified
table_id_in_query	Table inferred from SQL if not listed as referenced_table (useful for SELECT, INSERT, etc.)
destination_table.*	Destination table (if applicable)
looker_studio_report_id	Report ID if triggered from Looker Studio
is_scheduled_query	Whether the job was a Scheduled Query

Code:

💻 View the code in my GitHub repository

Table detail

Consolidates structure, size, usage and technical configuration of all tables in the project. It merges multiple sources to enrich each table with key metadata.

What’s it for?

Audits logical/physical size, partitioning, clustering, and number of columns.
Assesses whether tables are properly documented (description, labels, tags).
Detects inactivity, inefficient or poorly designed structures, and applies rules to generate automatic recommendations such as delete, optimise, document, etc.

Resulting table:

Field	Meaning
table_id	Unique identifier of the table
total_logical_gb / total_physical_gb	Logical and physical size in GB
total_rows	Total number of rows
naming_snake_case	Whether the table name follows naming conventions
num_columns	Number of columns in the table
description, labels, tags	Useful metadata for data cataloguing
last_date_query	Last time the table was queried
storage_last_modified_date	Last actual content update
query_made_by	Who queried the table (user, Looker, or process)
days_since_last_query / days_since_last_update	Key metrics for data clean-up
recommendation	Suggested action: delete, document, optimise, etc.

Code:

💻 View the code in my GitHub repository

View detail

Extracts information about views created in the project: their origin (base table), structure, documentation, and recent usage.

What’s it for?

Helps detect views that haven’t been queried in a long time or have poorly defined names.
Makes it easier to check whether views are documented or reference non-existent objects.
Generates recommendations such as delete view, add description, or change name, based on defined rules.

Resulting table:

Field	Meaning
view_id / view_name	Full name and logical name of the view
origin_table / origin_table_group	Base table from which the view is built
naming_snake_case	Whether it follows a naming convention
description, labels, tags	Metadata for documentation
last_date_query, query_made_by	Last time it was used, and by whom
days_since_last_query	Time since the last execution
num_columns	Structural complexity
recommendation	Suggested action: delete, document, rename, etc.

Code:

💻 View the code in my GitHub repository

The full code is published in this public GitHub repository: Link to repository

BrotDani Model Schema

There’s no magic behind the BrotDani Method—just structure. A relational model was built, where tables are connected through primary and foreign keys. This setup brings together real usage, technical structure, and descriptive metadata into a final table that enables clear recommendations and informed decision-making.

Snapshot of the data model in Power BI — Snapshot of the data model in **Power BI**

Lo que todo gobernador del dato quiere

This approach goes beyond technical auditing: it provides an automated and scalable way to cover many of the core responsibilities a tactical data steward should handle..

Specifically, it addresses:

Usage and activity auditing: Understand which tables and views are used, when, and by whom — and detect abandoned resources.
Data documentation: Check whether objects have descriptions, labels, or other metadata that clarify their purpose.
Naming convention control: Evaluate whether names follow a defined format (like snake_case) to ensure consistency.
Storage and performance optimisation: Identify opportunities to partition, cluster, or reduce columns — lowering costs and improving performance.
Obsolete asset removal: Combine real usage with structure and metadata to suggest deletions or revisions.
Living data catalogue: Build a comprehensive view of your BigQuery ecosystem with health logic per object and full traceability.

This solution covers roughly 80% of the tactical data governance work needed for BigQuery. Best of all: it’s automated, replicable, and sustainable. What often feels like a never-ending project becomes a practical system with real impact.

Too long, don't read

Auditing BigQuery isn’t just a technical task — it’s a core practice of data governance. Without proper controls, projects quickly become cluttered with unnecessary tables, views, and datasets — driving up costs, creating confusion, and putting data quality at risk.

The BrotDani Method offers a practical and automatable approach to auditing and documenting BigQuery, using metadata alone. By querying INFORMATION_SCHEMA, it builds a relational model that combines real usage, technical structure, and descriptive metadata to generate automatic recommendations: delete, optimise, document, or monitor.

The result isn’t just a one-off diagnosis — it’s a living, scalable inventory that covers around 80%, if not more, of the tactical responsibilities of a BigQuery data steward. A concrete solution to support informed decision-making and foster a more organised, efficient, and sustainable data culture.

Thanks for reading!

And don’t forget—you can “like” this post without logging in.

A heartfelt thank you to Esther Ponce for the shared learning and support.

Hi, I'm Daniela. For many years, I was a mathematics teacher at schools and universities, and I also edited and wrote educational books. Today, I work as a Data Science Engineer and share content about mathematics, statistics, and data science on this site and social media. Creating and sharing knowledge is my way of staying connected to teaching. I strongly believe in two ideas that have always guided me: "acquiring new knowledge keeps you relevant" and "sharing is living."