Many months ago I posted an introductory primer to analytical data warehouses, specifically targeted to folks who were familiar with data work but did not have a lot of experience working with data warehouses. In that post I made reference to a follow-up blog post that would focus more on the transformation flow within a data warehouse:
There may be many, many downstream transformations for a particular data source, and the structure of these can vary widely. There are many different design philosophies around this area of the data warehouse, but I will leave that for another post. Suffice to say, a large portion of the “thinking” work in a data warehouse is in writing these downstream transforms.
Well, this is part 1 of that promised follow-up post (finally!). There is so much to say about the transformation flow that I’ve had to break it up into 3 posts: Part 1 will attempt to demonstrate that you are likely already familiar with the transformation flow in other tools, and relate your experience back to how it would work within data warehouses. Part 2 will go over the major data modeling design paradigms, which provide a common structure for these transformations. And part 3 will focus on how these transformations and data models can be put together and implemented in the data warehouse (this is the one that will focus on dbt).
These posts will continue to be in the same spirit with the same target audience as the initial one: this is written for those of you who work with data but have not yet had a chance (or perhaps have recently begun) to work on a team that uses a data warehouse. I’m going to assume that you’ve started to develop intuitions about how to clean, wrangle, organize, and analyze data, and you want some help jumpstarting your ability to work professionally on a data team that has a data warehouse. And while the previous post introduced some broad principles for what data warehouses are, why you would want to use them, and how they tend to be used, this post will start to dive deeper into the transformation and modeling flow of data within the warehouse - the bread and butter of the “analytics engineer”.
A philosophical aside
Throughout this post, I want you to pay attention to how I talk about the transformation flow. I will use words like “intuition”, “design”, “philosophies”, “choices”, etc. The underlying point is this: data modeling - the thing you are doing when you implement specific transformations - is a creative act. There are no “right” or “wrong” choices (though there are certainly “better” and “worse” choices for your specific context and use case), and with time and practice you can grow your data modeling craft as a creative knowledge worker.
Please note that when I say “creative”, I mean it in the sense of thinking about a concept or solving a problem in a new/different way, not in the imaginative or fantastical sense. Think of data modeling in the same way that you would think of mapmaking: you are creating a simplified and useful representation of the landscape that captures enough detail that it can serve a variety of use cases, while not including so much detail that the map becomes a mess of illegible scribbles. “The map is not the territory” - in other words, the model should not be confused with the reality it is representing. Models are imperfect, limited abstractions that must be interpreted. A model that perfectly and completely describes its reality has no utility at all (besides being impossible) - we simplify and choose what to emphasize because our minds have limits. To go back to maps: would you rather navigate while driving with your map application in “map” view or “satellite” view?
So remember: the work of translating perceived reality into a limited respresentation that can be described in a machine-readable format, and then further transforming, integrating, and analyzing that data into a human-digestable format that can then be used to influence how others perceive reality (and thus their actions, which may effect reality)… is deeply philosophical, creative, and intellectually rewarding work (and hopefully impactful!).
Furthermore, it is important to remember that all of your more abstract data work should be grounded in the reality that is your source truth and the impact that you want your data work to have. So it helps to have both knowledge and interest in that reality you are trying to model and influence. If you are still learning and looking for practice projects, or at the start of your career and looking for jobs, don’t make the mistake of thinking all data work is the same. Focus on the subjects that interest you the most and you have specialized knowledge in - and the parts of the world where you most want to make an impact. For me, that domain is (generally speaking) government services and international development. My favorite data sources (and the focus of my upcoming personal project) are about peace/conflict (a.k.a wars, and preventing/resolving them) and comparing political systems. What domains and data sources are you most interested in?
All of that to say - it is impossible for any one person, blog post, book, or methodology to tell you how to transform and organize your data. What I hope to accomplish in this post is to build upon the intuitions you likely already have, and provide some structure and guiderails for how to go further. In the following posts, I will point to some resources you can use to dive even deeper into the many rabbit holes that exist in the realm of data modeling. But in the end, it is up to you to become the expert on your data and make creative data modeling choices to craft your transformations - and hopefully inform and influence the decisions and actions people take in a positive, productive way.
What’s in a table? (and what’s in a CTE?)
If you read the previous post, then you know that data flows through many different tables and views in the data warehouse, from source to final data/analytical product. However, beyond a few basic rules like making sure your first transform is simple and has a 1:1 relationship with its source table, I did not go into just how these transforms get divided up into tables and organized in the warehouse. Figuring out when a particular transform should be materialized as a table/view, or just broken out into a CTE in your larger query, is often a matter of intuition (intuition founded on learned and internalized guidelines). This can be frustrating for beginners, but in this section I want to show that you probably already have developed some of that intuition - you just developed it in a different context.
Before trying to bridge from intuition you developed either by writing code in notebooks or applying formulas in spreadsheets, let’s define a few key terms and note some caveats.
Transformations. I’ve used this term quite a few times already, and if you’re not sure what that means you’re probably already confused. A transformation simply refers to any type of change you are applying to dataset. Changing the shape of data by joining two tables or aggregating a table is a substantial transformation, while applying functions to columns like trimming whitespace or changing a text field into a datetime are simpler transformations. Adding data by appending rows or updating values is not a transformation. Essentially, any SQL select query that is persisted in the database (the result creates a new table/view) is a transformation.
Tables, views, and CTEs. A table in a database contains physical data structured by rows & columns. A view is a saved query, and doesn’t actually contain data. A “Common Table Expression” (CTE) is similar to a subquery in that it is a self-contained SQL select query inside of a larger SQL query, but while subqueries can exist within a clause of the main query (e.g. the
from clause), CTEs are pulled out, ordered, and the first CTE is preceded by a
with keyword. CTEs are more readable, maintainable, and DRY (a CTE can be referenced multiple times in later queries, while a subquery must be repeated). If CTEs are a new concept for you, here is a good introductory article.
You should already know SQL. If you are unfamiliar with SQL, then I’d suggest you revisit this post (and the previous one) after you have learned the basics of SQL and databases. If there is interest, I can write a separate blog post focused on how to learn SQL and the many free online resources available (as well as excellent books to borrow/buy). For now, if you’re still reading, I’m going to assume you are comfortable with writing SQL select queries.
Bring your own examples. This post does not really have any specific examples to illustrate the transformation flow. Instead, I’m going to be relying on you, the reader, to think back to past projects and fit the data & code you are intimately familiar with to approaches I’m describing. The goal of this post is not to teach you how to do a specific technical thing, but rather to help develop a general mindset and approach to transforming data within a data warehouse. Plus, I’d much rather demonstrate the entire process by blogging my way through a personal project than coming up with a few trite examples - so there may not be any short examples right now, but there will be one very long example in the future. (Please don’t hate - I’m mostly writing these posts for the fun of it!)
Burgeoning data practitioners typically come to data warehouses having already developed some expertise in manipulating data in one of two paradigms: dataframes or spreadsheets. Pick which of the following sections to read depending on which best describes you.
For the pandas (or polars/dplyr/etc) practitioner
I first want to revisit a paragraph from the previous blog post:
If you have worked on enough code-based analytics projects, you have probably learned the value of maintaining a proper data transformation flow. This may show up in how you organize your data folder: you have your “raw” data that is never to be altered, your cleaned/processed data that is an intermediate output, and your final data. You learn the value of using code to transform data - it means all transformations are documented, replicable, and easy to alter. Your notebooks/scripts are a mix of exploration, small cleaning tasks, and big transformations. You may have some functions in there for transformations that need to be performed repeatedly or to encapsulate more complicated code. You are displaying chunks of dataframes in order to visually examine the results and decide what needs to be done next. You are also saving dataframes to CSVs to be used elsewhere. All of these elements show up in the analytical data warehouse as well - they just look a bit different.
Let’s dive deeper into this comparison: the analytical data flow of pandas code in jupyter notebooks and data in files vs the analytical data flow of SQL code in a dbt project and data in database tables.
I’m going to assume that if you chose this section you are already familiar with manipulating data using pandas (or polars, or dplyr) in jupyter (or R markdown) notebooks in a git-tracked repo, and it is how this process works in data warehouses that you are not sure about. If that is not the case, then this section is probably not useful for you, as there is nothing to create a conceptual bridge from. Feel free to skip ahead to the next section or come back later.
It may be useful for you to pause reading and go back to look at your last project. Scan through your notebooks with a critical eye. Don’t focus on the individual lines of code, but rather the overall structure. How did you organize your data files? How did you organize your notebooks? What code did you chunk together into one cell, and what code did you pull out so you could run it separately? When did you display your data, and what were you looking for? What types of transformations get saved to the same dataframe variable name (overwriting the dataframe), and what types of transformations get saved to a new variable name? Why did you choose to use a new variable name, and what naming schema organically developed for your variable names? When did you save your data to a file, and why? What code has been organized into functions, and why did you decide to do that? At what point in the development process did you decide to make that change? What notes/documentation did you include, in code comments or markdown cells? How different is your final completed notebook from how it looked during the development process? Did you pull out any exploratory notebook code and formalize it in a python script? How does the code in the python file compare to the code (that presumably accomplishes the same or a similar goal) in the notebook?
Seriously, take a few minutes to pause reading, pull up GitHub or your local repo, examine your project, and try to answer these questions before continuing. It will help, and this blog post isn’t going to disappear.
Caveats and code review concluded, let’s proceed.
Your notebook likely follows a pattern like this: first, you import the data files you need and save it to a dataframe. Then, you apply some set of transformations, saving it to some new variable name. Then, you might apply a different set of transformations, and save those to a new variable name. Finally, you will save your transformed data to a file - perhaps at multiple points in the process.
It’s useful to save sets of transformations to different dataframes because each time you tweak your code you have to re-run all of the transformations relevant for that dataframe variable name (or risk being unable to re-run your notebook and get the same result). You don’t want a new variable name for every single transformation, because that gets unwieldy, and you don’t want to use the same variable name throughout the notebook, because some blocks of code take longer to run and it becomes tiresome to re-run the whole notebook every time you make a tweak. So, you naturally develop some conventions for when to save a dataframe to a new variable name. Those groups of transformations that get applied to the same variable name - those are the same groups of transformations that correspond to CTEs. CTEs serve two purposes: (1) they allow you to order and chain many transformations, allowing the final overall query to accomplish more than one simple query alone; and (2) they make complex SQL code more readable, by logically grouping together transformations and assigning the result a meaningful name. So, think of your CTE aliases as being analagous to your dataframe variable names, and CTEs as accomplishing the same conceptual chunking that each successive dataframe iteration does.
Why save a dataframe to a CSV file? You may have reached a point where you want to share the resulting transformed data with someone else. Or, you may want to use that data in another notebook. Or, you may be ready to do something else with that data - make a visualization, run a statistical analysis, etc. Regardless, you have identified that the data has been sufficiently transformed that it needs to be preserved in that state in a more permanent and reusable way than just an ephemeral dataframe that will disappear when the notebook is shut down. It is at this same point that you should conclude a transformation being performed by a SQL query, and save the resulting data to the database as a table/view. This will allow others to query the transformed data - for their own subsequent queries, for visualization, for exploratory analysis, for productionized ML pipelines, etc. So, think of saving a dataframe to a CSV as being analagous to saving a query’s output to a table/view.
In short, data goes through many transformations. These transformations tend to be chunked together, due to necessity or conceptual similarity. At some point the transformation is substantial enough that the resulting data has its own value, and that state of the transformed data should be preserved for multiple future uses. This is true both in the dataframe flow of pandas code in jupyter notebooks, and the database flow of SQL queries making tables and views.
For the Excel (or Sheets) Enthusiast
Spreadsheets remain the most common way most people interact with data. They are intuitive, easy to use, and extremely flexible and powerful. For the vast majority of use cases, spreadsheets are sufficient. But if you are reading this, then you have probably already bumped up against some constraints and know that databases can solve your spreadsheet woes.
Have you ever accidently deleted the content of a cell without realizing it? Deleted some cells, only to realize that now the rest of your data has been shifted out of order as well? Realized that your dataset has exactly 1,048,576 rows… and some data is missing? Maybe you have manually, laboriously, transformed your raw data into something analyzable… only to learn that new data is available that needs to go through the exact same process? Or you get handed a spreadsheet to analyze, and the organization of the sheets make you want to question why you decided to work with data in the first place? Do you have a folder full of the exact same data, with some kind of versioning note in the file name (_v2, _FINAL, _FINALFINAL, _2022, _2020-2021, _thisyear, _forCDO, _forJane, etc)? At some point in your data journey, you’re going to look at all of your spreadsheets and think… there has to be a better way.
Congratulations - there is! But if you have worked only with spreadsheets for your whole career, even the word “database” may be intimidating (or, you may be thinking, I already made a database with my spreadsheets!). Big, complex, analytical data warehouses may not initially seem like a viable solution for you, but I promise that if you have already started learning SQL then you are ready to start making that transition. Let’s talk about how the work you are currently doing translates over to the process you will follow when transforming data in the data warehouse, and the advantages this shift will offer.
First, rather than getting emailed a spreadsheet to transform and analyze, that data should already be present in the database as a source table - you just need to know what the table name is. (If the data is not already in the database, that’s probably someone else’s job to get it in there and make sure it gets updated on a regular schedule). Next, you will need to write a SQL query in order to transform it. Think about the transformation process you would follow in your spreadsheet: you may do some data cleaning by making sure that a categorical column has only the set of values that should exist (correcting misspellings, removing whitespace, etc). You may turn a text column into a date column (if Excel hasn’t already tried to do it for you - good luck). You may do some math on your numerical values by creating a new column and using a formula. You may then need to do further math by subtracting that new column from another column. You may need to create a pivot table after these cleaned and transformed columns are created. You may need to create a new sheet with only a subset of the original data, or a new sheet that combines data together from two other sheets. Some of these transformations are small/intermediate - they are needed in order to get the data to its final form, and are typically done by modifying data inplace or creating a new column. Some transformations are more substantial - the result gets its own sheet. Usually transformations have to be done in a specific order. At various points you may save and share the resulting data by emailing the modified spreadsheet to a colleague, or start creating inline visualizations based on this transformed data.
Think about at what point during the transformation process the data reaches a “final” state (which may happen multiple times in the overall process). You want to share out or visualize this transformed data. In a SQL transformation flow, the group of transformations you need to get to that point will compose your overall query, and result in a table or view being created in the database. Rather than emailing the spreadsheet, you can just let your colleague know the new table name, and they can then query that table. You can query that table with a BI tool in order to visualize it. And remember - all of those transformations were done with a SQL query, and the original source table remains unchanged. This means that if the source table gets updated with new data, you can simply re-run the SQL query, all of your transformations will get applied, and your transformed table will have the new data. If you made a mistake or need to modify your query, you can simply update the SQL and re-run the query. This is only possible because the source table remains unchanged.
Now consider all of the transformations that would make up that overall query, and group them into stages. First, this group of transformations must happen. Then, this next group of transformations must happen - but they can only be done after the first group of transformations is done. Maybe you want to break those stages into smaller groups because certain transformations seem to form natural groups, and it makes more narrative sense for them to be separate. These groups of transformations, whether due to the necessity of the order they must be performed or the nature of the transformations, would be equivalent to the CTEs that make up your overall query. CTEs are your query building blocks - each one is complete and sufficient on its own, but stack them together and you can build a more complex overall structure. Chunking transformations together conceptually also helps your overall query to be more readable - after all, you don’t just want the transformation to happen, you also want other people to understand what the transformation is doing.
In short, data goes through many transformations. These transformations tend to be chunked together, due to necessity or conceptual similarity. At some point the transformation is substantial enough that the resulting data has its own value, and that state of the transformed data should be preserved for multiple future uses. This is true both in the spreadsheet flow of functions and pivot tables, and the database flow of SQL queries making tables/views.
Did the conceptual bridges hold?
I hope that these comparisons made sense and you can more easily see how the data transformation work you’ve already been performing translates over to the SQL transformations you will craft to build out your data warehouse. However, so far these bridges are only built out of theoretical foundations - you’ll have to finish the rest of the job by practicing using SQL to transform raw input data into actionable output datasets, and all of the intermediate tables along the way. Most of this will only make sense after you’ve had experience building data models in your own data warehouse. Only with practice can you turn abstract theoretical knowledge into internalized tacit knowledge.
If that still sounds intimidating, and you’re not sure how to get started with your own personal data warehouse project, you are in luck! I have my own personal project that I’ve been meaning to work on for quite a while now, and I’ve decided to write a series of blog posts along the way documenting my process (and motivating my progress). Stay tuned for that series in the coming months.
I want to note that so far, we have been focused on scenarios where we know where we are and where we want to go. We know what the raw source data looks like, and we know what the transformed data needs to look like in order to accomplish a specific purpose (such as creating a visualization, calculating a metric, or sharing summarized data). This is the most fundamental and impactful type of data design. Shifting transformations that you may have performed in a manual, isolated, undocumented way into an data warehouse paradigm where the transformation process can be automated, transparent, and documented is inherently valuable and a necessary first step. You know your data, you are analyzing your data, and you know how to get your data from point A to point B (and point C, D, etc). You know exactly what this data will be used for, what the purposes of the transformations are, and you can see your project through the entire data lifecycle.
However, one drawback of this purpose-driven design paradigm is that it does not scale very well. On large data teams, there may be different people assigned to each task: ingesting, transforming, documenting, visualizing, and analyzing the data. Those tasked with transforming the data may not know all of the many purposes the data may need to be designed for. Furthermore, the data warehouse will need to house many different datasets, for many different purposes, transformed by different people at different times for different projects. If every dataset is only transformed for the initial known purpose, over time the design of the data warehouse will become rather chaotic (this is how you get a spaghetti DAG). Plus, it’s very likely that most of the data will no longer be in use, data transformations will have been duplicated as slightly different use cases arise, and users may not even realize that the data they need is already in the warehouse (though perhaps not in the form they need). This is not necessarily a bad problem to have, as it means that your data warehouse has survived for a few years and has been actively used by many different people. However, at some point it really does become a problem, and someone is going to look at the data spaghetti and think, “there must be a better way.”
And of course, there is. In fact, there are quite a few you can choose from or combine in whatever way suits the larger purpose of your data warehouse. In part 2 (coming soon!) I will cover a few of the most important data modeling design paradigms in broad strokes and link to resources where you can learn more about each one.
Enjoy Reading This Article?
Here are some more articles you might like to read next: