Many months ago I posted an introductory primer to analytical data warehouses, specifically targeted to folks who were familiar with data work but did not have a lot of experience working with data warehouses. In that post I made reference to a follow-up blog post that would focus more on the transformation flow within a data warehouse:
There may be many, many downstream transformations for a particular data source, and the structure of these can vary widely. There are many different design philosophies around this area of the data warehouse, but I will leave that for another post. Suffice to say, a large portion of the “thinking” work in a data warehouse is in writing these downstream transforms.
Well, this is part 1 of that promised follow-up post (finally!). There is so much to say about the transformation flow that I’ve had to break it up into 3 posts: Part 1 will attempt to demonstrate that you are likely already familiar with the transformation flow in other tools, and relate your experience back to how it would work within data warehouses. Part 2 will go over the major data modeling design paradigms, which provide a common structure for these transformations. And part 3 will focus on how these transformations and data models can be put together and implemented in the data warehouse (this is the one that will focus on dbt).
These posts will continue to be in the same spirit with the same target audience as the initial one: this is written for those of you who work with data but have not yet had a chance (or perhaps have recently begun) to work on a team that uses a data warehouse. I’m going to assume that you’ve started to develop intuitions about how to clean, wrangle, organize, and analyze data, and you want some help jumpstarting your ability to work professionally on a data team that has a data warehouse. And while the previous post introduced some broad principles for what data warehouses are, why you would want to use them, and how they tend to be used, this post will start to dive deeper into the transformation and modeling flow of data within the warehouse - the bread and butter of the “analytics engineer”.
Throughout this post, I want you to pay attention to how I talk about the transformation flow. I will use words like “intuition”, “design”, “philosophies”, “choices”, etc. The underlying point is this: data modeling - the thing you are doing when you implement specific transformations - is a creative act. There are no “right” or “wrong” choices (though there are certainly “better” and “worse” choices for your specific context and use case), and with time and practice you can grow your data modeling craft as a creative knowledge worker.
Please note that when I say “creative”, I mean it in the sense of thinking about a concept or solving a problem in a new/different way, not in the imaginative or fantastical sense. Think of data modeling in the same way that you would think of mapmaking: you are creating a simplified and useful representation of the landscape that captures enough detail that it can serve a variety of use cases, while not including so much detail that the map becomes a mess of illegible scribbles. “The map is not the territory” - in other words, the model should not be confused with the reality it is representing. Models are imperfect, limited abstractions that must be interpreted. A model that perfectly and completely describes its reality has no utility at all (besides being impossible) - we simplify and choose what to emphasize because our minds have limits. To go back to maps: would you rather navigate while driving with your map application in “map” view or “satellite” view?
So remember: the work of translating perceived reality into a limited respresentation that can be described in a machine-readable format, and then further transforming, integrating, and analyzing that data into a human-digestable format that can then be used to influence how others perceive reality (and thus their actions, which may effect reality)… is deeply philosophical, creative, and intellectually rewarding work (and hopefully impactful!).
Furthermore, it is important to remember that all of your more abstract data work should be grounded in the reality that is your source truth and the impact that you want your data work to have. So it helps to have both knowledge and interest in that reality you are trying to model and influence. If you are still learning and looking for practice projects, or at the start of your career and looking for jobs, don’t make the mistake of thinking all data work is the same. Focus on the subjects that interest you the most and you have specialized knowledge in - and the parts of the world where you most want to make an impact. For me, that domain is (generally speaking) government services and international development. My favorite data sources (and the focus of my upcoming personal project) are about peace/conflict (a.k.a wars, and preventing/resolving them) and comparing political systems. What domains and data sources are you most interested in?
All of that to say - it is impossible for any one person, blog post, book, or methodology to tell you how to transform and organize your data. What I hope to accomplish in this post is to build upon the intuitions you likely already have, and provide some structure and guiderails for how to go further. In the following posts, I will point to some resources you can use to dive even deeper into the many rabbit holes that exist in the realm of data modeling. But in the end, it is up to you to become the expert on your data and make creative data modeling choices to craft your transformations - and hopefully inform and influence the decisions and actions people take in a positive, productive way.
If you read the previous post, then you know that data flows through many different tables and views in the data warehouse, from source to final data/analytical product. However, beyond a few basic rules like making sure your first transform is simple and has a 1:1 relationship with its source table, I did not go into just how these transforms get divided up into tables and organized in the warehouse. Figuring out when a particular transform should be materialized as a table/view, or just broken out into a CTE in your larger query, is often a matter of intuition (intuition founded on learned and internalized guidelines). This can be frustrating for beginners, but in this section I want to show that you probably already have developed some of that intuition - you just developed it in a different context.
Before trying to bridge from intuition you developed either by writing code in notebooks or applying formulas in spreadsheets, let’s define a few key terms and note some caveats.
Transformations. I’ve used this term quite a few times already, and if you’re not sure what that means you’re probably already confused. A transformation simply refers to any type of change you are applying to dataset. Changing the shape of data by joining two tables or aggregating a table is a substantial transformation, while applying functions to columns like trimming whitespace or changing a text field into a datetime are simpler transformations. Adding data by appending rows or updating values is not a transformation. Essentially, any SQL select query that is persisted in the database (the result creates a new table/view) is a transformation.
Tables, views, and CTEs. A table in a database contains physical data structured by rows & columns. A view is a saved query, and doesn’t actually contain data. A “Common Table Expression” (CTE) is similar to a subquery in that it is a self-contained SQL select query inside of a larger SQL query, but while subqueries can exist within a clause of the main query (e.g. the select
or from
clause), CTEs are pulled out, ordered, and the first CTE is preceded by a with
keyword. CTEs are more readable, maintainable, and DRY (a CTE can be referenced multiple times in later queries, while a subquery must be repeated). If CTEs are a new concept for you, here is a good introductory article.
You should already know SQL. If you are unfamiliar with SQL, then I’d suggest you revisit this post (and the previous one) after you have learned the basics of SQL and databases. If there is interest, I can write a separate blog post focused on how to learn SQL and the many free online resources available (as well as excellent books to borrow/buy). For now, if you’re still reading, I’m going to assume you are comfortable with writing SQL select queries.
Bring your own examples. This post does not really have any specific examples to illustrate the transformation flow. Instead, I’m going to be relying on you, the reader, to think back to past projects and fit the data & code you are intimately familiar with to approaches I’m describing. The goal of this post is not to teach you how to do a specific technical thing, but rather to help develop a general mindset and approach to transforming data within a data warehouse. Plus, I’d much rather demonstrate the entire process by blogging my way through a personal project than coming up with a few trite examples - so there may not be any short examples right now, but there will be one very long example in the future. (Please don’t hate - I’m mostly writing these posts for the fun of it!)
Burgeoning data practitioners typically come to data warehouses having already developed some expertise in manipulating data in one of two paradigms: dataframes or spreadsheets. Pick which of the following sections to read depending on which best describes you.
I first want to revisit a paragraph from the previous blog post:
If you have worked on enough code-based analytics projects, you have probably learned the value of maintaining a proper data transformation flow. This may show up in how you organize your data folder: you have your “raw” data that is never to be altered, your cleaned/processed data that is an intermediate output, and your final data. You learn the value of using code to transform data - it means all transformations are documented, replicable, and easy to alter. Your notebooks/scripts are a mix of exploration, small cleaning tasks, and big transformations. You may have some functions in there for transformations that need to be performed repeatedly or to encapsulate more complicated code. You are displaying chunks of dataframes in order to visually examine the results and decide what needs to be done next. You are also saving dataframes to CSVs to be used elsewhere. All of these elements show up in the analytical data warehouse as well - they just look a bit different.
Let’s dive deeper into this comparison: the analytical data flow of pandas code in jupyter notebooks and data in files vs the analytical data flow of SQL code in a dbt project and data in database tables.
I’m going to assume that if you chose this section you are already familiar with manipulating data using pandas (or polars, or dplyr) in jupyter (or R markdown) notebooks in a git-tracked repo, and it is how this process works in data warehouses that you are not sure about. If that is not the case, then this section is probably not useful for you, as there is nothing to create a conceptual bridge from. Feel free to skip ahead to the next section or come back later.
It may be useful for you to pause reading and go back to look at your last project. Scan through your notebooks with a critical eye. Don’t focus on the individual lines of code, but rather the overall structure. How did you organize your data files? How did you organize your notebooks? What code did you chunk together into one cell, and what code did you pull out so you could run it separately? When did you display your data, and what were you looking for? What types of transformations get saved to the same dataframe variable name (overwriting the dataframe), and what types of transformations get saved to a new variable name? Why did you choose to use a new variable name, and what naming schema organically developed for your variable names? When did you save your data to a file, and why? What code has been organized into functions, and why did you decide to do that? At what point in the development process did you decide to make that change? What notes/documentation did you include, in code comments or markdown cells? How different is your final completed notebook from how it looked during the development process? Did you pull out any exploratory notebook code and formalize it in a python script? How does the code in the python file compare to the code (that presumably accomplishes the same or a similar goal) in the notebook?
Seriously, take a few minutes to pause reading, pull up GitHub or your local repo, examine your project, and try to answer these questions before continuing. It will help, and this blog post isn’t going to disappear.
Caveats and code review concluded, let’s proceed.
Your notebook likely follows a pattern like this: first, you import the data files you need and save it to a dataframe. Then, you apply some set of transformations, saving it to some new variable name. Then, you might apply a different set of transformations, and save those to a new variable name. Finally, you will save your transformed data to a file - perhaps at multiple points in the process.
It’s useful to save sets of transformations to different dataframes because each time you tweak your code you have to re-run all of the transformations relevant for that dataframe variable name (or risk being unable to re-run your notebook and get the same result). You don’t want a new variable name for every single transformation, because that gets unwieldy, and you don’t want to use the same variable name throughout the notebook, because some blocks of code take longer to run and it becomes tiresome to re-run the whole notebook every time you make a tweak. So, you naturally develop some conventions for when to save a dataframe to a new variable name. Those groups of transformations that get applied to the same variable name - those are the same groups of transformations that correspond to CTEs. CTEs serve two purposes: (1) they allow you to order and chain many transformations, allowing the final overall query to accomplish more than one simple query alone; and (2) they make complex SQL code more readable, by logically grouping together transformations and assigning the result a meaningful name. So, think of your CTE aliases as being analagous to your dataframe variable names, and CTEs as accomplishing the same conceptual chunking that each successive dataframe iteration does.
Why save a dataframe to a CSV file? You may have reached a point where you want to share the resulting transformed data with someone else. Or, you may want to use that data in another notebook. Or, you may be ready to do something else with that data - make a visualization, run a statistical analysis, etc. Regardless, you have identified that the data has been sufficiently transformed that it needs to be preserved in that state in a more permanent and reusable way than just an ephemeral dataframe that will disappear when the notebook is shut down. It is at this same point that you should conclude a transformation being performed by a SQL query, and save the resulting data to the database as a table/view. This will allow others to query the transformed data - for their own subsequent queries, for visualization, for exploratory analysis, for productionized ML pipelines, etc. So, think of saving a dataframe to a CSV as being analagous to saving a query’s output to a table/view.
In short, data goes through many transformations. These transformations tend to be chunked together, due to necessity or conceptual similarity. At some point the transformation is substantial enough that the resulting data has its own value, and that state of the transformed data should be preserved for multiple future uses. This is true both in the dataframe flow of pandas code in jupyter notebooks, and the database flow of SQL queries making tables and views.
Spreadsheets remain the most common way most people interact with data. They are intuitive, easy to use, and extremely flexible and powerful. For the vast majority of use cases, spreadsheets are sufficient. But if you are reading this, then you have probably already bumped up against some constraints and know that databases can solve your spreadsheet woes.
Have you ever accidently deleted the content of a cell without realizing it? Deleted some cells, only to realize that now the rest of your data has been shifted out of order as well? Realized that your dataset has exactly 1,048,576 rows… and some data is missing? Maybe you have manually, laboriously, transformed your raw data into something analyzable… only to learn that new data is available that needs to go through the exact same process? Or you get handed a spreadsheet to analyze, and the organization of the sheets make you want to question why you decided to work with data in the first place? Do you have a folder full of the exact same data, with some kind of versioning note in the file name (_v2, _FINAL, _FINALFINAL, _2022, _2020-2021, _thisyear, _forCDO, _forJane, etc)? At some point in your data journey, you’re going to look at all of your spreadsheets and think… there has to be a better way.
Congratulations - there is! But if you have worked only with spreadsheets for your whole career, even the word “database” may be intimidating (or, you may be thinking, I already made a database with my spreadsheets!). Big, complex, analytical data warehouses may not initially seem like a viable solution for you, but I promise that if you have already started learning SQL then you are ready to start making that transition. Let’s talk about how the work you are currently doing translates over to the process you will follow when transforming data in the data warehouse, and the advantages this shift will offer.
First, rather than getting emailed a spreadsheet to transform and analyze, that data should already be present in the database as a source table - you just need to know what the table name is. (If the data is not already in the database, that’s probably someone else’s job to get it in there and make sure it gets updated on a regular schedule). Next, you will need to write a SQL query in order to transform it. Think about the transformation process you would follow in your spreadsheet: you may do some data cleaning by making sure that a categorical column has only the set of values that should exist (correcting misspellings, removing whitespace, etc). You may turn a text column into a date column (if Excel hasn’t already tried to do it for you - good luck). You may do some math on your numerical values by creating a new column and using a formula. You may then need to do further math by subtracting that new column from another column. You may need to create a pivot table after these cleaned and transformed columns are created. You may need to create a new sheet with only a subset of the original data, or a new sheet that combines data together from two other sheets. Some of these transformations are small/intermediate - they are needed in order to get the data to its final form, and are typically done by modifying data inplace or creating a new column. Some transformations are more substantial - the result gets its own sheet. Usually transformations have to be done in a specific order. At various points you may save and share the resulting data by emailing the modified spreadsheet to a colleague, or start creating inline visualizations based on this transformed data.
Think about at what point during the transformation process the data reaches a “final” state (which may happen multiple times in the overall process). You want to share out or visualize this transformed data. In a SQL transformation flow, the group of transformations you need to get to that point will compose your overall query, and result in a table or view being created in the database. Rather than emailing the spreadsheet, you can just let your colleague know the new table name, and they can then query that table. You can query that table with a BI tool in order to visualize it. And remember - all of those transformations were done with a SQL query, and the original source table remains unchanged. This means that if the source table gets updated with new data, you can simply re-run the SQL query, all of your transformations will get applied, and your transformed table will have the new data. If you made a mistake or need to modify your query, you can simply update the SQL and re-run the query. This is only possible because the source table remains unchanged.
Now consider all of the transformations that would make up that overall query, and group them into stages. First, this group of transformations must happen. Then, this next group of transformations must happen - but they can only be done after the first group of transformations is done. Maybe you want to break those stages into smaller groups because certain transformations seem to form natural groups, and it makes more narrative sense for them to be separate. These groups of transformations, whether due to the necessity of the order they must be performed or the nature of the transformations, would be equivalent to the CTEs that make up your overall query. CTEs are your query building blocks - each one is complete and sufficient on its own, but stack them together and you can build a more complex overall structure. Chunking transformations together conceptually also helps your overall query to be more readable - after all, you don’t just want the transformation to happen, you also want other people to understand what the transformation is doing.
In short, data goes through many transformations. These transformations tend to be chunked together, due to necessity or conceptual similarity. At some point the transformation is substantial enough that the resulting data has its own value, and that state of the transformed data should be preserved for multiple future uses. This is true both in the spreadsheet flow of functions and pivot tables, and the database flow of SQL queries making tables/views.
I hope that these comparisons made sense and you can more easily see how the data transformation work you’ve already been performing translates over to the SQL transformations you will craft to build out your data warehouse. However, so far these bridges are only built out of theoretical foundations - you’ll have to finish the rest of the job by practicing using SQL to transform raw input data into actionable output datasets, and all of the intermediate tables along the way. Most of this will only make sense after you’ve had experience building data models in your own data warehouse. Only with practice can you turn abstract theoretical knowledge into internalized tacit knowledge.
If that still sounds intimidating, and you’re not sure how to get started with your own personal data warehouse project, you are in luck! I have my own personal project that I’ve been meaning to work on for quite a while now, and I’ve decided to write a series of blog posts along the way documenting my process (and motivating my progress). Stay tuned for that series in the coming months.
I want to note that so far, we have been focused on scenarios where we know where we are and where we want to go. We know what the raw source data looks like, and we know what the transformed data needs to look like in order to accomplish a specific purpose (such as creating a visualization, calculating a metric, or sharing summarized data). This is the most fundamental and impactful type of data design. Shifting transformations that you may have performed in a manual, isolated, undocumented way into an data warehouse paradigm where the transformation process can be automated, transparent, and documented is inherently valuable and a necessary first step. You know your data, you are analyzing your data, and you know how to get your data from point A to point B (and point C, D, etc). You know exactly what this data will be used for, what the purposes of the transformations are, and you can see your project through the entire data lifecycle.
However, one drawback of this purpose-driven design paradigm is that it does not scale very well. On large data teams, there may be different people assigned to each task: ingesting, transforming, documenting, visualizing, and analyzing the data. Those tasked with transforming the data may not know all of the many purposes the data may need to be designed for. Furthermore, the data warehouse will need to house many different datasets, for many different purposes, transformed by different people at different times for different projects. If every dataset is only transformed for the initial known purpose, over time the design of the data warehouse will become rather chaotic (this is how you get a spaghetti DAG). Plus, it’s very likely that most of the data will no longer be in use, data transformations will have been duplicated as slightly different use cases arise, and users may not even realize that the data they need is already in the warehouse (though perhaps not in the form they need). This is not necessarily a bad problem to have, as it means that your data warehouse has survived for a few years and has been actively used by many different people. However, at some point it really does become a problem, and someone is going to look at the data spaghetti and think, “there must be a better way.”
And of course, there is. In fact, there are quite a few you can choose from or combine in whatever way suits the larger purpose of your data warehouse. In part 2 (coming soon!) I will cover a few of the most important data modeling design paradigms in broad strokes and link to resources where you can learn more about each one.
]]>When I first started working on the City of Boston Analytics Team, I had never really worked with an analytical data warehouse before. I had experience with databases, and with the data flow in analytics projects (mostly using pandas, in jupyter notebooks), and I had used databases as a starting point for analytics projects, but I had not yet put all of those pieces together. Then I started working as a data engineer whose primary role was to operate within an analytical data warehouse, and I took a 4-week crash course in dbt… and my brain went into overdrive trying to sythesize everything I already knew and was learning in order to come up with a mental model of what an analytical data warehouse was, how it should operate, and why. Now, almost two years later, I’m pretty sure I get it - and I want to share how I conceptualize a data warehouse in hopes that it makes the journey easier for anyone else who may be in the same position I was.
But first, a couple caveats:
An analytical data warehouse is the data source that powers analytical outputs - dashboards, reports, one-off statistical analyses, machine learning models, etc. Having a data warehouse makes sense when you are working in a team - you often need to share data, pick up other people’s projects, and know that an analytical output will continue to function even after a project is wrapped up. Having a reliable, managed, and shared infrastructure is a huge productivity boost. Also, pretty much every BI tool is built to at least connect to a database.
But what makes a data warehouse different from a (relational) database? In large part, this is simply a matter of semantics. Any database can be a data warehouse, it all comes down to how you use it. If you are using a database for analytics, you are likely doing a lot more and more complicated queries to read from the database. If you are using a database to power an application, you are doing a lot more (and need faster) writes to the database. Transactional databases are optimized for writes, and analytical databases are optimized for reads. Some DBMSs are balanced and can be used for both (i.e. PostgreSQL). Others optimize heavily for one use case (Snowflake is pretty much only used for analytics). How you are using a database also dictates how you organize it. If you are using a transactional database to power an application, there is a very good chance your database is normalized (most typically into 3rd or Boyce-Codd Normal Form).
Normalizing your data is the best way to organize your data when you want to reduce the duplication of data, avoid data anomolies for inserts/deletes/updates, and ensure relational integrity. But the normalized structure can make it difficult to answer seemingly simple questions - you often have to write complex queries with many joins. And analytics is all about asking questions of your data.
If you are using your database to answer many different analytical questions, you probably have many different data sources - and many of those data sources may be completely different from each other and never intended to be joined together. This is when a database starts to become a data warehouse - rather than one cohesive place to store data for one purpose (and therefore the data is all related), the data warehouse is more like a data repository. It is stored in a single place because of the convenience, not because the data is necessarily related (though some if it certainly will be).
In a well-designed relational database, it is usually pretty easy to understand how data is organized and linked through primary and foreign keys. You don’t necessarily need to know the data to query it so long as you know the structure. But an analytical data warehouse is not so easy to comprehend. There are almost never foreign keys, because they introduce complications to the ETL processes that update the data. If you are lucky, there are primary keys on some of the tables. To understand how an analytical data warehouse is organized, you instead need to look at the transformation flow.
If you have worked on enough code-based analytics projects, you have probably learned the value of maintaining a proper data transformation flow. This may show up in how you organize your data folder: you have your “raw” data that is never to be altered, your cleaned/processed data that is an intermediate output, and your final data. You learn the value of using code to transform data - it means all transformations are documented, replicable, and easy to alter. Your notebooks/scripts are a mix of exploration, small cleaning tasks, and big transformations. You may have some functions in there for transformations that need to be performed repeatedly or to encapsulate more complicated code. You are displaying chunks of dataframes in order to visually examine the results and decide what needs to be done next. You are also saving dataframes to CSVs to be used elsewhere.
All of these elements show up in the analytical data warehouse as well - they just look a bit different. No matter where the data lives, analytics data work should follow a set of core principles that cement data management best practices in the analytical output.
When raw data is loaded into a data warehouse via an ETL process, it is loaded in a “source” table. Different teams have different names for this set of tables - “base”, “staging”, etc - but I will use “source” for the sake of simplicity/consistency. These source tables are often isolated or otherwise identifiable in the data warehouse. Some teams have a separate database only for source data, others have a separate schema or set of schemas for source tables, and some may use a naming convention to identify these tables (e.g. a suffix of “_stg”).
These source tables reflect the actual source data as closely as possible. This means that they may have the wrong datatypes, need to be cleaned (e.g. trimming whitespace), split out into multiple columns (especially if it is JSON text), the column names may need to be changed, or some other transformation is needed before it can be considered “production ready”. But it is important that these raw source tables remain exactly the same as they were when first loaded into the database - you want to avoid alter table
statements.
Sometimes, these raw sources may come from a relational database. In this case, you may want to implement primary keys on the source tables - because if the primary key constraint is violated, that means something went wrong in the ETL process. However, implementing foreign keys is generally not recommended, as it can severely complicate the ETL process. Instead, document these foreign keys (in tests, in text, etc) and use them to inform downstream transformations.
The next step is to transform these source tables into “production” tables. There may be many transformations before producing the table that will be used for analysis, but there should always be at least one. In a data warehouse, a “transform” usually means a SQL select query that is used to create either a table or a view. For the purpose of this post, “production” refers to tables or views in the database that are ready to be used for any downstream analytical purposes (dashboards, analysis, further downstream views, etc), have been through a development and approval process, and have been designated as being “production” based on its name, schema, database, or some other indication.
The first transform is the simplist - it should select from the source table and apply any basic cleanup that is needed. Let’s call the result of this first transformation the “base production” table. It should correspond to the source table closely, and should have a 1:1 relationship to the source table (so, no joins). It should also always be materalized as a table (not as a view). The base production tables - and all downstream transforms - should live in a separate schema/database from the source tables, or otherwise be easily distinguished from the source tables.
Even if no cleanup or alteration from the base table is needed, this first transformation is still a necessary step. Why? Sometimes there is an error in the ETL process, and the source table may be wiped. If you only have one table (the source table), then you have lost that data, and it will effect any downstream uses on that table. However, if you have a base table (which selects from the source table), and the source table is wiped, then you can stop the transformation before the corresponding base production table is wiped. So, the base production table still has data (even if it is outdated - which is preferable to no data at all). Note: this is only possible if the base production table is materialized as a table, not a view.
Even if a source is static (not being updated by an ETL process), you still want to have this source -> base production table transform, and preserve both this separation and the tranformation applied. Why? You may discover at a later point that the base production table needs to be altered in some way - a column should be renamed, a data type needs to be corrected, whitespace needs to be removed, etc. Or, you may discover that a transform you thought you needed is actually incorrect, and you need to alter the transform. Once you alter the source table, there is no going back (besides dropping the table and re-creating it from the source). Whereas if you have a transformation query, you can simply re-run the query to re-create the altered production table. This is only possible if the raw data in the source table remains the same - and if the SQL used to do the transform is saved (preferably in a git-tracked repo).
While base production tables should be materialized as a table and should have a 1:1 relationship with its source table, all further downstream transformations can be materialized as views and query multiple tables. The primary reason that a downstream transformation would be materialized as a table instead of a view is for performance reasons - if the query is computationally expensive and takes a long time to run. Otherwise, views are preferred.
Note: further downstream transformations should never query a source table - only the production tables/views. These tables/views generally live in the same schemas/databases as the base production tables.
There may be many, many downstream transformations for a particular data source, and the structure of these can vary widely. There are many different design philosophies around this area of the data warehouse, but I will leave that for another post. Suffice to say, a large portion of the “thinking” work in a data warehouse is in writing these downstream transforms.
When new data is imported and transformed on a regular cadence, it is important to have Data Unit Tests (DUTs) to automatically check that previous assumptions about the data still hold true. DUTs can be performed on any table/view in the data warehouse - source, production, or downstream views. For example, you might want to check that the source table contains some minimum number of rows before doing the transform to production - and if the test fails, then the transform will not happen. You may want to check that a field you assume is your primary key is actually unique and has no null values. You may want to check that a categorical field only has a limited list of values - and if it fails the test, you only want it to warn you, but not prevent the transformation. The most important tests (the ones you want to stop downstream transforms if they fail) should be performed as early in the transformation flow as possible (ideally, the source tables), while the nice-to-have tests that should only warn you about failures (and not prevent transformations) can be implemented further downstream.
DUTs are a way to productionize data testing, but can also be useful during the exploratory phase when first designing transformations. You can test your data through a formal DUT structure, or through a series of queries. For example, you may want to select distinct
on a field to see all possible values, or select count(distinct __)
on a field compared to select count(__)
to see whether there are duplicates. Based on this exploratory data testing, you may wish to revise your transform. And while it is important to document and preserve any SQL transform queries so they can be re-run, it is less essential to preserve these ad-hoc data tests - their primary purpose for static data is to inform your SQL query for transforming source data into the production table. However, if the data source may be updated, then formalizing them into DUTs is a good idea.
Imagine you are an analyst, staring at this data warehouse and all of its many tables, trying to figure out which tables are going to have the answer to your question. Or, imagine you are an engineer, informed that something has gone wrong in a dashboard and you need to figure out what (and where) in the data flow something went wrong. What is the one thing that you are really going to want (need!) in order to do your job effectively and efficiently?
Documentation.
You are going to want documentation for your tables, the columns in those tables, the relationships and dependencies between those tables, the tests performed on those tables, and any external dependencies (dashboards relying on those tables, ETL processes powering the source tables). But documentation has a dirty secret - (almost) nobody wants to or has time to write documentation. Unless you make documentation fast and easy to produce (or you pay for it by making documentation someone’s job or primary responsiblity), it isn’t going to happen (at least not consistently).
dbt may be advertised as a tool to transform (and test!) your data, but its real superpower is in the documentation that is produced as a side effect of how those transformations and test are implemented. If you do not currently have an infrastructure to support transformations and data unit tests within your data warehouse, then implementing dbt is a no brainer because it is the easiest out-of-the-box way to accomplish those fundamental tasks. But even if you do have a (likely custom/house built) infrastructure to do those transforms & tests, making the switch to dbt is worth it, because of the documentation that is produced (and the culture of documentation that is encouraged/supported). Once you see a DAG (directed acyclic graph of table nodes) that visualizes every step from source to dashboard, there is no going back.
Why? Because given enough time, your team will accumulate enough data sources (and the people most knowledgable about those sources will leave at some point), and enough complicated transforms and table structures, that working within the data warehouse will become a snakes’ nest of tangled dependencies, table rot, and dangerous assumptions. And if you can’t rely on your data warehouse, then your stakeholders can’t rely on your analytical outputs. Documentation is not just a nice to have - it is essential. Document your data, your transforms, your tests, your analytical outputs, everything - and then put that documentation in one easily accessible and searchable place. dbt makes this easy, and that is the real reason it has become the darling of the data community - and the reason I’ve wanted to implement it for our team ever since I learned enough about the tool and our current data infrastructure.
Which bring us back to our principles of good data management for analytics. There was, in fact, one principle missing from the list, so let’s round it out.
Now that we’ve brought dbt into the picture, we can talk about one final piece of the (modern) analytical data warehouse: the development cycle. In the principles above (and the blocks of text further above), note the use of the term “production”. The presence of “production” data implies the presence of data that is not “production” - something distinct from the raw source data already described. This missing something is the “development” version of the data - basically, the data/transform that has not yet been finalized.
Tables that are still in development should be distinguished from production tables in the data warehouse. Dev tables may live in a separate database or schema, or be indicated through a table naming convention. Teams may also have less formal (read: social) ways of indicating that a table is in development, though this introduces opportunity for error. A table in development is not finalized - column names, order, and content may change, the table name itself may change, and the logic has not yet been reviewed and approved by the team. Essentially, it’s very risky (and certainly inefficient) to build downstream dependencies on dev tables, though there will always be cases where speedy delivery is a necessity and this rule has to be broken.
One great feature of dbt is that this development phase of tables/views (dbt refers to them all as “models”) is built into how dbt works through the use of profiles. It’s very easy to designate a specific schema or database as a dev schema/database, and to build models agnostic of the dev/prod schema. It’s also why git is a core component of any dbt project - the “main” branch should have models in “production”, while feature branches should have models in “development”. You have to be familiar with the use of version control in a software development cycle in order to use dbt effectively, and the fact that dbt brings these software engineering practices into the modern analytical workflow is considered a major selling point of dbt.
Unless you are already part of a data team that uses a data warehouse, it can be hard to understand what an analytical data warehouse is, how to use it properly, and why it is such an important part of working on a data team. If you’re in school or otherwise studying for a career in data, knowing how to work within a data warehouse will give you a big leg up over other entry level candidates. My hope is that this blog post can (1) convince you that you already know the most important pieces of the puzzle, (2) fill in some of the missing pieces of the puzzle, and (3) help you synthesize and put all of the pieces together to form a cohesive picture.
In future posts I would like to continue to flesh out how to work in a data warehouse, but this concludes the introduction. If you have any questions or future topics you’d like me to focus on, let me know!
]]>This website, and therefore this blog, has fallen by the wayside since I graduated from library school with my MSLIS in the spring of 2020… in the midst of a global pandemic. I wasn’t really sure what to do with the site, beyond letting it exist and adding some projects here and there, but I recently decided that it was time for a revamp (and rebrand/redesign). I wanted my personal website to say “working professional”, not “overly enthusiastic grad student” - though I am keeping all the old posts so the “overly enthusiastic grad student” is not going away. But it has been 3 years (wow!) since I was in grad school, and I have managed to get some professional experience under my belt - though with the pandemic raging as I entered the work force I had a bit of a rocky start.
My last semester of grad school was not the best. There was before spring break, and there was after spring break. Before spring break, everything was continuing on as usual, with some masks starting to appear. After spring break, everything shut down. All classes and work went remote. There was no graduation ceremony. I had no idea that the last time I saw my professors, classmates, coworkers, and firends, it would be the last time I saw them ever. While in undergrad I could not care less about the graduation ceremony, for grad school - a place where I really found my academic home - I was looking forward to the chance to see everyone one last time and say goodbye. Instead, there was just endless isolation.
After graduating into a pandemic with no job offer yet in hand, I did what everyone seemed to be doing - I moved back home. My mom was living alone, I couldn’t afford to pay rent without a job, and after the isolation of the previous months I was looking forward to forming a little family bubble. I put most of my stuff in storage, sure that I would be moving soon to wherever I found a job. And sure enough, within a few months of graduating I got a job offer… with one big catch. I would need to get my security clearance. I was so excited to get a job offer, I ignored the perfectly reasonable advice every entry level professional is given (and ignores): don’t accept the first offer you get. I was told that the process would take about 6-12 months, and I hoped that my previous security clearance (for my internship with the State Department in undergrad) would help speed things up. In the meantime, I could find some short-term contracting gigs to start advancing my data skills and developing experience. So that’s exactly what I did.
Pretty much as soon as I had some income coming in, I kept a promise that I had made to myself years ago at the end of undergrad (when I knew I was going to be doing a lot of traveling). I finally adopted a puppy. I adopted my sweet girl from Saving Grace, a rescue in Raleigh. At the rescue, the name on her tag was “Genuine Risk”. I named her Bella, after the loyal pony Bela in Wheel of Time. She was about 14 weeks old, the last of her litter, and shy as could be, but after 5 minutes of me sitting cross legged in her pen she crawled into my lap and fell asleep. That’s when I knew - and my mom really knew - that she would be coming home with us.
I also took up knitting (I know, such a classic pandemic hobby). The very first thing I knit was a blanket for Bella. My reasoning was: (1) she won’t care what it looks like, so it’s fine if it turns out awful and ugly, and (2) it’s getting cold, she’d probably appreciate a blanket that smells like me. (Actually I lie, the first 4 blanket squares got combined into a blanket for Molly, our much smaller and much older family dog). Sadly, that blanket no longer exists - it got slowly torn apart and consumed because… I gave it to a puppy. C’est la vie. But while the blanket slowly got smaller, I kept on knitting, and eventually started knitting shawls as wedding presents. (6 shawls and 5 weddings later, I’m finally now starting on a shawl for myself)
A year came… and went. Still no word on the security clearance. I had done all of the steps (including going in person for the drug test, polygraph, etc) and I got no updates, no estimated timeline, nothing. I knew that I wanted to kick off my career properly - in a full-time position doing data work as a part of a larger team focused on public service - and the contracting gigs had only been meant as a temporary stopgap until I could join such a team. I had been applying for data librarian jobs and even got to the final round of interviews a couple times, but no dice (actually, my Carpentries lesson was a side product of one of those interviews!).
Finally, in late fall of 2021, I hit gold. A mutual on twitter pointed me towards a posting for a data engineer at the City of Boston’s Analytics Team. It was perfect - data work, in public service, and a genuine way to really start my career. I applied, made a streamlit app for extra credit, and by the end of the year I had accepted a job offer.
I kid you not, a week after accepting that job offer, my security clearance came through. But I knew that the data engineer position in Boston was a better fit for me, so I stuck with it and moved to Boston in January of 2022. Almost a year and a half of waiting on that security clearance… well, sometimes, timing really is everything.
My first year as a Data Engineer on the Analytics Team was a true learning experience - exactly what I was hoping for from my first job to kick off this new career. In grad school, I had learned SQL and how to design databases in 3rd Normal Form. At my job, I learned about analytical data warehouses and how SQL was used in production code, and I leveled up my SQL skills substantially (can you believe we learned about subqueries but not CTEs?). In grad school, I had learned about automated workflows as related to improving data quality. At my job, I learned about data orchestration (hello, YAML files and YAQL) and the pros and cons of different workflow design paradigms. In grad school, I mostly wrote my own python code from scratch to finish. At my job, I learned how to work with a team of engineers and contribute to an existing codebase. I learned how to work and contribute within a larger analytics team - rather than doing every part of the analytics flow myself.
I also made sure to focus on my professional development in other ways. In March I took CoRise’s course “Analytics Engineering with dbt”, taught by Emily Hawkins. In August I took GovEx’s course on Data Governance. Both were valuable and what I learned in them I could immediately apply in my work.
During my first year, I focused on learning how the Analytics Team, and the Engineering Team especially, worked, and how I could best contribute within the existing framework. I learned so much from my coworkers, and I found the experience of having fellow engineers that I could lean on and collaborate with invaluable. But in addition to learning how to contribute to what was already there, I was also starting to identify new ideas that I wanted to add to the mix. Primarily: dbt.
For my second year on the Analytics Team, I wanted to make a substantial impact to improve how the engineering team worked. This involved a lot of thinking and iterating through designs and strategies before I brought my proposals to the team. I wanted to make sure that when it came to the actual implementation of these plans, that it would go as quickly and smoothly as possible. It also meant communicating with my team and building interest and agreement in my proposals. I had been talking about dbt almost since I joined the team, but in a “this is a cool tool” way, not in a “let’s do it” way. If I was going to ask everyone to learn and use this new tool, I wanted to make sure they believed it was worth it too.
I don’t want to go into too much detail about what dbt is and how we are implementing it - I’ll save that for a future blog post. But suffice to say that the project got started early in 2023, and after 6 months we are finally in a sprint of building models and adding to the project, with almost all engineers onboarded and starting to onboard analysts. I’m very excited to see what further progress we can make for the rest of the year - and I’m especially excited to be able to share this experience of implementing dbt for a city analytics team in a talk at this year’s Coalesce (dbt Lab’s annual conference).
Besides the dbt project, this year I’ve also been working on larger and more complex projects that have involved a lot of data architecture/design work. Without going into too much detail, the city has a lot of legacy systems that don’t talk to each other, when city workers really need for information to be passed between those systems in order to do their work well and efficiently. I have been involved in a couple projects this year that have been focused on how to integrate these systems (or at least have all of the data in one place and able to be joined together), particularly for housing and development work. I love building ETL pipelines, but I know a project will be special when I get to do system design and data architecture work before building the pipeline. These projects are always bigger, longer, and trickier, involve working with more stakeholders and teams, and are just so rewarding because you can see the difference your contribution is making and you can build relationships outside of your team.
Another high point for this year on the team is that I’ve had the opportunity to start teaching Carpentries workshops again. I taught a workshop on git because the analysts on the team have been writing more code and wanted to preserve and collaborate on their code in a GitHub repository, and I taught some sessions on python and pandas as a part of a Data Culture pilot program focused on python. Later this year I’m planning on teaching a [SQL workshop](https://swcarpentry.github.io/sql-novice-survey/ and a session on querying APIs with python. Teaching computational skills is something I really enjoy, and it was the one thing I was really missing in this job last year.
Finally, if you think that working on a city analytics team sounds interesting, I will plug the fact that we are currently hiring for several positions - if you live in Boston or want to move to Boston, apply!
]]>My Carpentries teaching journey started out slow - I taught a couple of times that first year, and stuck to the lesson I was most comfortable with (SQL & Databases). However, at the start of the summer in 2021 my work put me in charge (or I volunteered to be in charge) of a group of undergraduate interns who wanted to learn how to use computational methods for open source intelligence analysis. So I put together a curriculum of Carpentries lessons to take my interns from zero-assumed knowledge to the ability to complete a computational analysis in Python. I’m teaching the same curriculum in the Fall (slightly expanded and refined) to another group of undergraduate interns. This allowed me to gain experience in teaching more of the core Carpentries lessons… and also inspired me to develop my own lesson focused on interactive data visualizations!
Each week I teach for 2 hours, and then learners can practice what they’ve learned and give feedback on the workshop in an assignment (delivered via Google Forms). I’ve included the curriculum schedule below:
Goals:
Assignment:
Goals:
Assignment:
Resources:
Lesson Plan:
Goals:
Assignment:
Further Resources:
Lesson Plan:
Goals:
Assignment:
Further Resources:
Lesson Plan:
Goals:
Assignment:
Lesson Plan:
Goals:
Assignment:
Further Resources:
Lesson Plan:
Goals:
Assignment:
Lesson Plan:
Goals:
Assignment:
Further Resources:
Lesson Plan:
Goals
Lesson Plan:
Goals:
Assignment:
Lesson Plan:
Goals:
Assignment:
Lesson Plan:
Goals:
Assignment:
Further Resources:
]]>If you’ve never heard of The Carpentries before, you may be confused - have I gotten into woodworking?! (No… but I have gotten into knitting - the real kind, not the RMarkdown kind).
From the horse’s mouth, here’s what The Carpentries is all about:
Vision: Our vision is to be the leading inclusive community teaching data and coding skills.
Mission: The Carpentries builds global capacity in essential data and computational skills for conducting efficient, open, and reproducible research. We train and foster an active, inclusive, diverse community of learners and instructors that promotes and models the importance of software and data in research. We collaboratively develop openly-available lessons and deliver these lessons using evidence-based teaching practices. We focus on people conducting and supporting research.
My first semester of my MSLIS, I took an Intro to Python class with Elizabeth Wickes. After completing the class, Elizabeth (an Instructor Trainer with The Carpentries and an elected member of the Executive Council since 2018) encouraged me to take the Carpentries instructor training class and get involved with the local UIUC chapter of The Carpentries… but it took me another year before I felt comfortable enough with my technical skills to feel like I could teach them. In December 2019, I finally took the instructor training class with Elizabeth Wickes & Neal Davis. It was a 2-day class that prepared new instructors to teach Carpentries workshop - with a special emphasis on how to teach while live-coding. As with all things Carpentries, the Instructor Training course is also freely available.
I had the chance to be a helper for one in-person workshop in early 2020 before… the pandemic happened. Then, everything went remote. I was a helper for another workshop series online, before I finally felt comfortable enough to teach a workshop for myself in Summer of 2020 - the SQL lesson, which remains my favorite lesson to teach!
The Carpentries is something special. For one, all of the workshop lessons are freely available - so anyone can use the lesson plans to teach or learn for themselves. The lessons live in GitHub repos, so they can be collaboratively developed. Anyone can develop their own Carpentries lesson using the provided template and submit the lesson to the Carpentries Incubator, so there are a ton of lessons out there beyond the core set. All of the lessons follow a similar pedagogical philosophy live coding throughout the workshop, getting learners to a place of being able to use their new skills as quickly as possible, and being as inclusive and supportive as possible. This reflection blog post details some of the pedagogy that is the foundation of all Carpentries lessons.
The Carpentries is also all about community. Some instructors are based at a university and have the support of their local chapter, but even if instructors don’t have a local chapter, they can be supported by the global online community. There are community discussions via Zoom and a Slack workspace so instructors can stay connected and learn from each other, and many instructors are active on social media, like Twitter. There’s also CarpentryCon!
So if you’re interested in getting involved with The Carpentries, go for it! I cannot emphasize enough how much I have learned and benefited from being a part of The Carpentries community.
]]>This is it - the last semester of my MS/LIS degree. As I am currently job searching, I decided to take a slightly lighter class load for this final semester: 2 regular classes, 1 short class, and an independent study. Here they are:
Note that the course descriptions & learning outcomes are pulled from the syllabi for each respective course.
While I have encountered a lot of the topics covered in this class in previous classes (e.g. Regular Expressions, the Relational Model), this is still a must-take class at the iSchool. I have actually been meaning to take this class for a while, but because it is reliably offered every semester, I kept putting it of to take other classes. This class teaches several concepts important to data curation (e.g. data provenance, reproducibility), and includes some new topics that are quite exciting to me - like logic programming (Datalog) and workflow automation.
Data cleaning (also: cleansing) is the process of assessing and improving data quality for later analysis and use, and is a crucial part of data curation and analysis. This course identifies data quality issues throughout the data lifecycle, and reviews specific techniques and approaches for checking and improving data quality. Techniques are drawn primarily from the database community, using schema-level and instance-level information, and from different scientific communities, which are developing practical tools for data pre-processing and cleaning.
This is a new class at the iSchool (this semester is the 2nd time it has been offered). The goal of the class is to teach students about data structures and algorithms in a fun and interesting way… through puzzles and games. Each week we are given a new puzzle or game that we have to represent in some kind of data structure and then solve based on a specific algorithmic approach. Many of the projects are done in groups, although some can also be completed individually. This class is challenging and fast-paced, but also highly rewarding because it pushes me to advance my python problem solving skills. Also, as a huge fan of puzzles and games, this class is just intrinsically interesting.
Learn, experiment, code with, and compare performance of common data structures and algorithms in a fun, collaborative, and challenging context. In class, students will discuss and solve logic puzzles and play several types of strategy games. In small teams they will explore the deductive, strategic, and tactical decisions involved, select appropriate data structures & algorithms to develop efficient program solutions to automate playing, solving, generating, or analyzing puzzles & games. Techniques and tools used include analysis of efficiency (Big-O), recursion, minimax, Monte Carlo Tree Search, client/server network communications, deterministic vs non-deterministic algorithms. Structures used include arrays, matrices, hash tables, stacks, various trees, network graphs, and custom structures. For some projects, students will have competitions pitting their solutions against other teams
Though the contextual focus of the course is on strategic games and logic puzzles, the underlying purpose is for students to learn and practice the following critical and broadly-useful skills:
This is a one-month short course offered at the end of the semester. It covers many of the vital skills often glossed over in typical programming classes - like version control, shell scripting, and computer cluster/cloud tools.
This class will provide an overview of the history and commonly offered command line interfaces and essential shell scripting tools. These approaches will be extended to cover common version control tools, including git and GitHub, their value, and how to appropriately organize a project within them. We will also review how to submit projects to the Illinois Campus Cluster tool, and touch on situations where it may be valuable to do so.
“The Linux Command Line” (2nd ed), by William Shotts
I decided to do an independent study this semester because I wanted to learn how to analyze data using python packages that are typically used for data science projects. Last semester, I learned how to perform statistical analyses in R (IS542) and how to use various machine learning algorithms for data mining in WEKA (IS590DT). However, the language that I am most comfortable with - and use for data wrangling - is python, and I have yet to learn how to apply the concepts I learned in those previous classes to the python data science ecosystem. I also want to learn how to do time-series analysis - something that neither of those previous classes covered.
This independent study is also functioning as a follow up to another class I took last semester, Open Data Mashups (IS590OM). The goal of that class was to produce a new analysis-ready dataset that combined data from other open data sources. I created a country-year time-series dataset that combined data from the Correlates of War project, the UCDP/PRIO Armed Conflict dataset, the World Bank’s World Development Indicators, and the Polity IV project. For my independent study, I will be analyzing this dataset in a variety of ways, implementing the statistical learning techniques I learned last semester in python.
“Introduction to Machine Learning with Python” (2016), by Andreas Müller and Sarah Guido
“Python for Data Analysis” (2017), by Wes McKinney
“Python Data Science Handbook” (2017), by Jake VanderPlas
“Machine Learning” MOOC by Andrew Ng
]]>One year down, one year to go! During the first year of my MSLIS degree, I focused on the fundamentals: how to organize and process data. This year, I am shifting my focus to analysis. This semester I am taking three classes focused on different ways to analyze data, as well as one project-oriented class that will allow me to collect together the data I want to analyze.
Note that the course descriptions, themes, learning outcomes and such are pulled from the syllabi for each respective course.
This is an introduction to statistics for information professionals - it covers the fundamental concepts of statistical analysis and how to perform those analyses in base R using RStudio.
An introduction to statistical and probabilistic models that are used to quantify information, assess information quality, and make decisions in a principled way. The increasing prevalence of massive data sets and falling computational barriers have rendered statistical modeling an integral part of contemporary information management. With this in mind, this class prepares students to select and properly undertake common modeling tasks. The course reviews relevant results from probability theory, emphasizing the merits and limitations of familiar probability distributions as vehicles for modeling information. Subsequent consideration includes parametric and non-parametric predictive models, as well as extensions of these models for unsupervised learning. Throughout these discussions, the course focuses on model selection and gauging model quality. Applications of statistical and probabilistic models to tasks in information management (e.g. prediction, ranking, and data reduction) are emphasized.
“An Introduction to Statistical Learning: with Applications in R” (2013), by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
“OpenIntro Statistics” (3rd ed)
Publisher’s Site, free download
“An Introduction to R” (3rd ed), Venables, W.N., Smith, D.M and the R Core Team
One of the most popular classes in the iSchool, this class is all about data visualization. The class mostly focuses on python libraries, and each class has both a lecture that deep-dives on one particular aspect of visualization as well as live code-alongs.
Data visualization is crucial to conveying information drawn from models, observations or investigations. This course will provide an overview of historical and modern techniques for visualizing data, drawing on quantitative, statistical, and network-focused datasets. Topics will include construction of communicative visualizations, the modern software ecosystem of visualization, and techniques for aggregation and interpretation of data through visualization. Particular attention will be paid to the Python ecosystem and multi-dimensional quantitative datasets.
Lecture slides and Jupyter notebooks from the code-alongs are available on the course’s GitHub website.
While IS542 focuses on the more traditional methods of statistical analysis, this class focuses on modern methods of extracting information from large data. Techniques for classification, clustering, and prediction are taught using WEKA, a free software for machine learning.
Data mining refers to the process of exploring large datasets with the goal of uncovering interesting patterns. This process usually involves a number of tasks such as data collection, pre-processing, and characterization; model fitting, selection, and evaluation; classification, clustering, and prediction. Although data mining has its roots in database management, it has grown into a discipline that focuses on algorithm design (to ensure computational feasibility) and statistical modeling (to separate the signal from the noise). As such, it draws heavily upon a variety of other disciplines including statistics, machine learning, operations research, and information retrieval. This course will cover the major data mining concepts, principles, and techniques that every information scientist should know about. Lectures will introduce and discuss the major approaches to data mining, computer lab sessions coupled with assignments will provide hands-on experience with these approaches, and term projects offer the opportunity to use data mining in a novel way. Mathematical detail will be left to the students who are so inclined.
“Data Mining: Practical Machine Learning Tools and Techniques” (4th ed), by I.H. Witten, E. Frank, M.A. Hall, and C.J. Pal
Data used for the course is available through the professor’s website.
The entire point of this class is to produce a new dataset that combines multiple other open data sources. The project is the class - with some career planning and development sprinkled in along the way. The end goal is to have a project for your portfolio, and the tools to use that project when applying for jobs - think elevator pitch, cover letter, resume, presentation, etc. As an added bonus, you have an interesting dataset ready and waiting for analysis.
Data sharing and modern open data standards have been creating large repositories of data that remain disconnected. Many data science and machine learning techniques are boosted by incorporating data representing a variety of domains and granularities. Topics on data curation, data cleaning, copyright, web scraping, storage, processing, and automation will be reviewed. This course seeks to explore techniques and perspectives of combining various data sources to create a dataset ready for analysis, but in a project oriented space so that each topic is synthesized with practice and experienced in context. Students will select a project area and explore the technical and conceptual requirements of that project space, eventually producing a proof of concept around it. All project domains and areas are open, with the only requirement being that they combine several data sources into a new dataset. This course is meant for students who have completed at least two semesters of coursework, are comfortable with programming in Python (the project can be completed in any language, but instruction will be in Python), and desire a space to explore and develop a capstone or independent study project. However, further work on the project is not a requirement. Guest speakers and field experts from the University Library will be invited. Students will be encouraged to share and publish their datasets at the end of the semester.
Today was the last day of both my class and my internship for the summer! I have been taking a class online, IS562: Metadata in Theory & Practice, as well as working as a Data Science Intern at the Program on Governance and Local Development in Gothenburg, Sweden. First I’ll introduce the class, and then share a reflection on my internship experience.
Rather than focusing on learning a specific metadata schema (some classes focus just on using those common in libraries, like MARC and MODS), this class focused on teaching the overarching concepts and technical traits that all metadata schemas share, in order to prepare students to work with any of the wide variety of metadata schemas and standards used in the real world (libraries, companies, academia, government, archives, etc). Along the way we got practice using a wide variety of specific metadata schemas (Dublin Core, CDWALite, PREMIS, and CIDOC-CRM just to name a few). While most metadata schemas are based in XML, we also learned about schemas that use RDF and how the semantic web and linked open data are becoming more prominent in the metadata world. PREMIS is a schema that currently uses XML but is in the process of transitioning to RDF, which makes it a very interesting schema to study.
Metadata plays an increasingly critical role in the creation, distribution, management and use of electronic materials. This course will combine theoretical examination of the design of metadata schema with their practical application in a variety of settings. Hands-on experience in the creation of descriptive, administrative, technical, provenance and structural metadata in XML- and RDF-based schema, along with their application in systems such as harvesting and digital repositories, will help students develop a thorough understanding of current practices in metadata and metadata schema creation.
Combines theoretical examination of the design of metadata schema with their practical application in a variety of settings. Hands-on experience in the creation of descriptive, administrative, and structural metadata, along with their application in systems such as OAI harvesting, OpenURL resolution systems, metasearch systems and digital repositories, will help students develop a thorough understanding of current metadata standards as well as such issues as crosswalking, metadata schema, metadata’s use in information retrieval and data management applications, and the role of standards bodies in metadata schema development.
I have spent the past 10 weeks working as a Data Science Intern at the Program on Governance and Local Development (GLD). The internship was kicked off in the best way possible - attending GLD’s annual conference to learn about accountability in local governance and enjoying the wonderful hospitality only a Swedish seaside spa hotel can offer. The conference was a great way to learn about the ongoing academic work in local governance and development, and I was able to meet many interesting professionals from both academia and development agencies. But after the conference, it was time to get to work. This summer GLD has been conducting a survey in Kenya, Zambia, and (later) Malawi to gather data for their Local Governance Performance Index. GLD had just hired a new Data Scientist, and it was my job to help her with monitoring the data incoming daily from the survey. In addition to learning new tools like Survey To Go and dipping my toes into R, I was able to use the Python skills I’ve spent the last year developing almost every day. My skillset of information processing and organization was a perfect complement to the Data Scientist’s skillset in statistical analysis (she had just completed her PhD in Statistics), and we worked really well together. Data science typically has two sides - preparing the data, and analyzing the data. I happen to enjoy (and have more experience in) preparing the data, while her expertise was in analysis. Along the way, I got a few ad-hoc lessons in various statistical concepts (she loves to teach statistics, and I was a very willing audience), and I did my best to make her job easier and allow her to spend more time on analyzing the data.
The vast majority of my python work was in processing XML documents using XPath (extracting needed information and sometimes putting it into tabular form), reorganizing tabular data using Pandas (with my design choices informed by relational databases), and writing extracted data into whatever format was needed for analysis (such as WKT files for use in QGIS). Every week - sometimes every day - there was some new information puzzle to solve… and it was fun. This was messy, real-world data that would eventually be used to try and improve people’s lives. While using my newly developed technical skills was a major priority for me, it was just as important that my work, my time and effort, would go toward improving the world.
This internship was also a perfect transition between my first year at the iSchool - which was focused on information organization and processing - and my second year at the iSchool - which will be more focused on data analysis. I got to use my python skills prepare data for analysis, while getting an introduction to analysis through the osmosis of office conversations. I’ve even started working my way through Hadley Wickham’s R For Data Science, in anticipation of learning R in one my classes next semester. I have learned so much over the past year… I can only hope that I will learn just as much over the next!
]]>Note: This was originally written as a final paper for IS590TH: Theories of Information, in May 2019. I’ve edited it slightly for this blog post.
Generally speaking, a Unique Identifier (UID) is an inscription that represents (no more than) one entity within a given system. UIDs are essential to the functioning of modern information systems, so it is important to understand and define what a UID is and how it should be used.
In “Identifiers for the 21st Century”, McMurry et al. provides guidelines for creating good unique identifiers. However, while they outline the desirable characteristics of a UID, they refrain from stating which characteristics are necessary for an identifier to truly qualify as a UID, and which are merely good practice. In “Sense and Reference on the Web”, Halpin examines a specific type of identifier, the URI (uniform resource identifier), but also refrains from defining unique identifiers more generally outside of the context of the web.
In this post, I will define unique identifiers by deducing their essential properties.
Information systems exist to represent or describe some aspect of the real world. Entities are the specific things that we are describing through their attributes and relationships. For example, think of a person. You may wish to record specific attributes of this person, such as their name, age, eye color, hair color, height, etc. You may wish to describe their role within the system by establishing relationships with other entities - what company they are employed by, who their parents are, or what objects they own. But how can you identify this person? In natural language, you would probably use their name. But what if there are multiple people with the same name? You might add other attributes - “the Jane with the brown hair”, or “the Jane who works for B Corp.” You may need to use multiple attributes to identify the intended person. However, the combination of these attributes (which can uniquely identify Jane) is not a UID.
Think about what is happening in your mind. You have a mental concept of the person, but you cannot telepathically transmit this mental concept to another person. So instead, you describe the person using attributes - things that you associate with your concept of the person. This person’s name is Jane. This person is a woman. This person has brown hair, and so on. You know that the other person you are conversing with also has a mental concept of Jane. If you settle on the right combination of attributes (which uniquely identify Jane for both of parties), then you will know that saying “Jane” refers to that specific person.
Just as you cannot telepathically transmit your mental concept of Jane to another person, you cannot transmit this concept to a computer. However, you still want to create a record of Jane in the information system, including all of her attributes. Unlike a person, however, a computer is unable to form a mental concept. Instead, the concept of the person named Jane is represented in the information system by a UID. Let’s say Jane’s UID is CH4TW1N
. You can tell the computer that CH4TW1N
’s name is Jane, that CH4TW1N
is a woman, that CH4TW1N
has brown hair, and so on. Thus, a UID functions for a computer like a mental concept functions for a person. In Fregean terms, a UID is a proper Name, with the entity it picks out as its referent. This is the UID’s fundamental purpose in information systems.
Let’s first begin by describing an identifier.
An identifier is the physical, inscribable representation of a concept. The most common form of a UID is an alphanumeric string stored on a computer. For example, take the alphanumeric string ‘CH4TW1N’. On a computer, this alphanumeric string has a specific encoding, and could be translated to some combination of 1s and 0s, which themselves are physically encoded in the computer’s hardware. A person could also physically write down ‘CH4TW1N’ with pen and paper, and it would still be an identifier. However, I could not simply say out loud the phonetic pronunciation of this string and expect it to be an identifier - it cannot be referenced at a later time (unless it was recorded, in which case it is still being physically encoded on some medium). So, the first fundamental trait of an identifier is that it must be an inscription in some form that can be referenced at a later time.
Furthermore this inscription must pick out an entity. An entity could be a concept, a person, a physical object, a place, etc. Anything that has an independent concept in your mind is an entity; more commonly, in English a good rule of thumb is that entities are nouns (person, place, thing, or idea). When two entities are connected in some way, that connection is referred to as a relationship. Relationships are not entities; however, certain relationships may be instances of a concept, which may be an entity. For example, Jane’s brother is Martin. The Jane-Martin relationship is not an independent entity, because it cannot mentally exist without either Jane or Martin. However, the concept of being siblings can exist independently, and could be an entity. While translating relationships into entities can be quite complex, attributes are simpler. If Jane has brown hair, then ‘brown hair’ is an attribute of Jane. However, ‘brown hair’ is not exclusive to Jane - we can picture brown hair independently of Jane. Attributes can often be entities themselves.
So, we understand what an identifier is. However, unique identifiers have two core additional properties: UIDs must be both unique and unambiguous.
If a UID is unambiguous, any given UID points to exactly one entity (McMurry et al). For example, if there are two Janes at B Corp, then the identifier ‘Jane’ will pick out two separate entities. The identifier ‘Jane’ is ambiguous, and therefore not a UID. If a UID is unique, any given entity is assigned exactly one identifier (McMurry et al). If Jane is assigned the identifier CH4TW1N
at B Corp, then finds a new job a F Corp where she has the identifier W4TCH3R
. She later moves back to B Corp, and is assigned the new identifier 3L1Z4PM
. Jane now has two identifiers at B Corp. Thus, Jane does not have a unique identifier, by virtue of the fact that two identifiers will pick her out.
The two core properties of uniqueness and unambiguousness imply two further properties - identifier stability and entity stability. An identifier is stable if it does not change over time (McMurry et al). Jane’s identifier at B Corp changed over time from CH4TW1N
to 3L1Z4PM
, so it was not stable. An entity is stable if the entity picked out by an identifier does not change over time (McMurry et al). If, after Jane left B Corp, her CH4TW1N
identifier was reassigned to Julia, then it is not stable. In other words, an identifier’s uniqueness and unambiguousness must persist over time.
These properties of uniqueness, unambiguity, identifier stability, and entity stability can only be enforced if the UIDs exist within a system. Theoretically, this system would contain every possible UID, whether it is currently mapped to an entity or not. In this system, a UID could exist in one of three possible states: not yet assigned to an entity, currently assigned to a single entity, or retired (was assigned to an entity, which for some reason was removed from the system). A UID can only travel one way along this path - it cannot go backwards from being removed from the system to being assigned to entity. In this system, a UID isn’t really created or deleted, merely assigned. And a UID cannot be reassigned, because that would result in the entire system failing to meet the required properties of unambiguity and identifier stability.
The fact that a UID must exist within a system of all possible UIDs implies that there are a finite number of possible UIDs within the system - and furthermore, that UIDs must obey certain rules.
When a system is created in order to record information about entities (which will have UIDs), the set of all possible UIDs must be defined. In practical terms, this means that a UID can be defined by a regular expression; in linguistic terms, this means that a UID must be describable by a formal grammar.
For example, let’s say that B Corp’s system defines a UID as a string of alphanumeric character of length 7. There are 26 letters and 10 digits, and 36^7 is 78,364,164,096. This system of UIDs can identify over 78 billion entities - more than sufficient. Most of these UIDs will never be assigned, and thus never need to enter the database. They will never leave the first stage. Some UIDs will be used to identify employees, and thus enter the second stage. The UID will be entered into a database and be assigned attributes with specific values. Some of these UIDs may be retired - for example, if an employee passes away. However, these retired UIDs must never be used again - which means that the physical system must still keep track of them, if only to prevent their reassignment.
How these UIDs are actually generated is not an issue, so long as they follow these simple rules: they must be formed from a formal grammar’s production rules, and once assigned to an entity they must never be reassigned.
It is generally regarded as bad practice to embed meaning within UIDs. When meaning is embedded in a UID, it may become easy to confuse a UID with an attribute. However, while a UID’s value could be considered an attribute of the entity, the UID’s purpose is to stand in for the entity itself - and thus is not an attribute. There is one fundamental distinction between UIDs and attributes: while an attribute’s value may change, a UID may not. Thus, if a UID’s value is based on an attribute’s value, and that attribute’s value changes, confusion may ensue. The UID cannot be changed, or the entire UID system will be invalidated. But a person looking at the UID may make the wrong assumption. Suppose, for example, that the first character ‘C’ in Jane’s UID means that she is a level C employee. But Jane recently got promoted to level B. Her UID cannot be changed and still be valid, but her UID also incorrectly implies that she is a level C employee.
Encoding meaning in a UID is dangerous - but is it invalid? Even though someone might make the wrong assumption by only looking at Jane’s UID, so long as the UID remains unique, unambiguous, and stable over time, it is still a valid UID. Rather than a person making an erroneous assumption, the reverse should be true - individuals must disassociate meaning from the UID since it cannot be guaranteed to be correct. As McMurry et al recommend, “Meaning should only be embedded if it is indisputable, unchangeable and also useful to the data consumer.”
When a UID system is created, it is usually created to work within a specific context - a company, a university, a government database, etc. Outside of the system, a UID’s uniqueness cannot be guaranteed. However, data systems frequently need to be integrated. And so this system, this context, needs to be defined. This is frequently accomplished by assigning the context itself to a unique identifier (commonly denoted as a prefix followed by a colon). The prefix designates which system a UID belongs to, and thus becomes a part of the identifier itself. Together, all of the contexts and their prefixes form a new system - a new context that is the union of all sub-contexts contained within. And so, the prefix becomes a part of this new system’s UID formal grammar.
If the prefix is to become part of the UID, then it too must follow the rules outlined thus far: to be unambiguous (one prefix designates exactly one context), to be unique (a context has exactly one prefix), and to be stable (prefixes do not change over time or refer to different contexts over time). This stability over time is what necessitates the use of prefixes - for even if the local UIDs happen to still be unique and unambiguous in the merged system, there is no guarantee that this will still be the case for future states of the system, as each local UID’s uniqueness is only guaranteed within its local system. Whenever one system is merged with another, namespaces and prefixes must be designated, or the identifier cannot qualify as a unique identifier.
Unique context identifiers can ensure the new UID system is unambiguous (each UID picks out exactly one entity); however, ensuring that the new UID system is unique is more complicated. For example, suppose B Corp and F Corp are both acquired by L Corp, which combines all employee UIDs into one system using context identifiers. Suppose B Corp’s prefix is B
and F Corp’s prefix is F
. Suddenly, we have a problem. B:CH4TW1N
and F:W4TCH3R
both refer to Jane, and so the entire UID system no longer fulfills the uniqueness requirement.
To solve this problem, L Corp should find the combination of stored attributes which can uniquely identify a person. For example, B Corp and F Corp will likely store attributes such as her first name, last name, and birthdate. Together, these attributes may uniquely identify an employee (though there is no guarantee). Or they may have stored other attributes for security purposes, such as fingerprints or a retinal scan. These are much more likely to guarantee uniqueness. Regardless, having a a combination of attributes that can uniquely identify an entity can help a system enforce its UID uniqueness. In database terms, this is known as a compound primary key. It’s also possible that a single attribute may uniquely identify Jane, such as her social security number - a UID from a much broader system. L Corp can now create a mapping from one context to another. However, this is only the first step in the solution.
The real problem is that the two local contexts identify entities of the same type (person), thus creating the danger of overlap (non-uniqueness). It can be generalized that in a UID system that contains multiple contexts, besides the prefix itself being unique, the entity type that the prefix describes must also be unique. This means that in order for L Corp to have a UID system, they must create a formal grammar and then assign a new UID to each employee. L Corp may wish preserve the old B Corp and F Corp UIDs for historical purposes, and so may create mappings between the UID systems, but only by establishing a completely new set of UIDs can L Corp ensure that each employee has an unambiguous and unique identifier.
Suppose that a UID system identified multiple types of entities - people, locations, objects, etc. If any one of these entity types overlaps with another UID system, the same process of (1) finding the combination of attributes that can uniquely identify an entity, (2) mapping between the systems, and then using that mapping when (3) creating a completely new UID system must be followed. (This also implies that it is good practice for a local UID system to identify one entity type, but this is not a necessary condition for UIDs.)
It is important to note that there is no such thing as a truly global UID - nor is it desirable for there to be a global UID system (though that discussion is outside the scope of this post). For there to be a truly global UID, all entities - all information - would have to exist within a single information system. Even with the existence of the internet, this is not feasible. Therefore, the issue of context identifiers, namespaces, prefixes, and mappings between systems is unavoidable - any definition of a UID must discuss both local identifiers and context identifiers as components of an overall UID.
For local (single context) UIDs, embedding meaning is strongly discouraged (though not disqualifying). However, when context identifiers and prefixes are added to the mix, embedded meaning quickly becomes impossible to escape. Many prefixes are acronyms or short words, describing either the organization that produced the UID or the type of entity that the UID set describes (or a combination of the two). Should meaning be a disqualifying characteristic of UIDs, no non-local UID with a context identifier could be considered a UID - and as I have already established, any definition of UIDs must consider context identifiers as an essential aspect of a UID. Therefore, while opacity is a recommended characteristic for UIDs, it is not a necessary property. So long as the UID still follows the production rules of its formal grammar, any meaning, purposefully embedded or not, is irrelevant.
Despite the best of intentions, it is inevitable that a UID system will change over time. A UID may need to be reassigned, or the formal grammar may need to change. When this happens, a completely new UID system is created. UIDs cannot be reassigned, and the set of all possible UIDs have already been defined. By definition, should either of these things happen, it necessitates a distinct system and context, and thus a distinct prefix as well.
This most frequently happens as a result of versioning. A common convention is to append the version number at the end of the original prefix (again embedding meaning in the context identifier). It is important to document this association, so as to avoid both contexts being used in a larger system (as this would introduce a uniqueness problem), as well as to provide a mapping from the old system to the new.
This implies that in addition to context identifiers having a specify entity type (or set) that their local UIDs reference as an attribute, context identifiers also have a specific time period as an attribute. If B Corp changed their production rules (for example, they wish to make their UIDs 5 characters long instead of 7) on January 1, 2018, then the context identifier bcorp
has the attribute of identifying entities of the type person and the attribute of identifying entities prior to 2018. The context identifier bcorp_v2
has the attribute of identifying entities during and after 2018. Another property of context identifiers is the formal grammar used to generate all local UIDs. If this set of UIDs are stored on the web, another attribute of the context identifier may be its web address (URL).
However, it is important to distinguish between a context’s attributes and a context’s identifier. An online location is an attribute, because it is not guaranteed to be stable over time. Within the semantic web community, a web address (URL) is often treated as a unique identifier. This is erroneous, as web addresses have the potential to change from one moment to the next. For a web address to be a UID, the entire system would have to be created anew every second. Perhaps this is feasible - for the web’s system to incorporate a timestamp into each context identifier, but it is certainly not practical or desirable. Furthermore, the semantic web community does not argue that this is the case - and so, we are left with the conclusion that web addresses (URLs) are not UIDs, both because they are not stable over time, nor can uniqueness be guaranteed (there is nothing restricting me from copying one webpage exactly at a different web address). I will repeat my point: location is an attribute, not a unique identifier.
There is a larger issue at play here, which I will touch on only briefly - the issue of identity. This is absolutely core to the concept of unique identifiers, as each UID must pick out an entity with a distinct identity. However, it is generally accepted that entities may change - Jane can dye her hair blonde, change employers, move to a new location… even change her name. But you would not consider blonde Eliza of P Corp to be a different person. You could argue that she still has the same fingerprints, the same DNA, but even these can change over time. Eliza and Jane are still the same person. When it comes to physical entities like people, identity is fairly easy to grasp, if hard to define. However, identity is not nearly so easily traced when it comes to imaginary entities - countries, corporations, books, etc. When I say imaginary entities, I mean those things that would not exist without people to think them up. A tiger is a tiger, whether human beings ever evolved from great apes or not. But a corporation only really exists in society’s collective consciousness. It may be described on paper, have physical offices, and pay employees, but the moment everybody stops believing that the corporation exists, it will cease to exist as an entity (Yuval Noah Harari explores this concept in “Sapiens”). For these types of entities, identity is much harder to trace. What if the corporation changes its name? What if it fires all of its employees, sells all of its offices, and finds new ones? What if it sells a different product? Is it still the same corporation? What is the line that an entity must cross in order to assume a new identity, and thus a new UID? These are questions that must still be considered in order to have a full understanding of unique identifiers - for the meaning of the entity that a UID identifies is essential to the meaning of the UID itself.
Definition of a UID:
Remember
A UID may be decomposed into a context identifier and a local identifier. In this case, all context identifiers must also follow this definition, wherein the entity referred to is the set of local identifiers.
When a local identifier violates any of the rules in the definition, a new context is created and a new context identifier must be defined.
Opacity (lack of inherent or inferred meaning), while highly recommended for UIDs, is not necessary.
McMurry JA, Juty N, Blomberg N, Burdett T, Conlin T, Conte N, et al. (2017) Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data. PLoS Biol 15(6): e2001414. https://doi.org/10.1371/journal.pbio.2001414
Harry Halpin. (2011) Sense and Reference on the Web. Minds Mach. 21, 2 (May 2011), 153-178. http://dx.doi.org/10.1007/s11023-011-9230-6
]]>It’s summer, and I have officially completed half of my Master’s in Library & Information Science degree. This past semester has been extremely busy for me - not only was I taking more classes, but I was working 20 hours/week as a Graduate Research Assistant at the Cline Center on campus. And while that meant that I had essentially no free time (hence the utter lack of blog posts), there was nothing that I wanted to drop… so I can only blame myself. And next year looks to be at least as busy! But before I look ahead, I want to take some time to reflect on what I have learned and accomplished during my first year of library school.
The MSLIS degree at UIUC has only two required classes (IS501 & IS502), and I have now completed both. The degree also requires 40 credit hours, and I have now completed 24 credit hours - I’ll definitely be going well-over the credit hour requirement by the last semester. My cumulative GPA for the first year is 3.94 - despite an overloaded semester, I was able to earn straight A’s. Check, check, check - from an academic standpoint, my MSLIS degree is going very well.
I chose UIUC and planned my classes with the express purpose of gaining and advancing my technical skills. I had a fairly good idea of what fundamentals were important to learn, and I could hold conversations about various technical concepts. I was familiar enough with the tech world to not be overly intimidated by the concepts we were learning about in class, but not enough to actually be able to accomplish something on my own. Now, just a year later, I have confidence in my ability to solve information problems. I feel like an information magician - in the same way that the magicians in my favorite fantasy novels can manipulate the physical world around them with spells, I can manipulate data in an information world with code - a very empowering feeling in our Information Age. I was able to reach this point by focusing my learning on 3 core concepts:
Philosophy has always been extremely interesting to me. I love having philosophical debates and learning about new philosophical ideas. There is nothing more important to learning than being able to think critically - and nothing teaches critical thinking skills like philosophy. I have a whole rant about how severely undervalued and underutilized philosophy is in the modern era - but I will save that for another time. The point is, philosophy is important because it is foundational to every field of study… whether you realize it or not.
In the fall I audited a practical philosophy class, PHIL103: Logic & Reasoning. I was introduced to propositional logic - propositions, validity, truth tables, translating arguments into the logical language, probability and the Bayes theorem. Essentially, I learned how to think in a purely logical way - the same way that computer “think”. I was cultivating a logical headspace that proved to be extremely useful in both my database and python classes.
In the spring I took IS590TH: Theories of Information. This was a one-time, small seminar-style class taught by the iSchool Dean, Dr. Renear. The goal of the class was to answer one seemingly simple question: what is information? We learned about the technique of conceptual analysis - essentially defining an abstract concept - and slowly accumulated the building blocks we would need in order to define information. We read original works by Gottlob Frege, Alonzo Church, Bertrand Russell, Saul Kripke, Edmund Gettier, H. P. Grice, and Jonathan Furner. While the Logic class had introduced me to the concept of a proposition, in this class we dissected it - to a degree that I had not thought possible. We dissected a lot of concepts, and as a result gained a much deeper understanding of all the fundamental building blocks that make up the concept of information. This was a class that really got my brain fired up - what can I say? I like thinking deeply about abstract and obscure things. But as a result, the headspace I have been cultivating for working with information deepened, broadened, and became more connected. I am able to think critically about information and data in a way that I could not have imagined a year ago, and I feel prepared to tackle more advanced subjects because I have such a solid foundation in the study of information science.
I knew from the beginning that databases were one of those fundamental technologies that I needed to learn how to use, simply because of how prevalent they are in storing information. But I had yet to learn just how important it is to properly organize and structure data so that it can be used properly.
For my first semester, I took IS490DB: Introduction to Databases. My main goal was to learn just what databases are and how to access the information in them using SQL. The course, however, placed a lot of emphasis on database design - and in the end, that was far more fascinating to me than learning how to write SQL queries (though I certainly learned that as well). We learned how to design a database from scratch - modeling the information in terms of entities, attributes, and relationships and drafting a Chen-style ER diagram (and later EER diagrams), mapping to the relational model, building the relational model with a GUI and then finally with SQL. At every stage there are design decisions to make that fundamentally effect how users will interact with the information. Throughout the course we were taught the principles of normalization, but we weren’t formally taught about normalization and functional dependencies until the very end of the course… and at that point it just felt natural and instinctual to use those good database design principles. I was fully converted - relational databases are da bomb!
During the spring I took IS561: Information Modeling. In this class, relational databases were just one technique of many for organizing information. And before you can organize information for use with a particular technology, you need to create a model of that information. Different types of information lend themselves well to different models and technology. We learned how to model documents using XML and how to craft the schema in a DTD; how to model networks using graphs and how the semantic web uses knowledge graphs formed from a combination of RDF syntax and OWL ontologies; we studied logic through phrase-structured formal grammars (BNF) and learned how to use first-order predicate logic (an expansion of propositional logic) and it’s applications in developing ontologies. Only at the end of the class did we learn about relational databases and how to map between the relational model and RDF. As it turns out, relational databases are just one of many ways to structure data - and the same information can be modeled using many different techniques and technologies, each with their advantages and disadvantages. And besides adding a bunch of modeling/organizational tools to my information magician toolbelt, I was also able to advance my understanding of formal logic. This was also a very project-oriented class, so I got a lot of hands-on practice using all of these tools.
I knew from the start that learning how to program would be very important - but it was also something I was kind of dreading. I was afraid I would find it boring, or too difficult and complicated. You can’t really say you have technical skills in this age and not know how to program - and my whole goal for this degree was to gain and advance my technical skills. What if I simply didn’t enjoy it?
As you can probably tell at this point, those fears were not realized. My first semester, I began learning how to program using the python language in IS452: Foundations of Information Processing. In this class, the emphasis was on using python to process information (vs building a software program). I had never considered software development to be very interesting, and I hadn’t really thought about other uses of programming languages outside of the realm of software development. But using it to process and transform data? That I found myself enjoying. I had already encountered the problem of having the data I needed but not in the right form for analysis (during my undergrad honors thesis). Having the power manipulate data structures and get the information I actually needed in a form that I could analyze? That was some powerful stuff. Besides learning how to use all of the basic data structures/techniques in python (lists, dictionaries, tuples, loops, functions, decision structures) we also learned how to read/write files (text, csv, json), navigate XML documents using XPath, and craft Regular Expressions. By the end of the course, I felt like I had a decent grasp of the base python language, and could use it to do some simple data wrangling tasks. But I did not yet have full confidence in my ability to handle any real information processing task independently - which is why the next class was so important.
In the spring I took IS590PR: Programming for Analytics and Data Processing. This class was designed as a follow-up to IS452, and it was all about manipulating data structures to get the target information out of various public datasets. This class allowed me to build confidence in my ability to use python for practical data wrangling projects. Every assignment I started out unsure whether I would even be able to complete it, but by the end I had a solution that I was proud of - and by repeating that process over and over again, I gained confidence that even if I did not know immediately how to solve a problem, I could figure it out. Along the way I learned how to use several important python libraries (Pandas, Numpy), how to properly document code using docstrings and test code using doctests, and how to use classes to create object-oriented programs. We also covered several other topics, including XPath, regex, requesting data from an API, graphs and the NetworkX library, and efficiency/optimization using the Cython and Numba packages. The tool that was most heavily used in this class, however, was Pandas - a library I have come to appreciate greatly.
One of the things that I most appreciate about this degree is how project oriented that classes are. Furthermore, many of my technical classes have very open-ended final projects - allowing me to work within the domain that I am interested in (political science, international relations). There are two major projects that I am especially proud of.
During my first semester, I was able to combine the final projects for both my python and database classes. My goal was to create a relational database version of the various datasets that make up the Correlates of War project. I had previously used some of the CoW datasets for my undergraduate thesis - and had to go to the statistical consulting center to transform the data into the right form so it could be integrated with other datasets. The CoW project is widely used by political scientists studying international conflict, and I wanted to create something that would be useful for them. I knew that the CoW datasets could work so much better together if they existed within one cohesive database - but I had no idea just how difficult it would be to bring them back together. Over time, the datasets had been split apart and maintained by different parties, who made different design decisions. While the datasets could work together because they shared unique IDs for the various entities the project tracked (countries, territories, wars, etc), it was no longer possible to simply merge the datasets and move on.
So on the database side, I had to design and create a database schema that would integrate all of the various datasets. On the python side, I had to transform the publicly available datasets into the format required for my database design. And since there is no better library for wrangling tabular data than Pandas, that meant that I had to start teaching myself Pandas before formally learning it in the Spring semester. It was a much more ambitious undertaking than I had expected - and I continued to work on the project even after turning in the components required by my classes. In fact, I plan to continue working on this project throughout my degree, expanding it to incorporate other datasets that use the CoW identifier system.
If you’re interested in exploring this project more, please refer to its GitHub repo:
International Relations Database GitHub Repo
This was the final project for my spring programming class, IS590PR. This project is significant to me for a few reasons - the amount of work I put into it, the game theory + poli-sci element, the potential for future academic work using this program… but mostly because it is the first time I have written my own classes. Up till now, including for all course assignments, I did everything procedurally. I did not foresee just how difficult it would be to wrap my head around programming with classes. They are just another data structure, yes, but they function in a fundamentally different way, and it took a lot of time and experimentation for me to be able to use classes properly (at a beginner level) and feel comfortable with the data flow.
So what does this project do? Essentially, it runs an iterated prisoner’s dilemma tournament (you will find umpteen-million poli-sci journal articles on this topic) many times with several randomized elements in order to rank different strategies on their game score. The twist that makes this prisoner’s dilemma tournament unique is “reactive noise”: in addition to having a starting noise environmental variable, the noise level is incremented up for defecting plays and down for cooperative plays.
If you’re interested in trying to run your own simulations using my code, look for the Quick Start Guide in the GitHub repo:
Prisoner’s Dilemma GitHub Repo
So the Correlates of War database project is a good showcase of “organizing information”, and the Prisoner’s Dilemma project is a good showcase of “processing information” - it’s only proper that I should also have a final project to showcase “an information-oriented headspace.” Well, it’s not really a project, or even really a proper academic paper, but I do have something for you.
For my spring philosophy class (IS590TH), our single/final assignment was to write our own (or critique someone else’s) conceptual analysis on a topic relevant to library/information science. I chose to write a conceptual analysis (a.k.a. definition) on unique identifiers. For some reason, identifiers fascinate me - and I had devoted a lot of thought throughout the semester to understanding identifiers on a deeper level.
So what are unique identifiers? Stay tuned to my next blog post to find out! I’ll be posting my paper in its full, rambling glory.
As much fun as learning for the sake of learning is, I’m earning this degree so that I can get a job doing what I enjoy. Taking classes is only half the battle - I also need to gain practical on-the-job experience.
During the fall semester, I had an hourly job and volunteered my time with different research projects in the iSchool. I talked with professors and told them what I wanted to do - work with data in the field of political science. Fortunately, many of those professors knew exactly where I had to go: the Cline Center for Advanced Social Research. The Cline Center is based in UIUC’s Research Park, and works on computational social science research with partners at the University and other organizations focused on providing data for political scientists. The Cline Center was the one place on campus that would provide me the opportunity to combine information science with political science. At the end of the fall semester, I found out that they were hiring a graduate research assistant. I owe it to several amazing professors at the iSchool who advocated for me and connected me with the folks at the Cline Center. The GRA-ship turned out to be a perfect melding of my various skillsets: I would be creating the documentation for the Cline Center’s Global News Index - a massive database of metadata and extracted features for international news articles.
I was hired, and in January I got started writing the documentation that would allow the Cline Center to open up the database to the campus at large. Researching for and writing the documentation used my journalism skills (investigating algorithms and distilling complicated information into a concise and readable form) my political science and journalism domain knowledge, and my newly acquired understanding of information science (knowing what information about the variables and corpora was important to include for researchers from a wide variety of domains). I got to use InDesign to layout the documentation (and even learned some new tricks!), and learned that technical writing was definitely in my wheelhouse. By the end of the semester, after spending 20 hours every week working on this documentation, I had produced documentation for the global news index itself, as well as user guides for the applications used to access the database.
I’m happy to say that next semester I will be continuing to work at the Cline Center as a Graduate Research Assistant, with my duties shifting from creating documentation to more directly assisting with the research conducted by the Cline Center and its partners. This GRA-ship was important to me for many reasons - not only did I gain experience in a new skill (technical writing), but I gained confidence in my ability to operate as a professional. I didn’t feel like a student, just there to help with menial tasks or shadow the real experts; I felt like a part of the team, contributing something valuable on my own merits. What’s more - I enjoyed my work. I even had a couple small opportunities to use my coding skills (which I enjoyed the most!). I’m so excited to see how I can further develop in my next semester.
But before the fall semester arrives, there is summer - and what an exciting summer it will be! This summer I am in Gothenburg, Sweden, as a Data Science Intern for the Program on Governance and Local Development at the University of Gothenburg. I am working directly with GLD’s Data Scientist on a major project - a massive survey on local governance issues in multiple Southeast African countries. My internship started off in the best way possible… attending GLD’s annual conference (at an amazing spa hotel) to get a crash course on accountability research in international development. I got to meet both academics and practitioners from international aid organizations and enjoyed many interesting discussions. As for the internship itself, I’ve been able to use my python skills to process information and put it into the form our Data Scientist needs in order to run her analyses. I’m very glad that both of my programming instructors taught a unit on XPath - most of the information I’ve been processing comes from XML documents! I’m enjoying myself immensely, because every day has a new coding challenge - and it’s all real problems (not class assignments) that need to be solved. I’m so excited to be at an internship that uses the information processing and organization skills I’ve spent the past year acquiring, in the domain of international development. Plus… it’s Sweden! Living and working in a foreign country feels pretty normal at this point, but I am beyond stoked to experience life in a Scandinavian nation (especially considering the cool summers). Here’s to another year of professional development - hopefully, it leads to a job!
]]>