In 2011 at the Oracle OpenWorld conference, I attended a presentation of Kent Graziano about Data Vault Modeling. It was the first time I heard of this special modeling method for Data Warehouses, and after this session I was a bit confused about these data models with hubs, links and satellites. Is it really something new, or is it more or less the same technique we are using for master data versioning using head and version tables?
The idea of Data Vault Modeling seems to be interesting, but a lot of details were not clear for me. I ordered the book The Business of Data Vault Modeling of Dan Linstedt, but this book was more confusing than helpful for me. After some discussions with several colleagues at Trivadis, my opinion was clear: Yes, it is more or less the same as our approach described in our DWH book. Problem solved, let’s go back to work…
Last week at the Trivadis TechEvent, there was another session about Data Vault Modeling, presented by Hans Hultgren. The presenation was just an introduction into this modeling technique, and most of the information were the same things I already heard two years ago in San Francisco. But after his presentation we had an interesting discussion about the similaritites and differences between Data Vault Modeling and our method of head and version tables. Although Hans has never heard of our approach before, he found it at least “interesting” – whatever that means. After a beer, I decided to do some homework and to design my first draft of a Data Vault Model – based on an example in our book. Here it is.
Master Data Versioning with Head and Version Tables
As described in our Trivadis BI Blueprints and in the book “Data Warehousing mit Oracle“, we use the following versioning method to store historical master data:
- For each master data entity, a head table is created. It contains a surrogate key, the business key (which is often the primary key of the source system entity) as well as all “static” business attributes that never change during the lifecycle of a record.
- All “dynamic” business attributes that may change over the time are stored in a version table. The version table contains a foreign key to the master table, a validation range (effective and expriation date) and all attributes that can be changed.
- Relationships between entities are designed with foreign key relationships. A foreign key always refers the head tables to decouple the versions of two entities. Depending on whether the relationship can change over time or not, the foreign key column is either stored in the head or in the version table. In the example below, a product can be moved to another product subcategory during its lifecycle, but a subcategory belongs always to the same product category.
Master Data Versioning with Data Vault Model
Let’s try to make a first draft for a Data Vault Model based on the example above. The head tables can be compared to Hubs, the version tables to Satellites in a Data Vault Model. The relationships are always designed as Links, i.e.. many-to-many relationshiops between hubs.
- For each master data entity, a Hub is created. It contains a surrogate key and the business key (which is often the primary key of the source system entity), but no additional business attributes. The hubs in the example below are blue colored.
- All business attributes are stored in a Satellite. There is no distinction between “static” and “dynamic” attributes. It is possible to have more than one Satellite for the same Hub, but in the example below, all business attributes are stored in one Satellite. The effective date is stored as a business attribute, too, but not the expiration date. The Satellite contains a foreign key to the corresponding Hub and is colored red in the example below.
- A Link is used to store the relationships between the entities, in our case between products and subcategories as well as between subcategories and categories. Because a Link allows to store many-to-many relationships, the are no business constraints like “a product belongs to exactly one subcategory” visible in the data model. This increases the flexibility of the data model: if the business rule changes (e.g. a products can belong to more than one subcategories at the same time), no model change is required. Links are colored green in our example.
Similarities and Differences
Both modeling concepts – head and version tables as well as Data Vault Modeling – are used to design the Core model, i.e. the central data store used for data integration and data versioning. They are definitely not designed for end user queries. For this, data marts using Star Schemas or multidimensional OLAP cubes are typically used.
What is common in both methods is a unique key for each business entity that is independent from a particular version (this is one of the main differences to the Slowly Changing Dimensions type 2 where a particular version is referred). In our approach, we use a head table for this, in a Data Vault Model, this is implemented with a Hub.
Version tables and Satellites are not exactly the same. The version tables contain only the business attributes that can be changed during the time. Whenever at least one of the attributes is changed in the source system, a new version is created in the Data Warehouse, and the previous version is marked as outdated by updating the expiration date. This is different in a Satellite. Here, a new version is create for every change, too. But there is no expiration date, only a load date an propably an effective date as a business attribute.
Another difference is that the static attributes, i.e. business attributes that never change in the source system, are stored in the Satellite as well. In Data Vault Modeling, there is no distinction between static or dynamic attributes – or even relationships. A business attribute is a business attribute – that’s it.
Relationships in a Data Vault Model are always many-to-many relationships and stored in Links. This increases the complexity of the data model, but allows more flexibility because new business rules can be implemented without changing the data model. This is one of the main reasons for Data Vault Modeling: In agile environment with many change requests is can be very useful to have a flexible data model that can be enhanced easily. This is also the case for new attributes that are just included with additional Satellites.
What I never read or heard about yet is the complexity of the ETL processes. From my current point of view it makes no difference whether to load source data into head and version tables or into a Data Vault. A kind of delta determination to detect changed attributes must be implemented in both models. But even if it would be easier to load data into a Data Vault, it is more complex and expensive to load the Data Marts from a Data Vault because the queries to determine the correct version of each Satellite is not trivial. But these are only assumptions. We will propably see in the future what the benefits and restrictions of Data Vault Modeling will be.