In April, Hans Hultgren held a presentation about Data Vault Modeling on the Trivadis TechEvent (see my previous blog post “Data Vault Modeling – My first attempt to walk“). Now, we had the opportunity to attend a “Data Vault Challenge Workshop” at Trivadis to learn more about Data Vault Modeling.
Six DWH key players at Trivadis spent a one day workshop with Hans Hultgren and discussed with him several aspects of this modeling technique and its implementation. To prepare the workshop, we were able to send a list of questions to Hans. We were interested in general modeling tips for Data Vault and especially in ETL patterns to load data from source systems to a Data Vault and from there to dimensional data marts. Because all of us work as consultants in data warehouse projects for different customers, we wanted to know a lot of details, based on our experience of “real life” DWH projects. Hans prepared several examples and explanations to answer our questions. The presentation of his answers inspired us to more questions and many discussions. It was a very interesting, but also challenging day for all of us – including Hans 🙂
Here my personal summary of the Data Vault Challenge Workshop and the most important findings from my point of view:
Modeling Tips for Data Vault
- The timestamp usually referred as part of the primary key in the satellite tables is not a business date, but only the technical load date when the row was loaded into the Data Vault. Every hub, satellite and link table contains a load timestamp, and in the satellites it is used as part of the primary key. Business related validity dates (e.g. effective and expiration date) are stored as additional attributes in the satellite. Instead of a load timestamp is would also be possible to use a technical audit id that identifies an individual execution of a load job. We had an extensive discussion about this topic, but did not find a final answer. Anyway, from my point of view, this is a feasible approach.
- It is common to have hubs with multiple satellites. Typical cases for an additional satellite are the rate of change, the type of data, a new source system or security considerations. There are no fix rules when a new satellite has to be created. It is even possible to load data from different sources into the same satellite table, but separate satellites allow more flexibility for the ETL jobs.
- All relationships between hubs must be implemented with link tables, even if the relationship is a 1-to-1 or a 1-to-many relationship. Because a link table allows to store many-to-many relationships, the Data Vault model has no restrictions on the cardinality of a relationship. If such restrictions exist, they must be implemented with ETL rules. A link can refer two or more hubs, and for additional attributes, e.g. validity information of a relationship, satellites can also be added to link tables.
- There is no differentiation between master data (dimensions) and transactional data (facts) in a Data Vault model. Even for transactional data, a hub with one or more satellites is provided. From a modeling point of view, this is useful and flexible, but it can lead to performance issues for incremental loads of data marts (see below).
Loading and Extending Data Vault Models
- The ETL patterns to load data from source systems into a Data Vault are clear and easy to implement. One major benefit is that it is possible to load all tables of the same type in parallel: 1. Parallel load of all stage tables. 2. Parallel load of all hub tables. 3. Parallel load of all satellite tables of the hubs. 4. Parallel load of all link tables. 5. Parallel load of all satellite tables of the links. Here again, there is no difference between master data and transactional data. Because the ETL patterns are common for each type of table, it should be possible to generate the load jobs with little effort.
- A Data Vault model is very flexible for structure changes. There are several ways to extend a Data Vault model with additional attributes for an existing hub. Which approach is the best depends on the specific situation of the project. 1. Additional attributes can be added to an already existing satellite. 2. A new satellite is added to the same hub. In this case, no migration of existing satellites is required. 3. A new satellite that contains the already existing and the new attributes is introduced. Existing data is migrated to the new satellite, and after the migration, the previous satellite is removed.
Loading Data Marts from a Data Vault
- As assumed before, the extraction of data from a Data Vault to load the correct versions to the Data Marts is not trivial. Especially for hubs with multiple satellites that contain different validities for each version, the logic to build the correct validity ranges is quite complex, but possible. A typical approach is to create “point-in-time tables” to build all intermediate versions based on all satellites of a particular hub. It is recommended to store this information in physical tables (a kind of stage tables) to improve the load performance of the data marts.
- A real challenge is the performance for Incremental loads of fact tables in a data mart. The reason is that the relevant information (e.g. the transaction date) is stored in a satellite table, not in the hub. Therefore, it is not easy to find an efficient way to implement the Data Vault tables to allow incremental extraction of transactional data. Probably, it is possible to use table partitioning on the hub and satellite tables based on the load timestamp (or better the audit id) and then use this information for the incremental load of the fact tables. I think it’s hard to find a common pattern that fits for all projects, but for large data warehouses, an adequate solution must be implemented that allows to load the fact tables efficiently.
- A Data Vault model is definitely not appropriate for user queries. Therefore it is not an option to implement a “virtual” data mart as a set of views to simulate dimensions and fact tables. To allow good performance for the BI applications, the data must be loaded into an additional persistent data mart layer, implemented as a star schema or an OLAP cube.
After the workshop with Hans Hultgren, most of my questions and concerns about Data Vault Modeling are cleared now. I think that this modeling method can be very useful for enterprise data warehouses that are mainly source-driven. In an agile project environment with regular change requests and enhancements of the data requirements, Data Vault Modeling is a flexible way to implement the Core data model of a data warehouse. Of course, it will not fit for all DWH projects, but depending on the requirements, a Data Vault model can be the right approach.
By the way: In November I will hold a presentation at the DOAG conference in Nürnberg about different modeling approaches for a Core DWH. Data Vault Modeling will be one of the methods I will present.