카테고리 없음

[MSTR] Implementing Slowly Changing Dimensions

시처럼 음악처럼 2014. 4. 1. 16:40

Implementing Slowly Changing Dimensions

One of the characteristics of the Data Warehouse is that it stores more historical data than the transactional systems.  While this is traditionally in the form of years and years of old data, it can also store modifications over time.  If a property is changed in the source, more often than not it's in the form of an update.  But if you're trying to perform analyses on this data and explain trends and outliers, missing the data as it existed at the time it was measured can make that task more difficult or impossible.  This architecture is known as Versioning (storing each version), Slowly Changing Dimensions (properties changing incrementally over time instead of on set recurring schedules) or Type 2 (the most common method that involves storing all history).

MicroStrategy doesn't natively support these types of architectures, but there are some various methods to implement them in your schema.  I've tried a few, and here is my recommended approach.

MicroStrategy has a Tech Note with a few options for handling this type of schema.  They basically suggest creating Logical Views to break apart the dimensions into Current and Historical, and then creating separate attributes for each.  I'm not really a fan of this approach because in a large warehouse, the performance is going to suffer with all of the lookups "between dates", not to mention a lot of hand coded SQL which complicates maintenance and makes platform migrations more difficult.

Their other approach is to use a database surrogate key (arbitrary value assigned during ETL), but then use a clunky Current Flag to differentiate the records which is difficult for business users to have to remember to use.

One more consideration before we dive in is that MicroStrategy does not support Theta Joins (and from the wording of that tech note, has no plans to).  These are joins where the evaluation condition is something other than equals (=).  Since a traditional implementation may require a "join between", this eliminates a lot of options.

My approach
I like a hybrid of the two options MicroStrategy provided.  First, the surrogate keys are the foundation for this schema.  The important feature is that the versions are controlled by these different surrogate keys and MicroStrategy doesn't have to worry about any dates.  Not only does this greatly reduce the schema complexity (including fewer objects to juggle for report writers), but its also much faster since it can do single integer joins instead of involving date "join betweens".

Next, instead of using a Current Flag to identify the latest records, I like to do this in a database view that includes only the latest version of the record.  The advantage here is that we don't have any of the hand coded SQL in MicroStrategy and instead can work with traditional tables.

Here's what it looks like:

Say that you have a Customer dimension that includes fields like Marital StatusAddress and Income.  Certainly these properties can change over time and will greatly affect the insight you'll derive by analyzing their data.

Your dimCustomer table will look like this:
Customer_Key (Surrogate Key)
Customer_ID (Transactional System ID)
Name
Address
Marital_Status
Income
(Sure you may have Name and Address broken out into more fields, but just for simplicity lets keep it basic.)

It's important to keep the original transactional system ID around, because that's the language that users will understand.  They more than likely see this ID in the source system's front end and this is how they'll communicate with you in the event you're working with a specific customer record.  As data changes (customer gets married, gets a raise, moves houses), the Customer_Key will change each time but the Customer_ID will always remain the same.

Also notice that there are no effective dates here (although ETL will still probably store them for their purposes, logically to MicroStrategy, they aren't here and aren't needed).  The key itself has the logic for effective dates embedded, and I'll discuss that in more detail when we get to the Fact Table.

Your dimCustomerCurrent (view) will look like this:
Customer_ID (Transactional System ID)
Name
Address
Marital_Status
Income
Whereas the dimCustomer table will contain records for every change, the dimCustomerCurrent view will only contain the most recent version.  It also doesn't contain the Customer_Key field since it doesn't matter at this level.  This table will match what the transactional system has at this moment.

Your FactTable will look like this:
Date 
Customer_Key
Product_Key
Units_Sold
Price_Paid
When ETL loads the fact records, it will grab the appropriate Customer_Key by handling the Date "join between" in the back end.  The result from the MicroStrategy perspective is a single integer Key that links back to the correct properties for that Customer at the time of their purchase. 

Defining the MicroStrategy Schema

Create two Attributes:

Customer Key
(ID) Customer_Key - FactTable, dimCustomer

Customer
(ID) Customer_ID - dimCustomer, dimCustomerCurrent
(DESC) Name - dimCustomerCurrent
(Parent of Customer Key)

Stash Customer_Key away in a hidden folder (never to be heard from again) and use the Customer attribute freely among your system. 

What we've done is laid the current layer on top of the historical layer.  Even though we've jumped through a lot of hoops up until now, I've still found that the majority of reporting still only requires current reporting.  You don't want to complicate development or usage, so this model makes it all transparent.  You can use theCustomer attribute in filters, prompts and reports like you normally would. 

The report SQL will come out like this:

select a13.Customer_ID, a13.Name, SUM(Price_Pad*Units_Sold)
from FactTable a11
join dimCustomer a12
   on a11.Customer_Key = a12.Customer_Key
join dimCustomerCurrent a13
   on a12.Customer_ID = a13.Customer_ID
group by a13.Customer_ID, a13.Name

The query joins through the dimCustomer table and aggregates all of the versions into the current version. If you're only interested in who your most profitable customers are, this is the query you'd want to use and yet you haven't over complicated anything.

For the other attributes that are parents of Customer, you can make them a parent of the Customer attribute and they'll use the current definition, or you can make them a parent of Customer_Key and they'll be a historical definition.  This gives you the ability to flip between "As-Is, As-Was".

Marital Status
(ID) Marital_Status - dimCustomerCurrent
(Parent of Customer)

Marital Status (Historical)
(ID) Marital_Status - dimCustomer
(Parent of Customer Key)

If you want to do analysis based on your Customer's Marital Status at the time of purchase, you would use theMarital Status (Historical) attribute and the SQL for the report will come out like this:
select a12.Marital_Status, a13.Customer_ID, a13.Name, SUM(Price_Paid*Units_Sold)
from FactTable a11
join dimCustomer a12
   on a11.Customer_Key = a12.Customer_Key
join dimCustomerCurrent a13
   on a12.Customer_ID = a13.Customer_ID
group by a12.Marital_Status, a13.Customer_ID, a13.Name
Since we used the Marital Status (Historical) attribute, the field is sourced from the dimCustomer table which is the historical version.  If this customer has data over a long period of time and changed Marital Status over that time, we'll see data grouped by their various status changes.

Alternatively, if we want to build a list of our best, currently married customers, we'd use the current Marital Status attribute:

select a13.Marital_Status, a13.Customer_ID, a13.Name, SUM(Price_Paid*Units_Sold)
from FactTable a11
join dimCustomer a12
   on a11.Customer_Key = a12.Customer_Key
join dimCustomerCurrent a13
   on a12.Customer_ID = a13.Customer_ID
where a13.Marital_Status = 'M'
group by a13.Marital_Status, a13.Customer_ID, a13.Name

To the developer or end user, everything is very transparent.  They don't have to worry about dates, filters or flags.  They only have to choose between the normal (current) version of an attribute or the historical one, which is a very intuitive and self explanatory choice.

Aggregates
Aggregate tables also have this added flexibility since you're using single integer Keys/IDs.  You have the choice of aggregating to the Key values if you want to retain your historical relationships, or aggregating to the IDs if you want to save space.  Once again, it's completely transparent to the schema since the Current structure joins through the Historical layer.  (Of course, if you store your aggregates with IDs, you won't be able to use your Historical attributes with them).

Multisource
If you're extending your project with Multisource, chances are the other platforms are using ID and not your surrogate keys.  With this schema, you can simply extend your Customer attribute (based on ID) to the other platforms without any extra work.


참고 : Slowly Changing Dimensions - MicroStrategy