Overview
What is a Multi-Active Satellite?
A Multi-Active Satellite (MASH) is a satellite structure in Data Vault where you can have multiple valid records at the same time for the same parent hash key (HK). This is different from standard satellites where typically only one record is active per HK at any given time.
What are the two types of MASH in beVault?
beVault supports two modes of multi-active satellites:
-
Standard: On each load, all records are inserted, yielding a consistent load_dts across the set. This is the only variant recognized by the Data Vault Alliance. Depending on the number of records and load frequency, this may drastically increase database size.
-
Delta-driven: A "multi-row" hashdiff per hash key is used to detect changes across the full set of records for the same hash key. When a change is detected, all sub-records for that key are (re)inserted to preserve a single load_dts.
Understanding the Sequence Column
What is a Sequence Column?
The sequence column (also known as Child Dependent Key or CDK in Data Vault terminology) is the attribute that differentiates multiple records belonging to the same parent hash key. In a multi-active satellite, the sequence column is essential to:
-
Distinguish between different child records of the same parent entity
-
Ensure deterministic ordering when calculating hashdiffs
-
Form part of the unique index to prevent duplicates
Example: In mash_customer_contact_info, the sequence column is ContactType (e.g., "Email", "Phone", "Mobile") which differentiates multiple contact methods for the same customer.
Where is the sequence column used?
The sequence column appears in three critical places:
1. In the hashdiff calculation (staging L2)
BINARY(HASH(LISTAGG(COALESCE((COALESCE(stg1."ContactType", '')
|| ';;' || COALESCE(stg1."ContactValue", '')
|| ';;' || COALESCE(CAST(stg1."PreferenceRank" AS VARCHAR(1960)), '')
|| ';;' || COALESCE(stg1."IsPrimary", '')
), ''),';;') OVER (
PARTITION BY BINARY(HASH(COALESCE(TRIM(stg1."CustomerCode"), '-1'), 0), 16)
ORDER BY stg1."ContactType" ASC -- โ Sequence column ensures deterministic order
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), 0), 16) AS "B_hdiff"
2. In the unique index
CREATE UNIQUE INDEX "uidx_mash_customer_contact_info"
ON "dv"."mash_customer_contact_info"("hk","load_dts","ContactType");
-- ContactType (sequence column) is part of the unique constraint
3. In the MASH table structure
CREATE TABLE IF NOT EXISTS "dv"."mash_customer_contact_info"(
"hk" BINARY(16) NOT NULL
,"load_dts" TIMESTAMP NOT NULL
,"src" VARCHAR(255) NOT NULL
,"hdiff" BINARY(16) NOT NULL
,"ContactType" VARCHAR(255) NULL -- โ Sequence column
,"ContactValue" VARCHAR(255) NULL
,"PreferenceRank" SMALLINT NULL
,"IsPrimary" VARCHAR(255) NULL
-- ...
)
Common Questions
Why is the sequence column NOT in the WHERE clause?
This is a key characteristic of delta-driven MASH and is intentional. The WHERE clause compares only the HK and the aggregate hashdiff:
WHERE
masat."hk" IS NULL -- New HK
OR stg."B_hdiff" <> masat."hdiff" -- Change detected on the entire set
Why this approach?
1. Set-based comparison (not row-by-row)
Delta-driven MASH compares the entire set of records for a given HK, not individual rows. The hashdiff represents the "fingerprint" of all child records together.
2. Deletion detection
If a sequence column value is deleted from the source (e.g., a customer removes their mobile number), the aggregate hashdiff changes, triggering a complete reinsertion of all remaining records. This materializes the deletion in the historical record.
Example:
-
Previous load: Customer had Email, Phone, Mobile (3 records)
-
Current load: Customer has only Email, Phone (2 records)
-
Result: The hashdiff changes because Mobile is missing, so both Email and Phone are reinserted with a new load_dts
3. Temporal consistency
All records belonging to the same HK that change together share the same load_dts. This ensures a temporally consistent view of the complete state at any point in time.
If we compared row-by-row using the sequence column, some child records might get different load_dts values within the same logical set, breaking temporal consistency.
Example:
-- Load 1: Initial state
HK=Customer123, load_dts=2026-01-01, ContactType='Email', ContactValue='email@example.com'
HK=Customer123, load_dts=2026-01-01, ContactType='Phone', ContactValue='555-1234'
-- Load 2: Email changes
HK=Customer123, load_dts=2026-01-05, ContactType='Email', ContactValue='newemail@example.com'
HK=Customer123, load_dts=2026-01-05, ContactType='Phone', ContactValue='555-1234'
-- โ Both records reinserted with same load_dts for consistency
Why does the sequence column need to guarantee deterministic ordering?
Without deterministic ordering, the same set of records could produce different hashdiffs depending on the read order, causing false positive change detections.
Example without ordering:
-
Load 1: Hash(['Email;;email@...', 'Phone;;555...']) = ABC123
-
Load 2: Hash(['Phone;;555...', 'Email;;email@...']) = DEF456 โ Different hash for same data!
With sequence column ordering (ORDER BY ContactType ASC):
-
Load 1: Hash(['Email;;email@...', 'Phone;;555...']) = ABC123
-
Load 2: Hash(['Email;;email@...', 'Phone;;555...']) = ABC123 โ Same hash!
What is the stg.load_process table?
The stg.load_process table is a traceability table that records:
-
process_start_date: When the load process started -
process_name: The name of the loading process
INSERT INTO "stg"."load_process"("process_start_date","process_name")
VALUES (TO_UTC_TIMESTAMP(CURRENT TIMESTAMP, CURRENT TIMEZONE),'nw_crm_customers_contact');
Purpose
This table is mandated by the VTCP (Virtual Temporal Consistency Point) standard from the Data Vault Alliance for standard multi-active satellites. It ensures that a set of inserted records belongs to the same logical load cycle, which is crucial for maintaining temporal consistency.
For implementation simplicity, beVault uses stg.load_process for both standard and delta-driven MASH, even though it's technically required only for the standard mode.
Benefits
-
Audit trail: Track all load executions
-
Performance monitoring: Analyze load durations and patterns
-
Temporal consistency: Comply with Data Vault Alliance standards
-
Debugging: Identify incomplete or failed load processes
Complete Example Walkthrough
Scenario: Customer Contact Information
We want to track multiple contact methods (Email, Phone, Mobile) for each customer using a delta-driven MASH.
Step 1: Create the MASH table
CREATE TABLE IF NOT EXISTS "dv"."mash_customer_contact_info"(
"hk" BINARY(16) NOT NULL
,"load_dts" TIMESTAMP NOT NULL
,"src" VARCHAR(255) NOT NULL
,"hdiff" BINARY(16) NOT NULL
,"ContactType" VARCHAR(255) NULL -- Sequence column
,"ContactValue" VARCHAR(255) NULL
,"PreferenceRank" SMALLINT NULL
,"IsPrimary" VARCHAR(255) NULL
,CONSTRAINT "fk_mash_customer_contact_info"
FOREIGN KEY("hk")
REFERENCES "dv"."h_customer"("hk")
)
ORGANIZE BY ROW
COMPRESS YES ADAPTIVE
;
CREATE UNIQUE INDEX "uidx_mash_customer_contact_info"
ON "dv"."mash_customer_contact_info"("hk","load_dts","ContactType");
Step 2: Create staging L2 with hashdiff calculation
CREATE OR REPLACE VIEW "stg"."v_nw_crm_customers_contact_l2"(
"src", "load_dts", "A_bk", "A_hk"
,"ContactType", "ContactValue", "PreferenceRank", "IsPrimary"
,"B_hdiff"
)
AS
SELECT
'nw_crm' AS "src"
, TO_UTC_TIMESTAMP(CURRENT TIMESTAMP, CURRENT TIMEZONE) AS "load_dts"
, COALESCE(TRIM(stg1."CustomerCode"), '-1') AS "A_bk"
, BINARY(HASH(COALESCE(TRIM(stg1."CustomerCode"), '-1'), 0), 16) AS "A_hk"
, stg1."ContactType" AS "ContactType"
, stg1."ContactValue" AS "ContactValue"
, stg1."PreferenceRank" AS "PreferenceRank"
, stg1."IsPrimary" AS "IsPrimary"
-- Multi-row hashdiff calculation
, BINARY(HASH(LISTAGG(COALESCE((COALESCE(stg1."ContactType", '')
|| ';;' || COALESCE(stg1."ContactValue", '')
|| ';;' || COALESCE(CAST(stg1."PreferenceRank" AS VARCHAR(1960)), '')
|| ';;' || COALESCE(stg1."IsPrimary", '')
), ''),';;') OVER (
PARTITION BY BINARY(HASH(COALESCE(TRIM(stg1."CustomerCode"), '-1'), 0), 16)
ORDER BY stg1."ContactType" ASC -- Sequence column ordering!
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), 0), 16) AS "B_hdiff"
FROM "stg"."nw_crm_customers_contact" AS stg1
;
Step 3: Create the delta detection view
CREATE OR REPLACE VIEW "dv"."v_mash_nw_crm_customers_contact_B"(
"hk", "load_dts", "src"
,"ContactType", "ContactValue", "PreferenceRank", "IsPrimary"
,"hdiff"
)
AS
SELECT DISTINCT
stg."A_hk" AS "hk"
, TO_UTC_TIMESTAMP(CURRENT TIMESTAMP, CURRENT TIMEZONE) AS "load_dts"
, CAST((COALESCE(stg."src", '') || '-' || 'nw_crm_customers_contact') AS VARCHAR(255)) AS "src"
, stg."ContactType" AS "ContactType"
, stg."ContactValue" AS "ContactValue"
, stg."PreferenceRank" AS "PreferenceRank"
, stg."IsPrimary" AS "IsPrimary"
, stg."B_hdiff" AS "hdiff"
FROM "stg"."nw_crm_customers_contact_l2" AS stg
LEFT OUTER JOIN "dv"."mash_customer_contact_info" AS masat
ON stg."A_hk" = masat."hk" AND
(
SELECT MAX(z."load_dts") AS "load_dts"
FROM "dv"."mash_customer_contact_info" AS z
WHERE masat."hk" = z."hk"
) = masat."load_dts"
WHERE
masat."hk" IS NULL -- New customer
OR stg."B_hdiff" <> masat."hdiff" -- Change detected (addition, modification, or deletion)
;
Step 4: Load process
-- Record load start
INSERT INTO "stg"."load_process"("process_start_date","process_name")
VALUES (TO_UTC_TIMESTAMP(CURRENT TIMESTAMP, CURRENT TIMEZONE),'nw_crm_customers_contact');
-- Populate staging L2
INSERT INTO "stg"."nw_crm_customers_contact_l2" (...)
SELECT * FROM "stg"."v_nw_crm_customers_contact_l2" AS v;
-- Insert changed records into MASH
INSERT INTO "dv"."mash_customer_contact_info" (...)
SELECT * FROM "dv"."v_mash_nw_crm_customers_contact_B" AS v;
Summary
|
Aspect |
Delta-Driven MASH |
|---|---|
|
Sequence column in hashdiff calculation |
โ Yes - ensures deterministic ordering |
|
Sequence column in unique index |
โ Yes - prevents duplicates |
|
Sequence column in WHERE clause |
โ No - compares aggregate hashdiff only |
|
Comparison method |
Set-based (all records for an HK) |
|
Detects deletions |
โ Yes - via hashdiff change |
|
Temporal consistency |
โ Yes - same load_dts for all records in set |
|
load_process table |
Used (inherited from VTCP standard) |
Key Takeaway: In delta-driven MASH, the sequence column (also called Child Dependent Key/CDK) is essential for ordering and identification, but change detection operates at the set level (via aggregate hashdiff) rather than the row level. This enables deletion detection and ensures temporal consistency across all child records.