Breadcrumbs

Understanding Multi-Active Satellites (MASH) and the Role of the Sequence Column in Data Vault

Overview

What is a Multi-Active Satellite?

A Multi-Active Satellite (MASH) is a satellite structure in Data Vault where you can have multiple valid records at the same time for the same parent hash key (HK). This is different from standard satellites where typically only one record is active per HK at any given time.

What are the two types of MASH in beVault?

beVault supports two modes of multi-active satellites:

  1. Standard: On each load, all records are inserted, yielding a consistent load_dts across the set. This is the only variant recognized by the Data Vault Alliance. Depending on the number of records and load frequency, this may drastically increase database size.

  2. Delta-driven: A "multi-row" hashdiff per hash key is used to detect changes across the full set of records for the same hash key. When a change is detected, all sub-records for that key are (re)inserted to preserve a single load_dts.

Understanding the Sequence Column

What is a Sequence Column?

The sequence column (also known as Child Dependent Key or CDK in Data Vault terminology) is the attribute that differentiates multiple records belonging to the same parent hash key. In a multi-active satellite, the sequence column is essential to:

  • Distinguish between different child records of the same parent entity

  • Ensure deterministic ordering when calculating hashdiffs

  • Form part of the unique index to prevent duplicates

Example: In mash_customer_contact_info, the sequence column is ContactType (e.g., "Email", "Phone", "Mobile") which differentiates multiple contact methods for the same customer.

Where is the sequence column used?

The sequence column appears in three critical places:

1. In the hashdiff calculation (staging L2)

SQL
BINARY(HASH(LISTAGG(COALESCE((COALESCE(stg1."ContactType", '')
    || ';;' || COALESCE(stg1."ContactValue", '')
    || ';;' || COALESCE(CAST(stg1."PreferenceRank" AS VARCHAR(1960)), '')
    || ';;' || COALESCE(stg1."IsPrimary", '')
    ), ''),';;') OVER (
    PARTITION BY BINARY(HASH(COALESCE(TRIM(stg1."CustomerCode"), '-1'), 0), 16)
    ORDER BY stg1."ContactType" ASC  -- โ† Sequence column ensures deterministic order
    RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), 0), 16) AS "B_hdiff"

2. In the unique index

SQL
CREATE UNIQUE INDEX "uidx_mash_customer_contact_info" 
ON "dv"."mash_customer_contact_info"("hk","load_dts","ContactType");
-- ContactType (sequence column) is part of the unique constraint

3. In the MASH table structure

SQL
CREATE TABLE IF NOT EXISTS "dv"."mash_customer_contact_info"(
    "hk" BINARY(16) NOT NULL
    ,"load_dts" TIMESTAMP NOT NULL
    ,"src" VARCHAR(255) NOT NULL
    ,"hdiff" BINARY(16) NOT NULL
    ,"ContactType" VARCHAR(255) NULL  -- โ† Sequence column
    ,"ContactValue" VARCHAR(255) NULL
    ,"PreferenceRank" SMALLINT NULL
    ,"IsPrimary" VARCHAR(255) NULL
    -- ...
)

Common Questions

Why is the sequence column NOT in the WHERE clause?

This is a key characteristic of delta-driven MASH and is intentional. The WHERE clause compares only the HK and the aggregate hashdiff:

SQL
WHERE 
    masat."hk" IS NULL                    -- New HK
    OR stg."B_hdiff" <> masat."hdiff"     -- Change detected on the entire set

Why this approach?

1. Set-based comparison (not row-by-row)

Delta-driven MASH compares the entire set of records for a given HK, not individual rows. The hashdiff represents the "fingerprint" of all child records together.

2. Deletion detection

If a sequence column value is deleted from the source (e.g., a customer removes their mobile number), the aggregate hashdiff changes, triggering a complete reinsertion of all remaining records. This materializes the deletion in the historical record.

Example:

  • Previous load: Customer had Email, Phone, Mobile (3 records)

  • Current load: Customer has only Email, Phone (2 records)

  • Result: The hashdiff changes because Mobile is missing, so both Email and Phone are reinserted with a new load_dts

3. Temporal consistency

All records belonging to the same HK that change together share the same load_dts. This ensures a temporally consistent view of the complete state at any point in time.

If we compared row-by-row using the sequence column, some child records might get different load_dts values within the same logical set, breaking temporal consistency.

Example:

-- Load 1: Initial state
HK=Customer123, load_dts=2026-01-01, ContactType='Email', ContactValue='email@example.com'
HK=Customer123, load_dts=2026-01-01, ContactType='Phone', ContactValue='555-1234'

-- Load 2: Email changes
HK=Customer123, load_dts=2026-01-05, ContactType='Email', ContactValue='newemail@example.com'
HK=Customer123, load_dts=2026-01-05, ContactType='Phone', ContactValue='555-1234'
-- โ†‘ Both records reinserted with same load_dts for consistency

Why does the sequence column need to guarantee deterministic ordering?

Without deterministic ordering, the same set of records could produce different hashdiffs depending on the read order, causing false positive change detections.

Example without ordering:

  • Load 1: Hash(['Email;;email@...', 'Phone;;555...']) = ABC123

  • Load 2: Hash(['Phone;;555...', 'Email;;email@...']) = DEF456 โ† Different hash for same data!

With sequence column ordering (ORDER BY ContactType ASC):

  • Load 1: Hash(['Email;;email@...', 'Phone;;555...']) = ABC123

  • Load 2: Hash(['Email;;email@...', 'Phone;;555...']) = ABC123 โ† Same hash!

What is the stg.load_process table?

The stg.load_process table is a traceability table that records:

  • process_start_date: When the load process started

  • process_name: The name of the loading process

SQL
INSERT INTO "stg"."load_process"("process_start_date","process_name")
    VALUES (TO_UTC_TIMESTAMP(CURRENT TIMESTAMP, CURRENT TIMEZONE),'nw_crm_customers_contact');

Purpose

This table is mandated by the VTCP (Virtual Temporal Consistency Point) standard from the Data Vault Alliance for standard multi-active satellites. It ensures that a set of inserted records belongs to the same logical load cycle, which is crucial for maintaining temporal consistency.

For implementation simplicity, beVault uses stg.load_process for both standard and delta-driven MASH, even though it's technically required only for the standard mode.

Benefits

  1. Audit trail: Track all load executions

  2. Performance monitoring: Analyze load durations and patterns

  3. Temporal consistency: Comply with Data Vault Alliance standards

  4. Debugging: Identify incomplete or failed load processes

Complete Example Walkthrough

Scenario: Customer Contact Information

We want to track multiple contact methods (Email, Phone, Mobile) for each customer using a delta-driven MASH.

Step 1: Create the MASH table

SQL
CREATE TABLE IF NOT EXISTS "dv"."mash_customer_contact_info"(
    "hk" BINARY(16) NOT NULL
    ,"load_dts" TIMESTAMP NOT NULL
    ,"src" VARCHAR(255) NOT NULL
    ,"hdiff" BINARY(16) NOT NULL
    ,"ContactType" VARCHAR(255) NULL        -- Sequence column
    ,"ContactValue" VARCHAR(255) NULL
    ,"PreferenceRank" SMALLINT NULL
    ,"IsPrimary" VARCHAR(255) NULL
    ,CONSTRAINT "fk_mash_customer_contact_info"
        FOREIGN KEY("hk")
            REFERENCES "dv"."h_customer"("hk")
)
    ORGANIZE BY ROW
    COMPRESS YES ADAPTIVE
;

CREATE UNIQUE INDEX "uidx_mash_customer_contact_info" 
ON "dv"."mash_customer_contact_info"("hk","load_dts","ContactType");

Step 2: Create staging L2 with hashdiff calculation

SQL
CREATE OR REPLACE VIEW "stg"."v_nw_crm_customers_contact_l2"(
    "src", "load_dts", "A_bk", "A_hk"
    ,"ContactType", "ContactValue", "PreferenceRank", "IsPrimary"
    ,"B_hdiff"
)
AS
    SELECT
          'nw_crm' AS "src"
        , TO_UTC_TIMESTAMP(CURRENT TIMESTAMP, CURRENT TIMEZONE) AS "load_dts"
        , COALESCE(TRIM(stg1."CustomerCode"), '-1') AS "A_bk"
        , BINARY(HASH(COALESCE(TRIM(stg1."CustomerCode"), '-1'), 0), 16) AS "A_hk"
        , stg1."ContactType" AS "ContactType"
        , stg1."ContactValue" AS "ContactValue"
        , stg1."PreferenceRank" AS "PreferenceRank"
        , stg1."IsPrimary" AS "IsPrimary"
        -- Multi-row hashdiff calculation
        , BINARY(HASH(LISTAGG(COALESCE((COALESCE(stg1."ContactType", '')
            || ';;' || COALESCE(stg1."ContactValue", '')
            || ';;' || COALESCE(CAST(stg1."PreferenceRank" AS VARCHAR(1960)), '')
            || ';;' || COALESCE(stg1."IsPrimary", '')
            ), ''),';;') OVER (
            PARTITION BY BINARY(HASH(COALESCE(TRIM(stg1."CustomerCode"), '-1'), 0), 16)
            ORDER BY stg1."ContactType" ASC  -- Sequence column ordering!
            RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), 0), 16) AS "B_hdiff"
    FROM "stg"."nw_crm_customers_contact" AS stg1
;

Step 3: Create the delta detection view

SQL
CREATE OR REPLACE VIEW "dv"."v_mash_nw_crm_customers_contact_B"(
    "hk", "load_dts", "src"
    ,"ContactType", "ContactValue", "PreferenceRank", "IsPrimary"
    ,"hdiff"
)
AS
    SELECT DISTINCT
          stg."A_hk" AS "hk"
        , TO_UTC_TIMESTAMP(CURRENT TIMESTAMP, CURRENT TIMEZONE) AS "load_dts"
        , CAST((COALESCE(stg."src", '') || '-' || 'nw_crm_customers_contact') AS VARCHAR(255)) AS "src"
        , stg."ContactType" AS "ContactType"
        , stg."ContactValue" AS "ContactValue"
        , stg."PreferenceRank" AS "PreferenceRank"
        , stg."IsPrimary" AS "IsPrimary"
        , stg."B_hdiff" AS "hdiff"
    FROM "stg"."nw_crm_customers_contact_l2" AS stg
        LEFT OUTER JOIN "dv"."mash_customer_contact_info" AS masat
            ON stg."A_hk" = masat."hk" AND
                (
                    SELECT MAX(z."load_dts") AS "load_dts"
                    FROM "dv"."mash_customer_contact_info" AS z
                    WHERE masat."hk" = z."hk"
                ) = masat."load_dts"
    WHERE 
        masat."hk" IS NULL                    -- New customer
        OR stg."B_hdiff" <> masat."hdiff"     -- Change detected (addition, modification, or deletion)
;

Step 4: Load process

SQL
-- Record load start
INSERT INTO "stg"."load_process"("process_start_date","process_name")
    VALUES (TO_UTC_TIMESTAMP(CURRENT TIMESTAMP, CURRENT TIMEZONE),'nw_crm_customers_contact');

-- Populate staging L2
INSERT INTO "stg"."nw_crm_customers_contact_l2" (...)
SELECT * FROM "stg"."v_nw_crm_customers_contact_l2" AS v;

-- Insert changed records into MASH
INSERT INTO "dv"."mash_customer_contact_info" (...)
SELECT * FROM "dv"."v_mash_nw_crm_customers_contact_B" AS v;

Summary

Aspect

Delta-Driven MASH

Sequence column in hashdiff calculation

โœ… Yes - ensures deterministic ordering

Sequence column in unique index

โœ… Yes - prevents duplicates

Sequence column in WHERE clause

โŒ No - compares aggregate hashdiff only

Comparison method

Set-based (all records for an HK)

Detects deletions

โœ… Yes - via hashdiff change

Temporal consistency

โœ… Yes - same load_dts for all records in set

load_process table

Used (inherited from VTCP standard)

Key Takeaway: In delta-driven MASH, the sequence column (also called Child Dependent Key/CDK) is essential for ordering and identification, but change detection operates at the set level (via aggregate hashdiff) rather than the row level. This enables deletion detection and ensures temporal consistency across all child records.