MARS Technical Architecture

Deep dive into the forensic recovery pipeline

System Overview

The MARS architecture follows a two-phase approach: first building a reference knowledge base (exemplars), then using that knowledge to classify and recover data from damaged or carved databases (candidates).

High-Level Architecture

┌─────────────────────────────────────────────────────────┐
│                   EXEMPLAR PHASE                        │
│  (Known-Good macOS System → Reference Knowledge Base)   │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
         ┌──────────────────────────────────┐
         │   Scan Live System / Image       │
         │   Extract Database Schemas       │
         └──────────────────────────────────┘
                           │
                           ▼
         ┌──────────────────────────────────┐
         │   Generate Rubrics (JSON)        │
         │   Create Hash Lookup Table       │
         └──────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│                   CANDIDATE PHASE                       │
│     (Carved Files → Classified & Recovered Data)        │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
         ┌──────────────────────────────────┐
         │   File Categorization            │
         │   (raw_scanner/file_categorizer) │
         └──────────────────────────────────┘
                           │
                           ▼
         ┌──────────────────────────────────┐
         │   Database Variant Selection     │
         │   (O/C/R/D/X variant testing)    │
         └──────────────────────────────────┘
                           │
                ┌──────────┴──────────┐
                │                     │
                ▼                     ▼
    ┌──────────────────┐   ┌──────────────────┐
    │  Variant X       │   │  Valid Variants  │
    │  (Empty/Failed)  │   │  (O/C/R/D)       │
    └──────────────────┘   └──────────────────┘
                │                     │
                ▼                     ▼
    ┌──────────────────┐   ┌──────────────────┐
    │  Byte Carving    │   │  LF Processor    │
    │  (Extract raw)   │   │  (4 modes)       │
    └──────────────────┘   └──────────────────┘
                │                     │
                └──────────┬──────────┘
                           ▼
         ┌──────────────────────────────────┐
         │   Final Output Organization      │
         │   (catalog/, metamatches/, etc.) │
         └──────────────────────────────────┘

Data Flow: Exemplar → Candidate

The system operates in two phases:

Phase 1: Exemplar Scanning

Scans a known-good macOS system to extract database schemas and create matching rubrics. This builds the "source of truth" for identifying carved databases.

  • Input: Live macOS system, disk image, or directory
  • Output: Database schemas, JSON rubrics, hash lookup table
  • Key Artifact: exemplar_hash_lookup.json enables O(1) matching
Phase 2: Candidate Processing

Processes carved/recovered files and matches them against exemplar schemas. Recovers data from damaged databases using multiple strategies.

  • Input: Carved files from PhotoRec, Scalpel, etc.
  • Output: Classified databases, reconstructed tables, carved data

Key Terminology

Term Definition
Exemplar Reference database from known-good system; provides schema baseline
Candidate Carved/recovered database being analyzed against exemplars
Rubric JSON schema definition with table structures, column roles, and statistics
Variant Recovery method: O=Original, C=Clone, R=Recover, D=Dissect, X=Failed
Lost & Found SQLite's recovery mechanism for orphaned database pages
Catalog Exact schema matches - recovered databases that perfectly match exemplar schemas
Metamatch Non-catalog matches with intact data; identical schemas are combined and deduplicated
Found Data ORPHAN: Unmatched L&F fragments preserved for manual review
Design Philosophy
MARS prioritizes traceability. Every output file includes original filenames and data source provenance tracking, ensuring all recovered data can be traced back to its origins.

Exemplar Scanning

Exemplar scanning builds the reference knowledge base that powers the entire recovery pipeline. By analyzing known-good databases from a live macOS system, MARS creates detailed schema fingerprints that enable accurate classification and reconstruction of carved data.

What Gets Extracted

The exemplar scanner performs comprehensive analysis of each database:

  • Schema Structure: Full table and column definitions
  • Column Roles: Semantic classification (timestamp, UUID, URL, email, etc.)
  • String Statistics: Average string lengths for disambiguation
  • Example Values: Sample data for pattern matching
  • Row Counts: Expected data volumes per table

Rubric Generation Process

1

File Discovery

Scan source using dfVFS, matching patterns from artifact_recovery_catalog.yaml.

2

Schema Extraction

For each SQLite database, extract complete schema using sqlite_master queries. Identify primary keys, foreign keys, and indexes.

3

Semantic Analysis

Analyze column data to assign semantic roles (timestamp, UUID, URL, email, path). Extract example values and calculate string statistics.

4

Rubric Generation

Combine schema + semantics into JSON rubric format. Save to databases/schemas/{db_name}.rubric.json

5

Hash Lookup Creation

Compute MD5 hash of schema signature (tables + columns). Store in exemplar_hash_lookup.json for O(1) matching.

Hash-Based Fast Matching

The hash lookup table enables rapid O(1) exemplar identification without loading full rubrics or iterating through databases:

Schema Hash Format
MD5(table1|col1,col2,col3|table2|col4,col5)

Each table is represented by name + sorted column names, separated by pipes. This creates a unique fingerprint for each schema variant.

Lookup Table Structure
{
  "a1b2c3d4": "Safari_History",
  "e5f6g7h8": "Chrome_Cookies",
  "i9j0k1l2": "Messages_chat"
}

Schema hashes include both table names AND column names, making false positives virtually impossible. Only databases with identical structures will match.

Rubric System

Rubrics are the intelligence layer that enables MARS to match fragmented data back to its original table structure.

Rubric Anatomy

A complete rubric contains three layers of information:

1. Structural Layer
  • Table names and column definitions
  • Data types (INTEGER, TEXT, REAL, BLOB)
  • Primary key and foreign key relationships
  • Index definitions
2. Semantic Layer
  • Column roles: timestamp, UUID, URL, email, path, domain
  • Pattern detection: timestamp formats, ID patterns
  • Semantic anchors: weighted confidence scores
  • Example values for pattern matching
3. Statistical Layer
  • Average string lengths per column (avg_length)
  • Row counts and data volumes
  • Value distributions and frequency patterns
  • Most common values for disambiguation

Column Roles & Semantic Anchors

Semantic anchors are weighted scores that boost confidence when specific patterns are detected. They help create a strong "fingerprint" and disambiguate between similar schemas:

Role Weight Detection Method
timestamp 0.9 Unix epoch, Cocoa timestamp, ISO 8601 formats
uuid 1.0 8-4-4-4-12 hexadecimal pattern
url 0.8 HTTP/HTTPS scheme detection
email 0.7 name@domain.tld pattern
domain 0.6 Hostname pattern without scheme
path 0.5 Filesystem path patterns

String Statistics for Disambiguation

Average string lengths (avg_length) help distinguish between similar tables with different purposes:

Example: URL vs. Title Columns
Column Type Avg Length
url TEXT 87
title TEXT 32

Both are TEXT columns, but length statistics reveal their different purposes and improve matching accuracy.

Rubric Matching Algorithm

When matching lost_and_found fragments to rubrics, MARS uses a multi-factor confidence scoring system:

  1. Column Matching: Compare column names and types between fragment and rubric
  2. Chunk Analysis: Identify longest matching column sequences (chunks)
  3. Semantic Boost: Add weights for detected semantic anchors
  4. String Validation: Compare avg_length statistics
  5. Threshold Filtering: Reject matches below minimum confidence (default 0.7)

Candidate Scan Pipeline

The candidate pipeline processes carved files through multiple specialized stages, each handling a specific aspect of recovery and classification.

Pipeline Stages

1. raw_scanner/         File Discovery & Categorization
   file_categorizer     ├─ Identify file types via fingerprinting
                        ├─ Skip media/executables
                        └─ Organize by artifact type

                              │
                              ▼

2. fingerprinter/       Type-Specific Identification
                        ├─ SQLite: Magic bytes + schema extraction
                        ├─ Text logs: Pattern-based classification
                        └─ Archives: Compression format detection

                              │
                              ▼

3. matcher/             Schema Matching & Classification
   rubric_matcher       ├─ Hash-based O(1) exemplar lookup
                        ├─ Rubric confidence scoring
                        └─ Group unmatched by schema similarity

                              │
                              ▼

4. db_variant_selector/ Recovery Variant Testing
                        ├─ O (Original): Test as-is
                        ├─ C (Clone): Clean copy via VACUUM
                        ├─ R (Recover): SQLite .recover output
                        ├─ D (Dissect): sqlite_dissect rebuild
                        └─ X (Failed): Send to byte carving

                              │
                ┌─────────────┴─────────────┐
                │                           │
                ▼                           ▼

5a. carver/           5b. lf_processor/
    Byte-Level Carving      Fragment Reconstruction
    ├─ Extract timestamps   ├─ MERGE: Metamatch groups
    ├─ Parse protobuf       ├─ CATALOG: Exact matches
    ├─ Analyze URLs         ├─ NEAREST: Best-fit exemplar
    └─ Output JSONL/CSV     └─ ORPHAN: Unmatched

                              │
                              ▼

6. output/              Final Organization
   structure            ├─ catalog/ (exact matches)
                        ├─ metamatches/ (identical schemas, no exemplar)
                        ├─ found_data/ (L&F orphans)
                        └─ carved/ (byte-carved)

Stage 1: File Categorization

The file categorizer scans input directories and identifies file types using magic bytes and content analysis:

  • SQLite databases: Identified by SQLite format 3\x00 magic bytes
  • Text logs: WiFi logs, system logs, install logs via pattern matching
  • Archives: gzip, bzip2 (decompressed and re-scanned)
  • Ignored: Images, videos, executables (defined in config)

Stage 2: Fingerprinting

The fingerprinter performs deep type analysis beyond simple magic bytes:

SQLite Fingerprinting
  • Extract full schema via sqlite_master queries
  • Compute schema hash (MD5 of table+column structure)
  • Count rows per table
  • Detect lost_and_found tables from recovery
Text Log Fingerprinting
  • Sample first 1000-10000 lines
  • Detect timestamp patterns (multiple formats)
  • Match against known log prefixes
  • Calculate pattern frequency for confidence scoring

Stage 3: Schema Matching

The matcher uses the exemplar hash lookup table for instant classification:

  1. Compute schema hash of candidate database
  2. O(1) lookup in exemplar_hash_lookup.json
  3. On match: Load full rubric for detailed validation
  4. On miss: Group with other unmatched databases by schema hash
Fallback Matching
If hash lookup fails, MARS falls back to full rubric comparison using fuzzy matching with configurable tolerance (default 0.8 table overlap).

Stage 4: Database Variant Selection

See detailed section below on variant selection system.

Stage 5: Lost & Found Reconstruction

See detailed section below on LF processing (MERGE/CATALOG/NEAREST/ORPHAN).

Stage 6: Output Organization

Final databases are organized by match quality and processing path:

Directory Content Quality
catalog/ CATALOG exact matches + promoted NEAREST (successful L&F recovery) Highest confidence
metamatches/ MERGE combined identical schemas (no exemplar match) High confidence
found_data/ ORPHAN: Unmatched L&F fragments Low confidence - requires review
empty/ Catalog match with no usable data May contain rejected/ data for manual review
carved/ Byte-carved residue (variant X) Requires manual review

Database Variant Selection System

When a carved SQLite database is encountered, it may be corrupted, incomplete, or structurally damaged. The variant selector attempts multiple recovery strategies and chooses the best result.

The O/C/R/D/X Variant Approach

Each candidate database is processed through up to five different recovery methods:

Variant Method When to Use
O Original Test raw carved file as-is. Often works for cleanly-carved databases.
C Clone Copy database to clean state using VACUUM INTO. Removes freelist corruption.
R Recover Run sqlite3 .recover command. Creates lost_and_found tables for orphaned pages.
D Dissect Use sqlite_dissect to rebuild from raw pages. Only when exemplar match found.
X Failed All variants failed validation. Send to byte carving pipeline.

Variant Selection Logic

1

Discovery & Introspection

Attempt O, C, R (and D if exemplar matched). Collect metadata: table sets, row counts, integrity check results.

2

Hash-Based Matching

Compute schema hash for each variant. O(1) lookup against exemplar_hash_lookup.json. Skip non-matching variants.

3

Profiling & Weighting

For matched variants, sample up to PROFILE_TABLE_SAMPLE_LIMIT tables. Generate per-table row counts and completeness scores.

4

Best Variant Selection

Combine profile score with base heuristic (integrity + row count). Choose highest-scoring variant. Mark others for cleanup.

Variant Scoring Heuristics

Each variant is scored based on multiple factors:

  • Integrity: PRAGMA integrity_check result (pass = higher score)
  • Row Count: Total rows across all tables (more = better)
  • Table Completeness: Percentage of expected tables present
  • Match Quality: Exact hash match vs. fuzzy table match
  • lost_and_found Presence: R variant bonus for LF reconstruction potential
Variant Decision Record (sqlite_scan_results.jsonl)

Each database gets a decision record documenting the chosen variant and rationale:

{
  "case_name": "f12345678",
  "chosen_variant": "R",
  "match_type": "hash",
  "exemplar_name": "Safari_History",
  "profile_score": 0.95,
  "variants": {
    "O": {"valid": true, "rows": 150, "integrity": "ok"},
    "C": {"valid": true, "rows": 150, "integrity": "ok"},
    "R": {"valid": true, "rows": 168, "has_lf": true},
    "D": {"valid": false}
  }
}

Residue Processing

After variant selection, the residue processor performs cleanup and extraction:

  1. Lost-and-Found Extraction: Extract lost_and_found_* tables from chosen variant into separate databases
  2. Storage Cleanup: Delete non-chosen variants to save disk space

Lost & Found (LF) Reconstruction

When SQLite's .recover command succeeds, it creates lost_and_found tables containing orphaned database pages. The LF processor matches these fragments against exemplar rubrics and reconstructs coherent databases.

The Four Processing Modes

LF reconstruction follows a prioritized processing order based on match quality:

Mode Description Match Type Output
MERGE Metamatch groups Identical schemas, no exemplar match metamatches/
CATALOG Exact matches Exact schema match to exemplar catalog/
NEAREST Best-fit exemplar Databases matched to nearest (not exact) exemplar. Rebuilds using exemplar schema as template. Successful recoveries moved to catalog/. catalog/ or empty/
ORPHAN Unmatched tables No match found. Adds match hints to remnant LF tables if possible. found_data/

Processing Order & Rationale

0

Phase 1: Split Databases

Extract lost_and_found tables from all recovered databases into separate "split" databases for matching.

1

Phase 2: MERGE (First)

Group databases with identical schemas (tables + columns) that don't match any exemplar. Combine into superrubrics for later matching.

2

Phase 3: CATALOG (Second)

Process exact matches using canonical exemplar rubrics. Highest-quality reconstruction with known-good schemas.

3

Phase 4: NEAREST (Third)

For databases that don't fit MERGE or CATALOG, match to the nearest exemplar based on schema similarity. Rebuild using the exemplar schema as template. Results initially go to found_data/, then Phase 7 reclassifies.

4

Phase 5: ORPHAN

Collect all unmatched fragments from MERGE/CATALOG/NEAREST. Preserve for manual forensic review. Must run after matching phases to capture remnants.

5

Phase 7: Reclassification (Final)

Reclassify NEAREST results based on recovery success: successful L&F recovery (total_lf_rows > 0) promotes to catalog/; no recovery moves to empty/. Also cleans up empty CATALOG entries.

MERGE: Metamatch Groups (Identical Schemas)

When multiple databases share identical schemas (same tables AND columns) but don't match any known exemplar, they're grouped by schema hash and processed together:

  1. Grouping: Databases classified by schema hash (tables + columns)
  2. Combining: Merge all group members into single database
  3. Superrubric Generation: Create schema rubric from merged data
  4. Fragment Matching: Match lost_and_found tables against superrubric
  5. Reconstruction: Rebuild combined database with intact + recovered data
Example: Unknown App Database Metamatch Group

Three carved databases with identical schema (no exemplar match):

  • f12345678: Unknown app database from disk offset 0x00BC
  • f23456789: Same schema found at disk offset 0x1A00
  • f34567890: Same schema found at disk offset 0x2F00

Result: Combined into single unknown_app_a1b2c3d4 database with data from all three sources. The filename is based off the first table name.

CATALOG: Exact Matches

Exact schema matches use the canonical exemplar rubric for reconstruction:

  1. Hash Match: Schema hash exactly matches exemplar
  2. Rubric Loading: Load canonical exemplar rubric
  3. Fragment Matching: Match lost_and_found columns to rubric tables
  4. Reconstruction: Rebuild using exemplar schema as template
  5. Remnant Handling: Unmatched fragments saved to lost_and_found/

NEAREST: Best-Fit Exemplar Matching

When a database doesn't exactly match any exemplar but is close enough for useful reconstruction, NEAREST matches it to the nearest exemplar schema based on similarity:

  1. Schema Comparison: Find the nearest matching exemplar based on table and column similarity
  2. Fragment Matching: Match lost_and_found fragments against the nearest exemplar rubric
  3. Schema Rebuild: Reconstruct database using the nearest exemplar schema as template (structurally compatible with CATALOG results)
  4. Initial Output: Results go to found_data/ initially
  5. Phase 7 Reclassification: Successful recovery (L&F rows > 0) promotes to catalog/; no recovery moves to empty/
Why Reclassification?
Since NEAREST rebuilds databases using the exemplar schema as template, successful recoveries are structurally compatible with CATALOG results and can be merged with existing CATALOG entries for the same exemplar.

ORPHAN: Unmatched Fragments

Fragments that don't match any schema are preserved for manual review:

  • Collection: Gather all remnants from MERGE/CATALOG/NEAREST processing
  • Preservation: Create standalone databases with original fragment structure
  • Naming: Include match hints when partial matches existed
  • Traceability: Filenames preserved for forensic correlation
Output Example: Safari_History_f12345678_orphans.sqlite
  • Safari_History: Match hint (what we think it is)
  • f12345678: Original file offset (where found)
  • orphans: Contains unmatched lost_and_found fragments

Data Source Tracking

All reconstructed tables include a data_source column for provenance:

Value Meaning
carved_{db_name} Intact data from original database structure
found_{db_name} Reconstructed data from lost_and_found fragments

Manifest Files

Each reconstructed database includes a *_manifest.json documenting:

  • Source databases that contributed data
  • Intact rows vs. LF-recovered rows per source
  • Remnant tables (unmatched fragments)
  • Duplicates removed during deduplication
  • Table-level statistics (row counts per table)
Forensic Traceability
Every output filename includes the original filename, allowing investigators to trace recovered data back to its origins. The data_source column provides row-level provenance tracking.

Byte-Carving Pipeline

When all variant recovery methods fail (variant X), MARS falls back to byte-level carving. This extracts raw data directly from database pages without relying on SQLite's structural integrity.

Carving Strategy

The byte carver processes databases page-by-page, extracting forensic artifacts:

1. Timestamp Detection

Identifies multiple timestamp formats:

  • Unix epoch: Seconds, milliseconds, nanoseconds since 1970
  • Cocoa/Core Data: Seconds since 2001-01-01 (macOS/iOS)
  • Chrome: Microseconds since 1601-01-01
  • WebKit: Seconds since 2001-01-01
  • Windows FILETIME: 100ns ticks since 1601
  • ISO 8601: Text timestamps (2025-01-18T10:30:00Z)
2. URL Extraction & Analysis

Uses regex patterns to find URLs, then analyzes with Unfurl:

  • Parse query parameters and fragments
  • Extract embedded timestamps from URL structure
  • Detect UUIDs and session IDs in paths
  • Identify platform (Facebook, YouTube, Twitter, etc.)
3. Protobuf Decoding

Uses blackboxprotobuf for schema-agnostic decoding:

  • Detect protobuf magic patterns in BLOB fields
  • Decode without schema using type inference
  • Extract nested timestamps and strings
  • Generate JSON representation with inferred typedef
4. Text Extraction
  • Extract printable ASCII strings (minimum length filtering)
  • Preserve context (surrounding bytes for analysis)
  • Deduplicate close timestamps to reduce noise

Carving Process Flow

Database File (Variant X)
        │
        ▼
┌───────────────────┐
│ Read Page-by-Page │ (4096-byte SQLite pages)
└───────────────────┘
        │
        ├─── Scan for numeric values ───→ Timestamp Classifier
        │                                  ├─ Unix epoch?
        │                                  ├─ Cocoa timestamp?
        │                                  ├─ Chrome time?
        │                                  └─ Valid range filter
        │
        ├─── Regex match URLs ──────────→ Unfurl Analyzer
        │                                  ├─ Parse structure
        │                                  ├─ Extract query params
        │                                  └─ Detect embedded timestamps
        │
        ├─── Scan for BLOB data ────────→ Protobuf Decoder
        │                                  ├─ blackboxprotobuf decode
        │                                  ├─ Extract nested data
        │                                  └─ Convert to JSON
        │
        └─── Extract text strings ──────→ Text Scanner
                                           ├─ Printable ASCII filter
                                           └─ Context preservation
        │
        ▼
┌──────────────────────────────────┐
│ Output Generation                │
│ ├─ timestamps.csv (optional)     │
│ ├─ carved.jsonl (detailed)       │
│ └─ carved.db (structured SQLite) │
└──────────────────────────────────┘

Integration with Unfurl

Unfurl provides context-aware URL analysis that helps distinguish real timestamps from ID values:

Example: Facebook URL Analysis
https://facebook.com/photo.php?fbid=123456789&id=987654321

Unfurl extracts:

  • Platform: facebook
  • Photo ID: 123456789 (NOT a timestamp, despite numeric format)
  • User ID: 987654321 (confirmed ID, not time)

This context prevents false positive timestamp classifications.

Integration with blackboxprotobuf

Schema-agnostic protobuf decoding recovers structured data without requiring .proto definitions:

Protobuf Decoding Example
Input: Binary BLOB (unknown structure)

blackboxprotobuf output:
{
  "message": {
    "1": "user@example.com",
    "2": 1705584600,
    "3": {
      "1": "https://example.com/path",
      "2": 42
    }
  },
  "typedef": {
    "1": {"type": "string"},
    "2": {"type": "int"},
    "3": {"type": "message", ...}
  }
}

Field "2" is identified as an integer and classified as a timestamp candidate by the timestamp detector.

Carving Output Formats

Format Use Case Content
timestamps.csv Timeline analysis All detected timestamps with format, confidence, and offset
carved.jsonl Detailed review Page-by-page extraction with URLs, protobuf, text, and context
carved.db Structured analysis SQLite database with tables for timestamps, URLs, and extracted data

dfVFS Integration

MARS uses Digital Forensics Virtual File System (dfVFS) for universal disk image access. This provides consistent file access across all forensic image formats and archive types.

Supported Formats

Category Formats
Disk Images E01, Ex01, DD, DMG, VMDK, VHD
Volumes GPT, APM, MBR, APFS containers
Filesystems APFS, HFS+, NTFS, ext4, FAT
Archives TAR, ZIP, GZIP, BZIP2

Glob Pattern Matching

MARS extends dfVFS with full globstar (**) support for flexible pattern matching:

Glob Pattern Examples
Pattern Matches
/Users/*/Library/Safari/History.db Safari history for any user
**/Library/Caches All cache directories (any depth)
/private/var/**/com.apple.*.db Apple databases in /private/var tree

Globstar Implementation

The ** wildcard matches zero or more directory levels:

  • ** at start: Matches from root (any depth prefix)
  • /**/ in middle: Matches zero or more intermediate directories
  • /** at end: Matches everything below current level
  • * alone: Matches exactly one directory level
Pattern:
/Library/Caches

Matches:

  • /Library/Caches (zero segments before)
  • /Users/admin/Library/Caches (two segments before)
  • /System/Volumes/Data/Users/admin/Library/Caches (many segments)

Does NOT match:

  • /Users/admin/Library/Caches/Chrome (extends past pattern)

Volume System Integration

dfVFS provides automatic volume enumeration and metadata extraction:

  • GPT Partitions: Partition name, GUID, size, type
  • APFS Containers: Volume names, encryption status, roles
  • Volume Labels: Filesystem labels from volume attributes

EWF/E01 Mount Utilities

MARS includes utilities for mounting forensic images for interactive exploration:

  • macOS: Requires Fuse-T for userspace filesystem mounting
  • Windows: Arsenal Image Mounter or similar tools
  • Linux: libewf + FUSE for native EWF support

Directory Filtering & Exclusion

To prevent hangs on directories with millions of cache files, MARS includes smart exclusions:

Skip Directories (Performance)
  • */*/Library/Caches/Google/Chrome/*/Code Cache
  • */*/Library/Caches/com.apple.Safari/fsCachedData
  • */*/Library/Caches/*/com.apple.metal
  • */*/Library/Caches/*/GPUCache

These directories contain hundreds of thousands of small cache files that slow down scans without forensic value.

Data Volume Variants
MARS automatically generates System/Volumes/Data/ pattern variants for paths starting with Users/, Library/, private/, or var/. This handles macOS Big Sur+ volume layout changes.

Configuration System

MARS uses a centralized configuration system defined in config/schema.py. All settings are organized into logical sections with dataclasses.

Configuration Sections

MatchingConfig

Controls database matching and confidence thresholds:

  • min_confidence: Minimum match confidence (default 0.7)
  • min_rows: Minimum rows required for valid match (default 10)
  • min_columns: Minimum columns for substantial match (default 3)
  • semantic_anchor_threshold: Minimum anchor score (default 2.0)
SemanticAnchorWeights

Weights for pattern detection in matching:

  • uuid: 1.0 (UUID pattern weight)
  • timestamp_text: 0.9 (timestamp detection)
  • url: 0.8 (URL pattern)
  • email: 0.7 (email pattern)
  • uuid_in_pk: 2.0 (UUID in primary key - strong signal)
SchemaComparisonConfig

Tables and prefixes to ignore during schema comparison:

  • GLOBAL_IGNORABLE_TABLES: sqlite_sequence, sqlite_stat*, meta, z_metadata, etc.
  • ignorable_prefixes: sqlite_, sqlean_ (system tables)
  • ignorable_suffixes: _content, _segments, _segdir (FTS tables)
  • salvage_tables: lost_and_found, carved, recovered_rows
VariantSelectorConfig

Database variant selection behavior:

  • dissect_all: Attempt sqlite_dissect on all variants (default false)
CarverConfig

Byte-carving settings:

  • ts_start: Filter timestamps after this date (default 2015-01-01)
  • ts_end: Filter timestamps before this date (default 2030-01-01)
  • filter_mode: 'permissive', 'balanced', 'strict', or 'all'
  • decode_protobuf: Attempt protobuf decoding (default true)
  • csv_export: Generate CSV output (default false)
ExemplarScanConfig

Exemplar scanning parameters:

  • epoch_min: Minimum valid timestamp date (default 2000-01-01)
  • epoch_max: Maximum valid timestamp date (default 2038-01-19)
  • min_role_sample_size: Minimum rows for semantic role assignment (default 5)
  • enabled_catalog_groups: Database groups to include (empty = all)
  • excluded_file_types: File types to skip (e.g., 'cache', 'log')

Ignorable Tables

GLOBAL_IGNORABLE_TABLES defines tables that are always filtered during schema comparison:

Category Tables
SQLite System sqlite_sequence, sqlite_stat1/2/3/4, sqlite_master, sqlite_temp_master
Extensions sqlean_define, rtree_*, fts5_*, etc.
CoreData meta, dbinfo, z_primarykey, z_metadata, z_modelcache

User-Configurable Settings

Settings marked with user_configurable=True are exposed in the UI and saved to .marsproj files:

  • Basic: Debug mode, progress bars
  • Exemplar: Date ranges, catalog groups, excluded file types
  • Carver: Timestamp filtering, protobuf decoding, CSV export
  • Advanced: Variant selection, dissect options