MARS Technical Architecture

Deep dive into the forensic recovery pipeline

System Overview

The MARS architecture follows a two-phase approach: first building a reference knowledge base (exemplars), then using that knowledge to classify and recover data from damaged or carved databases (candidates).

High-Level Architecture

┌─────────────────────────────────────────────────────────┐
│                   EXEMPLAR PHASE                        │
│  (Known-Good macOS System → Reference Knowledge Base)   │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
         ┌──────────────────────────────────┐
         │   Scan Live System / Image       │
         │   Extract Database Schemas       │
         └──────────────────────────────────┘
                           │
                           ▼
         ┌──────────────────────────────────┐
         │   Generate Rubrics (JSON)        │
         │   Create Hash Lookup Table       │
         └──────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│                   CANDIDATE PHASE                       │
│     (Carved Files → Classified & Recovered Data)        │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
         ┌──────────────────────────────────┐
         │   File Categorization            │
         │   (raw_scanner/file_categorizer) │
         └──────────────────────────────────┘
                           │
                           ▼
         ┌──────────────────────────────────┐
         │   Database Variant Selection     │
         │   (O/C/R/D/X variant testing)    │
         └──────────────────────────────────┘
                           │
                ┌──────────┴──────────┐
                │                     │
                ▼                     ▼
    ┌──────────────────┐   ┌──────────────────┐
    │  Variant X       │   │  Valid Variants  │
    │  (Empty/Failed)  │   │  (O/C/R/D)       │
    └──────────────────┘   └──────────────────┘
                │                     │
                ▼                     ▼
    ┌──────────────────┐   ┌──────────────────┐
    │  Byte Carving    │   │  LF Processor    │
    │  (Extract raw)   │   │  (4 modes)       │
    └──────────────────┘   └──────────────────┘
                │                     │
                └──────────┬──────────┘
                           ▼
         ┌──────────────────────────────────┐
         │   Final Output Organization      │
         │   (catalog/, metamatches/, etc.) │
         └──────────────────────────────────┘

Data Flow: Exemplar → Candidate

The system operates in two phases:

Phase 1: Exemplar Scanning

Scans a known-good macOS system to extract database schemas and create matching rubrics. This builds the "source of truth" for identifying carved databases.

Input: Live macOS system, disk image, or directory
Output: Database schemas, JSON rubrics, hash lookup table
Key Artifact: exemplar_hash_lookup.json enables O(1) matching

Phase 2: Candidate Processing

Processes carved/recovered files and matches them against exemplar schemas. Recovers data from damaged databases using multiple strategies.

Input: Carved files from PhotoRec, Scalpel, etc.
Output: Classified databases, reconstructed tables, carved data

Key Terminology

Term	Definition
Exemplar	Reference database from known-good system; provides schema baseline
Candidate	Carved/recovered database being analyzed against exemplars
Rubric	JSON schema definition with table structures, column roles, and statistics
Variant	Recovery method: O=Original, C=Clone, R=Recover, D=Dissect, X=Failed
Lost & Found	SQLite's recovery mechanism for orphaned database pages
Catalog	Exact schema matches - recovered databases that perfectly match exemplar schemas
Metamatch	Non-catalog matches with intact data; identical schemas are combined and deduplicated
Found Data	ORPHAN: Unmatched L&F fragments preserved for manual review

Design Philosophy

MARS prioritizes traceability. Every output file includes original filenames and data source provenance tracking, ensuring all recovered data can be traced back to its origins.

Exemplar Scanning

Exemplar scanning builds the reference knowledge base that powers the entire recovery pipeline. By analyzing known-good databases from a live macOS system, MARS creates detailed schema fingerprints that enable accurate classification and reconstruction of carved data.

What Gets Extracted

The exemplar scanner performs comprehensive analysis of each database:

Schema Structure: Full table and column definitions
Column Roles: Semantic classification (timestamp, UUID, URL, email, etc.)
String Statistics: Average string lengths for disambiguation
Example Values: Sample data for pattern matching
Row Counts: Expected data volumes per table

Rubric Generation Process

File Discovery

Scan source using dfVFS, matching patterns from artifact_recovery_catalog.yaml.

Schema Extraction

For each SQLite database, extract complete schema using sqlite_master queries. Identify primary keys, foreign keys, and indexes.

Semantic Analysis

Analyze column data to assign semantic roles (timestamp, UUID, URL, email, path). Extract example values and calculate string statistics.

Rubric Generation

Combine schema + semantics into JSON rubric format. Save to databases/schemas/{db_name}.rubric.json

Hash Lookup Creation

Compute MD5 hash of schema signature (tables + columns). Store in exemplar_hash_lookup.json for O(1) matching.

Hash-Based Fast Matching

The hash lookup table enables rapid O(1) exemplar identification without loading full rubrics or iterating through databases:

Schema Hash Format

MD5(table1|col1,col2,col3|table2|col4,col5)

Each table is represented by name + sorted column names, separated by pipes. This creates a unique fingerprint for each schema variant.

Lookup Table Structure

{
  "a1b2c3d4": "Safari_History",
  "e5f6g7h8": "Chrome_Cookies",
  "i9j0k1l2": "Messages_chat"
}

Schema hashes include both table names AND column names, making false positives virtually impossible. Only databases with identical structures will match.

Rubric System

Rubrics are the intelligence layer that enables MARS to match fragmented data back to its original table structure.

Rubric Anatomy

A complete rubric contains three layers of information:

1. Structural Layer

Table names and column definitions
Data types (INTEGER, TEXT, REAL, BLOB)
Primary key and foreign key relationships
Index definitions

2. Semantic Layer

Column roles: timestamp, UUID, URL, email, path, domain
Pattern detection: timestamp formats, ID patterns
Semantic anchors: weighted confidence scores
Example values for pattern matching

3. Statistical Layer

Average string lengths per column (avg_length)
Row counts and data volumes
Value distributions and frequency patterns
Most common values for disambiguation

Column Roles & Semantic Anchors

Semantic anchors are weighted scores that boost confidence when specific patterns are detected. They help create a strong "fingerprint" and disambiguate between similar schemas:

Role	Weight	Detection Method
`timestamp`	0.9	Unix epoch, Cocoa timestamp, ISO 8601 formats
`uuid`	1.0	8-4-4-4-12 hexadecimal pattern
`url`	0.8	HTTP/HTTPS scheme detection
`email`	0.7	name@domain.tld pattern
`domain`	0.6	Hostname pattern without scheme
`path`	0.5	Filesystem path patterns

String Statistics for Disambiguation

Average string lengths (avg_length) help distinguish between similar tables with different purposes:

Example: URL vs. Title Columns

Column	Type	Avg Length
url	TEXT	87
title	TEXT	32

Both are TEXT columns, but length statistics reveal their different purposes and improve matching accuracy.

Rubric Matching Algorithm

When matching lost_and_found fragments to rubrics, MARS uses a multi-factor confidence scoring system:

Column Matching: Compare column names and types between fragment and rubric
Chunk Analysis: Identify longest matching column sequences (chunks)
Semantic Boost: Add weights for detected semantic anchors
String Validation: Compare avg_length statistics
Threshold Filtering: Reject matches below minimum confidence (default 0.7)

Candidate Scan Pipeline

The candidate pipeline processes carved files through multiple specialized stages, each handling a specific aspect of recovery and classification.

Pipeline Stages

1. raw_scanner/         File Discovery & Categorization
   file_categorizer     ├─ Identify file types via fingerprinting
                        ├─ Skip media/executables
                        └─ Organize by artifact type

                              │
                              ▼

2. fingerprinter/       Type-Specific Identification
                        ├─ SQLite: Magic bytes + schema extraction
                        ├─ Text logs: Pattern-based classification
                        └─ Archives: Compression format detection

                              │
                              ▼

3. matcher/             Schema Matching & Classification
   rubric_matcher       ├─ Hash-based O(1) exemplar lookup
                        ├─ Rubric confidence scoring
                        └─ Group unmatched by schema similarity

                              │
                              ▼

4. db_variant_selector/ Recovery Variant Testing
                        ├─ O (Original): Test as-is
                        ├─ C (Clone): Clean copy via VACUUM
                        ├─ R (Recover): SQLite .recover output
                        ├─ D (Dissect): sqlite_dissect rebuild
                        └─ X (Failed): Send to byte carving

                              │
                ┌─────────────┴─────────────┐
                │                           │
                ▼                           ▼

5a. carver/           5b. lf_processor/
    Byte-Level Carving      Fragment Reconstruction
    ├─ Extract timestamps   ├─ MERGE: Metamatch groups
    ├─ Parse protobuf       ├─ CATALOG: Exact matches
    ├─ Analyze URLs         ├─ NEAREST: Best-fit exemplar
    └─ Output JSONL/CSV     └─ ORPHAN: Unmatched

                              │
                              ▼

6. output/              Final Organization
   structure            ├─ catalog/ (exact matches)
                        ├─ metamatches/ (identical schemas, no exemplar)
                        ├─ found_data/ (L&F orphans)
                        └─ carved/ (byte-carved)

Stage 1: File Categorization

The file categorizer scans input directories and identifies file types using magic bytes and content analysis:

SQLite databases: Identified by SQLite format 3\x00 magic bytes
Text logs: WiFi logs, system logs, install logs via pattern matching
Archives: gzip, bzip2 (decompressed and re-scanned)
Ignored: Images, videos, executables (defined in config)

Stage 2: Fingerprinting

The fingerprinter performs deep type analysis beyond simple magic bytes:

SQLite Fingerprinting

Extract full schema via sqlite_master queries
Compute schema hash (MD5 of table+column structure)
Count rows per table
Detect lost_and_found tables from recovery

Text Log Fingerprinting

Sample first 1000-10000 lines
Detect timestamp patterns (multiple formats)
Match against known log prefixes
Calculate pattern frequency for confidence scoring

Stage 3: Schema Matching

The matcher uses the exemplar hash lookup table for instant classification:

Compute schema hash of candidate database
O(1) lookup in exemplar_hash_lookup.json
On match: Load full rubric for detailed validation
On miss: Group with other unmatched databases by schema hash

Fallback Matching

If hash lookup fails, MARS falls back to full rubric comparison using fuzzy matching with configurable tolerance (default 0.8 table overlap).

Stage 4: Database Variant Selection

See detailed section below on variant selection system.

Stage 5: Lost & Found Reconstruction

See detailed section below on LF processing (MERGE/CATALOG/NEAREST/ORPHAN).

Stage 6: Output Organization

Final databases are organized by match quality and processing path:

Directory	Content	Quality
`catalog/`	CATALOG exact matches + promoted NEAREST (successful L&F recovery)	Highest confidence
`metamatches/`	MERGE combined identical schemas (no exemplar match)	High confidence
`found_data/`	ORPHAN: Unmatched L&F fragments	Low confidence - requires review
`empty/`	Catalog match with no usable data	May contain `rejected/` data for manual review
`carved/`	Byte-carved residue (variant X)	Requires manual review

Database Variant Selection System

When a carved SQLite database is encountered, it may be corrupted, incomplete, or structurally damaged. The variant selector attempts multiple recovery strategies and chooses the best result.

The O/C/R/D/X Variant Approach

Each candidate database is processed through up to five different recovery methods:

Variant	Method	When to Use
O	Original	Test raw carved file as-is. Often works for cleanly-carved databases.
C	Clone	Copy database to clean state using VACUUM INTO. Removes freelist corruption.
R	Recover	Run `sqlite3 .recover` command. Creates lost_and_found tables for orphaned pages.
D	Dissect	Use sqlite_dissect to rebuild from raw pages. Only when exemplar match found.
X	Failed	All variants failed validation. Send to byte carving pipeline.

Variant Selection Logic

Discovery & Introspection

Attempt O, C, R (and D if exemplar matched). Collect metadata: table sets, row counts, integrity check results.

Hash-Based Matching

Compute schema hash for each variant. O(1) lookup against exemplar_hash_lookup.json. Skip non-matching variants.

Profiling & Weighting

For matched variants, sample up to PROFILE_TABLE_SAMPLE_LIMIT tables. Generate per-table row counts and completeness scores.

Best Variant Selection

Combine profile score with base heuristic (integrity + row count). Choose highest-scoring variant. Mark others for cleanup.

Variant Scoring Heuristics

Each variant is scored based on multiple factors:

Integrity: PRAGMA integrity_check result (pass = higher score)
Row Count: Total rows across all tables (more = better)
Table Completeness: Percentage of expected tables present
Match Quality: Exact hash match vs. fuzzy table match
lost_and_found Presence: R variant bonus for LF reconstruction potential

Variant Decision Record (sqlite_scan_results.jsonl)

Each database gets a decision record documenting the chosen variant and rationale:

{
  "case_name": "f12345678",
  "chosen_variant": "R",
  "match_type": "hash",
  "exemplar_name": "Safari_History",
  "profile_score": 0.95,
  "variants": {
    "O": {"valid": true, "rows": 150, "integrity": "ok"},
    "C": {"valid": true, "rows": 150, "integrity": "ok"},
    "R": {"valid": true, "rows": 168, "has_lf": true},
    "D": {"valid": false}
  }
}

Residue Processing

After variant selection, the residue processor performs cleanup and extraction:

Lost-and-Found Extraction: Extract lost_and_found_* tables from chosen variant into separate databases
Storage Cleanup: Delete non-chosen variants to save disk space

Lost & Found (LF) Reconstruction

When SQLite's .recover command succeeds, it creates lost_and_found tables containing orphaned database pages. The LF processor matches these fragments against exemplar rubrics and reconstructs coherent databases.

The Four Processing Modes

LF reconstruction follows a prioritized processing order based on match quality:

Mode	Description	Match Type	Output
MERGE	Metamatch groups	Identical schemas, no exemplar match	`metamatches/`
CATALOG	Exact matches	Exact schema match to exemplar	`catalog/`
NEAREST	Best-fit exemplar	Databases matched to nearest (not exact) exemplar. Rebuilds using exemplar schema as template. Successful recoveries moved to `catalog/`.	`catalog/` or `empty/`
ORPHAN	Unmatched tables	No match found. Adds match hints to remnant LF tables if possible.	`found_data/`

Processing Order & Rationale

Phase 1: Split Databases

Extract lost_and_found tables from all recovered databases into separate "split" databases for matching.

Phase 2: MERGE (First)

Group databases with identical schemas (tables + columns) that don't match any exemplar. Combine into superrubrics for later matching.

Phase 3: CATALOG (Second)

Process exact matches using canonical exemplar rubrics. Highest-quality reconstruction with known-good schemas.

Phase 4: NEAREST (Third)

For databases that don't fit MERGE or CATALOG, match to the nearest exemplar based on schema similarity. Rebuild using the exemplar schema as template. Results initially go to found_data/, then Phase 7 reclassifies.

Phase 5: ORPHAN

Collect all unmatched fragments from MERGE/CATALOG/NEAREST. Preserve for manual forensic review. Must run after matching phases to capture remnants.

Phase 7: Reclassification (Final)

Reclassify NEAREST results based on recovery success: successful L&F recovery (total_lf_rows > 0) promotes to catalog/; no recovery moves to empty/. Also cleans up empty CATALOG entries.

MERGE: Metamatch Groups (Identical Schemas)

When multiple databases share identical schemas (same tables AND columns) but don't match any known exemplar, they're grouped by schema hash and processed together:

Grouping: Databases classified by schema hash (tables + columns)
Combining: Merge all group members into single database
Superrubric Generation: Create schema rubric from merged data
Fragment Matching: Match lost_and_found tables against superrubric
Reconstruction: Rebuild combined database with intact + recovered data

Example: Unknown App Database Metamatch Group

Three carved databases with identical schema (no exemplar match):

f12345678: Unknown app database from disk offset 0x00BC
f23456789: Same schema found at disk offset 0x1A00
f34567890: Same schema found at disk offset 0x2F00

Result: Combined into single unknown_app_a1b2c3d4 database with data from all three sources. The filename is based off the first table name.

CATALOG: Exact Matches

Exact schema matches use the canonical exemplar rubric for reconstruction:

Hash Match: Schema hash exactly matches exemplar
Rubric Loading: Load canonical exemplar rubric
Fragment Matching: Match lost_and_found columns to rubric tables
Reconstruction: Rebuild using exemplar schema as template
Remnant Handling: Unmatched fragments saved to lost_and_found/

NEAREST: Best-Fit Exemplar Matching

When a database doesn't exactly match any exemplar but is close enough for useful reconstruction, NEAREST matches it to the nearest exemplar schema based on similarity:

Schema Comparison: Find the nearest matching exemplar based on table and column similarity
Fragment Matching: Match lost_and_found fragments against the nearest exemplar rubric
Schema Rebuild: Reconstruct database using the nearest exemplar schema as template (structurally compatible with CATALOG results)
Initial Output: Results go to found_data/ initially
Phase 7 Reclassification: Successful recovery (L&F rows > 0) promotes to catalog/; no recovery moves to empty/

Why Reclassification?

Since NEAREST rebuilds databases using the exemplar schema as template, successful recoveries are structurally compatible with CATALOG results and can be merged with existing CATALOG entries for the same exemplar.

ORPHAN: Unmatched Fragments

Fragments that don't match any schema are preserved for manual review:

Collection: Gather all remnants from MERGE/CATALOG/NEAREST processing
Preservation: Create standalone databases with original fragment structure
Naming: Include match hints when partial matches existed
Traceability: Filenames preserved for forensic correlation

Output Example: Safari_History_f12345678_orphans.sqlite

Safari_History: Match hint (what we think it is)
f12345678: Original file offset (where found)
orphans: Contains unmatched lost_and_found fragments

Data Source Tracking

All reconstructed tables include a data_source column for provenance:

Value	Meaning
`carved_{db_name}`	Intact data from original database structure
`found_{db_name}`	Reconstructed data from lost_and_found fragments

Manifest Files

Each reconstructed database includes a *_manifest.json documenting:

Source databases that contributed data
Intact rows vs. LF-recovered rows per source
Remnant tables (unmatched fragments)
Duplicates removed during deduplication
Table-level statistics (row counts per table)

Forensic Traceability

Every output filename includes the original filename, allowing investigators to trace recovered data back to its origins. The data_source column provides row-level provenance tracking.

Byte-Carving Pipeline

When all variant recovery methods fail (variant X), MARS falls back to byte-level carving. This extracts raw data directly from database pages without relying on SQLite's structural integrity.

Carving Strategy

The byte carver processes databases page-by-page, extracting forensic artifacts:

1. Timestamp Detection

Identifies multiple timestamp formats:

Unix epoch: Seconds, milliseconds, nanoseconds since 1970
Cocoa/Core Data: Seconds since 2001-01-01 (macOS/iOS)
Chrome: Microseconds since 1601-01-01
WebKit: Seconds since 2001-01-01
Windows FILETIME: 100ns ticks since 1601
ISO 8601: Text timestamps (2025-01-18T10:30:00Z)

2. URL Extraction & Analysis

Uses regex patterns to find URLs, then analyzes with Unfurl:

Parse query parameters and fragments
Extract embedded timestamps from URL structure
Detect UUIDs and session IDs in paths
Identify platform (Facebook, YouTube, Twitter, etc.)

3. Protobuf Decoding

Uses blackboxprotobuf for schema-agnostic decoding:

Detect protobuf magic patterns in BLOB fields
Decode without schema using type inference
Extract nested timestamps and strings
Generate JSON representation with inferred typedef

4. Text Extraction

Extract printable ASCII strings (minimum length filtering)
Preserve context (surrounding bytes for analysis)
Deduplicate close timestamps to reduce noise

Carving Process Flow

Database File (Variant X)
        │
        ▼
┌───────────────────┐
│ Read Page-by-Page │ (4096-byte SQLite pages)
└───────────────────┘
        │
        ├─── Scan for numeric values ───→ Timestamp Classifier
        │                                  ├─ Unix epoch?
        │                                  ├─ Cocoa timestamp?
        │                                  ├─ Chrome time?
        │                                  └─ Valid range filter
        │
        ├─── Regex match URLs ──────────→ Unfurl Analyzer
        │                                  ├─ Parse structure
        │                                  ├─ Extract query params
        │                                  └─ Detect embedded timestamps
        │
        ├─── Scan for BLOB data ────────→ Protobuf Decoder
        │                                  ├─ blackboxprotobuf decode
        │                                  ├─ Extract nested data
        │                                  └─ Convert to JSON
        │
        └─── Extract text strings ──────→ Text Scanner
                                           ├─ Printable ASCII filter
                                           └─ Context preservation
        │
        ▼
┌──────────────────────────────────┐
│ Output Generation                │
│ ├─ timestamps.csv (optional)     │
│ ├─ carved.jsonl (detailed)       │
│ └─ carved.db (structured SQLite) │
└──────────────────────────────────┘

Integration with Unfurl

Unfurl provides context-aware URL analysis that helps distinguish real timestamps from ID values:

Example: Facebook URL Analysis

https://facebook.com/photo.php?fbid=123456789&id=987654321

Unfurl extracts:

Platform: facebook
Photo ID: 123456789 (NOT a timestamp, despite numeric format)
User ID: 987654321 (confirmed ID, not time)

This context prevents false positive timestamp classifications.

Integration with blackboxprotobuf

Schema-agnostic protobuf decoding recovers structured data without requiring .proto definitions:

Protobuf Decoding Example

Input: Binary BLOB (unknown structure)

blackboxprotobuf output:
{
  "message": {
    "1": "user@example.com",
    "2": 1705584600,
    "3": {
      "1": "https://example.com/path",
      "2": 42
    }
  },
  "typedef": {
    "1": {"type": "string"},
    "2": {"type": "int"},
    "3": {"type": "message", ...}
  }
}

Field "2" is identified as an integer and classified as a timestamp candidate by the timestamp detector.

Carving Output Formats

Format	Use Case	Content
`timestamps.csv`	Timeline analysis	All detected timestamps with format, confidence, and offset
`carved.jsonl`	Detailed review	Page-by-page extraction with URLs, protobuf, text, and context
`carved.db`	Structured analysis	SQLite database with tables for timestamps, URLs, and extracted data

dfVFS Integration

MARS uses Digital Forensics Virtual File System (dfVFS) for universal disk image access. This provides consistent file access across all forensic image formats and archive types.

Supported Formats

Category	Formats
Disk Images	E01, Ex01, DD, DMG, VMDK, VHD
Volumes	GPT, APM, MBR, APFS containers
Filesystems	APFS, HFS+, NTFS, ext4, FAT
Archives	TAR, ZIP, GZIP, BZIP2

Glob Pattern Matching

MARS extends dfVFS with full globstar (**) support for flexible pattern matching:

Glob Pattern Examples

Pattern	Matches
`/Users/*/Library/Safari/History.db`	Safari history for any user
`**/Library/Caches`	All cache directories (any depth)
`/private/var/*/com.apple..db`	Apple databases in /private/var tree

Globstar Implementation

The ** wildcard matches zero or more directory levels:

** at start: Matches from root (any depth prefix)
/**/ in middle: Matches zero or more intermediate directories
/** at end: Matches everything below current level
* alone: Matches exactly one directory level

Pattern:

/Library/Caches

Matches:

/Library/Caches (zero segments before)
/Users/admin/Library/Caches (two segments before)
/System/Volumes/Data/Users/admin/Library/Caches (many segments)

Does NOT match:

/Users/admin/Library/Caches/Chrome (extends past pattern)

Volume System Integration

dfVFS provides automatic volume enumeration and metadata extraction:

GPT Partitions: Partition name, GUID, size, type
APFS Containers: Volume names, encryption status, roles
Volume Labels: Filesystem labels from volume attributes

EWF/E01 Mount Utilities

MARS includes utilities for mounting forensic images for interactive exploration:

macOS: Requires Fuse-T for userspace filesystem mounting
Windows: Arsenal Image Mounter or similar tools
Linux: libewf + FUSE for native EWF support

Directory Filtering & Exclusion

To prevent hangs on directories with millions of cache files, MARS includes smart exclusions:

Skip Directories (Performance)

*/*/Library/Caches/Google/Chrome/*/Code Cache
*/*/Library/Caches/com.apple.Safari/fsCachedData
*/*/Library/Caches/*/com.apple.metal
*/*/Library/Caches/*/GPUCache

These directories contain hundreds of thousands of small cache files that slow down scans without forensic value.

Data Volume Variants

MARS automatically generates System/Volumes/Data/ pattern variants for paths starting with Users/, Library/, private/, or var/. This handles macOS Big Sur+ volume layout changes.

Configuration System

MARS uses a centralized configuration system defined in config/schema.py. All settings are organized into logical sections with dataclasses.

Configuration Sections

MatchingConfig

Controls database matching and confidence thresholds:

min_confidence: Minimum match confidence (default 0.7)
min_rows: Minimum rows required for valid match (default 10)
min_columns: Minimum columns for substantial match (default 3)
semantic_anchor_threshold: Minimum anchor score (default 2.0)

SemanticAnchorWeights

Weights for pattern detection in matching:

uuid: 1.0 (UUID pattern weight)
timestamp_text: 0.9 (timestamp detection)
url: 0.8 (URL pattern)
email: 0.7 (email pattern)
uuid_in_pk: 2.0 (UUID in primary key - strong signal)

SchemaComparisonConfig

Tables and prefixes to ignore during schema comparison:

GLOBAL_IGNORABLE_TABLES: sqlite_sequence, sqlite_stat*, meta, z_metadata, etc.
ignorable_prefixes: sqlite_, sqlean_ (system tables)
ignorable_suffixes: _content, _segments, _segdir (FTS tables)
salvage_tables: lost_and_found, carved, recovered_rows

VariantSelectorConfig

Database variant selection behavior:

dissect_all: Attempt sqlite_dissect on all variants (default false)

CarverConfig

Byte-carving settings:

ts_start: Filter timestamps after this date (default 2015-01-01)
ts_end: Filter timestamps before this date (default 2030-01-01)
filter_mode: 'permissive', 'balanced', 'strict', or 'all'
decode_protobuf: Attempt protobuf decoding (default true)
csv_export: Generate CSV output (default false)

ExemplarScanConfig

Exemplar scanning parameters:

epoch_min: Minimum valid timestamp date (default 2000-01-01)
epoch_max: Maximum valid timestamp date (default 2038-01-19)
min_role_sample_size: Minimum rows for semantic role assignment (default 5)
enabled_catalog_groups: Database groups to include (empty = all)
excluded_file_types: File types to skip (e.g., 'cache', 'log')

Ignorable Tables

GLOBAL_IGNORABLE_TABLES defines tables that are always filtered during schema comparison:

Category	Tables
SQLite System	sqlite_sequence, sqlite_stat1/2/3/4, sqlite_master, sqlite_temp_master
Extensions	sqlean_define, rtree_, fts5_, etc.
CoreData	meta, dbinfo, z_primarykey, z_metadata, z_modelcache

User-Configurable Settings

Settings marked with user_configurable=True are exposed in the UI and saved to .marsproj files:

Basic: Debug mode, progress bars
Exemplar: Date ranges, catalog groups, excluded file types
Carver: Timestamp filtering, protobuf decoding, CSV export
Advanced: Variant selection, dissect options