Spaces:

MCP-1st-Birthday
/

Vault.MCP

Sleeping

File size: 50,896 Bytes

2510c5e

# Research: Multi-Tenant Obsidian-Like Docs Viewer

**Branch**: `001-obsidian-docs-viewer` | **Date**: 2025-11-15 | **Plan**: [plan.md](./plan.md)

## Overview

This document captures technical research and decisions for the implementation of a multi-tenant Obsidian-like documentation viewer. Each section addresses a specific research topic from Phase 0 of the implementation plan.

---

## 1. FastMCP HTTP Transport Authentication (Bearer Token)

### Decision

Use FastMCP's built-in `BearerAuth` mechanism with JWT token validation for HTTP transport authentication.

**Implementation approach**:
- Server: Configure FastMCP HTTP transport to accept `Authorization: Bearer <token>` header
- Client: Pass JWT token as string to `auth` parameter (FastMCP adds "Bearer" prefix automatically)
- Token format: JWT with claims `sub=user_id`, `exp=now+90days`, signed with `HS256` and server secret

### Rationale

1. **Native FastMCP support**: FastMCP provides first-class Bearer token authentication via `BearerAuth` class and string token shortcuts
2. **Minimal configuration**: Client code is as simple as `Client("https://...", auth="<token>")`
3. **Standard compliance**: Uses industry-standard `Authorization: Bearer` header pattern
4. **Transport flexibility**: Works seamlessly with both HTTP and SSE (Server-Sent Events) transports
5. **Non-interactive workflow**: Perfect for AI agents and service accounts that need programmatic access

### Alternatives Considered

**Alternative 1: Custom header authentication**
- **Rejected**: FastMCP supports custom headers but requires manual implementation of auth logic
- **Why rejected**: More complex, loses benefit of FastMCP's built-in token handling and validation

**Alternative 2: OAuth flow for MCP clients**
- **Rejected**: FastMCP supports full OAuth 2.1 flows with browser-based authentication
- **Why rejected**: Overly complex for AI agent use case; requires interactive browser flow which doesn't suit MCP STDIO or programmatic access patterns

**Alternative 3: API key-based authentication**
- **Rejected**: Could use simple API keys instead of JWTs
- **Why rejected**: JWTs provide expiration, claims, and stateless validation; better security posture for multi-tenant system

### Implementation Notes

**Server-side setup**:
```python
from fastmcp import FastMCP
from fastmcp.server.auth import BearerAuthProvider
import jwt

# For token validation (if using external issuer)
auth_provider = BearerAuthProvider(
    public_key="<RSA_PUBLIC_KEY>",
    issuer="https://your-issuer.com",
    audience="your-api"
)

# For internal JWT validation (our use case)
# Validate manually in middleware/dependency injection
def validate_jwt(token: str) -> str:
    payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"])
    return payload["sub"]  # user_id
```

**Client-side setup**:
```python
from fastmcp import Client

# Simplest approach - pass token as string
async with Client(
    "https://fastmcp.cloud/mcp",
    auth="eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."
) as client:
    await client.call_tool("list_notes", {})

# Explicit approach - use BearerAuth class
from fastmcp.client.auth import BearerAuth

async with Client(
    "https://fastmcp.cloud/mcp",
    auth=BearerAuth(token="eyJhbGci...")
) as client:
    await client.call_tool("list_notes", {})
```

**Key points**:
- Do NOT include "Bearer" prefix when passing token - FastMCP adds it automatically
- Token validation happens on every MCP tool call via HTTP transport
- STDIO transport bypasses authentication (local development only)
- For HF Space deployment, combine with HF OAuth to issue user-specific JWTs

**References**:
- FastMCP Bearer Auth docs: https://gofastmcp.com/clients/auth/bearer
- FastMCP authentication patterns: https://gyliu513.github.io/jekyll/update/2025/08/12/fastmcp-auth-patterns.html

---

## 2. Hugging Face Space OAuth Integration

### Decision

Use `huggingface_hub` library's built-in OAuth helpers (`attach_huggingface_oauth`, `parse_huggingface_oauth`) for zero-configuration OAuth integration in HF Spaces.

**Implementation approach**:
- Add `hf_oauth: true` to Space metadata in README.md
- Call `attach_huggingface_oauth(app)` to auto-register OAuth endpoints (`/oauth/huggingface/login`, `/oauth/huggingface/logout`, `/oauth/huggingface/callback`)
- Call `parse_huggingface_oauth(request)` in route handlers to extract authenticated user info
- Map HF username/ID to internal `user_id` for vault scoping

### Rationale

1. **Zero-configuration**: HF Spaces automatically injects OAuth environment variables (`OAUTH_CLIENT_ID`, `OAUTH_CLIENT_SECRET`, `OAUTH_SCOPES`) when `hf_oauth: true` is set
2. **Local dev friendly**: `parse_huggingface_oauth` returns mock user in local mode, enabling seamless development without OAuth setup
3. **Minimal code**: Two function calls provide complete OAuth flow (login redirect, callback handling, session management)
4. **First-class support**: Official HF library with guaranteed compatibility with Spaces platform
5. **Standard OAuth 2.0**: Under the hood, implements industry-standard OAuth with PKCE

### Alternatives Considered

**Alternative 1: Manual OAuth implementation**
- **Rejected**: Implement OAuth flow manually using `authlib` or `requests-oauthlib`
- **Why rejected**: Significantly more code, requires manual handling of PKCE, state validation, and token exchange; error-prone and loses HF Spaces auto-configuration

**Alternative 2: Third-party auth provider (Auth0, WorkOS)**
- **Rejected**: Use external auth service and connect HF as identity provider
- **Why rejected**: Adds unnecessary complexity and external dependencies for a system designed specifically for HF Spaces deployment

**Alternative 3: Session-based auth without OAuth**
- **Rejected**: Use simple username/password with cookie sessions
- **Why rejected**: Poor UX (users already have HF accounts), requires password management, doesn't leverage HF ecosystem integration

### Implementation Notes

**Space configuration** (README.md frontmatter):
```yaml
---
title: Obsidian Docs Viewer
emoji: 📚
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
hf_oauth: true  # <-- Enable OAuth
---
```

**Backend integration** (FastAPI):
```python
from fastapi import FastAPI, Request
from huggingface_hub import attach_huggingface_oauth, parse_huggingface_oauth

app = FastAPI()

# Auto-register OAuth endpoints
attach_huggingface_oauth(app)

@app.get("/")
def index(request: Request):
    oauth_info = parse_huggingface_oauth(request)

    if oauth_info is None:
        return {"message": "Not logged in", "login_url": "/oauth/huggingface/login"}

    # Extract user info
    user_id = oauth_info.user_info.preferred_username  # or use 'sub' for UUID
    display_name = oauth_info.user_info.name
    avatar = oauth_info.user_info.picture

    return {
        "user_id": user_id,
        "display_name": display_name,
        "avatar": avatar
    }

@app.get("/api/me")
def get_current_user(request: Request):
    oauth_info = parse_huggingface_oauth(request)
    if oauth_info is None:
        raise HTTPException(status_code=401, detail="Not authenticated")

    # Map to internal user model
    user_id = oauth_info.user_info.preferred_username

    # Initialize vault on first login if needed
    vault_service.ensure_vault_exists(user_id)

    return {
        "user_id": user_id,
        "hf_profile": {
            "username": oauth_info.user_info.preferred_username,
            "name": oauth_info.user_info.name,
            "avatar": oauth_info.user_info.picture
        }
    }
```

**Frontend integration** (React):
```typescript
// Check auth status on app load
useEffect(() => {
  fetch('/api/me')
    .then(res => {
      if (res.ok) return res.json();
      throw new Error('Not authenticated');
    })
    .then(user => setCurrentUser(user))
    .catch(() => window.location.href = '/oauth/huggingface/login');
}, []);
```

**Key points**:
- `attach_huggingface_oauth` must be called BEFORE defining routes that need auth
- `parse_huggingface_oauth` returns `None` if not authenticated (check before accessing user_info)
- In local development, returns mocked user with deterministic username (e.g., "local-user")
- OAuth tokens/sessions are managed by `huggingface_hub` (stored in cookies)
- For API/MCP access, issue separate JWT after OAuth login via `POST /api/tokens`

**Environment variables** (auto-injected in HF Space):
- `OAUTH_CLIENT_ID`: Public client identifier
- `OAUTH_CLIENT_SECRET`: Secret for token exchange
- `OAUTH_SCOPES`: Space-specific scopes (typically `openid profile`)

**References**:
- HF OAuth docs: https://huggingface.co/docs/hub/spaces-oauth
- huggingface_hub API: https://huggingface.co/docs/huggingface_hub/en/package_reference/oauth

---

## 3. SQLite Schema Design for Multi-Index Storage

### Decision

Use SQLite with FTS5 (Full-Text Search 5) for full-text indexing, plus separate regular tables for tags and link graph. Implement per-user isolation via `user_id` column in all tables.

**Schema approach**:
- **Full-text index**: FTS5 virtual table with `title` and `body` columns, using `porter` tokenizer for stemming
- **Tag index**: Regular table with `user_id`, `tag`, `note_path` (many-to-many relationship)
- **Link graph**: Regular table with `user_id`, `source_path`, `target_path`, `link_text`, `is_resolved`
- **Metadata index**: Regular table with `user_id`, `note_path`, `version`, `created`, `updated`, `title` for fast lookups
- **Index health**: Regular table with `user_id`, `note_count`, `last_full_rebuild`, `last_incremental_update`

### Rationale

1. **FTS5 performance**: Native full-text search with inverted index, sub-100ms query times for thousands of documents
2. **Separate concerns**: Full-text (FTS5), tags (simple lookup), and links (graph traversal) have different query patterns; separate tables optimize each
3. **Per-user isolation**: `user_id` column in all tables enables simple WHERE filtering without complex row-level security
4. **External content pattern**: FTS5 with `content=''` (contentless) avoids storing document text twice (already in filesystem)
5. **Version tracking**: Metadata table stores version counter for optimistic concurrency without polluting frontmatter
6. **Prefix indexes**: FTS5 `prefix='2 3'` option enables fast autocomplete/prefix search

### Alternatives Considered

**Alternative 1: Single FTS5 table for everything**
- **Rejected**: Store tags and links as UNINDEXED columns in FTS5 table
- **Why rejected**: FTS5 is optimized for full-text, not structured data; complex queries (e.g., "all notes with tag X") would require scanning all rows; tags/links don't benefit from tokenization

**Alternative 2: Separate SQLite database per user**
- **Rejected**: One `.db` file per user instead of `user_id` column
- **Why rejected**: Increases file I/O overhead, complicates connection pooling, harder to implement global admin queries (e.g., total user count)

**Alternative 3: PostgreSQL with pg_trgm or RUM indexes**
- **Rejected**: Use full Postgres instead of SQLite
- **Why rejected**: Overkill for single-server deployment, adds deployment complexity, SQLite is sufficient for target scale (5,000 notes/user, 10 concurrent users)

**Alternative 4: In-memory index only**
- **Rejected**: Build inverted index in Python dict, no persistence
- **Why rejected**: Slow startup (rebuild on every restart), no durability, doesn't scale beyond single process

### Implementation Notes

**Schema definition**:
```sql
-- Metadata index (fast lookups, version tracking)
CREATE TABLE IF NOT EXISTS note_metadata (
    user_id TEXT NOT NULL,
    note_path TEXT NOT NULL,
    version INTEGER NOT NULL DEFAULT 1,
    title TEXT NOT NULL,
    created TEXT NOT NULL,  -- ISO 8601 timestamp
    updated TEXT NOT NULL,  -- ISO 8601 timestamp
    PRIMARY KEY (user_id, note_path)
);
CREATE INDEX idx_metadata_user ON note_metadata(user_id);
CREATE INDEX idx_metadata_updated ON note_metadata(user_id, updated DESC);

-- Full-text search index (FTS5, contentless)
CREATE VIRTUAL TABLE IF NOT EXISTS note_fts USING fts5(
    user_id UNINDEXED,
    note_path UNINDEXED,
    title,
    body,
    content='',  -- Contentless (we don't store the actual text)
    tokenize='porter unicode61',  -- Stemming + Unicode support
    prefix='2 3'  -- Prefix indexes for autocomplete
);

-- Tag index (many-to-many: notes <-> tags)
CREATE TABLE IF NOT EXISTS note_tags (
    user_id TEXT NOT NULL,
    note_path TEXT NOT NULL,
    tag TEXT NOT NULL,
    PRIMARY KEY (user_id, note_path, tag)
);
CREATE INDEX idx_tags_user_tag ON note_tags(user_id, tag);
CREATE INDEX idx_tags_user_path ON note_tags(user_id, note_path);

-- Link graph (outgoing links from notes)
CREATE TABLE IF NOT EXISTS note_links (
    user_id TEXT NOT NULL,
    source_path TEXT NOT NULL,
    target_path TEXT,  -- NULL if unresolved
    link_text TEXT NOT NULL,  -- Original [[link text]]
    is_resolved INTEGER NOT NULL DEFAULT 0,  -- Boolean: 0=broken, 1=resolved
    PRIMARY KEY (user_id, source_path, link_text)
);
CREATE INDEX idx_links_user_source ON note_links(user_id, source_path);
CREATE INDEX idx_links_user_target ON note_links(user_id, target_path);
CREATE INDEX idx_links_unresolved ON note_links(user_id, is_resolved);

-- Index health tracking
CREATE TABLE IF NOT EXISTS index_health (
    user_id TEXT PRIMARY KEY,
    note_count INTEGER NOT NULL DEFAULT 0,
    last_full_rebuild TEXT,  -- ISO 8601 timestamp
    last_incremental_update TEXT  -- ISO 8601 timestamp
);
```

**Query patterns**:
```python
# Full-text search with ranking
cursor.execute("""
    SELECT
        note_path,
        title,
        bm25(note_fts, 3.0, 1.0) AS rank  -- Title weight=3, body weight=1
    FROM note_fts
    WHERE user_id = ? AND note_fts MATCH ?
    ORDER BY rank DESC
    LIMIT 50
""", (user_id, query))

# Get all notes with a specific tag
cursor.execute("""
    SELECT DISTINCT note_path, title
    FROM note_tags t
    JOIN note_metadata m USING (user_id, note_path)
    WHERE t.user_id = ? AND t.tag = ?
    ORDER BY m.updated DESC
""", (user_id, tag))

# Get backlinks for a note
cursor.execute("""
    SELECT DISTINCT l.source_path, m.title
    FROM note_links l
    JOIN note_metadata m ON l.user_id = m.user_id AND l.source_path = m.note_path
    WHERE l.user_id = ? AND l.target_path = ?
    ORDER BY m.updated DESC
""", (user_id, target_path))

# Get all unresolved links for UI display
cursor.execute("""
    SELECT source_path, link_text
    FROM note_links
    WHERE user_id = ? AND is_resolved = 0
""", (user_id,))
```

**Incremental update strategy**:
1. On `write_note`: Delete all existing rows for `(user_id, note_path)`, then insert new rows
2. Use transactions to ensure atomicity (delete old + insert new = single atomic operation)
3. Update `index_health.last_incremental_update` on every write

**Full rebuild strategy**:
1. Delete all index rows for `user_id`
2. Scan all `.md` files in vault directory
3. Parse each file and insert into all indexes
4. Update `index_health.note_count` and `last_full_rebuild`

**Key points**:
- FTS5 with `content=''` is contentless - we must manually INSERT/DELETE rows (no automatic synchronization)
- Use `porter` tokenizer for English stemming (search "running" matches "run")
- `bm25()` function provides relevance ranking (better than simple MATCH count)
- Prefix indexes (`prefix='2 3'`) enable fast `MATCH 'prefix*'` queries
- `UNINDEXED` columns in FTS5 are retrievable but not searchable (good for IDs)

**References**:
- SQLite FTS5 docs: https://www.sqlite.org/fts5.html
- FTS5 structure deep dive: https://darksi.de/13.sqlite-fts5-structure/

---

## 4. Wikilink Normalization and Resolution

### Decision

Implement case-insensitive normalized slug matching with deterministic ambiguity resolution based on Obsidian's behavior.

**Normalization algorithm**:
1. Extract link text from `[[link text]]`
2. Normalize: lowercase, replace spaces/hyphens/underscores with single dash, remove non-alphanumeric except dashes
3. Match normalized slug against normalized filename stems AND normalized frontmatter titles
4. If multiple matches: prefer same-folder match, then lexicographically smallest path

**Slug normalization function**:
```python
import re

def normalize_slug(text: str) -> str:
    """Normalize text to slug for case-insensitive matching."""
    text = text.lower()
    text = re.sub(r'[\s_]+', '-', text)  # Spaces/underscores → dash
    text = re.sub(r'[^a-z0-9-]', '', text)  # Keep only alphanumeric + dash
    text = re.sub(r'-+', '-', text)  # Collapse multiple dashes
    return text.strip('-')
```

### Rationale

1. **Obsidian compatibility**: Matches Obsidian's link resolution behavior (case-insensitive, flexible matching)
2. **User-friendly**: Users don't need to remember exact case or spacing (e.g., `[[API Design]]` matches `api-design.md`)
3. **Deterministic**: Same-folder preference + lexicographic tiebreaker ensures consistent resolution
4. **Efficient indexing**: Normalized slugs can be pre-computed and indexed for O(1) lookup
5. **Graceful fallback**: Broken links are tracked and displayed distinctly in UI

### Alternatives Considered

**Alternative 1: Exact case-sensitive matching**
- **Rejected**: Require `[[exact-filename]]` to match `exact-filename.md`
- **Why rejected**: Brittle user experience, doesn't match Obsidian behavior, forces users to remember exact capitalization

**Alternative 2: Fuzzy matching (Levenshtein distance)**
- **Rejected**: Use string similarity to find "close enough" matches
- **Why rejected**: Non-deterministic, slower, can match wrong notes ("Setup" matches "Startup"), confusing UX

**Alternative 3: Path-based links only**
- **Rejected**: Require full paths like `[[guides/setup]]` instead of `[[Setup]]`
- **Why rejected**: Verbose, doesn't match Obsidian's short-link paradigm, poor UX for large vaults

**Alternative 4: UUID-based links**
- **Rejected**: Use unique IDs like `[[#uuid-123]]` for stable references
- **Why rejected**: Not human-readable, requires additional metadata, doesn't match Obsidian convention

### Implementation Notes

**Resolution algorithm** (priority order):
```python
def resolve_wikilink(user_id: str, link_text: str, current_note_folder: str) -> str | None:
    """Resolve wikilink to note path, or None if unresolved."""
    normalized = normalize_slug(link_text)

    # Build candidate index: normalized_slug -> [note_paths]
    candidates = defaultdict(list)

    # Scan all notes for this user
    for note in list_all_notes(user_id):
        # Match against filename stem
        stem = Path(note.path).stem
        if normalize_slug(stem) == normalized:
            candidates[note.path].append(note.path)

        # Match against frontmatter title
        if note.title and normalize_slug(note.title) == normalized:
            candidates[note.path].append(note.path)

    if not candidates:
        return None  # Unresolved link

    paths = list(set(candidates.keys()))  # Deduplicate

    if len(paths) == 1:
        return paths[0]  # Unique match

    # Ambiguity resolution
    # 1. Prefer same-folder match
    same_folder = [p for p in paths if Path(p).parent == current_note_folder]
    if same_folder:
        return sorted(same_folder)[0]  # Lexicographic tiebreaker

    # 2. Lexicographically smallest path
    return sorted(paths)[0]
```

**Index optimization**:
Pre-compute normalized slugs for all notes and store in `note_metadata` table:
```sql
ALTER TABLE note_metadata ADD COLUMN normalized_title_slug TEXT;
ALTER TABLE note_metadata ADD COLUMN normalized_path_slug TEXT;
CREATE INDEX idx_metadata_title_slug ON note_metadata(user_id, normalized_title_slug);
CREATE INDEX idx_metadata_path_slug ON note_metadata(user_id, normalized_path_slug);
```

**Link extraction from Markdown**:
```python
import re

def extract_wikilinks(markdown_body: str) -> list[str]:
    """Extract all wikilink texts from markdown body."""
    pattern = r'\[\[([^\]]+)\]\]'
    return re.findall(pattern, markdown_body)
```

**Update link graph on write**:
```python
def update_link_graph(user_id: str, note_path: str, body: str):
    """Update outgoing links and backlinks for a note."""
    current_folder = str(Path(note_path).parent)

    # Extract wikilinks from body
    link_texts = extract_wikilinks(body)

    # Delete old links from this note
    db.execute("DELETE FROM note_links WHERE user_id=? AND source_path=?",
               (user_id, note_path))

    # Insert new links
    for link_text in link_texts:
        target_path = resolve_wikilink(user_id, link_text, current_folder)
        is_resolved = 1 if target_path else 0

        db.execute("""
            INSERT INTO note_links (user_id, source_path, target_path, link_text, is_resolved)
            VALUES (?, ?, ?, ?, ?)
        """, (user_id, note_path, target_path, link_text, is_resolved))
```

**UI rendering**:
```typescript
// Transform wikilinks to clickable links in rendered Markdown
function transformWikilinks(markdown: string, linkIndex: Record<string, string>): string {
  return markdown.replace(/\[\[([^\]]+)\]\]/g, (match, linkText) => {
    const targetPath = linkIndex[linkText];

    if (targetPath) {
      // Resolved link
      return `<a href="#/note/${encodeURIComponent(targetPath)}" class="wikilink">${linkText}</a>`;
    } else {
      // Broken link
      return `<a href="#/create/${encodeURIComponent(linkText)}" class="wikilink broken">${linkText}</a>`;
    }
  });
}
```

**Key points**:
- Pre-compute and cache slug mappings for performance (avoid re-scanning on every link resolution)
- Same-folder preference matches Obsidian's behavior (local references are intuitive)
- Lexicographic tiebreaker ensures determinism (same input always resolves to same output)
- Track `is_resolved` flag to identify broken links for UI warnings/affordances
- Update entire link graph on every note write (incremental update, not rebuild)

**Edge cases**:
- Empty link text `[[]]` - ignore/skip
- Nested brackets `[[foo [[bar]]]]` - naive regex fails; use proper parser or limit to non-nested pattern
- Link with pipe `[[link|display]]` - out of scope for MVP; treat entire string as link text

---

## 5. React + shadcn/ui Directory Tree Component

### Decision

Use **shadcn-extension Tree View** component with built-in virtualization via `@tanstack/react-virtual` for directory tree rendering.

**Component choice**: `shadcn-extension` Tree View
- **Installation**: Available at https://shadcn-extension.vercel.app/docs/tree-view
- **Features**: Virtualization, accordion-based expand/collapse, keyboard navigation, selection, custom icons
- **Why this one**: Only shadcn tree component with native virtualization support; critical for large vaults (5,000 notes)

### Rationale

1. **Virtualization required**: 5,000 notes would create 5,000+ DOM nodes without virtualization; TanStack Virtual renders only visible rows (~20-50 nodes)
2. **Performance**: Virtualization reduces initial render from ~2s to <100ms for large trees
3. **shadcn ecosystem**: Consistent styling with other shadcn/ui components (Button, ScrollArea, etc.)
4. **Accessibility**: Built on Radix UI primitives with keyboard navigation and ARIA support
5. **Customizable**: Supports custom icons per node, expand/collapse callbacks, and selection handling

### Alternatives Considered

**Alternative 1: MrLightful's shadcn Tree View**
- **Rejected**: Feature-rich component with drag-and-drop, custom icons
- **Why rejected**: No virtualization support; would cause performance issues with 1,000+ notes

**Alternative 2: Neigebaie's shadcn Tree View**
- **Rejected**: Advanced features (multi-select, checkboxes, context menus)
- **Why rejected**: No virtualization; overkill for simple directory browsing

**Alternative 3: react-arborist**
- **Rejected**: Powerful tree view library with virtualization and drag-and-drop
- **Why rejected**: Not part of shadcn ecosystem; requires custom styling to match UI; heavier dependency

**Alternative 4: Custom implementation with react-window**
- **Rejected**: Build tree view from scratch using `react-window` or `react-virtual`
- **Why rejected**: Significant development effort; reinventing the wheel; shadcn-extension already provides this

### Implementation Notes

**Installation**:
```bash
npx shadcn add https://shadcn-extension.vercel.app/registry/tree-view.json
```

**Component usage**:
```tsx
import { Tree, TreeNode } from "@/components/ui/tree-view";

interface FileTreeNode {
  id: string;
  name: string;
  path: string;
  isFolder: boolean;
  children?: FileTreeNode[];
}

function DirectoryTree({ vault, onSelectNote }: Props) {
  // Transform vault notes into tree structure
  const treeData = useMemo(() => buildTree(vault.notes), [vault.notes]);

  return (
    <Tree
      data={treeData}
      onSelectChange={(nodeId) => {
        const node = findNode(treeData, nodeId);
        if (!node.isFolder) {
          onSelectNote(node.path);
        }
      }}
      // Virtualization is enabled by default
      className="w-full h-full"
    />
  );
}

// Transform flat list of note paths into hierarchical tree
function buildTree(notes: Note[]): TreeNode[] {
  const root: Map<string, TreeNode> = new Map();

  for (const note of notes) {
    const parts = note.path.split('/');
    let currentLevel = root;

    for (let i = 0; i < parts.length; i++) {
      const part = parts[i];
      const isFile = i === parts.length - 1;
      const id = parts.slice(0, i + 1).join('/');

      if (!currentLevel.has(part)) {
        currentLevel.set(part, {
          id,
          name: isFile ? note.title : part,
          path: id,
          isFolder: !isFile,
          children: isFile ? undefined : new Map()
        });
      }

      if (!isFile) {
        currentLevel = currentLevel.get(part)!.children!;
      }
    }
  }

  return Array.from(root.values());
}
```

**Styling for Obsidian-like appearance**:
```css
/* Custom styles for file tree */
.tree-view-node {
  @apply py-1 px-2 rounded hover:bg-accent transition-colors;
}

.tree-view-node.selected {
  @apply bg-accent text-accent-foreground font-medium;
}

.tree-view-folder {
  @apply flex items-center gap-2;
}

.tree-view-file {
  @apply flex items-center gap-2 text-sm;
}

/* Icons */
.folder-icon {
  @apply text-yellow-500;
}

.file-icon {
  @apply text-gray-500;
}
```

**Collapsible behavior**:
```tsx
// Track expanded folders in state
const [expanded, setExpanded] = useState<Set<string>>(new Set(['root']));

<Tree
  data={treeData}
  expanded={expanded}
  onExpandedChange={setExpanded}
  // Auto-expand to selected note's folder
  onSelectChange={(nodeId) => {
    const path = nodeId.split('/');
    const folders = path.slice(0, -1);
    setExpanded(new Set([...expanded, ...folders]));
  }}
/>
```

**Key points**:
- Virtualization is automatic with shadcn-extension Tree View (uses TanStack Virtual internally)
- Must transform flat note list into nested tree structure (use `buildTree` utility)
- Track expanded/collapsed state separately from tree data
- Custom icons per node type (folder vs file) via `icon` prop
- Use `ScrollArea` component from shadcn to wrap tree for custom scrollbars

**Performance targets**:
- Initial render: <200ms for 5,000 notes
- Expand/collapse: <50ms per folder
- Search filter: <100ms to re-render filtered tree

**Accessibility**:
- Keyboard navigation: Arrow keys to navigate, Enter to select, Space to expand/collapse
- Screen reader support: ARIA labels for folders/files, expand/collapse state
- Focus management: Visible focus indicators, focus restoration after selection

---

## 6. Optimistic Concurrency Implementation

### Decision

Use **version counter** (integer) stored in SQLite index with `if_version` parameter for UI writes. Implement **ETag-like validation** via `If-Match` header in HTTP API.

**Approach**:
- Version counter: Integer field in `note_metadata` table, incremented on every write
- UI writes: Include `if_version: N` in `PUT /api/notes/{path}` body
- Server validation: Compare `if_version` with current version; return `409 Conflict` if mismatch
- MCP writes: No version checking (last-write-wins)
- ETag header: Return `ETag: "<version>"` in `GET /api/notes/{path}` response for HTTP compliance

### Rationale

1. **Simple implementation**: Integer counter is trivial to increment and compare
2. **Explicit versioning**: Version in request body makes intent clear ("I'm updating version 5")
3. **Database-backed**: Version persists in index, not frontmatter (keeps note content clean)
4. **HTTP-friendly**: Can expose as ETag header for standards compliance
5. **Performance**: Integer comparison is O(1), no hash computation needed

### Alternatives Considered

**Alternative 1: ETag with content hash**
- **Rejected**: Compute MD5/SHA hash of note content, return as ETag header
- **Why rejected**: Hash computation on every read adds latency; version counter is sufficient and faster

**Alternative 2: Last-Modified timestamps**
- **Rejected**: Use `updated` timestamp + `If-Unmodified-Since` header
- **Why rejected**: Timestamp precision issues (SQLite stores ISO strings, not microsecond precision); race conditions if multiple updates within same second

**Alternative 3: Version in frontmatter**
- **Rejected**: Store `version: 5` in YAML frontmatter
- **Why rejected**: Pollutes user-facing metadata; incrementing version requires parsing/re-serializing frontmatter; harder to manage

**Alternative 4: MVCC (Multi-Version Concurrency Control)**
- **Rejected**: Store multiple versions of each note, allow rollback
- **Why rejected**: Complex implementation; storage overhead; out of scope for MVP (no version history requirement)

### Implementation Notes

**Schema addition**:
```sql
-- Version counter in note_metadata table
ALTER TABLE note_metadata ADD COLUMN version INTEGER NOT NULL DEFAULT 1;
```

**API endpoint implementation**:
```python
from fastapi import HTTPException, Header
from typing import Optional

@app.put("/api/notes/{path}")
async def update_note(
    path: str,
    body: NoteUpdateRequest,
    user_id: str = Depends(get_current_user),
    if_match: Optional[str] = Header(None)  # ETag header support
):
    # Get current version
    current = get_note_metadata(user_id, path)

    # Check if_version in body OR If-Match header
    expected_version = body.if_version or (int(if_match.strip('"')) if if_match else None)

    if expected_version is not None and current.version != expected_version:
        raise HTTPException(
            status_code=409,
            detail={
                "error": "version_conflict",
                "message": "Note was updated by another process",
                "current_version": current.version,
                "provided_version": expected_version
            }
        )

    # Update note and increment version
    new_version = current.version + 1
    save_note(user_id, path, body.content)
    update_metadata(user_id, path, version=new_version, updated=now())

    return {
        "status": "ok",
        "version": new_version
    }

@app.get("/api/notes/{path}")
async def get_note(
    path: str,
    user_id: str = Depends(get_current_user)
):
    note = load_note(user_id, path)

    return JSONResponse(
        content={
            "path": note.path,
            "title": note.title,
            "metadata": note.metadata,
            "body": note.body,
            "version": note.version,
            "created": note.created,
            "updated": note.updated
        },
        headers={
            "ETag": f'"{note.version}"',  # Expose version as ETag
            "Cache-Control": "no-cache"   # Prevent stale reads
        }
    )
```

**Frontend implementation** (React):
```typescript
interface Note {
  path: string;
  title: string;
  body: string;
  version: number;
  // ...
}

async function saveNote(note: Note, newBody: string) {
  try {
    const response = await fetch(`/api/notes/${encodeURIComponent(note.path)}`, {
      method: 'PUT',
      headers: {
        'Content-Type': 'application/json',
        'Authorization': `Bearer ${token}`,
        // Option 1: Version in body
      },
      body: JSON.stringify({
        body: newBody,
        if_version: note.version  // Optimistic concurrency check
      })
    });

    if (response.status === 409) {
      const error = await response.json();
      alert(`Conflict: Note was updated (current version: ${error.current_version}). Please reload and try again.`);
      return;
    }

    const updated = await response.json();
    // Update local state with new version
    setNote({ ...note, body: newBody, version: updated.version });

  } catch (error) {
    console.error('Save failed:', error);
  }
}
```

**MCP tool implementation** (last-write-wins):
```python
@mcp.tool()
async def write_note(path: str, body: str, title: str = None) -> dict:
    """Write note via MCP (no version checking)."""
    user_id = get_user_from_context()

    # Load existing note to get current version (if exists)
    try:
        current = get_note_metadata(user_id, path)
        new_version = current.version + 1
    except NotFoundError:
        new_version = 1  # New note

    # Write without version check (last-write-wins)
    save_note(user_id, path, body, title)
    update_metadata(user_id, path, version=new_version, updated=now())

    return {"status": "ok", "path": path, "version": new_version}
```

**Conflict resolution UI**:
```tsx
function ConflictDialog({ currentVersion, serverVersion }: Props) {
  return (
    <Alert variant="destructive">
      <AlertTitle>Version Conflict</AlertTitle>
      <AlertDescription>
        This note was updated while you were editing (version {currentVersion} → {serverVersion}).
        <div className="mt-4 space-x-2">
          <Button onClick={reload}>Reload and Discard Changes</Button>
          <Button variant="outline" onClick={saveAsCopy}>Save as Copy</Button>
        </div>
      </AlertDescription>
    </Alert>
  );
}
```

**Key points**:
- Version counter starts at 1 for new notes, increments on every write
- HTTP API returns `409 Conflict` with detailed error message (current vs provided version)
- ETag header is optional but recommended for HTTP standards compliance
- MCP writes skip version check (AI agents don't need optimistic concurrency)
- Frontend shows clear error message with options: reload, save as copy, or manual merge

**Performance considerations**:
- Version check is single integer comparison (O(1))
- No need to read entire note content for validation
- Version update is atomic (SQLite transaction)

**References**:
- Optimistic concurrency patterns: https://event-driven.io/en/how_to_use_etag_header_for_optimistic_concurrency/
- HTTP conditional requests: https://developer.mozilla.org/en-US/docs/Web/HTTP/Conditional_requests

---

## 7. Markdown Frontmatter Parsing with Fallback

### Decision

Use `python-frontmatter` library for YAML parsing with try-except wrapper to handle malformed frontmatter gracefully. Implement fallback strategy: malformed YAML → treat as no frontmatter, extract title from first `# Heading` or filename stem.

**Parsing approach**:
```python
import frontmatter
from pathlib import Path

def parse_note(file_path: str) -> dict:
    """Parse note with frontmatter fallback."""
    try:
        # Attempt to parse frontmatter
        post = frontmatter.load(file_path)
        metadata = dict(post.metadata)
        body = post.content

    except (yaml.scanner.ScannerError, yaml.parser.ParserError) as e:
        # Malformed YAML - treat entire file as body
        with open(file_path, 'r', encoding='utf-8') as f:
            full_text = f.read()

        metadata = {}
        body = full_text

        # Log warning for debugging
        logger.warning(f"Malformed frontmatter in {file_path}: {e}")

    # Extract title (priority: frontmatter > first H1 > filename)
    title = (
        metadata.get('title') or
        extract_first_heading(body) or
        Path(file_path).stem
    )

    return {
        'title': title,
        'metadata': metadata,
        'body': body
    }

def extract_first_heading(markdown: str) -> str | None:
    """Extract first # Heading from markdown body."""
    match = re.match(r'^#\s+(.+)$', markdown, re.MULTILINE)
    return match.group(1).strip() if match else None
```

### Rationale

1. **Graceful degradation**: Malformed YAML doesn't break the system; note is still readable
2. **User-friendly**: Non-technical users may create invalid YAML; system should be forgiving
3. **Simple implementation**: Try-except wrapper is minimal code; `python-frontmatter` handles valid cases
4. **Fallback chain**: Title extraction has clear priority order (explicit > inferred > default)
5. **Debugging support**: Log warnings for malformed YAML so admins can fix source files

### Alternatives Considered

**Alternative 1: Strict parsing (fail on malformed YAML)**
- **Rejected**: Raise error and reject note with invalid frontmatter
- **Why rejected**: Poor UX; users may accidentally create invalid YAML (e.g., unquoted colons); breaks read-first workflow

**Alternative 2: TOML or JSON frontmatter**
- **Rejected**: Use `+++` TOML or `{{{ }}}` JSON delimiters instead of YAML
- **Why rejected**: Obsidian uses YAML exclusively; compatibility is critical

**Alternative 3: Lenient YAML parser**
- **Rejected**: Use `ruamel.yaml` with error recovery instead of PyYAML
- **Why rejected**: Adds complexity; `python-frontmatter` uses PyYAML internally; fallback strategy is simpler

**Alternative 4: Partial frontmatter extraction**
- **Rejected**: Parse valid keys, ignore malformed keys
- **Why rejected**: Difficult to implement; unclear semantics (which keys are valid?); safer to treat all as invalid

### Implementation Notes

**Error types to catch**:
```python
import yaml

try:
    post = frontmatter.load(file_path)
except (
    yaml.scanner.ScannerError,  # Invalid YAML syntax (e.g., unmatched quotes)
    yaml.parser.ParserError,    # Invalid YAML structure
    UnicodeDecodeError          # Non-UTF8 file encoding
) as e:
    # Fallback to no frontmatter
    pass
```

**Common malformed YAML examples**:
```yaml
---
title: API Design: Overview  # Unquoted colon - INVALID
tags: [backend, api]
---

---
title: "Setup Guide
description: Installation steps  # Unclosed quote - INVALID
---

---
  title: Indented incorrectly  # Bad indentation - INVALID
tags:
- frontend
---
```

**Auto-fix on write** (optional enhancement):
```python
def save_note(user_id: str, path: str, title: str, metadata: dict, body: str):
    """Save note with valid frontmatter (auto-fix on write)."""
    # Merge title into metadata
    metadata['title'] = title

    # Create Post object with validated metadata
    post = frontmatter.Post(body, **metadata)

    # Serialize with valid YAML
    file_content = frontmatter.dumps(post)

    # Write to file
    full_path = get_vault_path(user_id) / path
    full_path.write_text(file_content, encoding='utf-8')
```

**Title extraction regex**:
```python
def extract_first_heading(markdown: str) -> str | None:
    """Extract first # Heading (must be H1, not H2/H3)."""
    # Match # Heading (H1 only, not ## or ###)
    pattern = r'^#\s+(.+?)(?:\s+\{[^}]+\})?\s*$'
    match = re.search(pattern, markdown, re.MULTILINE)

    if match:
        heading = match.group(1).strip()
        # Remove Markdown formatting (e.g., **bold**, *italic*)
        heading = re.sub(r'[*_`]', '', heading)
        return heading

    return None
```

**Fallback priority**:
1. `metadata.get('title')` - Explicit frontmatter title
2. `extract_first_heading(body)` - First `# Heading` in body
3. `Path(file_path).stem` - Filename without `.md` extension

**Validation warnings**:
```python
# Add validation warnings to API response
if malformed_frontmatter:
    warnings.append({
        "type": "malformed_frontmatter",
        "message": "YAML frontmatter is invalid and was ignored",
        "line": error.problem_mark.line if hasattr(error, 'problem_mark') else None
    })
```

**UI display for warnings**:
```tsx
function NoteViewer({ note, warnings }: Props) {
  return (
    <div>
      {warnings.map(w => (
        <Alert key={w.type} variant="warning">
          <AlertTitle>Warning</AlertTitle>
          <AlertDescription>{w.message}</AlertDescription>
        </Alert>
      ))}
      <Markdown>{note.body}</Markdown>
    </div>
  );
}
```

**Key points**:
- Always catch `yaml.scanner.ScannerError` and `yaml.parser.ParserError` from PyYAML
- Log warnings with file path and error details for admin debugging
- Prefer graceful fallback over strict validation (read-first workflow)
- Auto-fix on write ensures newly saved notes have valid frontmatter
- Extract title from first `# Heading`, not `## Subheading` (H1 only)

**References**:
- python-frontmatter docs: https://python-frontmatter.readthedocs.io/
- PyYAML error handling: https://pyyaml.org/wiki/PyYAMLDocumentation

---

## 8. JWT Token Management in React

### Decision

Use **hybrid approach**: Store short-lived access token (JWT) in **memory** (React state/context), store long-lived refresh token in **HttpOnly cookie** (server-managed). For MVP without refresh tokens, store JWT in **memory only** with 90-day expiration.

**MVP approach** (no refresh tokens):
- Store JWT in React Context (memory)
- Token expires after 90 days (long-lived)
- On app load, check if token exists in memory → if not, redirect to login
- No localStorage (XSS vulnerability mitigation)
- No refresh flow (acceptable for MVP scale)

**Production approach** (with refresh tokens):
- Access token: 15-minute expiration, stored in memory
- Refresh token: 90-day expiration, stored in HttpOnly cookie
- Automatic refresh before access token expires
- Refresh endpoint: `POST /api/auth/refresh` (validates cookie, issues new access token)

### Rationale

1. **XSS protection**: Memory storage prevents JavaScript-based token theft (localStorage is vulnerable to XSS)
2. **CSRF protection**: HttpOnly cookies can't be accessed by JS, mitigating CSRF (when combined with SameSite attribute)
3. **Industry best practice (2025)**: Hybrid approach is current security standard for React SPAs
4. **Acceptable UX**: User logs in once per 90 days (or once per session if memory-only)
5. **No additional dependencies**: Built-in React Context API handles memory storage

### Alternatives Considered

**Alternative 1: localStorage for JWT**
- **Rejected**: Store JWT in `localStorage.setItem('token', jwt)`
- **Why rejected**: Vulnerable to XSS attacks (malicious scripts can read localStorage); still in OWASP Top 10; unacceptable security risk for multi-tenant system

**Alternative 2: sessionStorage for JWT**
- **Rejected**: Store JWT in `sessionStorage` (cleared on tab close)
- **Why rejected**: Poor UX (re-login on every new tab); still vulnerable to XSS

**Alternative 3: Cookies for both access and refresh tokens**
- **Rejected**: Store JWT in regular cookies (not HttpOnly)
- **Why rejected**: Vulnerable to CSRF if not using HttpOnly; vulnerable to XSS if accessible to JS

**Alternative 4: No token storage (re-authenticate on every request)**
- **Rejected**: Use HF OAuth on every API call
- **Why rejected**: Unacceptable latency; OAuth flow is slow (~2-3s per request)

### Implementation Notes

**MVP implementation** (memory-only, 90-day JWT):

```typescript
// Auth context (memory storage)
import { createContext, useContext, useState, useEffect } from 'react';

interface AuthContextType {
  token: string | null;
  setToken: (token: string) => void;
  logout: () => void;
}

const AuthContext = createContext<AuthContextType | null>(null);

export function AuthProvider({ children }: { children: React.ReactNode }) {
  const [token, setTokenState] = useState<string | null>(null);

  const setToken = (newToken: string) => {
    setTokenState(newToken);
  };

  const logout = () => {
    setTokenState(null);
    window.location.href = '/oauth/huggingface/logout';
  };

  return (
    <AuthContext.Provider value={{ token, setToken, logout }}>
      {children}
    </AuthContext.Provider>
  );
}

export function useAuth() {
  const context = useContext(AuthContext);
  if (!context) throw new Error('useAuth must be used within AuthProvider');
  return context;
}
```

```typescript
// App initialization (fetch token after OAuth)
function App() {
  const { token, setToken } = useAuth();
  const [loading, setLoading] = useState(true);

  useEffect(() => {
    // Check if authenticated via HF OAuth
    fetch('/api/me')
      .then(res => {
        if (!res.ok) throw new Error('Not authenticated');
        return res.json();
      })
      .then(user => {
        // Issue JWT token for API access
        return fetch('/api/tokens', { method: 'POST' });
      })
      .then(res => res.json())
      .then(data => {
        setToken(data.token);
        setLoading(false);
      })
      .catch(() => {
        // Redirect to OAuth login
        window.location.href = '/oauth/huggingface/login';
      });
  }, []);

  if (loading) return <div>Loading...</div>;

  return <MainApp />;
}
```

```typescript
// API client (include token in headers)
async function apiRequest(endpoint: string, options: RequestInit = {}) {
  const { token } = useAuth();

  const response = await fetch(`/api${endpoint}`, {
    ...options,
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${token}`,
      ...options.headers
    }
  });

  if (response.status === 401) {
    // Token expired or invalid
    logout();
    throw new Error('Unauthorized');
  }

  return response;
}
```

**Production implementation** (with refresh tokens):

```typescript
// Token refresh logic
let refreshPromise: Promise<string> | null = null;

async function refreshAccessToken(): Promise<string> {
  // Prevent multiple concurrent refresh calls
  if (refreshPromise) return refreshPromise;

  refreshPromise = fetch('/api/auth/refresh', {
    method: 'POST',
    credentials: 'include'  // Send HttpOnly cookie
  })
    .then(res => {
      if (!res.ok) throw new Error('Refresh failed');
      return res.json();
    })
    .then(data => {
      setToken(data.access_token);
      refreshPromise = null;
      return data.access_token;
    })
    .catch(err => {
      refreshPromise = null;
      logout();
      throw err;
    });

  return refreshPromise;
}

// Automatic refresh before token expires
useEffect(() => {
  if (!token) return;

  // Parse token to get expiration
  const payload = JSON.parse(atob(token.split('.')[1]));
  const expiresAt = payload.exp * 1000;
  const now = Date.now();
  const refreshAt = expiresAt - (5 * 60 * 1000);  // 5 minutes before expiry

  const timeoutId = setTimeout(() => {
    refreshAccessToken();
  }, refreshAt - now);

  return () => clearTimeout(timeoutId);
}, [token]);
```

**Backend refresh endpoint**:
```python
from fastapi import Cookie, HTTPException

@app.post("/api/auth/refresh")
async def refresh_token(
    refresh_token: str = Cookie(None, httponly=True, samesite='strict')
):
    if not refresh_token:
        raise HTTPException(status_code=401, detail="No refresh token")

    # Validate refresh token
    try:
        payload = jwt.decode(refresh_token, SECRET_KEY, algorithms=["HS256"])
        user_id = payload["sub"]
    except jwt.ExpiredSignatureError:
        raise HTTPException(status_code=401, detail="Refresh token expired")
    except jwt.InvalidTokenError:
        raise HTTPException(status_code=401, detail="Invalid refresh token")

    # Issue new access token (15-minute expiry)
    access_token = create_jwt(user_id, expiration_minutes=15)

    return {"access_token": access_token, "token_type": "bearer"}
```

**Key points**:
- Memory storage = token lost on page refresh (re-login required) → acceptable for MVP
- HttpOnly cookies cannot be accessed by JavaScript (XSS protection)
- Set `SameSite=strict` on refresh token cookie (CSRF protection)
- Refresh token rotation: issue new refresh token on each refresh (advanced security)
- Use `credentials: 'include'` in fetch to send HttpOnly cookies
- Parse JWT client-side to schedule refresh (or use server-sent expiry hint)

**Security checklist**:
- ✅ Access token in memory (XSS-resistant)
- ✅ Refresh token in HttpOnly cookie (XSS-resistant)
- ✅ SameSite=strict on cookies (CSRF-resistant)
- ✅ HTTPS required (prevent MITM)
- ✅ Short access token expiry (limit blast radius)
- ✅ Token refresh before expiry (seamless UX)
- ✅ Logout clears both tokens

**MVP vs Production tradeoff**:
- **MVP**: 90-day JWT in memory → simpler, acceptable for hackathon/PoC
- **Production**: 15-min access + 90-day refresh → better security, more complex

**References**:
- JWT storage best practices: https://www.descope.com/blog/post/developer-guide-jwt-storage
- HttpOnly cookies vs localStorage: https://www.wisp.blog/blog/understanding-token-storage-local-storage-vs-httponly-cookies
- React authentication patterns: https://marmelab.com/blog/2020/07/02/manage-your-jwt-react-admin-authentication-in-memory.html

---

## Summary of Key Decisions

| Topic | Decision | Primary Rationale |
|-------|----------|-------------------|
| **FastMCP Auth** | Bearer token with JWT validation | Native FastMCP support, minimal config, standard-compliant |
| **HF OAuth** | `attach_huggingface_oauth` + `parse_huggingface_oauth` | Zero-config, local dev friendly, official HF support |
| **SQLite Schema** | FTS5 for full-text + separate tables for tags/links | Performance, per-user isolation, optimized query patterns |
| **Wikilink Resolution** | Case-insensitive slug matching + same-folder preference | Obsidian compatibility, user-friendly, deterministic |
| **Directory Tree** | shadcn-extension Tree View with virtualization | Only shadcn option with virtualization for 5K+ notes |
| **Optimistic Concurrency** | Version counter in SQLite + `if_version` param | Simple, fast, HTTP-friendly, no content hashing overhead |
| **Frontmatter Parsing** | `python-frontmatter` + fallback to no frontmatter | Graceful degradation, user-friendly error handling |
| **JWT Management** | Memory storage (MVP) or memory + HttpOnly cookie (prod) | XSS protection, industry best practice (2025) |

---

## Next Steps

With research complete, proceed to **Phase 1: Data Model & Contracts**:

1. Create `data-model.md` with detailed Pydantic models and SQLite schemas
2. Create `contracts/http-api.yaml` with OpenAPI 3.1 specification
3. Create `contracts/mcp-tools.json` with MCP tool schemas (JSON Schema format)
4. Create `quickstart.md` with setup instructions and testing workflows

After Phase 1, run `/speckit.tasks` to generate dependency-ordered implementation tasks.