bigwolfe
init
2510c5e
# Research: Multi-Tenant Obsidian-Like Docs Viewer
**Branch**: `001-obsidian-docs-viewer` | **Date**: 2025-11-15 | **Plan**: [plan.md](./plan.md)
## Overview
This document captures technical research and decisions for the implementation of a multi-tenant Obsidian-like documentation viewer. Each section addresses a specific research topic from Phase 0 of the implementation plan.
---
## 1. FastMCP HTTP Transport Authentication (Bearer Token)
### Decision
Use FastMCP's built-in `BearerAuth` mechanism with JWT token validation for HTTP transport authentication.
**Implementation approach**:
- Server: Configure FastMCP HTTP transport to accept `Authorization: Bearer <token>` header
- Client: Pass JWT token as string to `auth` parameter (FastMCP adds "Bearer" prefix automatically)
- Token format: JWT with claims `sub=user_id`, `exp=now+90days`, signed with `HS256` and server secret
### Rationale
1. **Native FastMCP support**: FastMCP provides first-class Bearer token authentication via `BearerAuth` class and string token shortcuts
2. **Minimal configuration**: Client code is as simple as `Client("https://...", auth="<token>")`
3. **Standard compliance**: Uses industry-standard `Authorization: Bearer` header pattern
4. **Transport flexibility**: Works seamlessly with both HTTP and SSE (Server-Sent Events) transports
5. **Non-interactive workflow**: Perfect for AI agents and service accounts that need programmatic access
### Alternatives Considered
**Alternative 1: Custom header authentication**
- **Rejected**: FastMCP supports custom headers but requires manual implementation of auth logic
- **Why rejected**: More complex, loses benefit of FastMCP's built-in token handling and validation
**Alternative 2: OAuth flow for MCP clients**
- **Rejected**: FastMCP supports full OAuth 2.1 flows with browser-based authentication
- **Why rejected**: Overly complex for AI agent use case; requires interactive browser flow which doesn't suit MCP STDIO or programmatic access patterns
**Alternative 3: API key-based authentication**
- **Rejected**: Could use simple API keys instead of JWTs
- **Why rejected**: JWTs provide expiration, claims, and stateless validation; better security posture for multi-tenant system
### Implementation Notes
**Server-side setup**:
```python
from fastmcp import FastMCP
from fastmcp.server.auth import BearerAuthProvider
import jwt
# For token validation (if using external issuer)
auth_provider = BearerAuthProvider(
public_key="<RSA_PUBLIC_KEY>",
issuer="https://your-issuer.com",
audience="your-api"
)
# For internal JWT validation (our use case)
# Validate manually in middleware/dependency injection
def validate_jwt(token: str) -> str:
payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"])
return payload["sub"] # user_id
```
**Client-side setup**:
```python
from fastmcp import Client
# Simplest approach - pass token as string
async with Client(
"https://fastmcp.cloud/mcp",
auth="eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."
) as client:
await client.call_tool("list_notes", {})
# Explicit approach - use BearerAuth class
from fastmcp.client.auth import BearerAuth
async with Client(
"https://fastmcp.cloud/mcp",
auth=BearerAuth(token="eyJhbGci...")
) as client:
await client.call_tool("list_notes", {})
```
**Key points**:
- Do NOT include "Bearer" prefix when passing token - FastMCP adds it automatically
- Token validation happens on every MCP tool call via HTTP transport
- STDIO transport bypasses authentication (local development only)
- For HF Space deployment, combine with HF OAuth to issue user-specific JWTs
**References**:
- FastMCP Bearer Auth docs: https://gofastmcp.com/clients/auth/bearer
- FastMCP authentication patterns: https://gyliu513.github.io/jekyll/update/2025/08/12/fastmcp-auth-patterns.html
---
## 2. Hugging Face Space OAuth Integration
### Decision
Use `huggingface_hub` library's built-in OAuth helpers (`attach_huggingface_oauth`, `parse_huggingface_oauth`) for zero-configuration OAuth integration in HF Spaces.
**Implementation approach**:
- Add `hf_oauth: true` to Space metadata in README.md
- Call `attach_huggingface_oauth(app)` to auto-register OAuth endpoints (`/oauth/huggingface/login`, `/oauth/huggingface/logout`, `/oauth/huggingface/callback`)
- Call `parse_huggingface_oauth(request)` in route handlers to extract authenticated user info
- Map HF username/ID to internal `user_id` for vault scoping
### Rationale
1. **Zero-configuration**: HF Spaces automatically injects OAuth environment variables (`OAUTH_CLIENT_ID`, `OAUTH_CLIENT_SECRET`, `OAUTH_SCOPES`) when `hf_oauth: true` is set
2. **Local dev friendly**: `parse_huggingface_oauth` returns mock user in local mode, enabling seamless development without OAuth setup
3. **Minimal code**: Two function calls provide complete OAuth flow (login redirect, callback handling, session management)
4. **First-class support**: Official HF library with guaranteed compatibility with Spaces platform
5. **Standard OAuth 2.0**: Under the hood, implements industry-standard OAuth with PKCE
### Alternatives Considered
**Alternative 1: Manual OAuth implementation**
- **Rejected**: Implement OAuth flow manually using `authlib` or `requests-oauthlib`
- **Why rejected**: Significantly more code, requires manual handling of PKCE, state validation, and token exchange; error-prone and loses HF Spaces auto-configuration
**Alternative 2: Third-party auth provider (Auth0, WorkOS)**
- **Rejected**: Use external auth service and connect HF as identity provider
- **Why rejected**: Adds unnecessary complexity and external dependencies for a system designed specifically for HF Spaces deployment
**Alternative 3: Session-based auth without OAuth**
- **Rejected**: Use simple username/password with cookie sessions
- **Why rejected**: Poor UX (users already have HF accounts), requires password management, doesn't leverage HF ecosystem integration
### Implementation Notes
**Space configuration** (README.md frontmatter):
```yaml
---
title: Obsidian Docs Viewer
emoji: πŸ“š
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
hf_oauth: true # <-- Enable OAuth
---
```
**Backend integration** (FastAPI):
```python
from fastapi import FastAPI, Request
from huggingface_hub import attach_huggingface_oauth, parse_huggingface_oauth
app = FastAPI()
# Auto-register OAuth endpoints
attach_huggingface_oauth(app)
@app.get("/")
def index(request: Request):
oauth_info = parse_huggingface_oauth(request)
if oauth_info is None:
return {"message": "Not logged in", "login_url": "/oauth/huggingface/login"}
# Extract user info
user_id = oauth_info.user_info.preferred_username # or use 'sub' for UUID
display_name = oauth_info.user_info.name
avatar = oauth_info.user_info.picture
return {
"user_id": user_id,
"display_name": display_name,
"avatar": avatar
}
@app.get("/api/me")
def get_current_user(request: Request):
oauth_info = parse_huggingface_oauth(request)
if oauth_info is None:
raise HTTPException(status_code=401, detail="Not authenticated")
# Map to internal user model
user_id = oauth_info.user_info.preferred_username
# Initialize vault on first login if needed
vault_service.ensure_vault_exists(user_id)
return {
"user_id": user_id,
"hf_profile": {
"username": oauth_info.user_info.preferred_username,
"name": oauth_info.user_info.name,
"avatar": oauth_info.user_info.picture
}
}
```
**Frontend integration** (React):
```typescript
// Check auth status on app load
useEffect(() => {
fetch('/api/me')
.then(res => {
if (res.ok) return res.json();
throw new Error('Not authenticated');
})
.then(user => setCurrentUser(user))
.catch(() => window.location.href = '/oauth/huggingface/login');
}, []);
```
**Key points**:
- `attach_huggingface_oauth` must be called BEFORE defining routes that need auth
- `parse_huggingface_oauth` returns `None` if not authenticated (check before accessing user_info)
- In local development, returns mocked user with deterministic username (e.g., "local-user")
- OAuth tokens/sessions are managed by `huggingface_hub` (stored in cookies)
- For API/MCP access, issue separate JWT after OAuth login via `POST /api/tokens`
**Environment variables** (auto-injected in HF Space):
- `OAUTH_CLIENT_ID`: Public client identifier
- `OAUTH_CLIENT_SECRET`: Secret for token exchange
- `OAUTH_SCOPES`: Space-specific scopes (typically `openid profile`)
**References**:
- HF OAuth docs: https://huggingface.co/docs/hub/spaces-oauth
- huggingface_hub API: https://huggingface.co/docs/huggingface_hub/en/package_reference/oauth
---
## 3. SQLite Schema Design for Multi-Index Storage
### Decision
Use SQLite with FTS5 (Full-Text Search 5) for full-text indexing, plus separate regular tables for tags and link graph. Implement per-user isolation via `user_id` column in all tables.
**Schema approach**:
- **Full-text index**: FTS5 virtual table with `title` and `body` columns, using `porter` tokenizer for stemming
- **Tag index**: Regular table with `user_id`, `tag`, `note_path` (many-to-many relationship)
- **Link graph**: Regular table with `user_id`, `source_path`, `target_path`, `link_text`, `is_resolved`
- **Metadata index**: Regular table with `user_id`, `note_path`, `version`, `created`, `updated`, `title` for fast lookups
- **Index health**: Regular table with `user_id`, `note_count`, `last_full_rebuild`, `last_incremental_update`
### Rationale
1. **FTS5 performance**: Native full-text search with inverted index, sub-100ms query times for thousands of documents
2. **Separate concerns**: Full-text (FTS5), tags (simple lookup), and links (graph traversal) have different query patterns; separate tables optimize each
3. **Per-user isolation**: `user_id` column in all tables enables simple WHERE filtering without complex row-level security
4. **External content pattern**: FTS5 with `content=''` (contentless) avoids storing document text twice (already in filesystem)
5. **Version tracking**: Metadata table stores version counter for optimistic concurrency without polluting frontmatter
6. **Prefix indexes**: FTS5 `prefix='2 3'` option enables fast autocomplete/prefix search
### Alternatives Considered
**Alternative 1: Single FTS5 table for everything**
- **Rejected**: Store tags and links as UNINDEXED columns in FTS5 table
- **Why rejected**: FTS5 is optimized for full-text, not structured data; complex queries (e.g., "all notes with tag X") would require scanning all rows; tags/links don't benefit from tokenization
**Alternative 2: Separate SQLite database per user**
- **Rejected**: One `.db` file per user instead of `user_id` column
- **Why rejected**: Increases file I/O overhead, complicates connection pooling, harder to implement global admin queries (e.g., total user count)
**Alternative 3: PostgreSQL with pg_trgm or RUM indexes**
- **Rejected**: Use full Postgres instead of SQLite
- **Why rejected**: Overkill for single-server deployment, adds deployment complexity, SQLite is sufficient for target scale (5,000 notes/user, 10 concurrent users)
**Alternative 4: In-memory index only**
- **Rejected**: Build inverted index in Python dict, no persistence
- **Why rejected**: Slow startup (rebuild on every restart), no durability, doesn't scale beyond single process
### Implementation Notes
**Schema definition**:
```sql
-- Metadata index (fast lookups, version tracking)
CREATE TABLE IF NOT EXISTS note_metadata (
user_id TEXT NOT NULL,
note_path TEXT NOT NULL,
version INTEGER NOT NULL DEFAULT 1,
title TEXT NOT NULL,
created TEXT NOT NULL, -- ISO 8601 timestamp
updated TEXT NOT NULL, -- ISO 8601 timestamp
PRIMARY KEY (user_id, note_path)
);
CREATE INDEX idx_metadata_user ON note_metadata(user_id);
CREATE INDEX idx_metadata_updated ON note_metadata(user_id, updated DESC);
-- Full-text search index (FTS5, contentless)
CREATE VIRTUAL TABLE IF NOT EXISTS note_fts USING fts5(
user_id UNINDEXED,
note_path UNINDEXED,
title,
body,
content='', -- Contentless (we don't store the actual text)
tokenize='porter unicode61', -- Stemming + Unicode support
prefix='2 3' -- Prefix indexes for autocomplete
);
-- Tag index (many-to-many: notes <-> tags)
CREATE TABLE IF NOT EXISTS note_tags (
user_id TEXT NOT NULL,
note_path TEXT NOT NULL,
tag TEXT NOT NULL,
PRIMARY KEY (user_id, note_path, tag)
);
CREATE INDEX idx_tags_user_tag ON note_tags(user_id, tag);
CREATE INDEX idx_tags_user_path ON note_tags(user_id, note_path);
-- Link graph (outgoing links from notes)
CREATE TABLE IF NOT EXISTS note_links (
user_id TEXT NOT NULL,
source_path TEXT NOT NULL,
target_path TEXT, -- NULL if unresolved
link_text TEXT NOT NULL, -- Original [[link text]]
is_resolved INTEGER NOT NULL DEFAULT 0, -- Boolean: 0=broken, 1=resolved
PRIMARY KEY (user_id, source_path, link_text)
);
CREATE INDEX idx_links_user_source ON note_links(user_id, source_path);
CREATE INDEX idx_links_user_target ON note_links(user_id, target_path);
CREATE INDEX idx_links_unresolved ON note_links(user_id, is_resolved);
-- Index health tracking
CREATE TABLE IF NOT EXISTS index_health (
user_id TEXT PRIMARY KEY,
note_count INTEGER NOT NULL DEFAULT 0,
last_full_rebuild TEXT, -- ISO 8601 timestamp
last_incremental_update TEXT -- ISO 8601 timestamp
);
```
**Query patterns**:
```python
# Full-text search with ranking
cursor.execute("""
SELECT
note_path,
title,
bm25(note_fts, 3.0, 1.0) AS rank -- Title weight=3, body weight=1
FROM note_fts
WHERE user_id = ? AND note_fts MATCH ?
ORDER BY rank DESC
LIMIT 50
""", (user_id, query))
# Get all notes with a specific tag
cursor.execute("""
SELECT DISTINCT note_path, title
FROM note_tags t
JOIN note_metadata m USING (user_id, note_path)
WHERE t.user_id = ? AND t.tag = ?
ORDER BY m.updated DESC
""", (user_id, tag))
# Get backlinks for a note
cursor.execute("""
SELECT DISTINCT l.source_path, m.title
FROM note_links l
JOIN note_metadata m ON l.user_id = m.user_id AND l.source_path = m.note_path
WHERE l.user_id = ? AND l.target_path = ?
ORDER BY m.updated DESC
""", (user_id, target_path))
# Get all unresolved links for UI display
cursor.execute("""
SELECT source_path, link_text
FROM note_links
WHERE user_id = ? AND is_resolved = 0
""", (user_id,))
```
**Incremental update strategy**:
1. On `write_note`: Delete all existing rows for `(user_id, note_path)`, then insert new rows
2. Use transactions to ensure atomicity (delete old + insert new = single atomic operation)
3. Update `index_health.last_incremental_update` on every write
**Full rebuild strategy**:
1. Delete all index rows for `user_id`
2. Scan all `.md` files in vault directory
3. Parse each file and insert into all indexes
4. Update `index_health.note_count` and `last_full_rebuild`
**Key points**:
- FTS5 with `content=''` is contentless - we must manually INSERT/DELETE rows (no automatic synchronization)
- Use `porter` tokenizer for English stemming (search "running" matches "run")
- `bm25()` function provides relevance ranking (better than simple MATCH count)
- Prefix indexes (`prefix='2 3'`) enable fast `MATCH 'prefix*'` queries
- `UNINDEXED` columns in FTS5 are retrievable but not searchable (good for IDs)
**References**:
- SQLite FTS5 docs: https://www.sqlite.org/fts5.html
- FTS5 structure deep dive: https://darksi.de/13.sqlite-fts5-structure/
---
## 4. Wikilink Normalization and Resolution
### Decision
Implement case-insensitive normalized slug matching with deterministic ambiguity resolution based on Obsidian's behavior.
**Normalization algorithm**:
1. Extract link text from `[[link text]]`
2. Normalize: lowercase, replace spaces/hyphens/underscores with single dash, remove non-alphanumeric except dashes
3. Match normalized slug against normalized filename stems AND normalized frontmatter titles
4. If multiple matches: prefer same-folder match, then lexicographically smallest path
**Slug normalization function**:
```python
import re
def normalize_slug(text: str) -> str:
"""Normalize text to slug for case-insensitive matching."""
text = text.lower()
text = re.sub(r'[\s_]+', '-', text) # Spaces/underscores β†’ dash
text = re.sub(r'[^a-z0-9-]', '', text) # Keep only alphanumeric + dash
text = re.sub(r'-+', '-', text) # Collapse multiple dashes
return text.strip('-')
```
### Rationale
1. **Obsidian compatibility**: Matches Obsidian's link resolution behavior (case-insensitive, flexible matching)
2. **User-friendly**: Users don't need to remember exact case or spacing (e.g., `[[API Design]]` matches `api-design.md`)
3. **Deterministic**: Same-folder preference + lexicographic tiebreaker ensures consistent resolution
4. **Efficient indexing**: Normalized slugs can be pre-computed and indexed for O(1) lookup
5. **Graceful fallback**: Broken links are tracked and displayed distinctly in UI
### Alternatives Considered
**Alternative 1: Exact case-sensitive matching**
- **Rejected**: Require `[[exact-filename]]` to match `exact-filename.md`
- **Why rejected**: Brittle user experience, doesn't match Obsidian behavior, forces users to remember exact capitalization
**Alternative 2: Fuzzy matching (Levenshtein distance)**
- **Rejected**: Use string similarity to find "close enough" matches
- **Why rejected**: Non-deterministic, slower, can match wrong notes ("Setup" matches "Startup"), confusing UX
**Alternative 3: Path-based links only**
- **Rejected**: Require full paths like `[[guides/setup]]` instead of `[[Setup]]`
- **Why rejected**: Verbose, doesn't match Obsidian's short-link paradigm, poor UX for large vaults
**Alternative 4: UUID-based links**
- **Rejected**: Use unique IDs like `[[#uuid-123]]` for stable references
- **Why rejected**: Not human-readable, requires additional metadata, doesn't match Obsidian convention
### Implementation Notes
**Resolution algorithm** (priority order):
```python
def resolve_wikilink(user_id: str, link_text: str, current_note_folder: str) -> str | None:
"""Resolve wikilink to note path, or None if unresolved."""
normalized = normalize_slug(link_text)
# Build candidate index: normalized_slug -> [note_paths]
candidates = defaultdict(list)
# Scan all notes for this user
for note in list_all_notes(user_id):
# Match against filename stem
stem = Path(note.path).stem
if normalize_slug(stem) == normalized:
candidates[note.path].append(note.path)
# Match against frontmatter title
if note.title and normalize_slug(note.title) == normalized:
candidates[note.path].append(note.path)
if not candidates:
return None # Unresolved link
paths = list(set(candidates.keys())) # Deduplicate
if len(paths) == 1:
return paths[0] # Unique match
# Ambiguity resolution
# 1. Prefer same-folder match
same_folder = [p for p in paths if Path(p).parent == current_note_folder]
if same_folder:
return sorted(same_folder)[0] # Lexicographic tiebreaker
# 2. Lexicographically smallest path
return sorted(paths)[0]
```
**Index optimization**:
Pre-compute normalized slugs for all notes and store in `note_metadata` table:
```sql
ALTER TABLE note_metadata ADD COLUMN normalized_title_slug TEXT;
ALTER TABLE note_metadata ADD COLUMN normalized_path_slug TEXT;
CREATE INDEX idx_metadata_title_slug ON note_metadata(user_id, normalized_title_slug);
CREATE INDEX idx_metadata_path_slug ON note_metadata(user_id, normalized_path_slug);
```
**Link extraction from Markdown**:
```python
import re
def extract_wikilinks(markdown_body: str) -> list[str]:
"""Extract all wikilink texts from markdown body."""
pattern = r'\[\[([^\]]+)\]\]'
return re.findall(pattern, markdown_body)
```
**Update link graph on write**:
```python
def update_link_graph(user_id: str, note_path: str, body: str):
"""Update outgoing links and backlinks for a note."""
current_folder = str(Path(note_path).parent)
# Extract wikilinks from body
link_texts = extract_wikilinks(body)
# Delete old links from this note
db.execute("DELETE FROM note_links WHERE user_id=? AND source_path=?",
(user_id, note_path))
# Insert new links
for link_text in link_texts:
target_path = resolve_wikilink(user_id, link_text, current_folder)
is_resolved = 1 if target_path else 0
db.execute("""
INSERT INTO note_links (user_id, source_path, target_path, link_text, is_resolved)
VALUES (?, ?, ?, ?, ?)
""", (user_id, note_path, target_path, link_text, is_resolved))
```
**UI rendering**:
```typescript
// Transform wikilinks to clickable links in rendered Markdown
function transformWikilinks(markdown: string, linkIndex: Record<string, string>): string {
return markdown.replace(/\[\[([^\]]+)\]\]/g, (match, linkText) => {
const targetPath = linkIndex[linkText];
if (targetPath) {
// Resolved link
return `<a href="#/note/${encodeURIComponent(targetPath)}" class="wikilink">${linkText}</a>`;
} else {
// Broken link
return `<a href="#/create/${encodeURIComponent(linkText)}" class="wikilink broken">${linkText}</a>`;
}
});
}
```
**Key points**:
- Pre-compute and cache slug mappings for performance (avoid re-scanning on every link resolution)
- Same-folder preference matches Obsidian's behavior (local references are intuitive)
- Lexicographic tiebreaker ensures determinism (same input always resolves to same output)
- Track `is_resolved` flag to identify broken links for UI warnings/affordances
- Update entire link graph on every note write (incremental update, not rebuild)
**Edge cases**:
- Empty link text `[[]]` - ignore/skip
- Nested brackets `[[foo [[bar]]]]` - naive regex fails; use proper parser or limit to non-nested pattern
- Link with pipe `[[link|display]]` - out of scope for MVP; treat entire string as link text
---
## 5. React + shadcn/ui Directory Tree Component
### Decision
Use **shadcn-extension Tree View** component with built-in virtualization via `@tanstack/react-virtual` for directory tree rendering.
**Component choice**: `shadcn-extension` Tree View
- **Installation**: Available at https://shadcn-extension.vercel.app/docs/tree-view
- **Features**: Virtualization, accordion-based expand/collapse, keyboard navigation, selection, custom icons
- **Why this one**: Only shadcn tree component with native virtualization support; critical for large vaults (5,000 notes)
### Rationale
1. **Virtualization required**: 5,000 notes would create 5,000+ DOM nodes without virtualization; TanStack Virtual renders only visible rows (~20-50 nodes)
2. **Performance**: Virtualization reduces initial render from ~2s to <100ms for large trees
3. **shadcn ecosystem**: Consistent styling with other shadcn/ui components (Button, ScrollArea, etc.)
4. **Accessibility**: Built on Radix UI primitives with keyboard navigation and ARIA support
5. **Customizable**: Supports custom icons per node, expand/collapse callbacks, and selection handling
### Alternatives Considered
**Alternative 1: MrLightful's shadcn Tree View**
- **Rejected**: Feature-rich component with drag-and-drop, custom icons
- **Why rejected**: No virtualization support; would cause performance issues with 1,000+ notes
**Alternative 2: Neigebaie's shadcn Tree View**
- **Rejected**: Advanced features (multi-select, checkboxes, context menus)
- **Why rejected**: No virtualization; overkill for simple directory browsing
**Alternative 3: react-arborist**
- **Rejected**: Powerful tree view library with virtualization and drag-and-drop
- **Why rejected**: Not part of shadcn ecosystem; requires custom styling to match UI; heavier dependency
**Alternative 4: Custom implementation with react-window**
- **Rejected**: Build tree view from scratch using `react-window` or `react-virtual`
- **Why rejected**: Significant development effort; reinventing the wheel; shadcn-extension already provides this
### Implementation Notes
**Installation**:
```bash
npx shadcn add https://shadcn-extension.vercel.app/registry/tree-view.json
```
**Component usage**:
```tsx
import { Tree, TreeNode } from "@/components/ui/tree-view";
interface FileTreeNode {
id: string;
name: string;
path: string;
isFolder: boolean;
children?: FileTreeNode[];
}
function DirectoryTree({ vault, onSelectNote }: Props) {
// Transform vault notes into tree structure
const treeData = useMemo(() => buildTree(vault.notes), [vault.notes]);
return (
<Tree
data={treeData}
onSelectChange={(nodeId) => {
const node = findNode(treeData, nodeId);
if (!node.isFolder) {
onSelectNote(node.path);
}
}}
// Virtualization is enabled by default
className="w-full h-full"
/>
);
}
// Transform flat list of note paths into hierarchical tree
function buildTree(notes: Note[]): TreeNode[] {
const root: Map<string, TreeNode> = new Map();
for (const note of notes) {
const parts = note.path.split('/');
let currentLevel = root;
for (let i = 0; i < parts.length; i++) {
const part = parts[i];
const isFile = i === parts.length - 1;
const id = parts.slice(0, i + 1).join('/');
if (!currentLevel.has(part)) {
currentLevel.set(part, {
id,
name: isFile ? note.title : part,
path: id,
isFolder: !isFile,
children: isFile ? undefined : new Map()
});
}
if (!isFile) {
currentLevel = currentLevel.get(part)!.children!;
}
}
}
return Array.from(root.values());
}
```
**Styling for Obsidian-like appearance**:
```css
/* Custom styles for file tree */
.tree-view-node {
@apply py-1 px-2 rounded hover:bg-accent transition-colors;
}
.tree-view-node.selected {
@apply bg-accent text-accent-foreground font-medium;
}
.tree-view-folder {
@apply flex items-center gap-2;
}
.tree-view-file {
@apply flex items-center gap-2 text-sm;
}
/* Icons */
.folder-icon {
@apply text-yellow-500;
}
.file-icon {
@apply text-gray-500;
}
```
**Collapsible behavior**:
```tsx
// Track expanded folders in state
const [expanded, setExpanded] = useState<Set<string>>(new Set(['root']));
<Tree
data={treeData}
expanded={expanded}
onExpandedChange={setExpanded}
// Auto-expand to selected note's folder
onSelectChange={(nodeId) => {
const path = nodeId.split('/');
const folders = path.slice(0, -1);
setExpanded(new Set([...expanded, ...folders]));
}}
/>
```
**Key points**:
- Virtualization is automatic with shadcn-extension Tree View (uses TanStack Virtual internally)
- Must transform flat note list into nested tree structure (use `buildTree` utility)
- Track expanded/collapsed state separately from tree data
- Custom icons per node type (folder vs file) via `icon` prop
- Use `ScrollArea` component from shadcn to wrap tree for custom scrollbars
**Performance targets**:
- Initial render: <200ms for 5,000 notes
- Expand/collapse: <50ms per folder
- Search filter: <100ms to re-render filtered tree
**Accessibility**:
- Keyboard navigation: Arrow keys to navigate, Enter to select, Space to expand/collapse
- Screen reader support: ARIA labels for folders/files, expand/collapse state
- Focus management: Visible focus indicators, focus restoration after selection
---
## 6. Optimistic Concurrency Implementation
### Decision
Use **version counter** (integer) stored in SQLite index with `if_version` parameter for UI writes. Implement **ETag-like validation** via `If-Match` header in HTTP API.
**Approach**:
- Version counter: Integer field in `note_metadata` table, incremented on every write
- UI writes: Include `if_version: N` in `PUT /api/notes/{path}` body
- Server validation: Compare `if_version` with current version; return `409 Conflict` if mismatch
- MCP writes: No version checking (last-write-wins)
- ETag header: Return `ETag: "<version>"` in `GET /api/notes/{path}` response for HTTP compliance
### Rationale
1. **Simple implementation**: Integer counter is trivial to increment and compare
2. **Explicit versioning**: Version in request body makes intent clear ("I'm updating version 5")
3. **Database-backed**: Version persists in index, not frontmatter (keeps note content clean)
4. **HTTP-friendly**: Can expose as ETag header for standards compliance
5. **Performance**: Integer comparison is O(1), no hash computation needed
### Alternatives Considered
**Alternative 1: ETag with content hash**
- **Rejected**: Compute MD5/SHA hash of note content, return as ETag header
- **Why rejected**: Hash computation on every read adds latency; version counter is sufficient and faster
**Alternative 2: Last-Modified timestamps**
- **Rejected**: Use `updated` timestamp + `If-Unmodified-Since` header
- **Why rejected**: Timestamp precision issues (SQLite stores ISO strings, not microsecond precision); race conditions if multiple updates within same second
**Alternative 3: Version in frontmatter**
- **Rejected**: Store `version: 5` in YAML frontmatter
- **Why rejected**: Pollutes user-facing metadata; incrementing version requires parsing/re-serializing frontmatter; harder to manage
**Alternative 4: MVCC (Multi-Version Concurrency Control)**
- **Rejected**: Store multiple versions of each note, allow rollback
- **Why rejected**: Complex implementation; storage overhead; out of scope for MVP (no version history requirement)
### Implementation Notes
**Schema addition**:
```sql
-- Version counter in note_metadata table
ALTER TABLE note_metadata ADD COLUMN version INTEGER NOT NULL DEFAULT 1;
```
**API endpoint implementation**:
```python
from fastapi import HTTPException, Header
from typing import Optional
@app.put("/api/notes/{path}")
async def update_note(
path: str,
body: NoteUpdateRequest,
user_id: str = Depends(get_current_user),
if_match: Optional[str] = Header(None) # ETag header support
):
# Get current version
current = get_note_metadata(user_id, path)
# Check if_version in body OR If-Match header
expected_version = body.if_version or (int(if_match.strip('"')) if if_match else None)
if expected_version is not None and current.version != expected_version:
raise HTTPException(
status_code=409,
detail={
"error": "version_conflict",
"message": "Note was updated by another process",
"current_version": current.version,
"provided_version": expected_version
}
)
# Update note and increment version
new_version = current.version + 1
save_note(user_id, path, body.content)
update_metadata(user_id, path, version=new_version, updated=now())
return {
"status": "ok",
"version": new_version
}
@app.get("/api/notes/{path}")
async def get_note(
path: str,
user_id: str = Depends(get_current_user)
):
note = load_note(user_id, path)
return JSONResponse(
content={
"path": note.path,
"title": note.title,
"metadata": note.metadata,
"body": note.body,
"version": note.version,
"created": note.created,
"updated": note.updated
},
headers={
"ETag": f'"{note.version}"', # Expose version as ETag
"Cache-Control": "no-cache" # Prevent stale reads
}
)
```
**Frontend implementation** (React):
```typescript
interface Note {
path: string;
title: string;
body: string;
version: number;
// ...
}
async function saveNote(note: Note, newBody: string) {
try {
const response = await fetch(`/api/notes/${encodeURIComponent(note.path)}`, {
method: 'PUT',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${token}`,
// Option 1: Version in body
},
body: JSON.stringify({
body: newBody,
if_version: note.version // Optimistic concurrency check
})
});
if (response.status === 409) {
const error = await response.json();
alert(`Conflict: Note was updated (current version: ${error.current_version}). Please reload and try again.`);
return;
}
const updated = await response.json();
// Update local state with new version
setNote({ ...note, body: newBody, version: updated.version });
} catch (error) {
console.error('Save failed:', error);
}
}
```
**MCP tool implementation** (last-write-wins):
```python
@mcp.tool()
async def write_note(path: str, body: str, title: str = None) -> dict:
"""Write note via MCP (no version checking)."""
user_id = get_user_from_context()
# Load existing note to get current version (if exists)
try:
current = get_note_metadata(user_id, path)
new_version = current.version + 1
except NotFoundError:
new_version = 1 # New note
# Write without version check (last-write-wins)
save_note(user_id, path, body, title)
update_metadata(user_id, path, version=new_version, updated=now())
return {"status": "ok", "path": path, "version": new_version}
```
**Conflict resolution UI**:
```tsx
function ConflictDialog({ currentVersion, serverVersion }: Props) {
return (
<Alert variant="destructive">
<AlertTitle>Version Conflict</AlertTitle>
<AlertDescription>
This note was updated while you were editing (version {currentVersion} β†’ {serverVersion}).
<div className="mt-4 space-x-2">
<Button onClick={reload}>Reload and Discard Changes</Button>
<Button variant="outline" onClick={saveAsCopy}>Save as Copy</Button>
</div>
</AlertDescription>
</Alert>
);
}
```
**Key points**:
- Version counter starts at 1 for new notes, increments on every write
- HTTP API returns `409 Conflict` with detailed error message (current vs provided version)
- ETag header is optional but recommended for HTTP standards compliance
- MCP writes skip version check (AI agents don't need optimistic concurrency)
- Frontend shows clear error message with options: reload, save as copy, or manual merge
**Performance considerations**:
- Version check is single integer comparison (O(1))
- No need to read entire note content for validation
- Version update is atomic (SQLite transaction)
**References**:
- Optimistic concurrency patterns: https://event-driven.io/en/how_to_use_etag_header_for_optimistic_concurrency/
- HTTP conditional requests: https://developer.mozilla.org/en-US/docs/Web/HTTP/Conditional_requests
---
## 7. Markdown Frontmatter Parsing with Fallback
### Decision
Use `python-frontmatter` library for YAML parsing with try-except wrapper to handle malformed frontmatter gracefully. Implement fallback strategy: malformed YAML β†’ treat as no frontmatter, extract title from first `# Heading` or filename stem.
**Parsing approach**:
```python
import frontmatter
from pathlib import Path
def parse_note(file_path: str) -> dict:
"""Parse note with frontmatter fallback."""
try:
# Attempt to parse frontmatter
post = frontmatter.load(file_path)
metadata = dict(post.metadata)
body = post.content
except (yaml.scanner.ScannerError, yaml.parser.ParserError) as e:
# Malformed YAML - treat entire file as body
with open(file_path, 'r', encoding='utf-8') as f:
full_text = f.read()
metadata = {}
body = full_text
# Log warning for debugging
logger.warning(f"Malformed frontmatter in {file_path}: {e}")
# Extract title (priority: frontmatter > first H1 > filename)
title = (
metadata.get('title') or
extract_first_heading(body) or
Path(file_path).stem
)
return {
'title': title,
'metadata': metadata,
'body': body
}
def extract_first_heading(markdown: str) -> str | None:
"""Extract first # Heading from markdown body."""
match = re.match(r'^#\s+(.+)$', markdown, re.MULTILINE)
return match.group(1).strip() if match else None
```
### Rationale
1. **Graceful degradation**: Malformed YAML doesn't break the system; note is still readable
2. **User-friendly**: Non-technical users may create invalid YAML; system should be forgiving
3. **Simple implementation**: Try-except wrapper is minimal code; `python-frontmatter` handles valid cases
4. **Fallback chain**: Title extraction has clear priority order (explicit > inferred > default)
5. **Debugging support**: Log warnings for malformed YAML so admins can fix source files
### Alternatives Considered
**Alternative 1: Strict parsing (fail on malformed YAML)**
- **Rejected**: Raise error and reject note with invalid frontmatter
- **Why rejected**: Poor UX; users may accidentally create invalid YAML (e.g., unquoted colons); breaks read-first workflow
**Alternative 2: TOML or JSON frontmatter**
- **Rejected**: Use `+++` TOML or `{{{ }}}` JSON delimiters instead of YAML
- **Why rejected**: Obsidian uses YAML exclusively; compatibility is critical
**Alternative 3: Lenient YAML parser**
- **Rejected**: Use `ruamel.yaml` with error recovery instead of PyYAML
- **Why rejected**: Adds complexity; `python-frontmatter` uses PyYAML internally; fallback strategy is simpler
**Alternative 4: Partial frontmatter extraction**
- **Rejected**: Parse valid keys, ignore malformed keys
- **Why rejected**: Difficult to implement; unclear semantics (which keys are valid?); safer to treat all as invalid
### Implementation Notes
**Error types to catch**:
```python
import yaml
try:
post = frontmatter.load(file_path)
except (
yaml.scanner.ScannerError, # Invalid YAML syntax (e.g., unmatched quotes)
yaml.parser.ParserError, # Invalid YAML structure
UnicodeDecodeError # Non-UTF8 file encoding
) as e:
# Fallback to no frontmatter
pass
```
**Common malformed YAML examples**:
```yaml
---
title: API Design: Overview # Unquoted colon - INVALID
tags: [backend, api]
---
---
title: "Setup Guide
description: Installation steps # Unclosed quote - INVALID
---
---
title: Indented incorrectly # Bad indentation - INVALID
tags:
- frontend
---
```
**Auto-fix on write** (optional enhancement):
```python
def save_note(user_id: str, path: str, title: str, metadata: dict, body: str):
"""Save note with valid frontmatter (auto-fix on write)."""
# Merge title into metadata
metadata['title'] = title
# Create Post object with validated metadata
post = frontmatter.Post(body, **metadata)
# Serialize with valid YAML
file_content = frontmatter.dumps(post)
# Write to file
full_path = get_vault_path(user_id) / path
full_path.write_text(file_content, encoding='utf-8')
```
**Title extraction regex**:
```python
def extract_first_heading(markdown: str) -> str | None:
"""Extract first # Heading (must be H1, not H2/H3)."""
# Match # Heading (H1 only, not ## or ###)
pattern = r'^#\s+(.+?)(?:\s+\{[^}]+\})?\s*$'
match = re.search(pattern, markdown, re.MULTILINE)
if match:
heading = match.group(1).strip()
# Remove Markdown formatting (e.g., **bold**, *italic*)
heading = re.sub(r'[*_`]', '', heading)
return heading
return None
```
**Fallback priority**:
1. `metadata.get('title')` - Explicit frontmatter title
2. `extract_first_heading(body)` - First `# Heading` in body
3. `Path(file_path).stem` - Filename without `.md` extension
**Validation warnings**:
```python
# Add validation warnings to API response
if malformed_frontmatter:
warnings.append({
"type": "malformed_frontmatter",
"message": "YAML frontmatter is invalid and was ignored",
"line": error.problem_mark.line if hasattr(error, 'problem_mark') else None
})
```
**UI display for warnings**:
```tsx
function NoteViewer({ note, warnings }: Props) {
return (
<div>
{warnings.map(w => (
<Alert key={w.type} variant="warning">
<AlertTitle>Warning</AlertTitle>
<AlertDescription>{w.message}</AlertDescription>
</Alert>
))}
<Markdown>{note.body}</Markdown>
</div>
);
}
```
**Key points**:
- Always catch `yaml.scanner.ScannerError` and `yaml.parser.ParserError` from PyYAML
- Log warnings with file path and error details for admin debugging
- Prefer graceful fallback over strict validation (read-first workflow)
- Auto-fix on write ensures newly saved notes have valid frontmatter
- Extract title from first `# Heading`, not `## Subheading` (H1 only)
**References**:
- python-frontmatter docs: https://python-frontmatter.readthedocs.io/
- PyYAML error handling: https://pyyaml.org/wiki/PyYAMLDocumentation
---
## 8. JWT Token Management in React
### Decision
Use **hybrid approach**: Store short-lived access token (JWT) in **memory** (React state/context), store long-lived refresh token in **HttpOnly cookie** (server-managed). For MVP without refresh tokens, store JWT in **memory only** with 90-day expiration.
**MVP approach** (no refresh tokens):
- Store JWT in React Context (memory)
- Token expires after 90 days (long-lived)
- On app load, check if token exists in memory β†’ if not, redirect to login
- No localStorage (XSS vulnerability mitigation)
- No refresh flow (acceptable for MVP scale)
**Production approach** (with refresh tokens):
- Access token: 15-minute expiration, stored in memory
- Refresh token: 90-day expiration, stored in HttpOnly cookie
- Automatic refresh before access token expires
- Refresh endpoint: `POST /api/auth/refresh` (validates cookie, issues new access token)
### Rationale
1. **XSS protection**: Memory storage prevents JavaScript-based token theft (localStorage is vulnerable to XSS)
2. **CSRF protection**: HttpOnly cookies can't be accessed by JS, mitigating CSRF (when combined with SameSite attribute)
3. **Industry best practice (2025)**: Hybrid approach is current security standard for React SPAs
4. **Acceptable UX**: User logs in once per 90 days (or once per session if memory-only)
5. **No additional dependencies**: Built-in React Context API handles memory storage
### Alternatives Considered
**Alternative 1: localStorage for JWT**
- **Rejected**: Store JWT in `localStorage.setItem('token', jwt)`
- **Why rejected**: Vulnerable to XSS attacks (malicious scripts can read localStorage); still in OWASP Top 10; unacceptable security risk for multi-tenant system
**Alternative 2: sessionStorage for JWT**
- **Rejected**: Store JWT in `sessionStorage` (cleared on tab close)
- **Why rejected**: Poor UX (re-login on every new tab); still vulnerable to XSS
**Alternative 3: Cookies for both access and refresh tokens**
- **Rejected**: Store JWT in regular cookies (not HttpOnly)
- **Why rejected**: Vulnerable to CSRF if not using HttpOnly; vulnerable to XSS if accessible to JS
**Alternative 4: No token storage (re-authenticate on every request)**
- **Rejected**: Use HF OAuth on every API call
- **Why rejected**: Unacceptable latency; OAuth flow is slow (~2-3s per request)
### Implementation Notes
**MVP implementation** (memory-only, 90-day JWT):
```typescript
// Auth context (memory storage)
import { createContext, useContext, useState, useEffect } from 'react';
interface AuthContextType {
token: string | null;
setToken: (token: string) => void;
logout: () => void;
}
const AuthContext = createContext<AuthContextType | null>(null);
export function AuthProvider({ children }: { children: React.ReactNode }) {
const [token, setTokenState] = useState<string | null>(null);
const setToken = (newToken: string) => {
setTokenState(newToken);
};
const logout = () => {
setTokenState(null);
window.location.href = '/oauth/huggingface/logout';
};
return (
<AuthContext.Provider value={{ token, setToken, logout }}>
{children}
</AuthContext.Provider>
);
}
export function useAuth() {
const context = useContext(AuthContext);
if (!context) throw new Error('useAuth must be used within AuthProvider');
return context;
}
```
```typescript
// App initialization (fetch token after OAuth)
function App() {
const { token, setToken } = useAuth();
const [loading, setLoading] = useState(true);
useEffect(() => {
// Check if authenticated via HF OAuth
fetch('/api/me')
.then(res => {
if (!res.ok) throw new Error('Not authenticated');
return res.json();
})
.then(user => {
// Issue JWT token for API access
return fetch('/api/tokens', { method: 'POST' });
})
.then(res => res.json())
.then(data => {
setToken(data.token);
setLoading(false);
})
.catch(() => {
// Redirect to OAuth login
window.location.href = '/oauth/huggingface/login';
});
}, []);
if (loading) return <div>Loading...</div>;
return <MainApp />;
}
```
```typescript
// API client (include token in headers)
async function apiRequest(endpoint: string, options: RequestInit = {}) {
const { token } = useAuth();
const response = await fetch(`/api${endpoint}`, {
...options,
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${token}`,
...options.headers
}
});
if (response.status === 401) {
// Token expired or invalid
logout();
throw new Error('Unauthorized');
}
return response;
}
```
**Production implementation** (with refresh tokens):
```typescript
// Token refresh logic
let refreshPromise: Promise<string> | null = null;
async function refreshAccessToken(): Promise<string> {
// Prevent multiple concurrent refresh calls
if (refreshPromise) return refreshPromise;
refreshPromise = fetch('/api/auth/refresh', {
method: 'POST',
credentials: 'include' // Send HttpOnly cookie
})
.then(res => {
if (!res.ok) throw new Error('Refresh failed');
return res.json();
})
.then(data => {
setToken(data.access_token);
refreshPromise = null;
return data.access_token;
})
.catch(err => {
refreshPromise = null;
logout();
throw err;
});
return refreshPromise;
}
// Automatic refresh before token expires
useEffect(() => {
if (!token) return;
// Parse token to get expiration
const payload = JSON.parse(atob(token.split('.')[1]));
const expiresAt = payload.exp * 1000;
const now = Date.now();
const refreshAt = expiresAt - (5 * 60 * 1000); // 5 minutes before expiry
const timeoutId = setTimeout(() => {
refreshAccessToken();
}, refreshAt - now);
return () => clearTimeout(timeoutId);
}, [token]);
```
**Backend refresh endpoint**:
```python
from fastapi import Cookie, HTTPException
@app.post("/api/auth/refresh")
async def refresh_token(
refresh_token: str = Cookie(None, httponly=True, samesite='strict')
):
if not refresh_token:
raise HTTPException(status_code=401, detail="No refresh token")
# Validate refresh token
try:
payload = jwt.decode(refresh_token, SECRET_KEY, algorithms=["HS256"])
user_id = payload["sub"]
except jwt.ExpiredSignatureError:
raise HTTPException(status_code=401, detail="Refresh token expired")
except jwt.InvalidTokenError:
raise HTTPException(status_code=401, detail="Invalid refresh token")
# Issue new access token (15-minute expiry)
access_token = create_jwt(user_id, expiration_minutes=15)
return {"access_token": access_token, "token_type": "bearer"}
```
**Key points**:
- Memory storage = token lost on page refresh (re-login required) β†’ acceptable for MVP
- HttpOnly cookies cannot be accessed by JavaScript (XSS protection)
- Set `SameSite=strict` on refresh token cookie (CSRF protection)
- Refresh token rotation: issue new refresh token on each refresh (advanced security)
- Use `credentials: 'include'` in fetch to send HttpOnly cookies
- Parse JWT client-side to schedule refresh (or use server-sent expiry hint)
**Security checklist**:
- βœ… Access token in memory (XSS-resistant)
- βœ… Refresh token in HttpOnly cookie (XSS-resistant)
- βœ… SameSite=strict on cookies (CSRF-resistant)
- βœ… HTTPS required (prevent MITM)
- βœ… Short access token expiry (limit blast radius)
- βœ… Token refresh before expiry (seamless UX)
- βœ… Logout clears both tokens
**MVP vs Production tradeoff**:
- **MVP**: 90-day JWT in memory β†’ simpler, acceptable for hackathon/PoC
- **Production**: 15-min access + 90-day refresh β†’ better security, more complex
**References**:
- JWT storage best practices: https://www.descope.com/blog/post/developer-guide-jwt-storage
- HttpOnly cookies vs localStorage: https://www.wisp.blog/blog/understanding-token-storage-local-storage-vs-httponly-cookies
- React authentication patterns: https://marmelab.com/blog/2020/07/02/manage-your-jwt-react-admin-authentication-in-memory.html
---
## Summary of Key Decisions
| Topic | Decision | Primary Rationale |
|-------|----------|-------------------|
| **FastMCP Auth** | Bearer token with JWT validation | Native FastMCP support, minimal config, standard-compliant |
| **HF OAuth** | `attach_huggingface_oauth` + `parse_huggingface_oauth` | Zero-config, local dev friendly, official HF support |
| **SQLite Schema** | FTS5 for full-text + separate tables for tags/links | Performance, per-user isolation, optimized query patterns |
| **Wikilink Resolution** | Case-insensitive slug matching + same-folder preference | Obsidian compatibility, user-friendly, deterministic |
| **Directory Tree** | shadcn-extension Tree View with virtualization | Only shadcn option with virtualization for 5K+ notes |
| **Optimistic Concurrency** | Version counter in SQLite + `if_version` param | Simple, fast, HTTP-friendly, no content hashing overhead |
| **Frontmatter Parsing** | `python-frontmatter` + fallback to no frontmatter | Graceful degradation, user-friendly error handling |
| **JWT Management** | Memory storage (MVP) or memory + HttpOnly cookie (prod) | XSS protection, industry best practice (2025) |
---
## Next Steps
With research complete, proceed to **Phase 1: Data Model & Contracts**:
1. Create `data-model.md` with detailed Pydantic models and SQLite schemas
2. Create `contracts/http-api.yaml` with OpenAPI 3.1 specification
3. Create `contracts/mcp-tools.json` with MCP tool schemas (JSON Schema format)
4. Create `quickstart.md` with setup instructions and testing workflows
After Phase 1, run `/speckit.tasks` to generate dependency-ordered implementation tasks.