Aparecium Baseline Model Card

Summary

  • Task: Reconstruct natural language posts from token‑level MPNet embeddings (reverse embedding).
  • Focus: Crypto domain, with equities as auxiliary domain.
  • Checkpoint: Baseline model trained with a phased schedule and early stopping.
  • Data: 1.0M synthetic posts (500k crypto + 500k equities), programmatically generated via OpenAI API. No real social‑media content used.
  • Input contract: token‑level MPNet matrix of shape (seq_len, 768), not a pooled vector.

Intended use

  • Research and engineering use for studying reversibility of embedding spaces and for building diagnostics/tools around embedding interpretability.
  • Not intended to reconstruct private or sensitive content; reconstruction accuracy depends on embedding fidelity and domain match.

Model architecture

  • Encoder side: External; we assume MPNet family encoder (default: sentence-transformers/all-mpnet-base-v2) to produce token‑level embeddings.
  • Decoder: Transformer decoder consuming the MPNet memory:
    • d_model: 768
    • Decoder layers: 2
    • Attention heads: 8
    • FFN dim: 2048
    • Token and positional embeddings; GELU activations
  • Decoding:
    • Supports greedy, sampling, and beam search.
    • Optional embedding‑aware rescoring (cosine similarity between the candidate’s re‑embedded sentence and the pooled MPNet target).
    • Optional lightweight constraints for hashtag/cashtag/URL continuity.

Recommended inference defaults:

  • num_beams=8
  • length_penalty_alpha=0.6
  • lambda_sim=0.6
  • rescore_every_k=4, rescore_top_m=8
  • beta=10.0
  • enable_constraints=True
  • deterministic=True

Training data and provenance

  • 1,000,000 synthetic posts total:
    • 500,000 crypto‑domain posts
    • 500,000 equities‑domain posts
  • All posts were programmatically generated via the OpenAI API (synthetic). No real social‑media content was used.
  • Embeddings:
    • Token‑level MPNet (default: sentence-transformers/all-mpnet-base-v2).
    • Cached to SQLite to avoid recomputation and allow resumable training.

Training procedure (baseline regimen)

  • Domain emphasis: 80% crypto / 20% equities per training phase.
  • Phased training (10% of available chunks per phase), evaluate after each phase:
    • In‑sample: small subset from the phase’s chunks
    • Out‑of‑sample: small hold‑out from both domains (not seen in the phase)
    • Early‑stop condition: stop if out‑of‑sample cosine degrades relative to prior phase.
  • Optimizer: AdamW
  • Learning rate (baseline finetune): 5e‑5
  • Batch size: 16
  • Input max_source_length: 256
  • Target max_target_length: 128
  • Checkpointing: every 2,000 steps and at phase end.

Notes

  • Training used early stopping based on out‑of‑sample cosine.

Evaluation protocol (for the metrics below)

  • Sample size: 1,000 examples per domain drawn from cached embedding databases.
  • Decode config: num_beams=8, length_penalty_alpha=0.6, lambda_sim=0.6, rescore_every_k=4, rescore_top_m=8, beta=10.0, enable_constraints=True, deterministic=True.
  • Metrics:
    • cosine_mean/median/p10/p90: cosine between pooled MPNet embedding of generated text and the pooled MPNet target vector (higher is better).
    • score_norm_mean: length‑penalized language model score (more positive is better; negative values are common for log‑scores).
    • degenerate_pct: % of clearly degenerate generations (very short/blank/only hashtags).
    • domain_drift_pct: % of equity‑like terms in crypto outputs (or crypto‑like terms in equities outputs). Heuristic text filter; intended as a rough indicator only.

Results (current models/baseline checkpoint)

  • Crypto (n=1000)
    • cosine_mean: 0.681
    • cosine_median: 0.843
    • cosine_p10: 0.000
    • cosine_p90: 0.984
    • score_norm_mean: −1.977
    • degenerate_pct: 5.2%
    • domain_drift_pct: 0.0%
  • Equities (n=1000)
    • cosine_mean: 0.778
    • cosine_median: 0.901
    • cosine_p10: 0.326
    • cosine_p90: 0.986
    • score_norm_mean: −1.344
    • degenerate_pct: 2.2%
    • domain_drift_pct: 4.4%

Interpretation

  • The model reconstructs many posts with strong embedding alignment (p90 ≈ 0.98 cosine in both domains).
  • Equities shows higher average/median cosine and lower degeneracy than crypto, consistent with the auxiliary‑domain role and data characteristics.
  • A small fraction of degenerate outputs exists in both domains (crypto ~5.2%, equities ~2.2%).
  • Domain drift is minimal from crypto→equities (0.0%) and present at a modest rate from equities→crypto (~4.4%) under the chosen heuristic.

Input contract and usage

  • Input: MPNet token‑level matrix (seq_len × 768) for a single post. Do not pass a pooled vector.
  • Tokenizer/model alignment matters: use the same MPNet tokenizer/model version that produced the embeddings.

Limitations and responsible use

  • Reconstruction is not guaranteed to match the original post text; it optimizes alignment within the MPNet embedding space and LM scoring.
  • The model can produce generic or incomplete outputs (see degenerate_pct).
  • Domain drift can occur depending on decode settings (see domain_drift_pct).
  • Data are synthetic programmatic generations, not real social‑media posts. Domain semantics may differ from real‑world distributions.
  • Do not use for reconstructing sensitive/private content or for attempting to de‑anonymize embedding corpora. This model is a research/diagnostic tool.

Reproducibility (high‑level)

  • Prepare embedding caches (not included): build local token‑level MPNet embedding caches for your corpora (e.g., via a data prep script) and store them in your own paths.
  • Baseline training: iterative 10% phases, 80:20 (crypto:equities), LR=5e‑5, BS=16, early‑stop on out‑of‑sample cosine degradation.
  • Evaluation: 1,000 samples/domain with the decode settings shown above.
  • The released checkpoint corresponds to the latest non‑degrading phase under early‑stopping.

License

  • Code: MIT (per repository).
  • Model weights: same as code unless declared otherwise upon release.

Citation

If you use this model or codebase, please cite the Aparecium project and this baseline report.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support