Task: Reconstruct natural language posts from token‑level MPNet embeddings (reverse embedding).
Focus: Crypto domain, with equities as auxiliary domain.
Checkpoint: Baseline model trained with a phased schedule and early stopping.
Data: 1.0M synthetic posts (500k crypto + 500k equities), programmatically generated via OpenAI API. No real social‑media content used.
Input contract: token‑level MPNet matrix of shape (seq_len, 768), not a pooled vector.

Research and engineering use for studying reversibility of embedding spaces and for building diagnostics/tools around embedding interpretability.
Not intended to reconstruct private or sensitive content; reconstruction accuracy depends on embedding fidelity and domain match.

Encoder side: External; we assume MPNet family encoder (default: sentence-transformers/all-mpnet-base-v2) to produce token‑level embeddings.
Decoder: Transformer decoder consuming the MPNet memory:
- d_model: 768
- Decoder layers: 2
- Attention heads: 8
- FFN dim: 2048
- Token and positional embeddings; GELU activations
Decoding:
- Supports greedy, sampling, and beam search.
- Optional embedding‑aware rescoring (cosine similarity between the candidate’s re‑embedded sentence and the pooled MPNet target).
- Optional lightweight constraints for hashtag/cashtag/URL continuity.

Recommended inference defaults:

1,000,000 synthetic posts total:
- 500,000 crypto‑domain posts
- 500,000 equities‑domain posts
All posts were programmatically generated via the OpenAI API (synthetic). No real social‑media content was used.
Embeddings:
- Token‑level MPNet (default: sentence-transformers/all-mpnet-base-v2).
- Cached to SQLite to avoid recomputation and allow resumable training.

Domain emphasis: 80% crypto / 20% equities per training phase.
Phased training (10% of available chunks per phase), evaluate after each phase:
- In‑sample: small subset from the phase’s chunks
- Out‑of‑sample: small hold‑out from both domains (not seen in the phase)
- Early‑stop condition: stop if out‑of‑sample cosine degrades relative to prior phase.
Optimizer: AdamW
Learning rate (baseline finetune): 5e‑5
Batch size: 16
Input max_source_length: 256
Target max_target_length: 128
Checkpointing: every 2,000 steps and at phase end.

Notes

Sample size: 1,000 examples per domain drawn from cached embedding databases.
Decode config: num_beams=8, length_penalty_alpha=0.6, lambda_sim=0.6, rescore_every_k=4, rescore_top_m=8, beta=10.0, enable_constraints=True, deterministic=True.
Metrics:
- cosine_mean/median/p10/p90: cosine between pooled MPNet embedding of generated text and the pooled MPNet target vector (higher is better).
- score_norm_mean: length‑penalized language model score (more positive is better; negative values are common for log‑scores).
- degenerate_pct: % of clearly degenerate generations (very short/blank/only hashtags).
- domain_drift_pct: % of equity‑like terms in crypto outputs (or crypto‑like terms in equities outputs). Heuristic text filter; intended as a rough indicator only.

Results (current models/baseline checkpoint)

Interpretation

The model reconstructs many posts with strong embedding alignment (p90 ≈ 0.98 cosine in both domains).
Equities shows higher average/median cosine and lower degeneracy than crypto, consistent with the auxiliary‑domain role and data characteristics.
A small fraction of degenerate outputs exists in both domains (crypto ~5.2%, equities ~2.2%).
Domain drift is minimal from crypto→equities (0.0%) and present at a modest rate from equities→crypto (~4.4%) under the chosen heuristic.

Input: MPNet token‑level matrix (seq_len × 768) for a single post. Do not pass a pooled vector.
Tokenizer/model alignment matters: use the same MPNet tokenizer/model version that produced the embeddings.

Reconstruction is not guaranteed to match the original post text; it optimizes alignment within the MPNet embedding space and LM scoring.
The model can produce generic or incomplete outputs (see degenerate_pct).
Domain drift can occur depending on decode settings (see domain_drift_pct).
Data are synthetic programmatic generations, not real social‑media posts. Domain semantics may differ from real‑world distributions.
Do not use for reconstructing sensitive/private content or for attempting to de‑anonymize embedding corpora. This model is a research/diagnostic tool.

Prepare embedding caches (not included): build local token‑level MPNet embedding caches for your corpora (e.g., via a data prep script) and store them in your own paths.
Baseline training: iterative 10% phases, 80:20 (crypto:equities), LR=5e‑5, BS=16, early‑stop on out‑of‑sample cosine degradation.
Evaluation: 1,000 samples/domain with the decode settings shown above.
The released checkpoint corresponds to the latest non‑degrading phase under early‑stopping.

If you use this model or codebase, please cite the Aparecium project and this baseline report.

Downloads last month: -; Downloads are not tracked for this model. How to track