File size: 2,372 Bytes
b4e4358
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
license: apache-2.0
language:
- en
datasets:
- wikitext
- glue
pipeline_tag: text-generation
tags:
- transformer
- attention
- mla
- research

---

# Deepseek Tiny V0.1

6-layer DeepSeek-V3 with Multihead Latent Attention (MLA) trained for research on shared subspaces in Transformer attention mechanisms.

## Model Description

- **Model Type**: Transformer Decoder (DeepSeek-V3 based)
- **Architecture**: 6-layer decoder with Mixture of Experts
- **Parameters**: 16.26M
- **Hidden Size**: 256
- **Attention Heads**: 8
- **Head Dimension**: 32
- **Sequence Length**: 1,024 tokens
- **Query Latent Dimension**: 96
- **Key-Value Latent Dimension**: 64


## Performance

- **SST-2 Accuracy**: 87.96%
- **WikiText-103 Perplexity**: 28.89

## Research Context

This model is part of the [shared-subspaces](https://github.com/chrisjmccormick/shared-subspaces) research project investigating the impact of shared output latent spaces in Transformer attention mechanisms.








## Usage

```python
import torch
from transformers import DeepseekV3ForCausalLM, AutoTokenizer

# Load model and tokenizer
model = DeepseekV3ForCausalLM.from_pretrained("ChrisMcCormick/deepseek-tiny-v0.1")
tokenizer = AutoTokenizer.from_pretrained("ChrisMcCormick/deepseek-tiny-v0.1")



# Generate text
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Training Details

- **Pre-training Dataset**: WikiText-103
- **Fine-tuning Dataset**: SST-2 (GLUE)
- **Optimizer**: AdamW
- **Learning Rate**: 5e-4 (pre-training), 5e-5 (fine-tuning)
- **Weight Decay**: 0.01 (pre-training), 0.05 (fine-tuning)
- **Precision**: bfloat16
- **Compilation**: torch.compile with inductor backend
- **Training Steps**: 12,500 (pre-training), 1,500 (fine-tuning)

## Limitations

- Small scale model (16M parameters) intended for research purposes
- Trained on limited data compared to production models
- May require custom loading code for output subspace variants

## Citation

```bibtex
@misc{mccormick2025sharedsubspaces,
  title={Shared Subspaces in Transformer Attention: Investigating Output Latent Spaces},
  author={McCormick, Chris},
  year={2025},
  howpublished={\url{https://github.com/chrisjmccormick/shared-subspaces}}
}
```

## License

Apache 2.0