Check out my blog!
Update 17/10/2025
V4 released! This time instead of training a vision model from scratch, it uses a simple mlp that takes the cls token from a DINOv3 model to get the embedding.
Far more accurate than the previous V3! You do need access to the DINOv3 with your HuggingFace token, though.
You can use 6/7 numbers to fully describe the style of an (anime) image!
What's it and what could it do?
Many diffusion models, though, choose to use artist tags to control the style of output images. I am really not a fan of that, for three reasons:
- Many artists share very similar styles, making many artist tags redundant.
- Some artists have more than one distinct art style in their works. For basic example, sketch vs finished images.
- Prone to content bleeding. If the artist tag you choose draws lots of repeating content, it's very likely these content will bleed into your output despite not prompting for them.
One way to overcome this is using a style embedding model. It's a model which takes in images of arbitrary sizes and outputs a style vector for each image. The style vector lives in an N-Dimension space, and is essentially just a list of numbers with a length of N. Each number in the list corresponds to a specific style element the input image has.
Images with similar style should have similar embedding (low distance) while different style will have embeddings that are far apart (high distance).
The included py file gives minimal usage example. minimal_script.py provides the minimal codes for running an image through the network and obtain an output. While gallery_review.py contains the code I used to generate those visualisations and clustering.
Training data is here.
Training Hyperparameters
With current version (v4):
Training was done using PyTorch Lightning.
lr = 0.0005
weight_decay = 0.01
AdamW optimizer
ExponentialLR scheduler, with a gamma of 0.99, applied every epoch.
Batch size of 9999 (so all data goes through the network at once).
With every anchor image, 4 positive images and 16 negative images are used.
Trained for 150 epoches. On a single RTX 3080 GPUs. A total of 150 optimizer updates.