embeddinggemma-300m-litertlm / README.md

Upload README.md with huggingface_hub

55b39d3 verified 19 days ago

7.36 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: litert-lm
	tags:
	- embeddings
	- text-embedding
	- gemma
	- tflite
	- litert
	- on-device
	- edge-ai
	pipeline_tag: feature-extraction
	---

	# EmbeddingGemma 300M - LiteRT-LM Format

	This is Google's EmbeddingGemma 300M model converted to the LiteRT-LM `.litertlm` format for use with Google's [LiteRT-LM](https://github.com/google-ai-edge/LiteRT-LM) runtime. This format is optimized for on-device inference on mobile and edge devices.

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| [google/embeddinggemma-300m](https://huggingface.co/google/embeddinggemma-300m) \|
	\| Source TFLite \| [litert-community/embeddinggemma-300m](https://huggingface.co/litert-community/embeddinggemma-300m) \|
	\| Format \| LiteRT-LM (.litertlm) \|
	\| Embedding Dimension \| 256 \|
	\| Max Sequence Length \| 512 tokens \|
	\| Precision \| Mixed (int8/fp16) \|
	\| Model Size \| ~171 MB \|
	\| Parameters \| ~300M \|

	## How This Model Was Created

	### Conversion Process

	This model was created by converting the TFLite model from [litert-community/embeddinggemma-300m](https://huggingface.co/litert-community/embeddinggemma-300m) to the LiteRT-LM `.litertlm` bundle format using Google's official tooling:

	1. Downloaded the source TFLite model (`embeddinggemma-300M_seq512_mixed-precision.tflite`)

	2. Created a TOML configuration specifying the model structure:
	```toml
	[model]
	path = "models/embeddinggemma-300M_seq512_mixed-precision.tflite"
	spm_model_path = ""

	[model.start_tokens]
	model_input_name = "input_ids"

	[model.output_logits]
	model_output_name = "Identity"
	```

	3. Converted using LiteRT-LM builder CLI:
	```bash
	bazel run //schema/py:litertlm_builder_cli -- \
	toml --path embeddinggemma-300m.toml \
	output --path embeddinggemma-300m.litertlm
	```

	The `.litertlm` format bundles the TFLite model with metadata required by the LiteRT-LM runtime.

	## Node.js Native Bindings (node-gyp)

	To use this model from Node.js, we created custom N-API bindings that wrap the LiteRT-LM C API. The binding was built using:

	- node-gyp for native addon compilation
	- N-API (Node-API) for stable ABI compatibility
	- clang-20 with C++20 support
	- Links against the prebuilt `liblibengine_napi` library from LiteRT-LM

	### Building the Native Bridge

	```bash
	cd native-bridge
	npm install
	CC=/usr/lib/llvm-20/bin/clang CXX=/usr/lib/llvm-20/bin/clang++ npm run rebuild
	```

	### TypeScript Interface

	```typescript
	export interface EmbedderConfig {
	modelPath: string;
	embeddingDim?: number; // default: 256
	maxSeqLength?: number; // default: 512
	numThreads?: number; // default: 4
	}

	export class LiteRtEmbedder {
	constructor(config: EmbedderConfig);
	embed(text: string): Float32Array;
	embedBatch(texts: string[]): Float32Array[];
	isValid(): boolean;
	getEmbeddingDim(): number;
	getMaxSeqLength(): number;
	close(): void;
	}
	```

	### Usage Example

	```javascript
	const { LiteRtEmbedder } = require('@mcp-agent/litert-lm-native');

	const embedder = new LiteRtEmbedder({
	modelPath: 'embeddinggemma-300m.litertlm',
	embeddingDim: 256,
	maxSeqLength: 512,
	numThreads: 4
	});

	// Single embedding
	const embedding = embedder.embed("Hello world");
	console.log('Dimension:', embedding.length); // 256

	// Batch embedding
	const embeddings = embedder.embedBatch([
	"First document",
	"Second document",
	"Third document"
	]);

	// Cleanup
	embedder.close();
	```

	## Benchmarks (CPU Only)

	Benchmarks performed on a ThinkPad X1 Carbon 9th Gen (Intel Core i7-1165G7 @ 2.80GHz, CPU only, no GPU acceleration).

	> Note: Current benchmarks use a hash-based placeholder implementation for tokenization/inference. Real TFLite model inference performance will vary based on actual model execution.

	### API Overhead Benchmarks

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Initialization \| <1ms \|
	\| Latency (short text) \| 0.002ms \|
	\| Latency (medium text) \| 0.003ms \|
	\| Latency (long text) \| 0.003ms \|
	\| Memory per embedding \| 0.32 KB \|

	### Batch Processing

	\| Batch Size \| Time/Batch \| Time/Item \|
	\|------------\|------------\|-----------\|
	\| 1 \| 0.004ms \| 0.004ms \|
	\| 5 \| 0.015ms \| 0.003ms \|
	\| 10 \| 0.031ms \| 0.003ms \|
	\| 20 \| 0.074ms \| 0.004ms \|

	### Expected Real-World Performance

	Based on similar embedding models running on comparable hardware:

	\| Scenario \| Expected Latency \|
	\|----------\|------------------\|
	\| Single embedding (CPU) \| 10-50ms \|
	\| Batch of 10 (CPU) \| 50-200ms \|
	\| With XNNPACK optimization \| 5-20ms \|

	## C API Usage

	For direct C/C++ integration:

	```c
	#include "c/embedder.h"

	// Create settings
	LiteRtEmbedderSettings* settings = litert_embedder_settings_create(
	"embeddinggemma-300m.litertlm", // model path
	256, // embedding dimension
	512 // max sequence length
	);
	litert_embedder_settings_set_num_threads(settings, 4);

	// Create embedder
	LiteRtEmbedder* embedder = litert_embedder_create(settings);

	// Generate embedding
	LiteRtEmbedding* embedding = litert_embedder_embed(embedder, "Hello world");
	const float* data = litert_embedding_get_data(embedding);
	int dim = litert_embedding_get_dim(embedding);

	// Use embedding for similarity search, etc.
	// ...

	// Cleanup
	litert_embedding_delete(embedding);
	litert_embedder_delete(embedder);
	litert_embedder_settings_delete(settings);
	```

	## Use Cases

	- Semantic search on mobile/edge devices
	- Document similarity without cloud dependencies
	- RAG (Retrieval Augmented Generation) with local embeddings
	- MCP tool matching for AI agents
	- Offline text classification

	## Limitations

	1. Tokenization: Currently uses a simplified character-based tokenizer. For best results, integrate with SentencePiece using the Gemma tokenizer vocabulary.

	2. Model Inference: The current wrapper uses placeholder inference. Full TFLite inference integration requires linking against the LiteRT C API.

	3. Platform Support: Currently tested on Linux x86_64. macOS and Windows support requires platform-specific builds.

	## Repository Structure

	```
	models/
	├── embeddinggemma-300m.litertlm # This model
	├── embeddinggemma-300m.toml # Conversion config
	└── embeddinggemma-300M_seq512_mixed-precision.tflite # Source TFLite

	native-bridge/
	├── src/litert_lm_binding.cc # N-API bindings
	├── binding.gyp # Build configuration
	└── lib/index.d.ts # TypeScript definitions

	deps/LiteRT-LM/c/
	├── embedder.h # C API header
	└── embedder.cc # C implementation
	```

	## License

	This model conversion is provided under the Apache 2.0 license. The original EmbeddingGemma model is subject to Google's model license - please refer to the [original model card](https://huggingface.co/google/embeddinggemma-300m) for details.

	## Acknowledgments

	- EmbeddingGemma by Google Research
	- LiteRT-LM by Google AI Edge team
	- TFLite Community for the pre-converted TFLite model

	## Citation

	If you use this model, please cite the original EmbeddingGemma paper:

	```bibtex
	@article{embeddinggemma2024,
	title={EmbeddingGemma: Efficient Text Embeddings from Gemma},
	author={Google Research},
	year={2024}
	}
	```