--- license: apache-2.0 base_model: Qwen/Qwen2.5-3B tags: - semantic-ids - recommendation-system - video-games - generative-retrieval - qwen - fine-tuned datasets: - eugeneyan/video-games-semantic-ids-mapping language: - en library_name: transformers pipeline_tag: text-generation --- # Semantic ID Recommender - Qwen3 8B (Video Games) ## Model Description This is a Qwen3 8B model fine-tuned for video games product recommendation using semantic IDs. The model has been trained to understand and generate hierarchical semantic identifiers that encode product relationships, enabling generative retrieval for recommendation systems. See writeup and demo here: https://eugeneyan.com/writing/semantic-ids/ ### What are Semantic IDs? Semantic IDs are learned hierarchical representations that encode product similarities and relationships in their structure. Unlike traditional IDs, semantic IDs carry meaning - similar products have similar ID prefixes. ## Special Tokens The model uses special tokens to work with semantic IDs: - `<|sid_start|>`: Marks the beginning of a semantic ID - `<|sid_X|>`: Hierarchical level tokens where X ∈ [0, 1023] - `<|sid_end|>`: Marks the end of a semantic ID - `<|rec|>`: Trigger token for generating recommendations ### Semantic ID Format `<|sid_start|><|sid_127|><|sid_45|><|sid_89|><|sid_12|><|sid_end|>` This represents a 4-level hierarchy where each level provides increasingly specific categorization. ## Training Details - **Base Model**: Qwen3 8B - **Fine-tuning Method**: Supervised Fine-Tuning (SFT) - **Dataset**: Amazon Video Games reviews and metadata - **Number of Products**: 66,097 - **Training Epochs**: 2 - **Task**: Next item prediction and recommendation generation ## Usage ### Installation ```bash pip install transformers torch datasets ``` ### Basic Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch # Load model and tokenizer model_name = "eugeneyan/semantic-id-qwen3-8b-video-games" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) # Set padding for generation if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token # Generate recommendations prompt = "User: <|sid_start|><|sid_8|><|sid_454|><|sid_630|><|sid_768|><|sid_end|>\n<|rec|>" inputs = tokenizer(prompt, return_tensors="pt") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=50, temperature=0.3, top_p=0.7, top_k=20, do_sample=True, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id ) # Decode only the generated portion input_length = inputs["input_ids"].shape[1] generated_tokens = outputs[:, input_length:] response = tokenizer.decode(generated_tokens[0], skip_special_tokens=False) print(response) ``` ### Advanced: Mapping Semantic IDs to Product Titles ```python from datasets import load_dataset import pandas as pd import re from typing import List # Load mapping dataset dataset = load_dataset("eugeneyan/video-games-semantic-ids-mapping") mapping_df = dataset['train'].to_pandas() def parse_semantic_id(semantic_id: str) -> List[str]: """Parse semantic ID into component levels""" sid = semantic_id.replace("<|sid_start|>", "").replace("<|sid_end|>", "") pattern = r"<\|sid_\d+\|>" return re.findall(pattern, sid) def map_semantic_id_to_titles(semantic_id_str: str, mapping_df: pd.DataFrame) -> dict: """ Map semantic ID to titles with exact match and fallback. Returns dict with match_level, titles, count, and match_type. """ levels = parse_semantic_id(semantic_id_str) if not levels: return {"match_level": 0, "titles": [], "count": 0} # Try exact match first exact_matches = mapping_df[mapping_df["semantic_id"] == semantic_id_str] if len(exact_matches) > 0: titles = exact_matches["title"].tolist() return {"match_level": 4, "titles": titles, "count": len(titles), "match_type": "exact"} # Fallback to prefix matching for depth in range(min(3, len(levels)), 0, -1): prefix = "<|sid_start|>" + "".join(levels[:depth]) matches = mapping_df[mapping_df["semantic_id"].str.startswith(prefix)] if len(matches) > 0: titles = matches["title"].tolist() return { "match_level": depth, "titles": titles[:5], "count": len(titles), "match_type": "prefix" } return {"match_level": 0, "titles": [], "count": 0, "match_type": "none"} def extract_and_replace_semantic_ids(text: str, mapping_df: pd.DataFrame) -> str: """Replace all semantic IDs in text with product titles""" pattern = r"<\|sid_start\|>(?:<\|sid_\d+\|>)+<\|sid_end\|>" semantic_ids = re.findall(pattern, text) result = text for sid in semantic_ids: match_result = map_semantic_id_to_titles(sid, mapping_df) if match_result["count"] > 0: title = match_result["titles"][0] replacement = f'"{title}"' if match_result["match_type"] == "prefix": replacement += f' (L{match_result["match_level"]} match)' if match_result["count"] > 1: replacement += f' [+{match_result["count"]-1} similar]' else: replacement = "[Unknown Item]" result = result.replace(sid, replacement) return result ``` ## Example Interactions ### Single Item Recommendation ```python # Provide input of user past interactions and get recommendation INPUT = """User: <|sid_start|><|sid_8|><|sid_454|><|sid_630|><|sid_768|><|sid_end|>, <|sid_start|><|sid_126|><|sid_501|><|sid_553|><|sid_768|><|sid_end|>, <|sid_start|><|sid_205|><|sid_370|><|sid_548|><|sid_768|><|sid_end|> <|rec|>""".strip() response = chat(INPUT) # Output: Recommended product <|sid_start|><|sid_205|><|sid_407|><|sid_586|><|sid_768|><|sid_end|><|im_end|> # Output mapped ASSISTANT: "Assassin's Creed 2 Deluxe Edition [Download]" ``` ```python # Provide input of single past item and get similar item INPUT = """Customers who bought <|sid_start|><|sid_201|><|sid_311|><|sid_758|><|sid_768|><|sid_end|> also bought: <|rec|>""".strip() response = chat(INPUT) # Output: Recommended product <|sid_start|><|sid_201|><|sid_396|><|sid_608|><|sid_769|><|sid_end|><|im_end|> # Output mapped ASSISTANT: "The Legend of Zelda: Ocarina of Time 3D" ``` ### Natural Language with Semantic IDs ```python # Input: Natural language context # Provide natural language chat input and get item recommendations INPUT = """I like scifi and action games. <|rec|>""".strip() response = chat(INPUT) # Output: Multiple relevant products <|sid_start|><|sid_64|><|sid_313|><|sid_637|><|sid_768|><|sid_end|>, <|sid_start|><|sid_219|><|sid_463|><|sid_660|><|sid_768|><|sid_end|>, <|sid_start|><|sid_64|><|sid_313|><|sid_608|><|sid_768|><|sid_end|><|im_end|> # Output mapped ASSISTANT: "Halo 3 Limited Edition -Xbox 360", "Battlefield: Bad Company - Playstation 3", "Halo Reach - Limited Edition -Xbox 360" ``` ### Attribute-Steered Recommendations ```python # Steering recommendations given an item and attribute (Xbox) INPUT = """Recommend Xbox games similar to <|sid_start|><|sid_201|><|sid_396|><|sid_608|><|sid_769|><|sid_end|>: <|rec|>""".strip() response = chat(INPUT) # Output: Xbox-specific recommendations <|sid_start|><|sid_64|><|sid_271|><|sid_576|><|sid_768|><|sid_end|>, <|sid_start|><|sid_64|><|sid_400|><|sid_594|><|sid_768|><|sid_end|>, <|sid_start|><|sid_167|><|sid_271|><|sid_578|><|sid_768|><|sid_end|><|im_end|> # Output mapped ASSISTANT: "Fallout: New Vegas - Xbox 360 Ultimate Edition", "Tales of Vesperia - Xbox 360", "Halo Reach - Legendary Edition ``` ```python # Provide natural language chat input and get item recommendations INPUT = """I like animal and cute games. <|rec|>""".strip() response = chat(INPUT) # Output: Games matching the genre preference <|sid_start|><|sid_173|><|sid_324|><|sid_764|><|sid_768|><|sid_end|>, <|sid_start|><|sid_201|><|sid_397|><|sid_738|><|sid_769|><|sid_end|>, <|sid_start|><|sid_173|><|sid_305|><|sid_670|><|sid_768|><|sid_end|><|im_end|> # Output mapped ASSISTANT: "Animal Crossing: New Leaf", "Disney Magical World - Nintendo 3DS", "Nintendogs + Cats: Golden Retriever and New Friends" ``` ### Explanatory Recommendations ```python # Provide item to get recommendation and explanation INPUT = """I just finished <|sid_start|><|sid_125|><|sid_417|><|sid_656|><|sid_768|><|sid_end|>. Suggest another <|rec|> and explain why:""".strip() response = chat(INPUT) # Output: Recommendation with natural language explanation <|sid_start|><|sid_139|><|sid_289|><|sid_534|><|sid_768|><|sid_end|> If you liked Dragon Quest Heroes II, you might like Nights of Azure because both are action RPGs for the PlayStation 4 with a focus on combat and character progression. Both games offer a narrative-driven experience with a strong emphasis on combat mechanics, suggesting a shared appeal for players who enjoy this genre on the platform.<|im_end|> # Output mapped ASSISTANT: "Nights of Azure - PlayStation 4" If you liked Dragon Quest Heroes II, you might like Nights of Azure because both are action RPGs for the PlayStation 4 with a focus on combat and character progression. Both games offer a narrative-driven experience with a strong emphasis on combat mechanics, suggesting a shared appeal for players who enjoy this genre on the platform. ``` ### Multi-Turn Conversations The model supports multi-turn conversations with context preservation: ```python from transformers import TextStreamer def chat(text_input: str, messages: list = None, stream: bool = True): """Interactive chat with the model""" if messages is None: messages = [] messages.append({"role": "user", "content": text_input}) # Apply chat template text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer(text, return_tensors="pt").to(model.device) # Stream output for better UX streamer = TextStreamer(tokenizer, skip_prompt=True) if stream else None with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=512, temperature=0.3, top_p=0.7, top_k=20, do_sample=True, streamer=streamer ) # Extract only new tokens input_length = inputs["input_ids"].shape[1] generated = tokenizer.decode(output[0][input_length:], skip_special_tokens=True) messages.append({"role": "assistant", "content": generated}) return generated, messages ``` ```python # 1st turn: Ask for games similar to Mario Kart INPUT = "I'm looking for games similar to Mario Kart. <|rec|>" response1 = chat(INPUT) # Output <|sid_start|><|sid_131|><|sid_492|><|sid_639|><|sid_768|><|sid_end|>, <|sid_start|><|sid_145|><|sid_480|><|sid_617|><|sid_768|><|sid_end|>, <|sid_start|><|sid_145|><|sid_290|><|sid_620|><|sid_768|><|sid_end|><|im_end|> # Output mapped ASSISTANT: "CTR: Crash Team Racing", "Crazy Taxi 2 - Sega Dreamcast", "Mario Kart: Super Circuit" # 2nd turn: Tweak it for Xbox INPUT = "How about something similar but for Xbox? <|rec|>" response2 = chat(INPUT, new_convo=False) # Output <|sid_start|><|sid_183|><|sid_461|><|sid_517|><|sid_768|><|sid_end|>, <|sid_start|><|sid_183|><|sid_313|><|sid_679|><|sid_769|><|sid_end|>, <|sid_start|><|sid_183|><|sid_313|><|sid_605|><|sid_768|><|sid_end|><|im_end|> # Output mapped ASSISTANT: "Need for Speed Carbon - Xbox 360", "Forza Motorsport 2 - Xbox 360", "NASCAR '14 - Xbox 360" # 3rd turn: Ask for bundle name INPUT = "Suggest a name and description for the bundle" response3 = chat(INPUT, new_convo=False) # Output ASSISTANT: Xbox Racing Legends: NASCAR & Forza Collection ``` ### Performance - Model Size: ~16GB - Inference: Requires GPU with at least 20GB VRAM for float16 - Quantization: Can run on 12GB VRAM with 8-bit quantization - CPU Inference: Possible but slow; use MPS on Apple Silicon for better performance ### Category Information This model is specifically trained for Video Games products: - Total products: 66,097 - Hierarchy levels: 4 - Tokens per level: 1024 - Semantic similarity encoded in hierarchy depth ### Limitations - Trained specifically on video games products - Semantic IDs are fixed from training time - Requires mapping dataset to interpret semantic IDs - Performance may degrade on products very different from training data - May occasionally generate invalid semantic IDs (can be filtered post-generation) ### Citation If you use this model, please cite: ``` @model{semantic_id_qwen3_8b_video_games, author = {Eugene Yan}, title = {Semantic ID Recommender - Qwen3 8B (Video Games)}, year = {2024}, publisher = {Hugging Face}, url = {https://huggingface.co/eugeneyan/semantic-id-qwen3-8b-video-games} } ``` Acknowledgments - Base model: Qwen Team - Training approach inspired by: https://arxiv.org/abs/2305.12218 and https://arxiv.org/abs/2306.08121 - Dataset: Amazon Video Games ### Related Resources - Mapping Dataset: https://huggingface.co/eugeneyan/video-games-semantic-ids-mapping - GitHub: https://github.com/eugeneyan/semantic-ids