NuExtract-2-4B [experimental version] by NuMind π₯
NuExtract 2.0 experimental is a family of models trained specifically for structured information extraction tasks. It supports both multimodal inputs and is multilingual. NB: This is an experimental version that will be superseeded by NuExtract 2.0
We provide several versions of different sizes, all based on the InternVL2.5 family.
| Model Size | Model Name | Base Model | Huggingface Link | 
|---|---|---|---|
| 2B | NuExtract-2.0-2B | InternVL2_5-2B | NuExtract-2-2B | 
| 4B | NuExtract-2.0-4B | InternVL2_5-4B | NuExtract-2-4B | 
| 8B | NuExtract-2.0-8B | InternVL2_5-8B | NuExtract-2-8B | 
Overview
To use the model, provide an input text/image and a JSON template describing the information you need to extract. The template should be a JSON object, specifying field names and their expected type.
Support types include:
- verbatim-string- instructs the model to extract text that is present verbatim in the input.
- string- a generic string field that can incorporate paraphrasing/abstraction.
- integer- a whole number.
- number- a whole or decimal number.
- date-time- ISO formatted date.
- Array of any of the above types (e.g. ["string"])
- enum- a choice from set of possible answers (represented in template as an array of options, e.g.- ["yes", "no", "maybe"]).
- multi-label- an enum that can have multiple possible answers (represented in template as a double-wrapped array, e.g.- [["A", "B", "C"]]).
If the model does not identify relevant information for a field, it will return null or [] (for arrays and multi-labels).
The following is an example template:
{
  "first_name": "verbatim-string",
  "last_name": "verbatim-string",
  "description": "string",
  "age": "integer",
  "gpa": "number",
  "birth_date": "date-time",
  "nationality": ["France", "England", "Japan", "USA", "China"],
  "languages_spoken": [["English", "French", "Japanese", "Mandarin", "Spanish"]]
}
An example output:
{
  "first_name": "Susan",
  "last_name": "Smith",
  "description": "A student studying computer science.",
  "age": 20,
  "gpa": 3.7,
  "birth_date": "2005-03-01",
  "nationality": "England",
  "languages_spoken": ["English", "French"]
}
β οΈ We recommend using NuExtract with a temperature at or very close to 0. Some inference frameworks, such as Ollama, use a default of 0.7 which is not well suited to many extraction tasks.
Inference
Use the following code to handle loading and preprocessing of input data:
import torch
import torchvision.transforms as T
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height
    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)
    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images
def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values
def prepare_inputs(messages, image_paths, tokenizer, device='cuda', dtype=torch.bfloat16):
    """
    Prepares multi-modal input components (supports multiple images per prompt).
    
    Args:
        messages: List of input messages/prompts (strings or dicts with 'role' and 'content')
        image_paths: List where each element is either None (for text-only) or a list of image paths
        tokenizer: The tokenizer to use for applying chat templates
        device: Device to place tensors on ('cuda', 'cpu', etc.)
        dtype: Data type for image tensors (default: torch.bfloat16)
    
    Returns:
        dict: Contains 'prompts', 'pixel_values_list', and 'num_patches_list' ready for the model
    """
    # Make sure image_paths list is at least as long as messages
    if len(image_paths) < len(messages):
        # Pad with None for text-only messages
        image_paths = image_paths + [None] * (len(messages) - len(image_paths))
    
    # Process images and collect patch information
    loaded_images = []
    num_patches_list = []
    for paths in image_paths:
        if paths and isinstance(paths, list) and len(paths) > 0:
            # Load each image in this prompt
            prompt_images = []
            prompt_patches = []
            
            for path in paths:
                # Load the image
                img = load_image(path).to(dtype=dtype, device=device)
                
                # Ensure img has correct shape [patches, C, H, W]
                if len(img.shape) == 3:  # [C, H, W] -> [1, C, H, W]
                    img = img.unsqueeze(0)
                    
                prompt_images.append(img)
                # Record the number of patches for this image
                prompt_patches.append(img.shape[0])
            
            loaded_images.append(prompt_images)
            num_patches_list.append(prompt_patches)
        else:
            # Text-only prompt
            loaded_images.append(None)
            num_patches_list.append([])
    
    # Create the concatenated pixel_values_list
    pixel_values_list = []
    for prompt_images in loaded_images:
        if prompt_images:
            # Concatenate all images for this prompt
            pixel_values_list.append(torch.cat(prompt_images, dim=0))
        else:
            # Text-only prompt
            pixel_values_list.append(None)
    
    # Format messages for the model
    if all(isinstance(m, str) for m in messages):
        # Simple string messages: convert to chat format
        batch_messages = [
            [{"role": "user", "content": message}] 
            for message in messages
        ]
    else:
        # Assume messages are already in the right format
        batch_messages = messages
    
    # Apply chat template
    prompts = tokenizer.apply_chat_template(
        batch_messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    return {
        'prompts': prompts,
        'pixel_values_list': pixel_values_list,
        'num_patches_list': num_patches_list
    }
def construct_message(text, template, examples=None):
    """
    Construct the individual NuExtract message texts, prior to chat template formatting.
    """
    # add few-shot examples if needed
    if examples is not None and len(examples) > 0:
        icl = "# Examples:\n"
        for row in examples:
            icl += f"## Input:\n{row['input']}\n## Output:\n{row['output']}\n"
    else:
        icl = ""
        
    return f"""# Template:\n{template}\n{icl}# Context:\n{text}"""
To handle inference:
IMG_START_TOKEN='<img>'
IMG_END_TOKEN='</img>'
IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'
def nuextract_generate(model, tokenizer, prompts, generation_config, pixel_values_list=None, num_patches_list=None):
    """
    Generate responses for a batch of NuExtract inputs.
    Support for multiple and varying numbers of images per prompt.
    
    Args:
        model: The vision-language model
        tokenizer: The tokenizer for the model
        pixel_values_list: List of tensor batches, one per prompt
                          Each batch has shape [num_images, channels, height, width] or None for text-only prompts
        prompts: List of text prompts
        generation_config: Configuration for text generation
        num_patches_list: List of lists, each containing patch counts for images in a prompt
        
    Returns:
        List of generated responses
    """
    img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
    model.img_context_token_id = img_context_token_id
    
    # Replace all image placeholders with appropriate tokens
    modified_prompts = []
    total_image_files = 0
    total_patches = 0
    image_containing_prompts = []
    for idx, prompt in enumerate(prompts):
        # check if this prompt has images
        has_images = (pixel_values_list and
                      idx < len(pixel_values_list) and 
                      pixel_values_list[idx] is not None and 
                      isinstance(pixel_values_list[idx], torch.Tensor) and
                      pixel_values_list[idx].shape[0] > 0)
        
        if has_images:
            # prompt with image placeholders
            image_containing_prompts.append(idx)
            modified_prompt = prompt
            
            patches = num_patches_list[idx] if (num_patches_list and idx < len(num_patches_list)) else []
            num_images = len(patches)
            total_image_files += num_images
            total_patches += sum(patches)
            
            # replace each <image> placeholder with image tokens
            for i, num_patches in enumerate(patches):
                image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * model.num_image_token * num_patches + IMG_END_TOKEN
                modified_prompt = modified_prompt.replace('<image>', image_tokens, 1)
        else:
            # text-only prompt
            modified_prompt = prompt
        
        modified_prompts.append(modified_prompt)
    
    # process all prompts in a single batch
    tokenizer.padding_side = 'left'
    model_inputs = tokenizer(modified_prompts, return_tensors='pt', padding=True)
    input_ids = model_inputs['input_ids'].to(model.device)
    attention_mask = model_inputs['attention_mask'].to(model.device)
    
    eos_token_id = tokenizer.convert_tokens_to_ids("<|im_end|>\n".strip())
    generation_config['eos_token_id'] = eos_token_id
    
    # prepare pixel values
    flattened_pixel_values = None
    if image_containing_prompts:
        # collect and concatenate all image tensors
        all_pixel_values = []
        for idx in image_containing_prompts:
            all_pixel_values.append(pixel_values_list[idx])
        
        flattened_pixel_values = torch.cat(all_pixel_values, dim=0)
        print(f"Processing batch with {len(prompts)} prompts, {total_image_files} actual images, and {total_patches} total patches")
    else:
        print(f"Processing text-only batch with {len(prompts)} prompts")
    
    # generate outputs
    outputs = model.generate(
        pixel_values=flattened_pixel_values,  # will be None for text-only prompts
        input_ids=input_ids,
        attention_mask=attention_mask,
        **generation_config
    )
    
    # Decode responses
    responses = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    
    return responses
To load the model:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = ""
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding_side='left')
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, 
                                             torch_dtype=torch.bfloat16,
                                             attn_implementation="flash_attention_2" # we recommend using flash attention
                                            ).to("cuda")
Simple 0-shot text-only example:
template = """{"names": ["verbatim-string"]}"""
text = "John went to the restaurant with Mary. James went to the cinema."
input_messages = [construct_message(text, template)]
input_content = prepare_inputs(
    messages=input_messages,
    image_paths=[],
    tokenizer=tokenizer,
)
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
with torch.no_grad():
    result = nuextract_generate(
        model=model,
        tokenizer=tokenizer,
        prompts=input_content['prompts'],
        pixel_values_list=input_content['pixel_values_list'],
        num_patches_list=input_content['num_patches_list'],
        generation_config=generation_config
    )
for y in result:
    print(y)
# {"names": ["John", "Mary", "James"]}
Text-only input with an in-context example:
template = """{"names": ["verbatim-string"], "female_names": ["verbatim-string"]}"""
text = "John went to the restaurant with Mary. James went to the cinema."
examples = [
    {
        "input": "Stephen is the manager at Susan's store.",
        "output": """{"names": ["STEPHEN", "SUSAN"], "female_names": ["SUSAN"]}"""
    }
]
input_messages = [construct_message(text, template, examples)]
input_content = prepare_inputs(
    messages=input_messages,
    image_paths=[],
    tokenizer=tokenizer,
)
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
with torch.no_grad():
    result = nuextract_generate(
        model=model,
        tokenizer=tokenizer,
        prompts=input_content['prompts'],
        pixel_values_list=input_content['pixel_values_list'],
        num_patches_list=input_content['num_patches_list'],
        generation_config=generation_config
    )
for y in result:
    print(y)
# {"names": ["JOHN", "MARY", "JAMES"], "female_names": ["MARY"]}
Example with image input and an in-context example. Image inputs should use <image> placeholder instead of text and image paths should be provided in a list in order of appearance in the prompt (in this example 0.jpg will be for the in-context example and 1.jpg for the true input).
template = """{"store": "verbatim-string"}"""
text = "<image>"
examples = [
    {
        "input": "<image>",
        "output": """{"store": "Walmart"}"""
    }
]
input_messages = [construct_message(text, template, examples)]
images = [
    ["0.jpg", "1.jpg"]
]
input_content = prepare_inputs(
    messages=input_messages,
    image_paths=images,
    tokenizer=tokenizer,
)
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
with torch.no_grad():
    result = nuextract_generate(
        model=model,
        tokenizer=tokenizer,
        prompts=input_content['prompts'],
        pixel_values_list=input_content['pixel_values_list'],
        num_patches_list=input_content['num_patches_list'],
        generation_config=generation_config
    )
for y in result:
    print(y)
# {"store": "Trader Joe's"}
Multi-modal batched input:
inputs = [
    # image input with no ICL examples
    {
        "text": "<image>",
        "template": """{"store_name": "verbatim-string"}""",
        "examples": None,
    },
    # image input with 1 ICL example
    {
        "text": "<image>",
        "template": """{"store_name": "verbatim-string"}""",
        "examples": [
            {
                "input": "<image>",
                "output": """{"store_name": "Walmart"}""",
            }
        ],
    },
    # text input with no ICL examples
    {
        "text": "John went to the restaurant with Mary. James went to the cinema.",
        "template": """{"names": ["verbatim-string"]}""",
        "examples": None,
    },
    # text input with ICL example
    {
        "text": "John went to the restaurant with Mary. James went to the cinema.",
        "template": """{"names": ["verbatim-string"], "female_names": ["verbatim-string"]}""",
        "examples": [
            {
                "input": "Stephen is the manager at Susan's store.",
                "output": """{"names": ["STEPHEN", "SUSAN"], "female_names": ["SUSAN"]}"""
            }
        ],
    },
]
input_messages = [
    construct_message(
        x["text"], 
        x["template"], 
        x["examples"]
    ) for x in inputs
]
images = [
    ["0.jpg"],
    ["0.jpg", "1.jpg"],
    None,
    None
]
input_content = prepare_inputs(
    messages=input_messages,
    image_paths=images,
    tokenizer=tokenizer,
)
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
with torch.no_grad():
    result = nuextract_generate(
        model=model,
        tokenizer=tokenizer,
        prompts=input_content['prompts'],
        pixel_values_list=input_content['pixel_values_list'],
        num_patches_list=input_content['num_patches_list'],
        generation_config=generation_config
    )
for y in result:
    print(y)
# {"store_name": "WAL*MART"}
# {"store_name": "Trader Joe's"}
# {"names": ["John", "Mary", "James"]}
# {"names": ["JOHN", "MARY", "JAMES"], "female_names": ["MARY"]}
Template Generation
If you want to convert existing schema files you have in other formats (e.g. XML, YAML, etc.) or start from an example, NuExtract 2 models can automatically generate this for you.
E.g. convert XML into a NuExtract template:
def generate_template(description):
    input_messages = [description]
    input_content = prepare_inputs(
        messages=input_messages,
        image_paths=[],
        tokenizer=tokenizer,
    )
    generation_config = {"do_sample": True, "temperature": 0.4, "max_new_tokens": 256}
    with torch.no_grad():
        result = nuextract_generate(
            model=model,
            tokenizer=tokenizer,
            prompts=input_content['prompts'],
            pixel_values_list=input_content['pixel_values_list'],
            num_patches_list=input_content['num_patches_list'],
            generation_config=generation_config
        )
    return result[0]
xml_template = """<SportResult>
    <Date></Date>
    <Sport></Sport>
    <Venue></Venue>
    <HomeTeam></HomeTeam>
    <AwayTeam></AwayTeam>
    <HomeScore></HomeScore>
    <AwayScore></AwayScore>
    <TopScorer></TopScorer>
</SportResult>"""
result = generate_template(xml_template)
    
print(result)
# { 
#     "SportResult": {
#         "Date": "date-time",
#         "Sport": "verbatim-string",
#         "Venue": "verbatim-string",
#         "HomeTeam": "verbatim-string",
#         "AwayTeam": "verbatim-string",
#         "HomeScore": "integer",
#         "AwayScore": "integer",
#         "TopScorer": "verbatim-string"
#     }
# }
E.g. generate a template from natural language description:
text = """Give me relevant info about startup companies mentioned."""
result = generate_template(text)
    
print(result)
# {
#     "Startup_Companies": [
#         {
#             "Name": "verbatim-string",
#             "Products": [
#                 "string"
#             ],
#             "Location": "verbatim-string",
#             "Company_Type": [
#                 "Technology",
#                 "Finance",
#                 "Health",
#                 "Education",
#                 "Other"
#             ]
#         }
#     ]
# }
- Downloads last month
- 1
