weathon/qwen_2_5_vision_reward · gen[0]["generated

Sep 25

•

Hello, when I input an image, why does the return value of gen[0]['generated_text'] only contain a score? Is there an error?
gen[0]['generated_text']='Here's my rating for the image based on the provided guidelines: - Background: 0 ('

weathon

Owner Sep 25

Please tey this checkpoint
weathon/qwen_2_5_vision_reward_long

haoxincool

Sep 25

I used the checkpoint 'weathon/qwen_2_5_vision_reward_long', but gen[0]['generated_text'] still only returns a score.
gen[0]['generated_text']='Here's my rating for the image based on the provided guidelines: - Background: 0 ('

weathon

Owner Sep 25

Did you give it the system prompt? Please see readme.md for code. and use the rules.csv file in the prompt. also set max tokens as a very large number (10240)

haoxincool

Sep 25

Here is my code,I have used the rules.csv file in the prompt, and the result of gen[0]["generated_text"] is: Here's my rating for the image based on the provided guidelines: /n- Background: 0
from transformers import pipeline
import pandas as pd
import re
from PIL import Image
import json

pipe = pipeline("image-text-to-text", model="/data/eval/Qwen_2_5_vision_reward/model/qwen_2_5_vision_reward_long", max_length=10240)
df = pd.read_csv("./model/qwen_2_5_vision_reward_long/rules.csv")
df.columns = df.columns.str.strip()
df['Dimension'] = df['Dimension'].ffill()
df['dim_key'] = df['Dimension'].apply(lambda x: re.search(r'((.?))', x).group(1) if re.search(r'((.?))', x) else x)
guide = {
dim_key: {
int(row['Score']): str(row['Description']).strip()
for _, row in group.iterrows()
}
for dim_key, group in df.groupby('dim_key')
}

question = f"You need to rate the quality of an image, guideline: {guide}."

def rate(image):
messages = [
{
"role": "system",
"content": [{"type": "text", "text": question}],
},
{
"role": "user",
"content": [
{
"type": "image",
"image": image.resize((512, 512)),
}
],
}]
gen = pipe(text=messages, return_full_text=False)
print(gen[0]["generated_text"])
imgs_path = "/data/1014-001.jpg"
image = Image.open(imgs_path)
rate(image)

weathon

Owner Sep 25

please tey this code first,

from datasets import load_dataset

# load dataset, you can also use any images
train_ds = load_dataset("zai-org/VisionRewardDB-Image", split='train[:40000]')
test_ds = load_dataset("zai-org/VisionRewardDB-Image", split='train[40000:]')

from transformers import pipeline
pipe = pipeline("image-text-to-text", model="weathon/qwen_2_5_vision_reward")

from transformers import pipeline
import pandas as pd
df = pd.read_csv("rules.csv")
import pandas as pd
import re
from PIL import Image

df.columns = df.columns.str.strip()
df['Dimension'] = df['Dimension'].ffill()

df['dim_key'] = df['Dimension'].apply(lambda x: re.search(r'\((.*?)\)', x).group(1) if re.search(r'\((.*?)\)', x) else x)

guide = {
    dim_key: {
        int(row['Score']): str(row['Description']).strip()
        for _, row in group.iterrows()
    }
    for dim_key, group in df.groupby('dim_key')
}

question = f"You need to rate the quality of an image, guideline: {guide}."

import json
def rate(image):
  messages = [
      {
          "role": "system",
          "content": [{"type": "text", "text": question}],
      },
      {
          "role": "user",
          "content": [
              {
                  "type": "image",
                  "image": image.resize((512, 512)),
              }
          ],
  }]
  gen = pipe(text=messages, return_full_text=False)
  return sum(json.loads(gen[0]["generated_text"].replace("'", '"')).values())

rate(test_ds[3]["image"])

sum(test_ds[3]["annotation"].values())

weathon

Owner Sep 25

Seems like you are not using the right model "model="/data/eval/Qwen_2_5_vision_reward/model/qwen_2_5_vision_reward_long""

haoxincool

Sep 25

I download the checkpoint 'weathon/qwen_2_5_vision_reward_long' to my local path “/data/eval/Qwen_2_5_vision_reward/model/qwen_2_5_vision_reward_long"
and my code is the same as yours, I just loaded a local image, but the return value of gen[0]['generated_text'] only outputs the score of the first key

weathon

Owner Sep 25

https://colab.research.google.com/drive/1i0QS7NPFZ_oNmhCTjshH4sjrH0fEixRC?usp=sharing
I tried the code it is working. Could you confirm the content in guide.csv is correct

haoxincool

Sep 26

Hello , thanks to your help, the problem of no output has been resolved. It was due to the max tokens setting. But during batch processing, I discovered that the format of the result returned by the model is inconsistent. How can I make the output format consistent?
output like this：
This image has a high level of detail authenticity, with the subject's face appearing photorealistic and free of noticeable flaws. The color aesthetics are pleasant, contributing to an overall aesthetically pleasing image. The lighting and shadow distinctions are clear, enhancing the image's depth and realism.

Rating Breakdown:

Detail Authenticity: 1
Detail Refinement: 1
Emotional Response: 1
Environmental Light and Shadow Prominence: 1
Face Quality: 2
Hand Quality: 0
Harm Type: 0
Human Body Accuracy: 0
Light and Shadow Aesthetics: 1
Main Object Position: 1
Object Composition: 0
Overall Clarity: 1
Overall Symmetry: 0
Safety Rating: 1
Scene Richness: 0

Final Rating: 6/10
and this:
Here's my assessment of the image based on the provided guidelines:

Background Quality (background): 0
- The background is not remarkable and does not capture the viewer's attention.
Brightness (color brightness): 1
- The color is bright.
Color Aesthetics(color aesthetic): 1
- The colors in the image are aesthetically pleasing, and you can imagine the picture becoming more attractive solely due to its coloring.
Detail Authenticity (detail realism): 1
- The imagery is photorealistic and completely free of noticeable flaws.
Detail Refinement (detail refinement): 1
- The image is relatively refined overall, but still with some visible flaws or room for improvement in precision.
Emotional Response (emotion): 0
- The emotional response to the image is neutral or indifferent.
Environmental Light and Shadow Prominence (lighting distinction): 1
- Lighting or shadows are clearly visible, and a light source may be apparent in the image.
Face Quality (face): -1
- No human face is present.
Hand Quality (hands): -1
- No depiction of hands—select this without making assumptions.
Harm Type (unsafe type): 0
- Harmless
Human Body Accuracy (body): -1
- Minor anatomical flaws are present but not severe (e.g., subtle issues with proportions or facial features).
Light and Shadow Aesthetics (lighting aesthetic)（Maintain controlled variables—judge whether light and shadow significantly enhance the image's aesthetics.）: 0
- Light and shadow are present, but they do not significantly enhance the appeal of the image.
Main Object Position (main object): 1
- The main objects are prominent if they meet all the following conditions:A main object or objects exist in the image.The main objects are in noticeable positions (e.g., central areas of the image or focal points).The main objects are not too small.

weathon

Owner Sep 26

I still think you did not load the correct checkpoint. Because as the colab I shared with you, the output should be a json file. If it is not a json file, it is likely due to wrong checkpoint/prompt being used

haoxincool

Sep 28

Hello, thanks to your help. The generated results are now correct. I've found that the versions of transformers and peft can affect the generated results (both the scores and format).

weathon
/

qwen_2_5_vision_reward

gen[0]["generated_text"] has only one score