ownerEli commited on
Commit
55a4ab8
·
verified ·
1 Parent(s): 9863178

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +537 -3
README.md CHANGED
@@ -1,3 +1,537 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - OpenGVLab/ScaleCUA-Data
5
+ language:
6
+ - en
7
+ metrics:
8
+ - accuracy
9
+ base_model:
10
+ - Qwen/Qwen2.5-VL-3B-Instruct
11
+ pipeline_tag: image-text-to-text
12
+ library_name: transformers
13
+ tags:
14
+ - agent
15
+ ---
16
+
17
+ # SCALECUA: SCALING UP COMPUTER USE AGENTS WITH CROSS-PLATFORM DATA
18
+
19
+ [\[📂 GitHub\]](https://github.com/OpenGVLab/ScaleCUA) [\[📜 Paper\]](https://github.com/OpenGVLab/ScaleCUA) [\[🚀 Quick Start\]](#model-loading)
20
+
21
+
22
+
23
+ ## Introduction
24
+
25
+ Recent advances in Vision-Language Models have enabled the development of agents capable of automating interactions with graphical user interfaces. Some computer use agents demonstrate strong performance, while they are typically built on closed-source models or inaccessible proprietary datasets. Moreover, the existing open-source datasets still remain insufficient for developing cross-platform general-purpose computer-use agents. To bridge this gap, we scale up the computer use dataset, constructed via a novel dual-loop interactive pipeline that combines an automated agent and a human expert into data collection. It spans **6 operating systems** and **3 task domains**, offering a large-scale and diverse corpus for training computer use agents.
26
+ Building on this corpus, we develop **ScaleCUA**, capable of seamless operation across heterogeneous platforms. Trained on our dataset, it delivers consistent gains on several benchmarks, improving absolute success rates by **+26.6 points** on WebArena-Lite-v2 and **+10.7 points** on ScreenSpot-Pro compared to the baseline. Moreover, our ScaleCUA family achieves state-of-the-art performance across multiple benchmarks, e.g., **94.4%** on MMBench-GUI L1-Hard, **60.6%** on OSWorld-G and **47.4%** on WebArena-Lite-v2. These results highlight the effectiveness of our data-centric methodology in scaling both GUI understanding, grounding, and cross-platform task completion. We make our data, models, and code publicly available to facilitate future research: https://github.com/OpenGVLab/ScaleCUA.
27
+
28
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6502f241b1792803da7e8def/YdK0I790ehLAKpR1vGkX1.png)
29
+
30
+ ---
31
+
32
+ ## Model Loading
33
+
34
+ We provide an example code to run `ScaleCUA` using `transformers`.
35
+ ```python
36
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
37
+ from qwen_vl_utils import process_vision_info
38
+
39
+ # default: Load the model on the available device(s)
40
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
41
+ "OpenGVLab/ScaleCUA-3B", torch_dtype="auto", device_map="auto"
42
+ )
43
+
44
+ min_pixels = 3136
45
+ max_pixels = 2109744
46
+ processor = AutoProcessor.from_pretrained("OpenGVLab/ScaleCUA-3B", min_pixels=min_pixels, max_pixels=max_pixels)
47
+ ````
48
+
49
+ ## Direct Action Mode as grounder
50
+
51
+ For tasks that require direct GUI grounding (e.g., identifying and clicking a specific button from a description) or serve as grounder in agentic workflow, you can use the **Direct Action Mode**. This mode focuses on generating immediate, executable actions based on the visual input.
52
+
53
+ 1. To enable this mode, set the system prompt as follows:
54
+ ```python
55
+ SCALECUA_SYSTEM_PROMPT_GROUNDER = '''You are an autonomous GUI agent capable of operating on desktops, mobile devices, and web browsers. Your primary function is to analyze screen captures and perform appropriate UI actions to complete assigned tasks.
56
+
57
+ ## Action Space
58
+ def click(
59
+ x: float | None = None,
60
+ y: float | None = None,
61
+ clicks: int = 1,
62
+ button: str = "left",
63
+ ) -> None:
64
+ """Clicks on the screen at the specified coordinates. The `x` and `y` parameter specify where the mouse event occurs. If not provided, the current mouse position is used. The `clicks` parameter specifies how many times to click, and the `button` parameter specifies which mouse button to use ('left', 'right', or 'middle')."""
65
+ pass
66
+
67
+ def doubleClick(
68
+ x: float | None = None,
69
+ y: float | None = None,
70
+ button: str = "left",
71
+ ) -> None:
72
+ """Performs a double click. This is a wrapper function for click(x, y, 2, 'left')."""
73
+ pass
74
+
75
+ def rightClick(x: float | None = None, y: float | None = None) -> None:
76
+ """Performs a right mouse button click. This is a wrapper function for click(x, y, 1, 'right')."""
77
+ pass
78
+
79
+ def moveTo(x: float, y: float) -> None:
80
+ """Move the mouse to the specified coordinates."""
81
+ pass
82
+
83
+ def dragTo(
84
+ x: float | None = None, y: float | None = None, button: str = "left"
85
+ ) -> None:
86
+ """Performs a drag-to action with optional `x` and `y` coordinates and button."""
87
+ pass
88
+
89
+ def swipe(
90
+ from_coord: tuple[float, float] | None = None,
91
+ to_coord: tuple[float, float] | None = None,
92
+ direction: str = "up",
93
+ amount: float = 0.5,
94
+ ) -> None:
95
+ """Performs a swipe action on the screen. The `from_coord` and `to_coord` specify the starting and ending coordinates of the swipe. If `to_coord` is not provided, the `direction` and `amount` parameters are used to determine the swipe direction and distance. The `direction` can be 'up', 'down', 'left', or 'right', and the `amount` specifies how far to swipe relative to the screen size (0 to 1)."""
96
+ pass
97
+
98
+ def long_press(x: float, y: float, duration: int = 1) -> None:
99
+ """Long press on the screen at the specified coordinates. The `duration` specifies how long to hold the press in seconds."""
100
+ pass
101
+
102
+ ## Input Specification
103
+ - Screenshot of the current screen + task description
104
+
105
+ ## Output Format
106
+ <action>
107
+ [A set of executable action command]
108
+ </action>
109
+
110
+ ## Note
111
+ - Avoid action(s) that would lead to invalid states.
112
+ - The generated action(s) must exist within the defined action space.
113
+ - The generated action(s) should be enclosed within <action></action> tags.'''
114
+ ```
115
+ 2. Use the above system prompt to generate prediction:
116
+ ```python
117
+ low_level_instruction = "Click the 'X' button in the upper right corner of the pop-up to close it and access the car selection options."
118
+
119
+ messages = [
120
+ {
121
+ "role": "system",
122
+ "content":[
123
+ {
124
+ "type": "text",
125
+ "text": SCALECUA_SYSTEM_PROMPT_GROUNDER,
126
+ }
127
+ ]
128
+ },
129
+ {
130
+ "role": "user",
131
+ "content": [
132
+ {
133
+ "type": "image",
134
+ "image": "/path/to/your/image",
135
+ },
136
+ {"type": "text", "text": low_level_instruction},
137
+ ],
138
+ }
139
+ ]
140
+
141
+ # Preparation for inference
142
+ text = processor.apply_chat_template(
143
+ messages, tokenize=False, add_generation_prompt=True
144
+ )
145
+ image_inputs, video_inputs = process_vision_info(messages)
146
+ inputs = processor(
147
+ text=[text],
148
+ images=image_inputs,
149
+ videos=video_inputs,
150
+ padding=True,
151
+ return_tensors="pt",
152
+ )
153
+ inputs = inputs.to("cuda")
154
+
155
+ # Inference: Generation of the output
156
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
157
+ generated_ids_trimmed = [
158
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
159
+ ]
160
+ output_text = processor.batch_decode(
161
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
162
+ )
163
+ print(output_text)
164
+ ```
165
+ 3. Extract coordinates and transform it based on the resized image:
166
+ ```python
167
+ from qwen_vl_utils import smart_resize
168
+
169
+ def parse_scalecua_grounder_response(response, image_width: int, image_height: int, resized_width: int, resized_height: int) -> List[str]:
170
+ response = response.strip()
171
+ logger.info(f"Extracting coordinates from: {response}")
172
+ match = re.search(r"\((\d+),\s*(\d+)\)", response)
173
+ if not match:
174
+ pattern = r'\((?:x=)?([-+]?\d*\.\d+|\d+)(?:,\s*(?:y=)?([-+]?\d*\.\d+|\d+))?\)'
175
+ match = re.search(pattern, response)
176
+ x = int(float(match.group(1)) / resized_width * width)
177
+ y = int(float(match.group(2)) / resized_height * height) if match.group(2) else None
178
+ if y is not None:
179
+ return (x, y)
180
+
181
+
182
+ resize_h, resize_w = smart_resize(image_height, image_width, min_pixels=min_pixels, max_pixels=max_pixels)
183
+ x, y = parse_scalecua_grounder_response(output_text, image_width, image_height, resize_w, resize_h)
184
+ ```
185
+
186
+ -----
187
+
188
+ ## Reasoned Action Mode as native agents
189
+
190
+ For complex, multi-step tasks, the **Reasoned Action Mode** guides the model to first think through the problem, state its intended operation, and then generate the corresponding action code. This is the recommended mode for general computer use automation. We will demonstrate an example of ScalueCUA in Ubuntu OS:
191
+
192
+ 1. To enable this mode, use the following system prompt:
193
+
194
+ ```python
195
+ SCALECUA_SYSTEM_PROMPT_AGENT = '''You are an autonomous GUI agent operating on the **Linux (Ubuntu)** platform. Your primary function is to analyze screen captures and perform appropriate UI actions to complete assigned tasks.
196
+
197
+ ## Action Space
198
+ def click(
199
+ x: float | None = None,
200
+ y: float | None = None,
201
+ clicks: int = 1,
202
+ button: str = "left",
203
+ ) -> None:
204
+ """Clicks on the screen at the specified coordinates. The `x` and `y` parameter specify where the mouse event occurs. If not provided, the current mouse position is used. The `clicks` parameter specifies how many times to click, and the `button` parameter specifies which mouse button to use ('left', 'right', or 'middle')."""
205
+ pass
206
+
207
+
208
+ def doubleClick(
209
+ x: float | None = None,
210
+ y: float | None = None,
211
+ button: str = "left",
212
+ ) -> None:
213
+ """Performs a double click. This is a wrapper function for click(x, y, 2, 'left')."""
214
+ pass
215
+
216
+
217
+ def rightClick(x: float | None = None, y: float | None = None) -> None:
218
+ """Performs a right mouse button click. This is a wrapper function for click(x, y, 1, 'right')."""
219
+ pass
220
+
221
+
222
+ def scroll(clicks: int, x: float | None = None, y: float | None = None) -> None:
223
+ """Performs a scroll of the mouse scroll wheel at the specified coordinates. The `clicks` specifies how many clicks to scroll. The direction of the scroll (vertical or horizontal) depends on the underlying operating system. Normally, positive values scroll up, and negative values scroll down."""
224
+ pass
225
+
226
+
227
+ def moveTo(x: float, y: float) -> None:
228
+ """Move the mouse to the specified coordinates."""
229
+ pass
230
+
231
+
232
+ def dragTo(
233
+ x: float | None = None, y: float | None = None, button: str = "left"
234
+ ) -> None:
235
+ """Performs a drag-to action with optional `x` and `y` coordinates and button."""
236
+ pass
237
+
238
+
239
+ def press(keys: str | list[str], presses: int = 1) -> None:
240
+ """Performs a keyboard key press down, followed by a release. The function supports pressing a single key or a list of keys, multiple presses, and customizable intervals between presses."""
241
+ pass
242
+
243
+
244
+ def hotkey(*args: str) -> None:
245
+ """Performs key down presses on the arguments passed in order, then performs key releases in reverse order. This is used to simulate keyboard shortcuts (e.g., 'Ctrl-Shift-C')."""
246
+ pass
247
+
248
+
249
+ def keyDown(key: str) -> None:
250
+ """Performs a keyboard key press without the release. This will put that key in a held down state."""
251
+ pass
252
+
253
+
254
+ def keyUp(key: str) -> None:
255
+ """Performs a keyboard key release (without the press down beforehand)."""
256
+ pass
257
+
258
+
259
+ def write(message: str) -> None:
260
+ """Write the specified text."""
261
+ pass
262
+
263
+
264
+ def call_user() -> None:
265
+ """Call the user."""
266
+ pass
267
+
268
+
269
+ def wait(seconds: int = 3) -> None:
270
+ """Wait for the change to happen."""
271
+ pass
272
+
273
+
274
+ def response(answer: str) -> None:
275
+ """Answer a question or provide a response to an user query."""
276
+ pass
277
+
278
+
279
+ def terminate(status: str = "success", info: str | None = None) -> None:
280
+ """Terminate the current task with a status. The `status` specifies the termination status ('success', 'failure'), and the `info` can provide additional information about the termination."""
281
+ pass
282
+
283
+
284
+ ## Input Specification
285
+ - Screenshot of the current screen + task description + your past interaction history with UI to finish assigned tasks.
286
+
287
+ ## Output Format
288
+ <think>
289
+ [Your reasoning process here]
290
+ </think>
291
+ <operation>
292
+ [Next intended operation description]
293
+ </operation>
294
+ <action>
295
+ [A set of executable action command]
296
+ </action>
297
+
298
+ ## Note
299
+ - Avoid actions that would lead to invalid states.
300
+ - The generated action(s) must exist within the defined action space.
301
+ - The reasoning process, operation and action(s) in your response should be enclosed within <think></think>, <operation></operation> and <action></action> tags, respectively.'''
302
+ ```
303
+
304
+ 2. Use the above system prompt to generate prediction:
305
+ ```python
306
+ SCALECUA_USER_PROMPT = '''Please generate the next move according to the UI screenshot, the task and previous operations.
307
+
308
+ Task:
309
+ {instruction}
310
+
311
+ Previous operations:
312
+ {history}
313
+ '''
314
+
315
+ def format_history(history):
316
+ if len(history) > 0:
317
+ actions_history = [f"Step {i+1}: {low_level}" for i, low_level in enumerate(history)]
318
+ return "\n".join(actions_history)
319
+ else:
320
+ return None
321
+
322
+ history = ["Click on 'Chrome'", "Click on the three-dot menu icon in the top right corner of the Chrome window to open the browser settings menu."]
323
+ step_history = format_history(history)
324
+
325
+ task_instruction = "I want to check my password information in Chrome"
326
+ user_prompt = SCALECUA_USER_PROMPT.format(
327
+ instruction=task_instruction,
328
+ history=step_history,
329
+ )
330
+
331
+
332
+ messages = [
333
+ {
334
+ "role": "system",
335
+ "content":[
336
+ {
337
+ "type": "text",
338
+ "text": SCALECUA_SYSTEM_PROMPT_AGENT,
339
+ }
340
+ ]
341
+ },
342
+ {
343
+ "role": "user",
344
+ "content": [
345
+ {
346
+ "type": "image",
347
+ "image": "/path/to/your/image",
348
+ },
349
+ {"type": "text", "text": user_prompt},
350
+ ],
351
+ }
352
+ ]
353
+ text = processor.apply_chat_template(
354
+ messages, tokenize=False, add_generation_prompt=True
355
+ )
356
+ image_inputs, video_inputs = process_vision_info(messages)
357
+ inputs = processor(
358
+ text=[text],
359
+ images=image_inputs,
360
+ videos=video_inputs,
361
+ padding=True,
362
+ return_tensors="pt",
363
+ )
364
+ inputs = inputs.to("cuda")
365
+
366
+ # Inference: Generation of the output
367
+ generated_ids = model.generate(**inputs, max_new_tokens=4096)
368
+ generated_ids_trimmed = [
369
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
370
+ ]
371
+ output_text = processor.batch_decode(
372
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
373
+ )
374
+ print(output_text)
375
+ ```
376
+
377
+ 3. Extract think, low-level instruction and actions from response:
378
+ ```python
379
+ def parse_response(response: str) -> Dict:
380
+ action_matches = re.findall(r'<action>\s*(.*?)\s*</action>', response, re.DOTALL)
381
+ actions = []
382
+ if action_matches:
383
+ for match in action_matches:
384
+ # Split each match by newline and strip whitespace from each line
385
+ lines = [line.strip() for line in match.split('\n') if line.strip()]
386
+ actions.extend(lines)
387
+ operation_match = re.search(r'<operation>\s*(.*?)\s*</operation>', response, re.DOTALL)
388
+ operation = operation_match.group(1).strip() if operation_match else None
389
+
390
+ think_match = re.search(r'<think>\s*(.*?)\s*</think>', response, re.DOTALL)
391
+ think = think_match.group(1).strip() if think_match else None
392
+
393
+ return (think, operation, actions)
394
+
395
+ def parse_actions(self, actions):
396
+ parsed_action = []
397
+ for action in actions:
398
+ match = re.match(r"(\w+)\((.*)\)", action)
399
+ if not match:
400
+ return None
401
+
402
+ func_name = match.group(1)
403
+ args_str = match.group(2)
404
+ args = {}
405
+
406
+ if 'hotkey' in func_name.lower():
407
+ keys = re.findall(r"'(.*?)'", args_str)
408
+ keys = [key.lower() for key in keys]
409
+ args["args"] = keys
410
+ elif 'press' in func_name.lower():
411
+ keys = None
412
+ presses = 1
413
+ presses_match = re.search(r"presses\s*=\s*(\d+)", args_str)
414
+ if presses_match:
415
+ presses = int(presses_match.group(1))
416
+ args_str = args_str[:presses_match.start()] + args_str[presses_match.end():]
417
+ args_str = args_str.rstrip(", ").strip()
418
+
419
+ keys_keyword_match = re.search(r"keys\s*=\s*(.*)", args_str, re.DOTALL)
420
+ if keys_keyword_match:
421
+ keys_str = keys_keyword_match.group(1).strip()
422
+ if (keys_str.startswith("'") and keys_str.endswith("'")) or \
423
+ (keys_str.startswith('"') and keys_str.endswith('"')):
424
+ keys_str = keys_str[1:-1]
425
+ elif keys_str.startswith("[") and keys_str.endswith("]"):
426
+
427
+ keys_str = ast.literal_eval(keys_str)
428
+ keys = keys_str
429
+ elif args_str:
430
+ keys_str = args_str.strip()
431
+ if (keys_str.startswith("'") and keys_str.endswith("'")) or \
432
+ (keys_str.startswith('"') and keys_str.endswith('"')):
433
+ keys_str = keys_str[1:-1]
434
+ keys = keys_str
435
+
436
+ args["keys"] = keys
437
+ args["presses"] = presses
438
+ elif 'scroll' in func_name.lower():
439
+ clicks, x, y = None, None, None
440
+ if '=' in args_str:
441
+ kwargs = dict(re.findall(r'(\w+)\s*=\s*(-?\d+)', args_str))
442
+
443
+ clicks = int(kwargs.get('clicks')) if kwargs.get('clicks') is not None else None
444
+ x = int(kwargs.get('x')) if kwargs.get('x') is not None else None
445
+ y = int(kwargs.get('y')) if kwargs.get('y') is not None else None
446
+
447
+ elif args_str:
448
+ try:
449
+ clicks = int(args_str)
450
+ except ValueError:
451
+ pass
452
+
453
+ if clicks: args['clicks'] = clicks
454
+ if x: args['x'] = x
455
+ if y: args['y'] = y
456
+
457
+ else:
458
+ if "=" in args_str:
459
+ for arg in re.finditer(r"(\w+)=\[([^\]]+)\]", args_str):
460
+ param = arg.group(1)
461
+ list_str = arg.group(2)
462
+
463
+ list_items = []
464
+ for item in re.finditer(r"'([^']*)'|\"([^\"]*)\"|([^,\]]+)", list_str):
465
+ val = (item.group(1) or item.group(2) or item.group(3)).strip()
466
+ if val:
467
+ list_items.append(val.strip('"\''))
468
+
469
+ args[param] = list_items
470
+
471
+
472
+ for arg in re.finditer(r"(\w+)=([^,)]+)", args_str):
473
+ param = arg.group(1)
474
+ if param in args:
475
+ continue
476
+
477
+ value_str = arg.group(2).strip()
478
+
479
+ if value_str.isdigit():
480
+ value = int(value_str)
481
+ elif value_str.replace(".", "", 1).isdigit():
482
+ value = float(value_str)
483
+ elif value_str.lower() in ("true", "false"):
484
+ value = value_str.lower() == "true"
485
+ else:
486
+ value = value_str.strip('"\'')
487
+
488
+ args[param] = value
489
+
490
+
491
+ else:
492
+ args_list = []
493
+ for arg in re.finditer(r"'([^']*)'|\"([^\"]*)\"|([^,]+)", args_str):
494
+ val = (arg.group(1) or arg.group(2) or arg.group(3)).strip()
495
+ if val:
496
+ args_list.append(val.strip('"\''))
497
+
498
+ if args_list:
499
+ args["args"] = args_list
500
+
501
+ parsed_action.append({
502
+ 'name': func_name,
503
+ 'parameters': args
504
+ })
505
+
506
+ think, operation, actions = parse_response(output_text)
507
+ structured_actions = parse_actions(actions)
508
+ ```
509
+
510
+ 4. Transform coordinates based on the resized image:
511
+ ```python
512
+ from qwen_vl_utils import smart_resize
513
+
514
+ resize_h, resize_w = smart_resize(image_height, image_width, min_pixels=min_pixels, max_pixels=max_pixels)
515
+ for action in actions
516
+ if 'x' in action['parameters'] :
517
+ x = "{:.4f}".format(float(x) / resize_w * image_width)
518
+ action['parameters']['x'] = x
519
+ if 'y' in action['parameters']
520
+ y = "{:.4f}".format(float(y) / resize_h * image_height)
521
+ action['parameters']['y'] = y
522
+ ```
523
+
524
+ -----
525
+
526
+ ## Citation
527
+
528
+ If you find our project useful in your research, please consider citing:
529
+
530
+ ```bibtex
531
+ @article{scalecua2025,
532
+ title={ScaleCUA: Scaling Up Computer Use Agents with Cross-Platform Data},
533
+ author={},
534
+ journal={},
535
+ year={2025}
536
+ }
537
+ ```