Browser Interaction
Instruct and control what the agent should do in the browser
Taking Action
Use act() to tell the agent what to do:
Instructions provided to act can be high-level tasks:
or low level actions:
Think of it like you’re telling a coworker to do something. Breaking it up into tiny steps is unnecessary, but you want to be specific enough to make the objective clear.
Chaining Acts
Combine multiple act calls to accomplish complex sequences of interactions:
You can also chain multiple steps together in the same act call for convenience:
Providing Data
You can provide arbitrary data fields that the agent will use where appropriate during its actions:
Custom Prompting
Provide custom system prompt instructions as needed:
Navigating Directly
While the agent is capable of navigating to URLs on its own, you may sometimes want to navigate to a specific URL directly.
To do this, use nav
:
Agent Capabilities
What can agent do in act?
The agent is capable of mouse, keyboard, and browser-specific actions, including but not limited to:
- Clicking with the mouse
- Dragging with the mouse
- Typing long blocks of content
- Pressing specific keystrokes
- Switching tabs
- Navigating to URLs
What is the agent aware of?
The agent knows about and sees:
- The current screenshot plus some past screenshots
- History of its own actions from the same
act()
- All currently open tabs
- Which tab is active