Browser Automation Agent

A Browser Automation Agent that allows users to interact with web browsers using natural language commands. The agent leverages the MCP protocol, Gemini AI, and Selenium WebDriver to interpret commands, map them to browser actions, and execute them seamlessly.

Features

Natural Language Commands: Accepts user input like "Navigate to google.com and search for 'Rust programming'" and executes it in the browser.
Automated Browser Actions:
- Navigate to URLs.
- Click on elements (e.g., buttons, links).
- Scroll up/down the page.
- Search for queries on websites.
- Type text into input fields.
AI-Powered Parsing: Uses Gemini AI to convert natural language commands into structured JSON-RPC requests.
Dynamic Element Detection: Locates elements dynamically using text or attributes.

How It Works

User Input: The user provides a natural language command (e.g., "Go to YouTube and search for 'Python tutorials'").

Command Parsing:

The input is sent to Gemini AI, which parses it into structured commands with functions and parameters.

Example Output:

[
  { "function": "navigate", "params": { "url": "https://www.youtube.com" } },
  { "function": "search", "params": { "query": "Python tutorials" } }
]

Command Execution:
- The parsed commands are executed using Selenium WebDriver.
- Each function (e.g., navigate, click, scroll) performs the corresponding browser action.

Installation

Prerequisites

Python3 and uv
Google Chrome installed
ChromeDriver installed (compatible with your Chrome version)
Gemini API key for AI command parsing

Steps

Clone the repository:

git clone https://github.com/your-repo/browser-agent.git
cd browser-agent

Install dependencies:
```
uv sync
```
Set up environment variables:
- Create a .env file in the root directory:
```
GEMINI_API_KEY=your_gemini_api_key_here
```

Usage

Start the agent by running:
```
uv run main.py
```
Enter natural language commands in the terminal, such as:
- Navigate to youtube.com
- Search for 'Python tutorials'
- Scroll down
- Go to github and click on the login button
To exit, type:
```
exit
```

Example Commands

Command	Action
`Navigate to github.com`	Opens GitHub in the browser.
`Search for 'AI tools' on YouTube`	Navigates to YouTube and searches for "AI tools".
`Scroll to the bottom of the page`	Scrolls to the bottom of the current page.
`Click on the login button`	Finds and clicks on a button with text containing "login".
`Type 'hello' into the username field`	Types "hello" into an input field labeled or named "username".

Project Structure

.
├── browser_agent.py        # Core logic for parsing and executing commands
├── function_decls.json     # Schema of supported functions and their parameters
├── logging_utils.py        # Utility functions for logging actions and errors
├── main.py                 # Entry point for running the agent
├── .env                    # Environment variables (e.g., Gemini API key)
└── README.md               # Project documentation (this file)

Supported Commands

1. Navigate (`navigate`)

Description: Opens a specified URL in the browser.
Parameters:
- url (string): The URL to navigate to.

2. Click (`click`)

Description: Clicks on an element in the browser.
Parameters:
- element (string): Text or keyword from the element's content.

3. Scroll (`scroll`)

Description: Scrolls up, down, left, or right on a webpage.
Parameters:
- direction (string): "up", "down", "left", or "right".
- bound (boolean): Whether to scroll modestly (false) or all the way (true).

4. Search (`search`)

Description: Searches for a query on a website.
Parameters:
- query (string): The search term.

5. Type (`type`)

Description: Types text into an input field.
Parameters:
- field (string): Text or keyword identifying the field.
- text (string): The text to type.

Future Enhancements

Add support for more complex workflows, such as form submissions or multi-step tasks.
Improve element detection using fuzzy matching or AI-based visual recognition.
Integrate additional browsers like Firefox or Edge.
Provide a web-based interface for easier use.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
browser_agent.py		browser_agent.py
function_decls.json		function_decls.json
lab.py		lab.py
logging_utils.py		logging_utils.py
main.py		main.py
pyproject.toml		pyproject.toml
sample.env		sample.env
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Browser Automation Agent

Features

How It Works

Installation

Prerequisites

Steps

Usage

Example Commands

Project Structure

Supported Commands

1. Navigate (`navigate`)

2. Click (`click`)

3. Scroll (`scroll`)

4. Search (`search`)

5. Type (`type`)

Future Enhancements

About

Uh oh!

Releases

Packages

Languages

Seudonym/browser-agent

Folders and files

Latest commit

History

Repository files navigation

Browser Automation Agent

Features

How It Works

Installation

Prerequisites

Steps

Usage

Example Commands

Project Structure

Supported Commands

1. Navigate (navigate)

2. Click (click)

3. Scroll (scroll)

4. Search (search)

5. Type (type)

Future Enhancements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Navigate (`navigate`)

2. Click (`click`)

3. Scroll (`scroll`)

4. Search (`search`)

5. Type (`type`)

Packages