A peek into rabbit’s progress with LAM playground

Tuesday, October 1, 2024

Background

Our vision at rabbit is to develop a cross-platform general agent system that enables smart, autonomous agents to act on your behalf. To achieve that, we have been closely following the latest published research and investigating emerging web-agent architectures. As the state of the art in LLMs progresses, so too do autonomous web agents. The general public may be familiar with progress made by companies like OpenAI, Google, Anthropic, and Meta on LLMs, but researchers from universities and other labs have paved the way for a new wave of LLM-based web agents like Cogagent, WebVoyager, and Agent-E [3][7][8]. We at rabbit have been building on this research internally, with some promising results.

Starting with a web agent

LAM playground advances our goal of achieving a truly cross-platform agent by implementing a generalist web agent -- that is, an agent that can navigate the constrained environment of a web browser. By definition, generalist web agents must be robust to new websites and new tasks, and they must be able to operate autonomously. [1-10]. LAM playground is based on WebVoyager, with modifications made to integrate with our browser automation stack, and to take advantage of features provided by rabbit OS.

For humans, the internet has become a second home, and navigating the web has become second nature. We both work and play on the internet. Nevertheless, we might dread spending just a few minutes on menial, boring tasks. The tasks we dread are often simple, so why don’t we just give them to AI? Something so simple for humans should be easy for AI, right? Not quite. We’ve seen impressive capabilities from LLMs, but even simple web tasks are anything but simple for machines.

How does a web agent work?

Web agents share the same basic problems to solve as more general cross-platform agents, and they approach these problems by performing two main kinds of work: planning and grounding [11][12]. Planning is predicting how to use a particular website, step by step. Grounding is knowing how to find the interactive elements on the rendered page, and knowing how to interact with them effectively. Both planning and grounding are difficult for even the most state-of-the-art LLM-powered web agents.

Planning

Let’s walk through a hypothetical task in order to see its subparts. The task is to find an Airbnb in LA. Your first step is to enter “Los Angeles” into the location bar. Your second step is to add the dates of your trip. Your third step is to add your guests. Then you need to go into the filters and select the ones you care about. Then comes the actual search. You start looking through all the displayed listings, and one-by-one you decide: do I like this or not? This process of searching might not seem very planned to us, but it does actually follow a pattern. You have an internal set of preferences, and for each listing you measure how it fares against those preferences. The listing that fits best, wins.

LLMs have a hard time with this kind of planning. They have to reason over both the structure of a website as well as the user’s set of preferences (which may or may not be explicit). Most people visit the same websites over and over again, and at some level they have memorized the flow of the application. You might even be able to close your eyes and visualize the layout of a website you use often. Web agents built on LLMs can’t do that out of the box. Reasoning over the structure of an application is much harder if you don’t know the website ahead of time.

Out of the box, every time an LLM “sees” a new website, it’s seeing it for the first time. Try going to a website you’ve never visited, and perform a task. You’ll find that even for us humans, new websites can be unintuitive. So, when an agent comes to a new website, it needs to plan on the fly, with no prior knowledge – just like a human would if we’ve never seen the site before. This often will lead to sequences of actions that result in a dead end. This is where planning blends with search, and depending on the website this can become a major time sink. This is one of the problems we’ve been working on at rabbit, and you can see in LAM playground the agent may go down the wrong path. Ideally, it will recognize the error and find its way back to a better state.

The fact that LLMs don’t have dependable, baked-in prior knowledge about web interfaces means that they can’t complete tasks just like a human would. A plan for an agent must be under-specified from the outset, and then updated as the agent continues its task. The result is a plan that should have a high-level functional step-by-step outline. Once the agent sees a new page load, then it can start to map its rough plan onto actionable items (e.g. click the “Location” text box). Every new page yields new information (e.g. menus expand to reveal new information), and as such a good agent should update and adjust its plan constantly. As you can see, simple tasks for humans are not so simple for machines. So far we’ve only talked about planning, but the story continues when we get to grounding.

Grounding

Grounding is where the rubber hits the road. It’s where the agent’s plan turns into specific, concrete actions on a rendered webpage. Let’s say the task is: “Buy me a 12-pack of Diet Coke on Amazon.” The agent needs to make a plan without dependable prior knowledge of Amazon.com’s web interface. The agent can make a plan that turns the task conditions into a logical sequence: 1) go to amazon.com 2) search for a 12-pack of Diet Coke 3) add it to the cart 4) proceed to checkout 5) finalize the payment. When the agent sees Amazon.com’s homepage, it must translate “search for a 12-pack of Diet Coke” to “click on the search bar, type in ‘12-pack of Diet Coke’, hit enter”. The grounding problem involves finding the search bar, and engaging it with the webdriver.

Again, this is such a trivial task for humans that we don’t even think about it. When you click an element on a webpage, you just move the cursor over the element and click it. This is a basic motor task, but it involves a tight feedback loop between perception and movement. It’s not practical to implement web agents with this same approach. Rather, we aim to find the location of the search bar, and directly engage it. This can be done by either referencing the location of the search bar in pixels as it’s presented in the browser, or we can find the search bar in the underlying source code of the webpage. Both of these approaches are difficult, and this “finding” is the grounding problem. Researchers have investigated both these approaches. Combining the visual input (screenshot) with the underlying HTML structure has shown promising results. This is what we’ve also found works best, and is currently running in LAM playground.

As you can see, web-based tasks that are simple for humans are not so simple for machines. The agent must understand what a user wants, break the task down into conditions, turn those conditions into a plan, update the plan over time, and ground all actions on the webpage. The latest advances in LLMs have made great progress in planning and grounding, and at rabbit our goal is to utilize the best of what’s available and productionize a useful AI agent system and make it accessible to our users as early as possible.

rabbit’s approach – LAM playground

Our latest progress in LAM playground employs an approach that directly interprets your request and will iterate over its planned steps to complete the action. The current approach makes use of both visual input (i.e. the screenshot) and structured input (i.e. the page source code) in order to take actions via a webdriver (e.g. Playwright). LAM playground is built to be agnostic to specific tasks and websites, it will try its best to finish any task you give it. Our new approach can answer questions, summarize information, and also perform tasks.

LAM playground brings us one step closer to achieving our vision of a cross-platform general AI agent system. Our goal is for LAM playground to eventually succeed at any task that a human could perform in a web browser. Later we will be expanding to additional platforms, including desktop and mobile apps, making our agent system cross-platform. We’re not quite there yet, but the progress is exciting, and we’re glad to share some of what we’ve done so far.

Works cited:

[1] Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., ... & Schulman, J. (2021). WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.

[2] Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., ... & Su, Y. (2023). Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070.

[3] Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., ... & Ding, M. (2023). Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914.

[4] Kim, G., Baldi, P., & McAleer, S. (2023). Language Models can Solve Computer Tasks. arXiv preprint arXiv:2307.03981.

[5] Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., ... & Sun, M. (2023). ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. arXiv preprint arXiv:2307.16789.

[6] Wang, X., Li, B., Song, Y., Xu, F. F., Tang, X., Zhuge, M., ... & Neubig, G. (2023). OpenDevin: An Open Platform for AI Software Developers as Generalist Agents. arXiv preprint arXiv:2407.16741.

[7] Abuelsaad, T., Akkil, D., Dey, P., Jagmohan, A., Vempaty, A., & Kokku, R. (2024). Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems. arXiv preprint arXiv:2407.13032.

[8] He, H., Yao, W., Ma, K., Yu, W., Dai, Y., Zhang, H., ... & Yu, D. (2024). WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. arXiv preprint arXiv:2401.13919.

[9] Lai, H., Liu, X., Iong, I. L., Yao, S., Chen, Y., Shen, P., ... & Tang, J. (2024). AutoWebGLM: A Large Language Model-based Web Navigating Agent. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.

[10] Putta, P., Mills, E., Garg, N., Motwani, S., Finn, C., Garg, D., & Rafailov, R. (2023). Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents. arXiv preprint arXiv:2408.07199. https://arxiv.org/abs/2408.07199

[11] https://en.wikipedia.org/wiki/Automated_planning_and_scheduling

[12] https://en.wikipedia.org/wiki/Symbol_grounding_problem