Tencent recently published a paper titled “AppAgent: Multimodal Agents as Smartphone User” that uses LLM model to intelligently navigate mobile UI screens and perform various actions. This includes

Tencent recently published a paper titled “AppAgent: Multimodal Agents as Smartphone User” that uses LLM model to intelligently navigate mobile UI screens and perform various actions. This includes

  • sending out an email
  • Watching a video
  • etc

It works by utilizing LLM models (e.g. chat-GPT’s gpt-4-vision-preview) to take a screenshot of a given UI screen and understand the different components (Text fields, buttons, etc) displayed. Once completed, it takes actions (tap, swipe, etc)

A few years back we were trying to dynamically analyze an application to scan apps for security vulnerabilities. This required an agent to drive the application. Back then we resorted to dummy tools like monkey 🤦

This has lots of use-cases,

  1. Dynamic Analysis - Combine this with Dynamic/Instrumentation Analysis tools like Frida,

  2. Testing - Avoid time-consuming QA cycles. During the “learning” phase, QA can show the model how to navigate the app. During this process, the application creates a “document” that holds information about the different screens and takes action during the “run” phase. This will help the model drive the app without QA interaction.

  3. Connect this with DevSecOps platforms. This allows GitHub/Azure/Gitlab to run security analysis without requiring manual review by a security researcher. This might be a company of its own. Or as an app in their marketplace 😁

Demo

Paper

https://arxiv.org/abs/2312.13771