How AI Agents Orchestrate Document Actions: From Natural Language to Client-Side WASM
By LLM Practitioner
When you tell TellPDF to 'merge files, delete the third page, and add a watermark,' the system translates your natural-language request into a structured sequence of actions. Rather than directly manipulating documents, a language model interprets your intent and generates machine-readable instructions that can be executed by specialized tools.
At runtime, your command is processed by an AI planning layer. The model does not execute code directly; instead, it acts as an orchestrator that maps your request to a sequence of JSON-structured operations defined within the application. For example, a request to remove a page may be translated into a DELETE_PAGES action with a corresponding page index parameter. The Next.js API serves as a secure, key-isolated proxy that returns the generated execution plan to the browser.
Once the structured plan reaches the browser, the client-side orchestration layer takes over. The JavaScript coordinator loads the required WebAssembly (WASM) modules and executes each operation sequentially. This architecture helps preserve privacy because the language model is responsible only for planning the workflow, while the actual document data remains in browser memory during processing. The same orchestration pattern can be implemented using commercial AI services or local open-weight models such as Gemma, Llama, and Mistral, depending on deployment requirements.