Network and cloud operations are increasingly time-consuming and risky, yet operational workflows rarely transfer across deployments because tooling, observability, and interfaces are heterogeneous and inconsistently deployed. Recent AI-based agents can help operators plan, diagnose and remediate incidents from natural-language goals (AI-for-Ops), but production adoption remains limited: a single wrong action from AI agents can cascade under delayed, noisy feedback and hidden dependencies. This proposal presents a blueprint that makes AI-for-Ops efficient, reliable, and verifiable. It first introduce two works (1) MeshAgent, which replaces raw context with compact and efficient constraints that guide and validate DSL generation; (2) NetArena, an emulator-backed offline evaluator that dynamically generates queries and ground truth to evaluate AI agents performance at scale. It then discuss the future work on online verification when running the agentic system, which serves as a safety guardrail against AI output and monitors outcomes to trigger rollback or escalation. Together, these ideas aim to turn AI-for-Ops from helpful suggestions into trustworthy, deployable automation across diverse network and cloud environments.
Yajie Zhou is a PhD student at the University of Maryland, College Park, advised by Prof. Zaoxing (Alan) Liu. Her research focuses on AI for systems and networking. She aims to make AI-driven methods robust and deployable for real-world network and cloud operations across diverse environments.

