Using Mixture-of-Experts to Preserve Privacy

The system first employs a Mixture-of-Experts (MoE) architecture, where multiple specialized Vision-Language Models (VLMs) analyze distinct aspects of a scene, such as traffic dynamics, pedestrians, road signs, and environmental conditions. Each "expert" generates a targeted description, and these are aggregated to form a comprehensive initial analysis of the image. A feed-forward neural network then dynamically assigns relevance weights to each expert's output based on the specific content of the input image, prioritizing the most important information for the given context. In the final stage, a Reinforcement Learning (RL) agent refines this aggregated text to produce a single, coherent, and privacy-aware output. The RL agent is trained to optimize the text for the dual objectives of semantic accuracy and privacy. This is achieved using a custom, composite reward function that balances semantic relevance (measured by BERTScore), coverage of key details from the experts (measured by ROUGE), and conciseness to avoid redundancy. By using policy-gradient methods, the agent learns to generate a final description that preserves necessary information for tasks like traffic monitoring while fundamentally abstracting away personally identifiable details. This method shifts the privacy-preserving paradigm from traditional image obfuscation to controlled semantic abstraction.