Image to Text Transformation With Maximum Text Generation Using Feedback-based RL
The core of the system is a Vision-Language Model (VLM) that performs the initial transformation of an image into a textual description. This process is designed to capture the essential information of a scene without exposing personally identifiable visual details. By doing this, the system avoids the privacy risks associated with traditional methods like blurring, which can be reversed or may not fully conceal sensitive content. To enhance the quality and relevance of the generated text, the framework employs a hierarchical, feedback-driven reinforcement learning (RL) model. This RL agent, using Proximal Policy Optimization (PPO), iteratively refines the initial text by selecting more specific prompts from a predefined list. Furthermore, a Retrieval-Augmented Generation (RAG) module is integrated to provide external feedback, validating the RL agent’s prompt selections and correcting potential errors. This creates an iterative loop where the text description is progressively improved for accuracy and detail over several cycles, ensuring the final output is both semantically rich and privacy-preserving.
