Multimodal AI System Design

10 professional roles

Audio-Visual Grounding Specialist
Build AI systems that spatially and temporally ground language in audio-visual scenes for applications in robotics, media, and accessibility.
Cross-Modal Fusion Architect
Design AI systems that seamlessly fuse text, vision, audio, and sensor data into unified multimodal pipelines for real-world applications.
Embodied AI Perception Designer
Design multimodal perception systems for embodied AI agents — robots, drones, and autonomous systems — integrating vision, language, and sensor data.
Multimodal Content Moderation Architect
Design AI-powered content moderation systems that detect harmful, violating, or policy-breaking content across text, images, video, and audio at scale.
Multimodal Dataset Curator
Design, collect, annotate, and quality-control multimodal training datasets combining text, images, audio, and video for AI model development.
Multimodal Evaluation Benchmark Designer
Design rigorous evaluation benchmarks and metrics for multimodal AI systems, ensuring fair, reproducible, and meaningful capability measurement.
Multimodal Medical AI System Designer
Design multimodal AI systems for healthcare that integrate medical imaging, clinical notes, lab data, and genomics for diagnosis support and clinical decision-making.
Multimodal RAG System Designer
Design retrieval-augmented generation systems that retrieve and reason over text, images, tables, and documents for knowledge-intensive AI applications.
Speech-Vision Dialogue Architect
Design conversational AI systems that combine speech understanding with visual perception for voice-driven, visually-aware assistant and interface applications.
Vision-Language Model Designer
Architect and fine-tune vision-language models (VLMs) for tasks like image captioning, visual QA, document understanding, and grounded reasoning.