Hi Guys,
I’d love to see real-time multimodal support for agent requests, allowing agents to process voice, video, and image inputs & outputs. Is this on your roadmap? If so, do you have a timeline for it?
Thanks!
Hi Guys,
I’d love to see real-time multimodal support for agent requests, allowing agents to process voice, video, and image inputs & outputs. Is this on your roadmap? If so, do you have a timeline for it?
Thanks!
I love this idea. Could you give an example flow of how you would want to use it?
Dear Tony,
Thank you for your reply. I’m glad that you liked it. I was going to develop a real-time application that my client will continuously stream his/her voice/video and I have to pass this stream to my Crew and I expect to have a stream of responses in Video/Voice/Text or tool calling etc.
For example I will stream the user’s voice to my crew and my crew needs to process it in real-time and respond based on the instructions given to it.
So far some LLMs are supporting this feature like gpt-4v and gemini etc.
Thanks for your attention,
Kind regards,
Hadi.
Another use case is to consider screenshots in email attachment along with the text to give the agent the full context of a (for example a help desk ticket).
This topic was automatically closed after 30 days. New replies are no longer allowed.