Multimodal is the name of the game here, because ChatGPT and Grok can both do the expected (that’s generate text) and the new ...