they don’t pretend (stop anthropomorphizing smh), they just fail to trigger engine of image processing, and they don’t have internal simple scripts checking for the engagement of such engine. for the encoder there is no difference between text input containing files or not, it follows then it will provide an answer regardless. simple script checking, on the other hand, cannot check whether the text output matches the ingested data, so they just fumble around.
it has to have very defined scripts of how to behave with outside data is what i’m saying, it can reply regardless of that, best you can hope for is that reply is some ways relates to other part of machine (“agent” or whatever image/file/video/audio recognition thingy sits there) (like obvious sane way is - launch one shot image descriptor, get it’s output, on fail - send description to llm there were no image, train the damn thing to synthesize both contexts in “sorry mate, no inputs”)
they don’t pretend (stop anthropomorphizing smh), they just fail to trigger engine of image processing, and they don’t have internal simple scripts checking for the engagement of such engine. for the encoder there is no difference between text input containing files or not, it follows then it will provide an answer regardless. simple script checking, on the other hand, cannot check whether the text output matches the ingested data, so they just fumble around.
it has to have very defined scripts of how to behave with outside data is what i’m saying, it can reply regardless of that, best you can hope for is that reply is some ways relates to other part of machine (“agent” or whatever image/file/video/audio recognition thingy sits there) (like obvious sane way is - launch one shot image descriptor, get it’s output, on fail - send description to llm there were no image, train the damn thing to synthesize both contexts in “sorry mate, no inputs”)