it's high time that such tools were developed: https://wise.wmcloud.org/
Search in #wikicommons is a pain and mostly doesn't yield results that are in any sense representative of what commons actually has to offer. Using multimodal embeddings in this context is very promising (and a good example of reasonable AI application)