We get a rather substantial winter break at SUNY Brockport, and I often use this time to do something “different”. This year, I wanted to explore image generation using AI (since everything AI is all the rage right now). Furthermore, I upgraded my computer recently and purchased a high(er) end graphics card. I got to thinking that running AI generation software locally might be a good test of my system, but needed a project. Then, I found this in my inbox:
I tend not to trust NY Times emails (if you trust blocked content, then any ads in the email are shown; however if I go directly to the website, my adblockers do their job). What is interesting is that the alt text for the image seems very AI-like. I wonder why a company focused on writing words needs to use an AI to generate a caption for an image they created (hopefully that is the case). It got me thinking: if they used AI to generate this caption, I would like to see what AI does with those words as a prompt to generate images.
I would not say that getting my computer set up for image generation was trivial; however, the documentation at huggingface has been extremely helpful with clear and concise examples. Knowing a bit of python – and knowing that ChatGPT can generate the code I don’t know, has made creating a simple locally-driven image generation app fairly straightforward.
Anyway, on to the results. I took the caption above and used that to prompt some image generation models. There are a bunch out there, including stable diffusion, open journey, and open dalle. I’ve got a version of each stored locally, and here’s what they came up with.
PROMPT: An illustration of a silhouetted cut out of a woman from the shoulders up on a wooden surface. Through the negative space of the woman we see blue sky with white daisies and green leaves flowing in a circular pattern.
The first up is stable diffusion, which is relatively quick on my computer, generating the four examples above in about 30 seconds. It seems to get the silhouette, white daises, and circular pattern correct and wooden surface, but is not consistent on the “from the shoulders up” and doesn’t do much with the “green leaves”. None of these images seem to be something that NYTimes would publish. Note, at the time of this writing, I haven’t looked at the actual image. I’m going to wait to do that until after AI has generated the images.
Next up is open journey, which has a very hard time with the wood portion of the prompt. Like stable diffusion, it is inconsistent with the “from the shoulders up” portion as well. Blue sky seems to be interpreted as blue hair in one case. We don’t even have circular patterns in all four cases. Flowers seem to be consistent.
Last up is open dalle, which in my opinion generates the most stunning results. It is also the most processor-intensive of the three, requiring about five minutes to generate the grid of 4 images. I am impressed by the presence of wooden surfaces in all four and this model seems to be the only one to consistently interpret “from the shoulders up”. We still see some blue hair. None of these examples have what I might consider as “negative space”. Stable diffusion created two (the two right-hand side examples) and open journey created one (upper right).
So now the great reveal – here is the actual image. I suspect this is more suspenseful for me than my loyal readers.
First, let me credit Sean Dong for the image. Second, let me say that all three AI models did a pretty poor job at turning the (most likely AI generated) prompt back into an image. Granted, I am using publicly available models and I don’t know what they’ve been trained on (although there’s a better chance of learning that than with the commercial models). AI had a really hard time with the concept of “negative space” and none of them adequately represent the flowers and sky within the negative space.
This little exercise gets me thinking about what would happen if I run the create an image/create a prompt cycle for multiple iterations and see what happens to the image. Does it converge on a consistent image/prompt combination or blow up and go in some bizarre direction? I’ve got another week before having to think about classes, so perhaps I’ll find out.
An interesting experiment.
The kind of thing that will keep your interest even once the semester starts, I expect. Good luck with it.
Hey Bob—I know you are a Mathematica head….
You can do Dall-E-3 image generation from within Mathematica 13.3+
https://jschrier.github.io/blog/2023/11/24/Dall-E-3-image-generation-in-Mathematica.html
I have a few various experimentations of working with this programmatically, e.g., having gpt-4 generate text descriptions and then turning them into images, which might give you some ideas
https://jschrier.github.io/blog/tag/dall-e-3
(and of course, if you want to run locally: https://resources.wolframcloud.com/FunctionRepository/resources/StableDiffusionSynthesize/
Very cool – I’ve been out of the loop with respect to new Mathematica features.