OsaTeching

A seven-round AI image generator showdown was held at Grok vs. Gemini.

General

Creating images using artificial intelligence has never been easier. With chatbots, it is even easier because the language model takes all the guesswork out of your photo prompts.

Grok is a relative newcomer in the chat platform arena: it's built into X and freely available, but rumor has it that at some point next year it will go independent with its own dedicated URL. This would put it in more direct competition with Gemini, ChatGPT, Claude, and MetaAI.

The xAI team also gave Grok its own custom AI image creation model. Previously, it used Flux to create images, but has now moved to Aurora. Elon Musk has stated that the name should not be used and that Grok should be considered to be creating its own images.

Gemini also recently underwent a major overhaul with Gemini 2.0 Flash joining the model available to Gemini Advanced subscribers. However, at least for now, Gemini 2.0 still uses the Imagen 3 model to create images; this will change when Gemini 2.0 has native imaging capabilities.

Both Grok and Gemini are particularly good at tasks that generate images, such as creating prompts for different models or improving on prompts already written. Therefore, we pitted them against each other.

Creating prompts to test the image generation capabilities of the two chatbots is a bit different than writing prompts for Midjourney or Ideogram, as the AI fills in the gaps, keeping it simple and using top-level concepts and some explanation, with an emphasis on using.

You should also use trigger words or phrases like “imagine,” “draw,” or “craft” to let the model know that you want a picture rather than a story or text response. I want pictures rather than drawings, so I use those as keywords.

Gemini only outputs 1:1 resolution images, but so far Grok seems to prefer 4:3. Unless otherwise noted, all images are initial responses, with no subsequent refinements. Also, all were requested within the same session, rather than creating a new chat for each prompt.

The prompt “Please generate a photo style image of a red fox navigating a crosswalk in a rainy city at dawn while a pedestrian with an umbrella waits at a stoplight.”

This first prompt is designed to test how well you can portray the animal as well as capture the appropriate lighting and background elements. The ideal output would look like a stylized photograph with rain effects, but keeping the landscape as realistic as possible.

Gemini's image is more striking, but I think Grok is closer to what I had in mind. The fox is much more realistic than the Gemini image.

Prompt: “Imagine a professional chef's kitchen during the dinner rush in a photographic style, with steam rising from the pots and flames visible from the grill station.

This is to accurately show the kitchen equipment and how well it can handle elements such as heat and humidity according to the prompt. It should show how a commercial kitchen looks and behaves, as well as ideas for activities.

This is an easy win for Grok because Gemini failed to understand the context of the prompt, namely that the chef is expected to be in the kitchen.

Prompt: “Produce a photograph, in the style of documentary photography, of a mid-rise building under construction with workers installing glass panels on a sunny afternoon with a crane operating overhead.

The prompt is intended to see how well you can generate perspective, as it is necessary to show height and location. It also needs to show the characteristics of the material and be as realistic as possible. I chose the documentary style because it adds further complexity.

Gemini's images look much more realistic than Grok's, which shows none of the workers and only a wide field of view.

Prompt: “Take a photo of a busy farmer's market at 7:00 AM in the style of a smartphone photo. “

In this comparison, the model needs to show not only the freshness of the merchandise and the movement of people, but also the time of day (correct lighting). I am looking for shadow length and activity level.

This was the most difficult decision for me. I liked the natural look of the Gemini image, but I think the Glock captured the lighting and time of day more accurately.

Prompt: “Make a black and white retro style photo of a mechanic using a diagnostic tool on a modern car with the hood up and the engine compartment visible.

I wanted to see how well both models could handle black and white photography. I also had to show the tooling, lighting, and engine details.

Again, the two images were very close, but I gave the edge to the Gemini, which showed the engine details more accurately.

Prompt: “Please make an action photo of paramedics treating a patient on a neighborhood street while the police are directing traffic around the scene.”

Action photography is difficult. I did it for a while as a journalist early in my career (though not very well). You need to show correct positioning, public safety measures in the image, and a sense of urgency.

Gemini created an image that was closer to the prompt and more realistic. This was an easy decision.

Prompt: “Create a photo-style image of a violinist practicing alone in a room at dusk.

Finally, something more artistic. Here we see the position of the violinist's hands, the natural lighting effects, and the quality of the sheet music.

One of these looks like a classical album jacket and the other looks like a photo of a person practicing the violin. Since the prompt calls for a practicing person, I concede the win to Grok.

Grok is very impressive. Not only as a chatbot, but in its ability to generate realistic images. While very impressive in its own right, it has a habit of being overly stylized.

It was a close race. The two models are about even, but Grok has a better ability to interpret prompts and produces a more natural image.

It is worth noting that Google will soon announce a new version of Gemini that can create images natively. This means that it will no longer be necessary to use Imagen 3 to create images, but will be able to create images on its own.