Generative AI — this isn’t the revolution

somak roy
9 min readAug 11, 2024

--

Midjourney: The Taj Mahal on a full moon night, as seen by Van Gogh

Generative AI is a strange, otherworldly intelligence.

ChatGPT and Midjourney have bottled the collective human civilisational output with math. The result is still not quite human.

This is not a knock on the machine. If smart is what smart does, Midjourney is smarter than Rembrandt. The bot on Discord can give you a Monet, a Dali, a Mughal miniature, a Dutch Master, a cubist nude descending down the stairs, a Rajasthani stepwell, a Raja Ravi Varma, and an anime assassin in an hour. The minutiae of technique essential to each school takes a lifetime of toil. Midjourney can do it all.

In a phenomenon known as convergent evolution, nature sometimes arrives at a particular design independently. Flight evolved separately in insects, mammals, and birds. The fourth invention of the wing was that of the pterosaur. Of course, one must also bring up Kitty Hawk in 1903. The convergence is functional. The form might differ greatly. Feathers versus skin, for instance.

The mosquitoes buzzing wings, the bat’s darkly foreboding shadow, and the Albatross’s majestic span — have entirely different genealogies. Same is the case with intelligence (exhibit A is the Giant Pacific Octopus), though an intelligence like ours with the infinite recursion of natural language evolved just once.

The replication of a useful function is what’s critical. Less so the form.

It’s not a good use of a working adult’s time to quibble on whether generative AI constitutes intelligence. Smart is what smart does. As with the different flight engineering of mammals, birds, and the jet engine, what matters is not the definition of flight, but that the jet engine — while lacking the aesthetics and elegance of the Albatross — can ferry 400 squishy bodies across oceans in less than a day with a record of safety unmatched in the history of transportation. It’s what it does.

Generative AI too will live and die by what it does. Human intelligence is as fallible as machine intelligence. Anybody who brings up politics on the Internet learns first hand its limitations. But human intelligence does useful things, such as Rembrandt and LinkedIn. Can generative AI?

It might. But as of this Monday, and as of Midjourney — v 5.2, it runs into a trifecta of challenges: reasoning, composability, and world knowledge.

I asked hashtag#Midjourney for a take on:

“Rabindranath Tagore and Bob Dylan hanging out.”

The two men won the literature Nobel for their work as lyricists a century apart. Dylan was born 75 days before Tagore died (coincidence, I think not).

The system made grave errors.

Midjourney — Rabindranath Tagore and Bob Dylan hanging out.

That’s all right. Human artists too do things wrong and do the wrong things. Damien Hirst exists. Rothko was considered a genius. But Midjourney is wrong in a way no human artist ever is. The errors are specific and revelatory.

A human artist assigned with bringing an imagined clash of titans to life would have to consider several matters of practicality. Tagore from which era? The Indian poet was precocious, and prolific throughout his long life. Born to privilege, he was photographed often. The corpus has images from the time he was a strikingly handsome man in his 20s to the eventual trailing off at 80.

Which is also the case with Dylan, who became a legend before 25. Are they meeting as young men? Is this Dylan in his years on and off with Joan Baez, or a bit earlier with Woody Guthrie? Is this after the motorcycle accident? Is it after the Never Ending Tour kicked off in 1988?

Where are they meeting? On neutral ground — Istanbul, where the irreconcilable twain, the East and the West, supposedly meet? Or, is one Master hosting the other? Does this happen in the American Midwest, in New York, or at the university Tagore founded in India?

What are they doing? Are they jamming? Which instruments? Does Tagore wield the ektara? Do they smoke — nicotine or marijuana? Do they drop acid?

The central premise of the current AI paradigm is that such questions will be resolved by billions of parameters tuned to precision by being trained on our species’ civilisational output. GPUs, billions,160-IQ scientists and engineers — and a plausible composite will be achieved in tens of seconds.

The plausibility will be physical and cultural. Legs are two at a time, fingers five, people are seated on solid objects and don’t float in midair. Objects have boundaries. A friendly arm around the shoulder cannot pierce the friend’s lung. That is world knowledge. The world knowledge needn’t be built, the training corpus has the world knowledge built in. Since the days of Altamira cave paintings, human artists have been limiting legs to two, fingers to five, and avoiding death by mangled chest.

World knowledge is contained within the corpus of images and text. The authoring of explicit rules — unnecessary.

Plausibility on the dimension of physical world knowledge has been achieved to an extent. Phantom digits and limbs have been solved in recent iterations. But there is a world of rules, not mere probabilistic relationships, that are in the domain of law, art, architecture, sociology, economics, politics, literature, and history — broadly culture.

This is unsolved. That this is unsolved becomes particularly evident when you mix genres and milieus. Exhibit A is fashion. Exhibit B is music gear. Tagore wore a kind of kimono. Bob Dylan, famously, has a style of his own. Sometimes its bohemian, sometimes it’s all American. But it’s purposeful and distinct. The style has rules.

It’s the same with instruments. If there’s Dylan there should be a harmonica. The Midjourney image clearly doesn’t come from any line of reasoning based on historical fact, nor is it composed. There isn’t any picking up elements and fusing them together in a culturally plausible way. Midjourney seems to infer that the task is to create hirsute, oriental, spiritual men, in a period, sepia-toned setting, with a side of music. The grim, grave, solemn mood of every photograph back then is the element the algorithm prioritised.

Now, could it be the case that the image comes out the way it does because the two milieus are worlds apart, and the training dataset simply didn’t have enough instances of works blending ‘Indian between 1861 and 1941’ and ‘Anglophone folk/rock/poetry since 1961’?

How about a similar ask where the milieus are culturally closer?

“Kurt Cobain and Bob Dylan hanging out”.

Midjourney: Bob Dylan and Kurt Cobain hang out.

‘Seattle, grunge, the early 1990s’ meets ‘an unbroken chain connecting everything before 1961 and everything after’.

It does come out better. Cobain’s unruly hair and movie star good looks are intact. However, Dylan is not quite Dylan. And interestingly, the system has concluded that the attribute to be prioritised is Cobain’s long, unruly hair…for Dylan wears his mane in the exact same way. Meanwhile both geniuses have lost a leg each. The choice of setting is strange. This isn’t recognisably anywhere. A picture of excess, with needles on the floor — signs of a fight with Courtney Love, would have been appropriate. How about a studio? These are two working musicians. How about a stage? How about backstage? What’s with the flowers?

(I can assure you the other iterations of the same prompt weren’t much better, but everybody had two legs. I tried all sorts of prompt engineering blood sugar sex magic as well. The kind and degree of error weren’t much different.)

The specificity of the error betrays the inner workings of generative AI. The training doesn’t lead to any higher levels of abstraction around people, celebrities, cultural epochs, fashion artefacts, and even objects. It’s brute force and chance all the way down. There are no levers for reasoning; there is no composability; there’s no world knowledge. And none of these emerge. There is no emergent celebrity object.

As decades of industrial automation establish — limited, task-specific automation is fine, even world changing, as long as it’s reliable. Assembly lines are effective theatres for man-machine collaboration, but the enterprise requires predictability. How can a floor of professionals with deadlines and clients rely on prompt ‘engineering’, which is mystical, closer to the incantations of a modern day priest than actual commands?

After months of fooling around with#Midjourney I began to get a sense of what the machine was actually good at.

I asked #midjourney for a #vangogh take on heavy rains. And the collective consciousness, bottled by linear algebra, delivered.

This came out well. And that it did, says something about what generative AI is good at and where it falters. Asked for a canonical example of a genre where the defining elements of the genre are well understood, AI delivers. The trouble starts when you mix genres.

Generative AI is great at ‘geist’, the spirit of a time, a place, a genre, a category, a thing described by a prompt that is nothing else but about the clearly enunciated ‘geist’. The system can pick up the whole ‘thing’ as an attribute well, when the ‘thing’ is well represented in the corpora. A Van Gogh painting is instantly recognisable and there are plenty of Van Gogh paintings.

The trouble starts when you mix genres and things, and elemental attributes (limbs, fashion styles) have to be parsed and assembled in accordance with reason and world knowledge (which fashion paradigms go together). That cannot be done with any regularity because higher level abstractions are not emergent.

All we have is the imprecision of ‘prompt engineering’.

Which is why what generative AI does best is stock photography. In stock photography, genres are limited and the specifics of the individual work don’t matter, only the presentation of the genre-defining attributes within a well understood framework does.

If you need “a man in a store aisle with a smartphone” for your corporate website, Midjourney is your friend.

Midjourney

But here too trouble awaits if you have a brief. Or a picture in your mind that has taken hold and is begging to be replicated. There is no composability. If you want the shopper to be a certain way, oriented a certain way, or in a particular kind of store, or in the company of his family, much heartbreak will follow. Phrases, clauses, commas, other delimiters — it’s hard, and what’s worse: it’s unpredictable.

Disrupting the stock photography subscription business is a great use case, but that’s not what civilisation changing tech should stop at.

There is another thing where generative AI does well, which is…ah kind of a pastiche of digital art.

Here I ask for “The Taj Mahal, as seen by Van Gogh.”

The Taj Mahal on a full moon night, as seen by Van Gogh.

This is good. In fact, it’s brilliant. The sky is indisputably from the mad, mad mind of the man, just a few short years before history’s worst case of unmourned death. There is a star, perhaps from Islamic imagery, an appropriate detail in a fusion that has a Mughal mausoleum as subject.

The AI gets several things wrong though. The minarets at the back are off structure. The garden is inaccurate. The yellow spots aren’t quite the brilliant exaggeration of stars above the Rhône. It’s not clear what they represent.

Of course, I am being petty.

AI does deliver here; to say otherwise is nitpicking. The marriage of two widely different cultures and schools of art has been forged through mathematical hijinks. It has been said #ai is whatever machines can’t do at the moment. The Van Gogh Taj Mahal would have been sorcery in early 2022. The pace of change shouldn’t cost us the capacity for wonder.

Perhaps#midjourney gets the Van Gogh Taj Mahal right for a reason. I am not the first person to wonder what the genius would have made of the monument to love on a full moon night. I am not the first person to wonder what Van Gogh would have made of the Eiffel Tower or the Sydney Opera House or the Golden Gate Bridge. There is an active genre of such amateur art on Behance, DeviantArt, and Reddit. This is an obvious item on an intern’s portfolio.

Replicating the form is a win for generative AI. But this isn’t the revolution.

--

--

somak roy
somak roy

Written by somak roy

Head, digital advisory services, Litmus7

No responses yet