augmenting human creativity: lessons learned from giving a computer a creativity test

Remember the specific period of the pandemic when Twitter feeds were filled with magical new use cases of Open AI’s newly-public language prediction model GPT-3 — new chapters of Harry Potter (¹), a comedic conversation between Alan Turing and Claude Shannon (²), the Figma designer on speed dial (³)?

All of these are impressive creative outputs produced by a computer. So, my first question was this: Knowing artificial intelligence builds off pre-established training models, is it possible a computer can produce objectively creative outputs? To test this in a scoped manner, I was curious how GPT-3 might perform on standard creativity tests.

What I learned was remarkable: GPT-3 did not perform very well on the majority of the tests, but in this process, I discovered that pairing GPT-3’s outputs with my own intelligence and creativity allowed me to optimize my own capabilities on these creative tests quite dramatically. And GPT-3’s results on one test in particular were in fact embarrassingly superior to my own.

Background on Test Methods

I selected creativity tests for this experiment based on a couple criteria:

Reputability of creativity test.
Objectivity of grading standards.
Interpretability of creative test prompts by a computer within the constraints of GPT-3.

Based on this criteria, I landed on two primary creativity tests:

Convergent thinking test: the Remote Associations Test

The Remote Associations Test (Mednick and Mednick, 1962) contains a series of questions that each provide three seemingly-unrelated words. The user must think of a fourth word that unites the three.

For example, take the words RIGHT, CAT, and CARBON. The correct response is COPY (copyright, copycat, carbon copy). Or, take SANDWICH, HOUSE, and GOLF. The expected response is CLUB.

The test itself is considered a “convergent” thinking test. Because much of the creative process involves taking two seemingly unrelated ideas to form a new solution or product that is novel and useful, this test has always stood out to me as a fascinating measure of creativity.

Importantly, from an inputs and outputs standpoint, this test seemed optimal. There is really only one correct answer to each of the test questions, and the questions are conveniently formatted to prompt a computer.

Divergent thinking test: the Divergent Association Task

The Divergent Association Task (Olson, 2021) includes a simple prompt: Please enter 10 words that are as different from each other as possible, in all meanings and uses of the words.

If you have not yet taken the test yet and are curious, the authors recommend taking the test before learning how it works, so feel free to do so here to avoid any spoilers: datcreativity.com

The test then provides a precise score after evaluation how each word is different from each of the rest, or by calculating “the average semantic distance between each of the words,” based on how frequently words are used together. (⁴)

This is a form of a divergent thinking test, as it requires creating open-ended responses to a particular prompt. Divergent thinking (this test) coupled with convergent thinking (the above test) make up creative thinking, so this seemed like a prime test to pair with the above.

What I especially like about this test is its data-driven approach. I knew I would be able to take GPT-3’s outputs, plug them into this online test, and immediately see the semantic distance between words, the computer’s final score on this test, and how the computer compared to the rest of the participants who have taken this test.

Both of these tests have incredibly objective grading that require no evaluation from myself, as well as a study pool of average results to tether the computer’s results against.

However, out of curiosity, I wanted to give one additional test a whirl in this experiment. This test requires more human power in grading (I’m hesitant to say “subjectivity,” as all these tests are meant to be objective, but some do require manual grading against a rubric), making it less useful in my mind for the sake of this, but I was curious to give it a try even if I didn’t have established data pool results to tether it to. This final test is:

Divergent thinking test: Alternative Uses Task (Guilford)

The Alternative Uses Test (Guilford, 1967) prompts users with a common object, such as a brick or trash can, and invites the user to generate as many use cases for the object as possible. Note: Some of you may recognize this as a relatively common interview question, particularly in product management.

For example, a brick can be used to build a house, but it also might be used as a weapon or paperweight or book end.

The outputs are then formally graded on four qualities of originality, fluency, flexibility, and elaboration. These are typically graded against a sample pool size, and while there is a very precise and methodical approach to scoring, it makes the test more difficult to grade on a one-off instance like this compared to the two above tests.

Nonetheless, the outputs speak for themselves in a sense, and I figured it would be insightful to see how a computer would respond to this prompt and test.

Process and Results

For each of the above three tests, I used the default playground version of GPT-3. I’ve outlined the processes and results for each individual test in detail below.

Convergent thinking test: the Remote Associations Test

Recall that this test prompts the user to think of a word uniting the other three.

After tweaking the language used in the prompt and the sample words provided, I landed on the following prompt and examples:

const { Configuration, OpenAIApi } = require("openai");
const configuration = new Configuration({
    apiKey: process.env.OPENAI_API_KEY,
});
const openai = new OpenAIApi(configuration);
const response = await openai.createCompletion("text-davinci-001", {
    prompt: "List a single word that links the other three words together.\n\nThree words: sandwich, house, golf\nWord: club\nThree words: cream, skate, water\nWord: ice\nThree words: duck, fold, dollar\nWord:",
    temperature: 0.7,
    max_tokens: 64,
    top_p: 1,
    frequency_penalty: 0,
    presence_penalty: 0,
});

I selected questions from this online Remote Associations Test. I included five each of Very Easy, Easy, Medium, Hard and Very Hard.

Out of the total 25 questions, the computer scored 8 out of 25 correctly. There was no significant correlation between level of difficulty (the computer consistently got 1 or 2 out of each five levels of difficulty correct).

This wasn’t a particularly great score. I suspect most of us reading this who are comfortable with English could comfortably score 80% or higher on the Very Easy and Easy, at minimum. That said, the computer’s speed (less than a second) was certainly quicker than the average human. (⁵)

The most fascinating takeaway in this test arose upon seeing a few truly creative twists in GPT-3’s responses. While many of the computer’s incorrect responses simply missed the mark (perhaps by tying only two of the words together, or selecting a synonym for one of the words), a few responses really surprised me and almost convinced me that there could in some cases be more than one interpretation of the fourth words.

For example, with “AGE / MILE / SAND,” the computer’s response was “TIME.” That’s deep, a bit existential, and pretty impressive. The correct response, according to the official Remote Association Test response, is “STONE,” which also works.

But frankly I’m inclined to believe the computer outsmarted the test here.

For a full writeup on process and a full overview of the results, see here.

Divergent thinking test: the Divergent Association Task

GPT-3 struggled to understand the prompt with this this (“Please enter 10 words that are as different from each other as possible, in all meanings and uses of the words.”). Depending on the wording, GPT-3 might list five sets of two words each that were opposites, but not 10 words that were all inherently different from each other.

Some phrasings of the prompt did finally seem to be interpreted by GPT-3, but the results were still below average. There were some pretty weird results in general:

GPT-3 seemed to follow an easy human trap too of getting into a “rut” of a type of word that were all in fact different but not different enough, but to an extreme, such as “fire” then “wind” then “water;”

At one point, GPT-3 did not seem to be using actual words (”yar,” then “zar”);

In another case, GPT-3 listed out ten variations of the world shelter.

Ultimately, the highest score I was able to squeeze out of GPT-3 was 77, which was higher than 44% of people who have taken the test.

Divergent thinking test: Alternative Uses Task (Guilford)

Recall this test prompts users to list out as many uses of an everyday object as possible.

The prompt I pulled was “hair pin.” I first tested this on myself, giving myself two minutes to list out alternative, unique use cases for a hair pin. I did not grade myself officially, but I took note of my responses.

I then experimented with a variety of written prompts to land on one that seemed to be best understood by GPT-3. This was far less a struggle than in the prior test. The computer seemed to consistently understand each prompt I chose, even though some worked better than others.

Each variation of wording on the prompt seemed to result in entirely new responses of potential use cases for the hair pin. GPT-3 really just kept generating new ideas.

Simply put: GPT-3 blew me away on this exercise. The answers were phenomenal. I did not objectively grade these responses, but the responses were objectively better than my own.

The computer’s proposed uses cases for a hair pin were shockingly impressive. Some were related to hair:

Divide hair into sections when styling.

Hold hair back while you sleep.

Add volume to your hair by lifting it at the roots and pinning it in place.

Hide a bad hair day.

Curl your hair with a hair pin.(⁶)

Create a faux bob.

Use a hairpin to keep your hair out of your face when you’re working out.

And many went beyond hair:

Secure a scarf.

Open a stubborn jar lid with a hairpin.

Keep your earrings on while you sleep by using hairpins to hold them in place.(⁷)

Decorate your Christmas tree with hairpins by threading them through the branches.

Pick up a small item off of the ground with a hairpin.

Make a makeshift bookmark by using a hairpin.

Use a hairpin to pick a lock.

Just remember: GPT-3 could create 10 or so of these creative use cases in a matter of milliseconds.

Insights and Lessons / Why?

When Douglas Engelbart wrote Augmenting Human Intellect in 1962, he dreamed of a world where personal computers would be an extension of human intelligence. It’s possible that now, 60 years later, computers have the capacity — to an extent — to replace human intelligence.

But computers have not replaced human creativity. The ability to create, to build something both novel and useful, is still mostly a uniquely human trait. I believe we are in an extraordinary era of Augmenting Human Creativity, where computers are in fact becoming the extension that allows us to create to more effective, optimal and extraordinary extents.

This is precisely what I uncovered — unexpectedly — in this experiment. The computer failed pretty dramatically in two of the three creativity tests, and I am confident an average human could perform better. Yet I found that much of the computer’s responses actually spurred my own creativity, forcing me to think outside the box a bit more, too.

Besides, in the third test — despite being the least measurable and was perhaps more of a “think outside of the box” free-from brainstorm — the computer absolutely outperformed me at a speed beyond what I could possibly imagine.

I envision a world where we could be using a version of GPT-3, outfitted with much more effective human-computer interaction capabilities, to augment our creativity every day.

I’ll leave you with an overview of my primary insights after this experiment:

We still have a long way to go before artificial intelligence can replace human creativity. There were some obvious flaws with computers attempting these tests, and humans would consistently outperform GPT-3 as it stands.

The time is now for artificial intelligence to become a tool in augmenting human creativity. As demonstrated in this series of tests, computers can immediately be a helpful tool for brainstorming.

One of the most time consuming elements of this was trying variations of prompts to most effectively issue a computer these tests. Perhaps we don’t yet know how to give computers instructions most effectively yet. This could all simply be an HCI problem, and improving this element could expedite the immediate potential of AI in augmenting human creativity.

There did not seem to be a way to effectively enforce the idea of usefulness in addition to novelty (the full definition of creativity) when brainstorming with GPT-3.

The speed was remarkable and far above what humans are capable of.

There is much left to be discovered, there is much still to build, and there is much more progress yet to be made.

But I personally cannot imagine a more exciting area of focus than the research happening around computers and creativity, as we move from the computer-augmented information and intelligence era to new bounds of computer-augmented creativity.

Any possibility for improving the effective utilization of the intellectual power of society’s problem solvers warrants the most serious consideration. This is because man’s problem-solving capability represents possibly the most important resource possessed by a society. The other contenders for first importance are all critically dependent for their development and use upon this resource. Any possibility for evolving an art or science that can couple directly and significantly to the continued development of that resource should warrant doubly serious consideration. - Douglas Englebart

Okay, this was a good read. Click here. ↩︎
Nothing better than two of my favorite people chatting with each other chaotically. Click here. ↩︎
Simply amazed at this one. It’s one of those moments that feel pretty futuristic, but that future is now. Click here. ↩︎
It’s worth taking the time to read the whole paper. Click here. ↩︎
Does speed mean something even if the answer isn’t correct? ↩︎
I’m not sure GPT-3 has ever tried to curl it’s hair before, but I appreciate the optimism. ↩︎
This one certainly got a good chuckle out of me. ↩︎