Rorschach Tests for Deep Learning Image Classifiers

Mathieu Lemay
Towards Data Science
7 min readJun 25, 2018

--

What do they see, when there is nothing to see?

The inventor of the Rorschach Test, Hermann Rorschach, next to his modern day counterpart, Walter Kovacs.

With a lot of exploration in image classification and labeling, we tend to play with various libraries to quickly assess and label pictures. But other than memory use and ease of retraining layers, how do they behave when presented with unclear, grainy pictures?

In order to get a better feel for the nuances and trends of each ConvNet at a practical level, I ran the most popular ones, in series, on all of the cards from the Rorschach Test.

The goal was simple: set my expectations and clear up my understanding for each one, so I can use the right tool for the job moving forward.

The Rorschach Test

Invented (or more accurately, modified) by Hermann Rorschach, the Rorschach Test is the most famous of the inkblot tests. An inkblot test is a simple conversation starter with your therapist, that goes “What do you see?” after being shown an intentionally abstract (and usually symmetrical) drop of ink squeezed in a sheet of paper. Rorschach took it one step further, and created a standard set. This standard set could now we used between patients and therapists.

All 10 cards from the original Rorschach Test. Note the use of colors, as opposed to some of the inkblot tests used before Mr. Rorschach’s standardized version.

The goal of this test is to evoke a subconscious reaction to the abstract art, overlaying it with your own biases and deepest thoughts, so that whatever you see first is what you’re thinking about. Although it has fallen out of favor in the last few years (“pseudo-science” has been used to describe it), it’s still a very useful trope to describe the field of psychology.

In this case, I’ve used it to force various libraries to classify this innocuous dataset into pre-defined labels.

The Classifiers

In order to classify the pictures, I looked at the following libraries:

  • ResNet50
  • VGG16
  • VGG19
  • InceptionV3
  • InceptionResNetV2
  • Xception
  • MobileNet
  • MobileNetV2
  • DenseNet
  • NASNet

(These are all of the native application libraries available with Keras 2.2.0.)

Each one of these classifiers has a particular structure and weights associated with it.

Documentation for each classifier from Keras.

Besides for the memory usage and trainable parameters, there is a lot of differences in the details of the implementation of each one. Instead of digging in the particularities of each structure, let’s look at how they view the same confusing data.

(If you want to learn more about each structure individually, here are the links if you want to learn more about ResNet, Inception, DenseNet, NASNet, VGG16/19, and MobileNet.)

The Results

Overall, the goal is to get a quick sense of the prediction and intensity of the belief behind the prediction. In order to do this, I’ve grouped the Top-1 predictions by name, and added their score together. This gives a sense of how much each classifier is ready to gamble on a particular card. Doing this for each label gives us a good proxy for the belief of each classifier, and gives us a sense of their relative prediction confidence for each card.

As an example, InceptionResNetV2, NASNetLarge, and DensetNet201 believed that Card 1 was a warplane (with scores of 88.3%, 46.2%, 18.6%, respectively). I’ve then added it up to a dimensionless score of 153.1. Now, I can compare this score in between the classifiers to see which one performs the best.

Cards 1, 2, and 3

Cards 1, 2, 3. Or, according to the classifiers, a fighter jet, a rooster, and a book jacket. Go figure.
Card 1:
warplane 153.05
wall_clock 75.41
letter_opener 47.29
lampshade 33.13
buckeye 28.95
jigsaw_puzzle 23.19
paper_towel 22.51
birdhouse 11.50
Card 2:
cock 72.84 # <--- Rooster. They mean rooster.
lampshade 59.99
brassiere 59.47
velvet 43.84
earthstar 41.42
mask 29.46
candle 21.84
tray 19.30
wall_clock 18.41
hair_slide 17.84
vase 11.44
Card 3:
book_jacket 325.35 # <--- No idea what's going on here.
coffee_mug 62.61
chocolate_sauce 45.00
candle 32.68
ant 25.81
matchstick 24.02
jersey 16.94

Cards 4, 5, 6, and 7 — Bugs and Explosions

First row: Card 4 and Card 5. Second row: Card 6 and Card 7. Card 5 is without a doubt a picture of some sort of bug.

Cards 4 through 7 elicit a bit more smoke, spaceship, and bugs for me. Which brings me back to my rural childhood, and playing with the best LEGO sets of all time. (For which I obviously mean M-Tron, the evil BlackTron, and Ice Planet.)

Card 4:
volcano 101.76
fountain 72.11
space_shuttle 32.72
hen-of-the-woods 29.40
pitcher 28.28
vase 25.00
king_crab 23.99
wall_clock 18.25
triumphal_arch 11.04
Card 5: # <--- Bugs. Definitely bugs.
isopod 106.54
king_crab 83.67
ant 61.58
long-horned_beetle 32.23
tick 30.05
hip 26.32
rhinoceros_beetle 14.52
Card 6:
space_shuttle 174.48
warplane 155.58
conch 104.73
missile 63.71
airship 57.73
fountain 11.57
Card 7:
missile 195.66
parachute 52.52
projectile 42.31
space_shuttle 31.18
hip 29.89
geyser 20.92
warplane 17.50

Cards 8, 9, and 10 — parachutes, conchs, and trays?

Cards 8, 9, and 10. How the right one is a “tray” is beyond me.
Card 8:
fountain 189.59
parachute 98.27
umbrella 94.61
pencil_sharpener 63.27
spoonbill 51.08
poncho 45.19
coral_fungus 12.05
shovel 10.12
Card 9:
missile 238.45
fountain 64.82
parachute 48.21
volcano 44.77
paper_towel 41.59
lampshade 13.48
Card 10:
tray 229.22
handkerchief 151.63
conch 77.94
feather_boa 60.34
rapeseed 15.95

Of course, we can go a lot deeper to see the average confidence of each one of the predictions. Below is some sample data for Card 1. All of the results are available on the project’s GitHub page.

All of the results available for Card #1.

The code

In order to label each picture, we start off by loading Pandas, NumPy, and the Keras image pre-processing libraries:

from keras.preprocessing import image
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.models import Model
import keras.backend as K
import numpy as np
import pandas as pd
import json

We then create a helper function to return a dataframe containing the score of the top 10 results for each library, in order to quickly assemble each one of the images’ score:

def getLabels(model, dims, pi, dp):
"""
Returns the top 10 labels, given a model, image dimensions,
preprocess_input() function, and the labels inside decode_predictions().
"""
df = pd.DataFrame()

for img_path in images:
# Resize image array
image = load_img(img_path, target_size=(dims, dims))
image = img_to_array(image)
image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
image = pi(image)

# Predict what is in the image
yhat = model.predict(image)

# return the top 10 labels
labels = dp(yhat, top=10)[0]
# create empty list and counter
labellist = []
counter = 0

# get labels for each image
for label in labels:
# Display the score of the label
print('%s (%.2f%%)' % (label[1], label[2]*100))

# Add image results to list
labellist.append(
{"labelnumber":counter,
"labelname" :label[1],
"pct" :'%.2f%%' % (label[2]*100),
"image" :img_path})
counter = counter + 1

# Add to dataframe
df_row = pd.Series()
df_row = pd.read_json(json.dumps(labellist), typ='frame', orient='records')
df = df.append(df_row,ignore_index=True)
print("------------")

return df

Now, we have a method for evaluating each library very quickly.

# ...
# ResNet50 Rorschach assessment
from keras.applications.resnet50 import ResNet50
from keras.applications.resnet50 import preprocess_input
from keras.applications.resnet50 import decode_predictions
model = ResNet50()df = getLabels(model, 224, preprocess_input, decode_predictions)
df.to_csv('results_resnet50.csv', index=False)
# ----------# VGG16 Rorschach assessment
from keras.applications.vgg16 import VGG16
from keras.applications.vgg16 import preprocess_input
from keras.applications.vgg16 import decode_predictions
model = VGG16()df = getLabels(model, 224, preprocess_input, decode_predictions)
df.to_csv('results_vgg16.csv', index=False)
# ...

…and so forth.

What’s fun about this approach is that it makes for a very functional notebook when playing with different images sets:

A snapshot of the notebook from the repository.

This notebook structure allows you to quickly explore the different data sets, and see what’s the best classifiers for your product or project. Are you looking for a reliable top-1 accuracy on a fuzzy picture? What about identifying all of the different contents of a picture? your mileage may vary, but you can now start exploring on your own.

You can get the full code here.

Have fun!

-matt.
matt@lemay.ai

Lemay.ai
1 (855) LEMAY-AI

Other articles you may enjoy:

Other articles you may enjoy from Daniel Shapiro, my CTO:

P.S. It’s pronounced roar-shawk.

--

--

Matt Lemay, P.Eng (matt@lemay.ai) is the co-founder of lemay.ai, an international enterprise AI consultancy, and of AuditMap.ai, an internal audit platform.