Simple Learned Weighted Sums of Inferior Temporal Neuronal Firing Rates Accurately Predict Human Core Object Recognition Performance Majaj et al., J. Neurosci (2015). All figures are from the linked paper.

Quick Summary

Key Question: Which area of the ventral visual stream allows readout of core object recognition behavior? Is such readout even possible?

Three-point summary:

  • Record macaque V4 and IT while animal is shown images of various classes and variation levels
  • Learn the weights of a decoder (SVM) that translates neural response to categorization
  • Weighted sums of SVM on IT neural response was sufficient in reading out human level core object recognition performance

Majaj et al (2015). explores whether a single linking hypothesis can quantitatively account for human level performance on core object classification. Linking hypothesis mentioned by the authors can be thought of as a way to read out behavior from neural response. While a series of studies Majaj et al (2015). confirms the ability to read out image classification ability from IT, this fact was not obvious when this study was being conducted.

A previous paper we dicussed, Yamins et al (2014)., demonstrated the ability to predict IT neural response from a convolutional neural network optimized for classification performance. However, it was not obvious that the inverse, classification from neural response, would also hold true.


The study described in this paper was performed in four steps:

  1. development of a comprehensive behavioral assay for human subjects
  2. collection of neural data from macaque V4 and IT
  3. implementation of different linking hypotheses
  4. compare the results of behavior prediction to actual behavior

Object Recognition Tests & Image Generation

Images of 64 objects were used to create images for object recognition tasks. The objects shown in the left most column of Fig. 1a had their viewing parameters (horizontal and veritical position, size, rotation around x, y, z axis) changed to three different variation levels (low, med, high). Then the rendered object was placed on a natural scene image as shown in the middle column of Fig. 1a. These images were then used to create tasks listed in the right most column of Fig. 1a.


Figure 1 Object Recognition Task Setup

Human Behavior Assay

Each human subject was asked to perform one of three larger set of tests: 8-way basic-level categorization, 8-way car categorization, 8-way face categorization. Their responses were collected over Amazon Mechanical Turk.

Macaque Neural Response

Macaque V4 and IT neural population responses were measured using multielectrode arrays. Monkey neural responses were collected during rapid visual stimulus presentation (RSVP) where images were presented in series. Typically, each image was presented about 50 times but all images were shown at least 28 times. Characterization of the neural recording data can be found in Fig. 2a. The green saturation level is the site’s response magnitude to the specific image. Using the multielectrode array, the study recorded from 168 sites in IT and 128 sites in V4. The placement of the arrays are available in Fig. 2b. Note that monkeys were not performing categorization task while their neural responses were recorded. Thus the neural data collected here is simply a neural response to the images shown.


Figure 2 Neural responses

Training linear decoders

Linking hypothesis which characterize how neural response encodes behavior requires a decoder that can translate neural code into observable behavior. Monkey neural response to images and human core object recognition data was used to train linear decoders. While different decoders were tested in this study, they ended up using SVM (Support Vector Machines). Monkey neural data and human psychophysics data were divided into ‘training’ and ‘testing’ sets for cross-validation. The training set was used to learn the weights of 8-way linear classifiers by optimizing for categorization performance.

Comparison of different linking hypotheses

After training linear decoders for each candidate linking hypothesis, testing set was used to generate predicted behavioral output for each 8-way task set (basic level, cars, faces). The predicted core object recognition performance as well as pattern was compared to the observed human core object recognition data to choose the correct linking hypothesis.


Human core object recognition results

Human core object recognition performance was computed using d’ or sensitivity index. d’ = Z(hit rate) - Z(false alarm rate) where Z is the inverse of the CDF of the Gaussian distribution. In the context of this paper, higher d’ value indicates higher core object recognition performance.

The study notes two unsurprising results:

  1. human core object recognition ability depends on shape similarity
  2. human core object recognition ability drops for high variation images

To simplify, the first result demonstrates that basic-level categorizations such as car vs. not car and animal vs. not animal, were easy for humans. However, performance dropped for tasks with similar objects. For example, car 1 vs. car 2 was harder than basic-level categorization and face 1 vs. face 2 proved to be even harder. The second result showed that objects with higher variation viewing parameters were harder for humans. These results are shown in Fig 3.a. Note that face recognition was quite difficult for humans across variation levels – so much so that high variation faces were left out from the data. This raises an interesting question regarding face recognition. More on this in the Questions section.

Across the board, humans seem to be quite invariant to object view variation, though certainly not perfect.


Figure 3 Human core object recognition results


The authors claim that analysis of various linking hypotheses revealed that LaWS of RAD IT or learned weighted sums of randomly selected average neuronal responses spatially distributed over monkey IT successfully predicted human core object recognition performance and pattern. The specifications of LaWS of RAD IT hypothesis used in Fig. 4 involved using 128 IT neuronal sites, time window of 70-170 ms after image onset and SVM as the decoder.

As shown in Fig. 4b and Fig. 4c, LaWS of RAD IT predicted human core object recognition performance pattern shown in Fig. 3a quite well. Furthermore, the predicted performance shared the two dominant trends identified in observed human performance. LaWS of RAD IT performed better in basic-level categorization. Performance decreased in subordinate level categorization tasks where the objects are more similar in shape. Furthermore, object recognition performance also decreased with increased object view variation.


Figure 4 Predicted performance of LaWS of RAD IT (70-170ms.128N.SVM)

Other candidate hypotheses

This paper explored many other candidate linking hypotheses. While none of them yielded results as satisfactory as LaWS of RAD IT, one that is discussed relatively extensively is LaWS of RAD V4. This particular linking hypothesis shares most features of neural activity with LaWS of RAD IT but behavior is read out from V4 instead of IT. However, this change had a significant impact on the human core object recognition performance predictibility. As shown in Fig. 5a, the predicted performance pattern differs quite drastically from that of LaWS of RAD IT. Metrics such as consistency, which evaluates the similarity of pattern of the sensitivity metrics (d’) predicted by each candidate linking hypothesis, shows the drastic difference between LaWS of RAD V4 from LaWS of RAD IT. The nominal d’ pattern for LaWS of RAD V4 is shown in Fig. 5a and the consistency metric comparison is shown in Fig. 5b. While LaWS of RAD IT reaches human-to-human consistency level, LaWS of RAD V4 fails to do so.


Figure 5 Candidate linking hypotheses

Quantitatively assessing linking hypothesis: consistency & performance

At this point, the manuscript has already established that LaWS of RAD IT performs superior to other candidate linking hypotheses. However, the authors are also interested in assessing whether LaWS of RAD IT performs sufficiently similar to observed human core object recognition performance. In doing so, the paper explored two key metrics:

  1. consistency
  2. performance

I mentioned consistency in the previous section but to recap, this metric measures the relationship between predicted behavior and observed human behavior. Visually, consistency measures the similarity in the color pattern of Fig. 3a and Fig. 5a. This metric allowed the authors to rule out any candidate linking hypotheses that failed to match the behavior of individual subject behavior. As mentioned previously, V4 based linking hypotheses failed to predict the observed human object recognition performance pattern (Fig. 5b) while LaWS of RAD IT produced performance pattern similar to that of human subjects. This is not to say that V4 isn’t significant in object recognition ability, but that V4’s internal object representation isn’t sufficient for a reliable behavior readout.

Performance metric mentioned in this section regards to the absolute predicted d’ values compared to the observed d’ values. It could be the case that while the patterns are similar between the predicted and observed behavior, the d’ for the predicted is nowhere close to the observed values. The authors found that the d’ performance value was strongly dependent on the number of neuronal sites.

The authors took these metrics further as they varied the number of neuronal sites required to reach human levels for both performance and consistency. The results for LaWS of RAD IT and LaWS of RAD V4 hypotheses are summarized in Fig. 7b. As visible in the plot, LaWS of RAD IT with a 128 neuronal sites was close to human levels on both fronts while 168 neuronal sites allowed LaWS of RAD IT to reach human levels. However, LaWS of RAD V4, using extrapolation, could reach human levels on the performance front but no matter how many neuronal sites it employs, it can never reach human parity level on the consistency front. This result is aligned with the claim that behavior readout from IT is funcionally appropriate while readout from V4 is insufficient.


Figure 7 Human performance parity