# Dimensional Reduction

Dimensional reduction is the process of assigning high-dimensional stimuli to locations in a lower-dimensional space in such a way that the relations between the high-dimensional stimuli are recreated or preserved by their low-dimensional analogs. It is a process that has many applications, including information compression, noise attenuation, the induction of similarity metrics, and data visualization. There are many ways of performing dimensional reduction. In this post I describe a novel method for dimensional reduction which involves assigning stimuli to low-dimensional locations that allow the original high-dimensional stimuli to be optimally reconstructed.

# The Principle Of Reconstruction

The approach I am describing involves assigning the high-dimensional stimulus patterns, and learned bases, to a posited low-dimensional space, and then reconstructing the high-dimensional stimulus patterns by taking weighted sums over the bases. The bases are of the same dimensionality as the stimulus patterns, and the weights in the reconstructions summations are a decreasing function of the distance in the low-dimensional space between the stimulus pattern being reconstructed and each basis. The overarching goal is to learn locations for the stimulus patterns and the bases, and values for the bases themselves, to allow the high-dimensional patterns to be reconstructed with minimum error. The hope is that this approach will cause 'similar' stimuli to get assigned to proximal locations to one another in the low-dimensional space because they can be reconstructed with similar bases. This approach is notable because it defines a generative model for the high-dimensional stimulus patterns, which distinguishes it from some other recent and popular approaches to this problem such as t-SNE.

# Model

Let's assume we have an $N \times H$ matrix $S$ containing our high-dimensional stimuli, where $N$ is the number of stimulus objects, $H$ is the number of dimensions the stimulus objects have, and $S_{sh}$ denotes the value of pattern $s$ on dimension $h$. We also have an $M \times H$ matrix $B$ consisting of the reconstruction bases, where $M$ is the number of bases we elect to use in the reconstruction process (e.g. 100), and $B_{bh}$ denotes the value that basis $b$ has on dimension $h$.

In addition to this information the model needs to represent the low-dimensional locations assigned to both the stimuli and to the bases. Let's assume the locations of the stimulus objects are stored in an $N \times L$ matrix $P$, where $L$ is the number of dimensions we want our lower-dimensional space to possess (e.g., 2), and $P_{sl}$ represents the location of stimulus object $s$ on dimension $l$. In addition, let's assume the locations of the bases are stored in an $B \times L$ matrix $Q$ where $Q_{bl}$ represents the location of basis $b$ on dimension $l$.

If we denote the reconstruction of stimulus object $s$ on dimension $h$ as $R_{sh}$ then we have $$R_{sh} = \sum\limits_{b=1}^{B} w_{sb} B_{bh}$$ which is to say that reconstruction of a feature of a stimulus object via a weighted sum over the corresponding feature values of each basis, where the weight assigned to each basis in the sum is a monotonically decreasing function of the distance between the stimulus object and the basis in the hypothesized lower-dimensional space. In other words, $$w_{sb} = w_{bs} = \exp(-\sqrt[2]{d_{sb}})$$ where $$d_{sb} = \sqrt[2]{\sum\limits_{l=1}^{L} (l_{il} - l_{bl})^2}$$

Note that there are many other possible parameterizations of this basic framework. For example in the equations above we have assumed that the weight a basis receives is an exponentially decaying function of distance, whereas we could have weight decay with inverse squared distance, or use a sigmoid decay curve, and so on. Also, as we have outlined the model, we do not normalize the reconstruction weights a data point receives, whereas it is possible to normalize the reconstruction weights so that they sum to 1, in which case the reconstruction model is becomes a probabilistic mixture model over the bases.

Learning in this model involves finding the low-dimensional locations of the stimulus objects and the bases, and the high-dimensional feature values of the bases. This can be accomplished via gradient-descent by defining the reconstruction error of the stimulus objects, and then finding the gradient of this error with respect to each of these parameters. In what follows, to keep the notation simple, we consider only the reconstruction of a single high-dimensional feature (i.e. we drop the $s$ and $h$ subscripts); the full gradient equations can easily be extrapolated from what we write here through appropriate summation however.

Let's start with the squared reconstruction error, $E$, where $r$ represents the reconstructed value and $p$ $$E = (r-p)^2$$

The gradient of the reconstruction error with respect to $q_b$, the relevant feature value of basis $b$, is $$\frac{\partial E}{\partial q_b} = 2(r-p)w_b$$

The gradient of the reconstruction error with respect to the location of the stimulus object being reconstructed, on the low-dimensional dimension $x$, is $$\frac{\partial E}{\partial x_p} = 2 (r-p) \frac{\partial}{\partial x_p} \left( \sum\limits_b w_b q_b \right) = 2 (r-p) \left( \sum\limits_b \frac{q_b(x_b-x_p)w_b}{d_b} \right)$$

And the gradient of the reconstruction error with respect to the location of basis $b$ on the low-dimension dimension $x$, is $$\frac{\partial E}{\partial x_b} = 2 (r-p) \left( \frac{q_b(x_p-x_b)w_b}{d_b} \right)$$ Note that, modulo the summing over $b$, this is just the negative of the previous gradient equation, which reflects the underlying symmetry of the reconstruction model: moving the location of a bases one unit in a direction accomplishes the same as moving the location of the object being reconstructed in the opposite direction, with respect to the single basis and object.

# Performance On MNIST Data

The MNIST data set is a digitized collection of handwritten digits (0 to 9) commonly used to benchmark machine learning algorithms. We applied a model like the one described above to assign low-dimensional locations to these high-dimensional images, and by examining the learned low-dimensional locations we can visualize the structure of the digits.

The model we applied to the MNIST data differed from the model described above in a few respects:

1. Instead of exponential decay, we used a sigmoidal decay function to model the relationship between low-dimensional distance and reconstruction weight ($w_{sb} = \frac{1}{1 + e^{d_{sb} - 5}}$). This means that there is very little weight decay in the immediate vicinity of a basis, which acts as an attractor space for stimuli similar to the basis.
2. We normalized the weight activations of the bases as experimentation revealed that this produced better reconstructions and visualizations. This changes (complicates) the gradient equations a little; deriving the new equations is left as an exercise to the interested reader.
3. We applied noise to the basis weights on each iteration in order to help prevent the model settling into local optima too easily.

We implemented the model using numpy and autograd. Autograd in particular is a wonderful tool which allows you to express the generative part of a statistical model using numpy, and then automatically computes the backward pass over the model in order to automagically derive the gradient equations for the model. This is fantastically convenient for prototyping statistical models, and is sufficiently performant that it may even be useable in production-scale environments too. I'd strongly recommend that you check it out.

We used a random sample of 10,000 digits from the training data of MNIST and trained the reconstruction model we have described on this data using 100 learned bases. The model was trained using the ADAM algorithm with a learning rate of 0.005 for 50,000 iterations. The chart below shows the reconstruction error as a function of training epochs and it shows that the learned parameters were indeed resulting in more and more accurate reconstructions of the digits, as measured by mean squared error, over time.

The next chart shows the locations assigned to MNIST handwritten digits by this reconstruction model. The locations are color-coded by which digit class they belong to (and the learned locations of the bases are shown in black). As you can see, the model does a good job of compressing the high-dimensional stimuli into low-dimensional locations while allowing the digits to be reconstructed by the learned bases and preserving information about the identity of the digits.

Overall the model produces aesthetically pleasing visualizations that reflect the underlying high-dimensional structure of the MNIST digits rather well. I'm quite impressed by the quality of the dimensional reduction achieved via this reconstructive process.

The final chart shows the bases learned by the algorithm in order to reconstruct the MNIST digits. It's clear from this data that the model has successfully identified the canonical digit classes and is using these canonical forms as the basis of its reconstructions.