# Quantum Unsupervised Feature Learning (QUFL)

Unsupervised feature learning with image data using the D-Wave OneTM System

Contents

Section 1

• 1.1 - Introduction
• 1.2 - High level overview
• 1.3 - Some terminology, definitions, jargon and assumptions
• 1.4 - A worked example, using simple images

• Section 2

• 2.1 - The feature dictionary

• Section 3

• 3.1 - Learning features by solving an optimization problem
• 3.2 - A procedure for solving this optimization problem
• 3.3 - Worked example: Butterfly and the {Geordie, MukMuk, Suz, Apple} universe
• 3.4 - Searching for good features

• Section 4

• 4.1 - Definitions, parameters & indexing

• Aim, audience and required background

This material was developed to help those interested in the physics of the D-Wave OneTM System to perform some simple experiments and explore some of the interesting science behind the processor technology. This tutorial series is designed to introduce you to quantum computer programming using a practical, hands-on approach.

All of the software necessary for developers to begin learning how to program D-Wave quantum computing systems can be downloaded from the DOWNLOAD DEVKITS page. As the developer portal is currently in BETA test stage, you will need to have a D-Wave developer account to download these packs. To follow this tutorial it will be necessary to be familiar with the BlackBox compiler tutorial.

The high level programming language used in the included examples and source code is Python 2.6. Some experience of using the Python programming language is necessary to follow these advanced tutorials.

What you will learn

By following through the material in this white paper, you will learn:

• What Unsupervised Feature Learning is a powerful method in machine learning
• How to craft an algorithmic procedure that implements UFL on quantum hardware
• How to apply the QUFL algorithm to image data
• Software requirements

In order to proceed you should install several free software packages.

• Python - The material here assumes you are using Python 2.6. You can download Python from http://www.Python.org. If you need a programming development environment, Pycharm from Jetbrains http://www.jetbrains.com/pycharm/ is excellent. There is a small license fee to use it.
• NumPy - This is a scientific computing library. It is at http://numpy.scipy.org/
• Python Imaging Library (PIL) - This is a free graphics library. It is at http://www.pythonware.com/products/pil/
• Code for this tutorial (optional - i.e. if you want to see the full code files for each section). You can find the code referenced in this tutorial here: Github Repository for QUFL code. You will need a github account to access this code. Once you have a github account you can email devportal at dwavesys dot com and we can add you to the github D-Wave repositories.
• If you wish to run the code on the images used in section 3.3, You will need to download the image data files separately. They can be found here: Images for QUFL .zip file
• SECTION 1

1.1 - Introduction

In this tutorial we introduce a framework for learning features from data. We'll focus on the simplest case where the data doesn't come with labels. To make things less abstract, we'll work through some examples where the data is image data.

The algorithm attempts to learn a small set of features from which all the data the algorithm has seen can be reconstructed. These features are something like the 'distilled essence' of your data. It turns out, rather remarkably, that often high fidelity reconstruction of large amounts of data can be done with an unbelievably sparse set of features.

The ideas underlying the algorithm are closely related to those in the field of compressive sensing. A good motivation for why unsupervised feature learning is such an important task can be found here. It's highly recommended that you also watch this presentation.

Later we'll extend the framework in several ways. We'll allow a mix of labeled and unlabeled examples (this is called semi-supervised feature learning), extend the features into the time dimension (this provides a mechanism for predicting future outcomes), modify the algorithm to be able to learn as it receives input (this is called online learning), change the structure of the features (this allows the framework to include hierarchical feature learning or deep nets), and several other interesting modifications.

1.2 - High level overview

First we'll introduce the basic underlying concepts. Throughout we're going to present both the general framework, and specific examples based on analysis of images. While we're going to try to keep this first pass through at a (relatively) high level, the material is quite complex and challenging. In order to follow along with the material, it may be helpful to spend some time thinking about how the framework might be applied to a data set of your choosing, and as the document proceeds, to implement the ideas here yourself on your own data set.

1.3 - Some terminology, definitions, jargon and assumptions

The only thing you need to provide in order to use this framework is data. We define $$S^U$$ to be the total number of data objects you wish to use. In this tutorial, all of the data objects are unlabeled. For example, let's say you have a data set consisting of five total images (see Figure 1). Then in this case you'd have $$S^U = 5$$ unlabeled images.

The objects in your data set are assumed to all be representable as column vectors of real numbers. The vector for the ith object we call $$\vec{z}_i$$. We define the length of these vectors to be N. What N is depends on how you choose to represent the objects in this set. We define a matrix $$\hat{Z}^U$$ whose column vectors are the objects $$\vec{z}_i$$. The size of $$\hat{Z}^U$$ is $$S^U \times N$$, and it stores all of your data. The matrix $$\hat{Z}^U$$ is the most important input you will provide to the framework.

1.4 - A worked example, using simple images

If our objects are images, one way (not necessarily the best way) to represent our objects is to use the RGB color values of the pixels to populate the $$\hat{Z}^U$$ matrix. In the example shown in Figure 1, each image is an RGB bitmap image formed of a total of 2 x 2 pixels. Each pixel has three numbers associated with it that represent the values of each of the three color channels. These numbers are each in the range 0..255 for images with 8-bit color depth.

We can represent these images in the necessary form by simply loading the pixel values into our vector one by one. Starting with the top leftmost pixel, we write its R value as the first element of our vector; then its G value as the second element; and then its B value as the third element. We then move one pixel to the right, and write its R value as the fourth element; its G value as the fifth element; and its B value as the sixth element. We continue doing this until we've written all of the information in the image into our vector.

Here is how this looks. Starting with the leftmost image in Figure 1, we write down the R,G,B numbers for all four pixels. This gives

$\vec{z}_1^T=[255,0,0,0,0,0,0,0,0,255,0,0]$

(or 'red, black, black, red'). Note that we've written this down as a row vector (i.e. the transpose of $$\vec{z}_1$$ ) just so that it's easier to look at on the page.

Figure 1. This data set consists of five tiny RGB ('Red', 'Green', 'Blue') bitmap images. Each image contains four (i.e. 2 x 2) pixels. Each pixel's color is defined by three numbers, each in the range 0..255, which tell us the amount of red, green and blue in that pixel. For example, a completely red pixel is represented by the three numbers (255, 0, 0), while a completely blue pixel is (0,0,255). White is (255,255,255) and black is (0, 0, 0).

Similarly we write

$\vec{z}_2^T=[255,0,0,0,0,0,0,255,0,255,0,0] \hspace{5mm} (red, black, green, red)$ $\vec{z}_3^T=[255,0,0,0,0,255,0,0,0,255,0,0] \hspace{5mm} (red, blue, black, red)$ $\vec{z}_4^T=[255,0,0,0,0,0,0,0,0,0,0,255] \hspace{5mm} (red, black, black, blue)$ $\vec{z}_5^T=[0,255,0,0,0,0,0,0,0,0,255,0] \hspace{5mm} (green, black, black, green)$

The length of these vectors is N=3*2*2=12. The matrix $$\hat{Z}^U$$ is then formed by using the $$\vec{z}_i$$ vectors as its columns. Here it is, again written as the transpose (in this case, the $$\vec{z}^T_i$$ row vectors are the rows of the transpose $${(\hat{Z}^U)}^T$$ just for ease of viewing:

${(\hat{Z}^U)}^T = \left[ \begin{array}{ccc} \vec{z}_1^T \\ \vec{z}_2^T \\ \vec{z}_3^T \\ \vec{z}_4^T \\ \vec{z}_5^T \end{array} \right] = \left[ \begin{array}{ccc} 255 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 255 & 0 & 0 \\ 255 & 0 & 0 & 0 & 0 & 0 & 0 & 255 & 0 & 255 & 0 & 0 \\ 255 & 0 & 0 & 0 & 0 & 255 & 0 & 0 & 0 & 255 & 0 & 0 \\ 255 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 255 \\ 0 & 255 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 255 & 0 \end{array} \right]$

Now we have all of our data ready for what follows. If you have a specific data set in mind that's of interest to you, it might be worth pausing at this point and working out how your own data set could be written down in this way.

SECTION 2

2.1 - The feature dictionary

One of the algorithm's main objectives is to find the elements of a matrix $$\hat{D}_X$$. We refer to this matrix as the feature dictionary. The feature dictionary has K columns and N rows. K is the number of features we are allowing in the dictionary we are trying to learn. In the current version of the algorithm, you will need to supply K. We call the K column vectors dictionary atoms or features.

In this framework, what we are going to try to do is find a set of features from which all of the data we have can be optimally reconstructed. This reconstruction happens by linearly combining a subset of the features for each element of our data set. This concept is the key to understanding what we are trying to do here, and so we're going to spend some time going over it from a few different perspectives.

First let's take a look at a variant of our simple example in Figure 1. In this variant, instead of having five different images, we'll instead have five identical images (see Figure 2).

The question is this: if we wanted to construct each of these five images from a 'basis set' of as few images as possible, how would we do it?

Since the images are all identical, you could create a single (K=1) dictionary atom consisting of the image itself, and then represent each image by a single bit, representing whether or not that dictionary atom was 'turned on' (bit value = 1) or not (bit value = 0) in each of the data images.

Let's see how this works. For the image set from Figure 2, our data matrices are (check that this is correct!)

${(\hat{Z}^U)}^T = \left[ \begin{array}{ccc} \vec{z}_1^T \\ \vec{z}_2^T \\ \vec{z}_3^T \\ \vec{z}_4^T \\ \vec{z}_5^T \end{array} \right] = \left[ \begin{array}{ccc} 255 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 255 & 0 & 0 \\ 255 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 255 & 0 & 0 \\ 255 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 255 & 0 & 0 \\ 255 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 255 & 0 & 0 \\ 255 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 255 & 0 & 0 \end{array} \right]$

Let's define a new matrix, which we call the reconstruction matrix, to have the following form:

$\hat{Z}^R_U = \hat{D}_X \hat{W}^U$

We've also introduced the $$\hat{W}_U$$ matrix. $$\hat{W}_U$$ is an $$S^U \times K$$matrix whose entries are 0/1 (binary) valued. This matrix is used to determine which of the features to 'turn on' to represent the different objects in our data set.

Now what we want to find are reconstruction matrices that are as close as possible to the original data in our set.

The core idea in what follows is that we would like to find matrices $$\hat{D}_X$$ and $$\hat{W}_U$$ that minimize the difference between $$\hat{Z}_U$$ and its reconstruction $$\hat{Z}^R_U$$.

Now for the example in Figure 2, this is really easy to do. We can get perfect reconstruction by making the following choice:

$\hat{D}_X^T = [255,0,0,0,0,0,0,0,0,255,0,0] \hspace{5mm} ; \hspace{5mm} \hat{W}^U = [1,1,1,1,1]$

If you multiply these out, you will find,

${(\hat{Z}_R^U)}^T = {(\hat{D}_X \hat{W}^U )}^T = \left[ \begin{array}{ccc} 255 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 255 & 0 & 0 \\ 255 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 255 & 0 & 0 \\ 255 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 255 & 0 & 0 \\ 255 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 255 & 0 & 0 \\ 255 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 255 & 0 & 0 \end{array} \right]$

Figure 2. Here our data set consists of five identical images.

So we find that we can exactly (i.e. $$\hat{Z}^R_U = \hat{Z}_U$$) reconstruct all of our data using a single dictionary atom, namely $$\hat{D}_X^T = [255,0,0,0,0,0,0,0,0,255,0,0]$$, together with the prescriptions for whether or not to 'turn on' this feature to represent each image in turn$$\hat{W}^U = [1,1,1,1,1]$$, which means 'turn on this feature (i.e. set the bit to 1) to best represent each image in the data set').

While the mathematical formalism might seem a little dense, the underlying idea in this example is pretty straightforward. Here is another example that may help clarify what just happened.

Imagine the five images in Figure 2 are sequential images in a video sequence. Now imagine you would like to ship this entrancing video sequence to a friend over the internet. Here are two ways you could try to do this.

Option 1: Ship all the pixel values for all of the images. In this case, the total amount of information is just the total number of pixels in each image (four), times the total number of images (five), times the number of bits required to specify the RGB color value for each pixel (8*3=24). This is a total of 4*5*8*3 = 480 bits.

Option 2: Only ship $$\hat{D}_X$$ and $$\hat{W}^U$$. In this case, the total amount of information is just the total number of bits required to specify $$\hat{D}_X$$, i.e. (4*1*8*3 = 96) plus the number of bits required to specify $$\hat{W}^U$$ (5) for a total of 101 bits.

Because we can exactly reconstruct our original data with $$\hat{D}_X$$ and $$\hat{W}^U$$, option 2 is clearly better - we get to ship almost 5 times fewer bits and get perfect reconstruction.

What's going on here is something that, depending on how you look at it, is either kind of obvious or highly non-trivial. On one level all we are noticing is that, since each of the images is an exact copy, it makes no sense to send all of them - just send one of them, together with instructions that there should be five when the final video is viewed. But there is something going on here which is quite a deep concept, illustrated even by this simple example, that motivates the approach here. This deep concept is that the reason we can compress in this way is related to the underlying similarity of the images in our data set. Because they are all similar (in this case, identical), we can 'distill the essence' of that similarity and create a feature to represent it.

The dictionary atom $$\hat{D}_X^T = [255,0,0,0,0,0,0,0,0,255,0,0]$$ is in some sense the 'distilled essence' of our original data set.

The reason that this is such a deep concept is that most natural images have lots of structure because of the underlying structure in the natural world. This can be exploited to create features that are common to vast numbers of images that are not identical like the images in Figure 2, but share some underlying commonalities. For example, objects for which we have labels (i.e. words) - such as tigers, books, cups or bananas - tend to be contiguous. Our perception evolved to assign labels to such contiguous objects. We have words for banana and desk, but we have no word for the more distributed composite object 'banana on desk'. One of the objectives of this algorithm is to discover features that are commonly shared by all objects for which we have labels. If this could be done, we could hope to generate images of (for example) any possible banana from a relatively small number of dictionary atoms, learned from a study of images of bananas.

Sounds great? but how do you find $$\hat{D}_X$$ and $$\hat{W}^U$$?

In the previous example, we could guess what the values for $$\hat{D}_X$$ and $$\hat{W}^U$$ were to get perfect reconstruction. Of course in general you can't! So now let's lay out the framework for finding these for the general case.

SECTION 3

3.1 - Learning features by solving an optimization problem

The first thing we have to do is decide what the loss function will be for our data. The loss function is the function that tells us how to penalize deviations from perfect reconstruction. Here we're going to use the quadratic loss, but depending on what type of data you are using, you may prefer to use a different loss function.

In the case of images stored as we have been doing, the quadratic loss counts the squared difference between the pixel values of the original images and the pixel values of the reconstructions, summed over all pixels. We write this

$L_X = \sum^{S^U}_{i=1} \mu_i^U \| \vec{z}_i - \hat{D}_X \vec{w}_i^U \| = \sum^{S^U}_{i=1} \mu_i^U {[\vec{z}_i - \hat{D}_X \vec{w}_i^U]}^T [ \vec{z}_i - \hat{D}_X \vec{W}_i^U ]$

Here we've introduced some new notation so let's sort that out first. $$\vec{w}_i^U$$ is the ith column of $$\hat{W}^U$$. The vector $$\vec{\mu}_i^U$$ has length $$S^U$$. We've added this term to be able to individually weight the contributions arising from each image, a feature we can use later on as part of the optimization procedure. If you'd like to think of this in the least complex form possible, you can just set all of the $$\mu$$ terms to 1.

Although the expression for $$L_X$$ looks complicated, what it's evaluating is very simple - it's just the sum over the difference between the pixel values of the reconstructed images and the actual images squared. If the reconstructed images are exactly the same as the originals, $$L_X$$ is zero. As the reconstruction starts deviating from perfection, $$L_X$$ grows.

Here's an example, using the first image in Figure 2. In this case,

$\vec{z}_1^T=[255,0,0,0,0,0,0,0,0,255,0,0] \hspace{5mm} (red, black, black, red)$

Now let's say the reconstruction we got by computing $$\hat{D}_X \vec{w}_i^U$$ was perfect. Then $$\hat{D}_X \vec{w}_i^U =[255,0,0,0,0,0,0,0,0,255,0,0]$$ also, and the loss for that image would be $L_X=(255-255)^2+(0-0)^2+(0-0)^2+(0-0)^2+(0-0)^2+(0-0)^2$ $+(0-0)^2+(0-0)^2+(0-0)^2+(255-255)^2+(0-0)^2+(0-0)^2=0$ (each of these terms represents the difference in the pixel values between the original and the reconstruction).

Now let's say our reconstruction was off by a bit, and $$\hat{D}_X \vec{w}_i^U =[255,10,0,5,0,0,0,7,0,255,0,0]$$ Then the loss for the image would be $L_X=(255-255)^2+(10-0)^2+(0-0)^2+(5-0)^2+(0-0)^2+(0-0)^2+(0-0)^2$ $+(7-0)^2+(0-0)^2+(255-255)^2+(0-0)^2+(0-0)^2=100+25+49=174$ which is bigger (i.e. worse).

We also add a zero-norm regularization term, whose strength $$\lambda$$ will be set by cross-validation:

$R = \lambda \sum^{S^U}_{i=1} \vec{w}_i^U$

Adding up $$L_X$$ and $$R$$ gives us our full optimization objective function. Explicitly, this is

$F(\vec{\mu}^U, \hat{Z}^U, \hat{D}_X, \hat{W}^U, \lambda) = L_X + R = \sum^{S^U}_{i=1} \mu_i^U {[\vec{z}_i - \hat{D}_X \vec{w}_i^U]}^T [ \vec{z}_i - \hat{D}_X \vec{W}_i^U ] + \lambda \sum^{S^U}_{i=1} \vec{w}_i^U$

What we want to find here are the values of $$\hat{D}_X, \vec{w}^U$$ that minimize this expression when $$\vec{\mu}^U = \vec{1}$$. The parameter $$\lambda$$ gives you flexibility to perform cross-validation. If you don't want to perform cross-validation or regularization, you can set $$\lambda = 0$$.

3.2 - A procedure for solving this optimization problem

The objective function F is defined over both real variables (the elements of the dictionary $$\vec{D}_X$$ and binary variables (the elements of the $$\vec{W}^U$$ matrix). This makes the optimization over both a difficult mixed-integer optimization problem.

We will use the following four step procedure to optimize over F. This procedure gives us a local minimum which is a function of the initial dictionary.

Step 1. Begin by selecting a random dictionary $$\vec{D}_X$$

Step 2. Solve for the optimal values of the matrix $$\vec{W}^U$$ by solving a total of $$S^U$$ independent discrete optimization problems:

$\vec{w}_i^U = argmin_{\vec{w}} \{ \mu_i^U \vec{w}^T \hat{D}^T_X \hat{D}_X \vec{w} + {( \lambda \vec{1} - 2 \mu_i^U \hat{D}^T_X \vec{z}_i )}^T \vec{w} \}$

Note that all of the optimizations in this step can be done in parallel.

Step 3. Now that we've got values for $$hat{W}^U$$ we solve for $$\hat{D}_X$$ given those values. For the choice of quadratic loss for L_X we can do this by computing (you should check that this choice minimizes F when $$\hat{W}^U$$ is held fixed):

$\hat{D}_X = { \left( \sum^{S^U}_{i=1} \mu_i^U \vec{z}_i {[\vec{w_i}^U]}^T \right) \left( \sum^{S^U}_{i=1} \mu_i^U \vec{w}_i^U {[\vec{w_i}^U]}^T \right)}^{-1}$

Step 4. Go back to Step 2, using the newly found dictionary $$\hat{D}_X$$, and iterate through Steps 2-3 until some user-specified exit criteria are met. At each stage in the process the values of the key parameters $$\hat{D}_X, \hat{W}^U$$ will change, iteratively reducing the value of $$F$$ and getting closer and closer to perfect reconstruction of the data and its labels.

3.3 - Worked example: Butterfly and the {Geordie, MukMuk, Suz, Apple} universe

One application of the framework is to learn features from image data that correlate to human concepts - such as 'apple', or the myriad other objects in our natural world for which we have words. Conceptually what we'd like to do is replicate the type of learning about the natural world that an infant does (although in this example restricted to visual data).

To test this idea, let's for the moment anthropophormize the system running this framework - let's call the 'blank slate' system Butterfly. Prior to running the framework, Butterfly has no sensory input and no concept of anything.

We are going to try to teach Butterfly about the natural world. The first thing we'll do is attempt to teach her about a small number of concepts driven entirely by vision. In this first experiment, think of Butterfly as being akin to a newborn infant that tragically has no sensory input except vision. Even though this dramatically reduces the types of contextual information we usually use to understand our world, it is a start - and one of the great things about this framework is that data from other senses can be eventually added in a straightforward way.

The {Geordie, MukMuk, Suz, Apple} universe

In order to give Butterfly sight, we use a webcam controlled by the Python opencv module. This allows us quite a bit of flexibility in how we capture and process visual information near the webcam.

The first experiment we'd like to perform is to try to teach Butterfly to recognize four different objects in her field of vision - Geordie, MukMuk, Suz, and an apple (see Figure 4 for images of each of these objects).

We captured video sequences consisting of 256 images, under consistent but natural conditions, from each of these four objects (giving a total of 4 * 256 = 1,024 images in our full data set). No data curation was done on these images. From this initial data set, we chose 64 images from each category, giving a total of $$S^U$$ = 64* 4 = 256 images. This subset will be used to learn features able to reconstruct the subset. We'll then use these features as input to a neural-net based classifier. This classification process is described in the tutorial Training a neural network.

Figure 4. Images of the four objects {Geordie, MukMuk, Suz, Apple} in Butterfly's universe. The standard image size we will use is 80x112 pixels, and the input images from the webcam are stored as RGB bitmaps.

3.4 - Searching for good features

Our objective is to find $$\hat{D}_X, \hat{W}^U$$ that minimize: $F(\vec{\mu}^U, \hat{Z}^U, \hat{D}_X, \hat{W}^U, \lambda) = L_X + R = \sum^{S^U}_{i=1} \mu_i^U {[\vec{z}_i - \hat{D}_X \vec{w}_i^U]}^T [ \vec{z}_i - \hat{D}_X \vec{W}_i^U ] + \lambda \sum^{S^U}_{i=1} \vec{w}_i^U$

For simplicity we postpone doing any regularization $$(\lambda=0)$$ for the time being, and set $$\mu_i^U = 1$$. In this case there are two parameters that we can vary. The first is the number of dictionary atoms K, which we'll vary from $$K=1$$ up to $$K=30$$. The second is the number of initial starting points for the random dictionaries at the start of the procedure. Recall that the final result we get will be local minima determined by the initial dictionaries. Increasing the number of runs provides a simple mechanism for exploring different solutions.

The experiment we will do will increment K, and for every value of K we will choose 100 different starting points and record the final values of the objective function for each.

Figure 5. Top: as the number of features increases, the average single pixel error (the sum of the squared differences between ground truth and reconstructions) decreases logarithmically with number of features. Bottom: as the number of features grows, the time required for the exact enumeration solver becomes prohibitive and D-Wave's BlackBox takes over as the weapon of choice.

Figure 6. These are the best features found as the allowed number of features increases from (top) K=1 to (bottom) K=8. We see that at K=4 the four classes of object in the image data are captured, while for smaller values the dictionary has to 'decide' what to not represent. For larger values of K, the features start mixing together in non-trivial ways.

3.5 - Analysis of the solver procedure

The parameters $$\hat{D}_X, \hat{W}^U$$ found by following this procedure will not be the globally optimal parameters. The strategy of separating the optimization over real variables from the optimization over binary variables and iterating makes it possible (and probably inevitable) that the procedure will get stuck in local optima that depend on the specifics of the starting random dictionaries. In addition, while the optimization over the real variables is robust and fast, the binary optimization steps involve solving NP-hard discrete optimization problems (defined over K variables) and therefore we expect that for sufficiently large dictionaries (i.e. large K) the results obtained from Steps 2 and 3 will not be globally optimal.

It is difficult to analyze the procedure analytically. The strategy we will take to analyze the performance of this approach is to empirically measure performance, and see if the performance we get is 'good enough', where what 'good enough' means is set by the context of the application.

We can analyze the total runtime scaling of this approach in a straightforward manner.

Step 1: Generate and store $$KN$$ random numbers. This is dominated by the other two steps so we can ignore it.

Step 2: Solve a total of $$S^U$$ completely independent K variable discrete optimization problems. We'll assume for now we're doing these in series but this step can be fully parallelized. This is the step we will solve using D-Wave BlackBox, whose runtime we'll call $$T_R [K]$$.

Step 3: Here we have to compute two matrices, invert one of them, and then multiply them together. The cost of doing this scales as follows: computing the first matrix $$O(S^U NK)$$; computing the second matrix $$O(S^U K^2)$$; inverting the second matrix $$O(K^3 )$$; and multiplying them together $$O(NK^2)$$. So the total cost of this step is $$O(K(N+K)(S^U+K))$$.

If $$N \gg K, S^U \gg K, T_R [K] \gg KN$$ and we restart the algorithm R times at different starting dictionaries, then the total runtime scaling of the procedure is dominated by the time for the discrete optimizations in Step 2, i.e. the total runtime scales like $$O(S^U R T_R [K])$$. Note that both the individual rolls and the discrete optimization steps can be run in parallel, so the run time can be sped up linearly in the number of compute nodes, up to a maximum of $$S^U R$$ of these.

SECTION 4

4.1 - Definitions, parameters & indexing

The hardest part of implementing this framework is simply keeping all of the definitions, parameters and indexing straight. This section provides a simple 'cheat sheet' for what all the terms mean.

Variables & parameters

N: number of pixels * 3 (for 3 RGB channels); $$S^U$$: number of images without labels; K: number of features

The feature dictionary matrix. The $$K \times N \hat{D}_X$$ matrix stores the K features we are trying to find. $$D_{i,j}$$ are real numbers. The jth feature is stored in the jth column of $$\hat{D}_X$$

Input data matrix. The $$\hat{Z}^U$$ matrix holds all of the (unlabeled) input data. $$z_{i,j}$$ for $$j=1..N$$ are pixel values for unlabeled image i. $$\hat{Z}^U$$ is $$S^U \times N$$.

Compression matrix. The binary (0/1) variables telling us which features to turn on to represent the input images are denoted w. The matrix storing the w values for the unlabeled data we call $$\hat{W}^U$$. It is $$S^U \times K$$ and is given by

You may want to keep these references handy when implementing you code!