In this article, we describe how to build a machine learning model for the automatic recognition of handwritten digits. The machine learning model is generated from a distributed python script, executed by the Qarnot HPC service. The basis of handwritten digits we intend to recognize is the one of the MNIST database.
We assume a dataset (Train_Set) of images, each corresponding to a handwritten digit. The correspondences are encoded in a vector (labels) that given the index of an image returns a digit from 0 to 9. The goal is to induce a model that would perfectly predict the digits on another set of images (Test_Set). For Train_set and Test_set, we will consider the collection of handwritten digits of the MNIST database. This collection includes more than 70,000 sample images. The images are greyscaled and normalized to fit into 28 *28 pixels. In the database, each digit is associated with several handwritten images.
To build a machine learning model, we need a numerical representation (an encoding) of the data from which we want to induce a model. In the case of MNIST, such a representation is already provided. Indeed, each image in the database is represented as a matrix of 28*28 integers, capturing the darkness level of each pixel. This level goes from 0 (white) to 255 (black). Below for instance, we show how a handwritten image corresponding to the digit 1 is encoded (normalized to 1.0).
The scikit-learn library implements built-in function for loading data from the MNIST database. Below, we show a script for loading the data from MNIST and creating Train_Set and Test_Set.
from sklearn.datasets import fetch_openml
# Fetch MNIST database
mnist = fetch_openml('mnist_784', version=1, cache=True)
X, y = mnist.data / 255., mnist.targetYou can have a look at some values of X and y array. Each element of X array is a vector of 28x28=784 values between 0 and 255.
>>> y[0]
0.0
>>>X[0].reshape((28,28))
array([[  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  51, 159, 253, 159,  50,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  48, 238, 252, 252, 252, 237,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  54, 227, 253, 252, 239, 233, 252,  57,   6,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  10,  60, 224, 252, 253, 252, 202,  84, 252, 253, 122,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0, 163, 252, 252, 252, 253, 252, 252,  96, 189, 253, 167,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  51, 238, 253, 253, 190, 114, 253, 228,  47,  79, 255, 168,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,  48, 238, 252, 252, 179,  12,  75, 121,  21,   0,   0, 253, 243,  50,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,  38, 165, 253, 233, 208,  84,   0,   0,   0,   0,   0,   0, 253, 252, 165,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   7, 178, 252, 240,  71,  19,  28,   0,   0,   0,   0,   0,   0, 253, 252, 195,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,  57, 252, 252,  63,   0,   0,   0,   0,   0,   0,   0,   0,   0, 253, 252, 195,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0, 198, 253, 190,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0, 255, 253, 196,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,  76, 246, 252, 112,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0, 253, 252, 148,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,  85, 252, 230,  25,   0,   0,   0,   0,   0,   0,   0,   0,   7, 135, 253, 186,  12,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,  85, 252, 223,   0,   0,   0,   0,   0,   0,   0,   0,   7, 131, 252, 225,  71,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,  85, 252, 145,   0,   0,   0,   0,   0,   0,   0,  48, 165, 252, 173,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,  86, 253, 225,   0,   0,   0,   0,   0,   0, 114, 238, 253, 162,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,  85, 252, 249, 146,  48,  29,  85, 178, 225, 253, 223, 167,  56,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,  85, 252, 252, 252, 229, 215, 252, 252, 252, 196, 130,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,  28, 199, 252, 252, 253, 252, 252, 233, 145,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,  25, 128, 252, 253, 252, 141,  37,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0]], dtype=uint8)The MNIST database is classified as follows:
X[0] to X[5922] are 0'sX[5923] to X[12664] are 1'sX[5924] to X[18622] are 2'sIn the following scripts, we'll only focus on 0's, 1's and 2's within the X[:L], y[:L] subset.
The k-nearest neighbors (KNN) method is one of the most intuitive method for building machine learning models. The idea is the following. Let's assume that we have a basis of images X_train, associated with the corresponding digits y_train. Then, if we want to know to which digit a new input x corresponds, we simply compute the similarity between x and all the other images in X_train. To define similarity, the simplest way is to use a basic distance measure. By default, the classifier will use the euclidean distance between vectors. For instance, if  and  are two 784-dimensional vectors, their mutual euclidean distance is defined by:
D(x,x')=i=1∑n(xi−x'i)2
The K most similar images, i.e. those with the smallest distance with the candidate image, are then selected. We then pick the corresponding digits in y_train and return the most frequent. With scikit-learn, we can simply implement this machine learning scheme with the following code.
import sys
from sklearn import neighbors
from sklearn.datasets import fetch_openml
# Fetch MNIST database
mnist = fetch_openml('mnist_784', version=1, cache=True)
# crop to keep only 0's, 1's and 2's and normalize greyscale to 1.0
L = 18622
X, y = mnist.data[:L] / 255., mnist.target[:L]
# split even and odd elements into training and testing subsets
X_train, X_test = X[0::2], X[1::2]
y_train, y_test = y[0::2], y[1::2]
# create a simple KNN classifier with k=2 and uniform weight policy
clf = neighbors.KNeighborsClassifier(2, weights='uniform')
# fit the model using X_train as training data and y_train as target values
clf.fit(X_train, y_train)
# return the mean accuracy on the training and testing sets
print('Training set score: %f' % clf.score(X_train, y_train))
print('Test set score: %f' % clf.score(X_test, y_test))The scores measure the average number of good predictions of the model. The higher it is, the better is the model.  In this example, we only considered the first 18622 images of the MNIST database. Otherwise the process  would have been more time consuming. By running the python script, we get: Training set score: 0.994200 Test set score: 0.993986 The obtained score is high. But, one should notice that we did not considered the entire database. Moreover, better scores are possible. Indeed, we ran the KNN method with K=2 and in assuming that all points in each neighborhood are equally  weighted  (weight='uniform').  In practice, however, it might be more interesting to consider more neighbors (a greater value for K) or other definitions of the the most frequent digit. To include these possibilities, we propose to modify the above script as follows.
import sys
from sklearn import neighbors
from sklearn.datasets import fetch_openml
# Fetch MNIST database
mnist = fetch_openml('mnist_784', version=1, cache=True)
options = [['2', 'uniform'], ['2', 'distance'], ['3', 'uniform'], ['3', 'distance'] ]
index = int(sys.argv[1])
# crop to keep only 0's, 1's and 2's and normalize greyscale to 1.0
L = 18622
X, y = mnist.data[:L] / 255., mnist.target[:L]
# split even and odd elements into training and testing subsets
X_train, X_test = X[0::2], X[1::2]
y_train, y_test = y[0::2], y[1::2]
# create a simple KNN classifier with the specified k and weight policy
n_neighbors = int(options[index][0])
weights = options[index][1]
print ('(n_neighbors: '+ str(n_neighbors) +' , '+'weigths: '+weights+')')
clf = neighbors.KNeighborsClassifier(n_neighbors=n_neighbors, weights=weights)
# fit the model using X_train as training data and y_train as target values
clf.fit(X_train, y_train)
# return the mean accuracy on the training and testing sets
print('Training set score: %f' % clf.score(X_train, y_train))
print('Test set score: %f' % clf.score(X_test, y_test))The new mnist_knn2.py script proposes four possible combinations of parameters for the KNN model. The first two combinations generate a 2-NN models and the two others a 3-NN model. smith@wesson python mnist_knn2.py 1 (n_neighbors: 2 , weigths: distance) Training set score: 1.000000 Test set score: 0.999785 One can notice that by changing the weight policy from 'uniform' to 'distance', we get a better prediction score.  Thus, for obtaining the best KNN model, one should try all possible combinations of parameters. But, this could be time consuming, in particular if we do not restrict L to 18622.  A solution to reduce this time is to use the Qarnot HPC service for building the models in parallel.
With the Qarnot SDK, we can launch build concurrently the KNN models of our problem. For this, we use the script
import qarnot
conn =  qarnot.connection.Connection(client_token='<YOUR_API_TOKEN>')
task = conn.create_task('mnist-digitRecognition', 'docker-network', 4)
d = conn.create_bucket('digit-recognition-bucket')
d.add_file('mnist_knn2.py')
task.resources.append(d)
task.constants['DOCKER_REPO'] = 'huanjason/scikit-learn'
task.constants['DOCKER_CMD'] = 'python mnist_knn2.py ${FRAME_ID}'
task.run()In this script, we first create a task with 4 frames task = conn.create_task("mnist-digitRecognition", "docker-network", 4) The tasks will be run with the dockerhub image whose name is  "huanjason/scikit-learn". This image includes python 3.7, scikit-learn and numpy. Then, each frame will process the command python mnist_knn2.py ${FRAME_ID} By the definition, the possible values for  ${FRAME_ID}  are 0, 1 , 2, 3. Thus, the frame number i will generate the KNN model whose parameters are defined in options[i]. In running this script, we obtain the results The execution can also be monitored on Qarnot console. Once finished, the result is available in the stdout tab:
1> (n_neighbors: 2 , weigths: distance)
 0> (n_neighbors: 2 , weigths: uniform)
 2> (n_neighbors: 3 , weigths: uniform)
 3> (n_neighbors: 3 , weigths: distance)
 0> Training set score: 0.994200
 1> Training set score: 1.000000
 2> Training set score: 0.995274
 3> Training set score: 1.000000
 1> Test set score: 0.999785
 0> Test set score: 0.993986
 2> Test set score: 0.995060
 3> Test set score: 0.999785For further information, the interested reader can have a look at the KNN parameters defined in [2] and can define a larger parallel exploration in redefining the options array. We also recommend to change the above script to use the 60 000 first images of the MNIST database for training and the remaining for testing.
With Qarnot HPC services, you can go well beyond what could be described in this article. You will probably come up with many more interesting use cases. So, go ahead! Try it now, and share your feedback with the community, so that we can keep improving on this product.
[1] The KNN algorithm wikipedia
[2] KNN classifier in Scikit-learn scikit-learn.org
[3] Similarity / distance measures for KNN ">youtube