MATH089 Project 6 - Graph Laplacian and Spectral Clustering

Posted: 11/15/21

Due: 11/22/21, 11:55PM

Science usually arises from observation of regularity in data. A first step in all scientific analysis is the organization of such observations into groups based upon some similarity criterion. Grouping through quantitative mathematical procedures requires the definition of a distance between observations and a procedure to form observation clusters. This project is an introduction to this basic task in data science using classification of letter images as an example.

Figure 1. First 400 characters in Kaggle data base (First Steps With Julia | Kaggle)

Tasks:

1Introduction

(Read First Steps With Julia | Kaggle and research k-Means algorithm, e.g., Wikipedia)

2Methods

2.1Import data

The Kaggle database on character images contains 6220 files in .Bmp format, each of size 20 x 20 pixels. The ImageMagick package allows the images to be imported into Julia.

Once the Julia environment has been updated to contain the ImageMagick and Images packages, they can be used to analyze the character image data.

using Images, ImageMagick
cd(homedir()*"/courses/MATH089/clustering/testResized")
files=readdir(); c=load(files[1]); nFiles=size(files); mx,my=size(c); [nFiles mx my]

[ [ 6220 ] 20 20 ] (1)

c[1,1]

RGB{N0f8}(0.475,0.424,0.514)

Gray(c[1,1])

Gray{N0f8}(0.451)

Float64(Gray(c[1,1]))

0.45098039215686275

C=Float64.(Gray.(c)); imshow(C,cmap="gray");

2.2k-means algorithm

Clustering is a common data analysis procedure, and a Julia package is available.

import Pkg; Pkg.add("Clustering")

Resolving package versions…

Installed NearestNeighbors ─ v0.4.9

Installed Clustering ─────── v0.14.2

Updating ‘~/.julia/environments/v1.6/Project.toml‘

[aaaa29a8] + Clustering v0.14.2

Updating ‘~/.julia/environments/v1.6/Manifest.toml‘

[aaaa29a8] + Clustering v0.14.2

[b8a86587] + NearestNeighbors v0.4.9

Precompiling project…

[32m ✓ [39m[90mNearestNeighbors[39m

[32m ✓ [39mClustering

2 dependencies successfully precompiled in 3 seconds (288 already precompiled)

The k-means algorithm seeks to group n observations, each of dimension d into k clusters. For this application an observation is one of the character images and d=3mxmy using all color channels.

using Clustering

2.2.1Single vector representation of color image

Each pixel in the image contains three color channels

c[1,1]

RGB{N0f8}(0.714,0.153,0.188)

red(c[1,1])

0.714N0f8

Float64(red(c[1,1]))

0.7137254901960784

Define a function to form a single vector of size 3mxmy from an image of size mx×my

function FormObs(c)
  mx,my = size(c); d = 3*mx*my; 
  r=Float64.(red.(c)); g=Float64.(green.(c)); b=Float64.(blue.(c));
  reshape([r g b],(d,1))
end

FormObs

x=FormObs(c);
size(x)

[ 1200 1 ] (2)

size(c)

[ 20 20 ] (3)

2.2.2Apply k-means on data subset

The data set contains vectors 𝒙i for i=1,,n, where n is the number of images, and each 𝒙i contains d=1200 numbers. The goal of the k-means algorithm is to define sets Sj for j=1,,k that contain similar images, presumably of the same letter. Let S={S1,,Sk}. The k-means algorithm minimizes the variance in each cluster

mini=1k𝒙Si||𝒙-𝝁i||2
function FormDataSet(n0,n,f)
  c = load.(f[n0:n0+n-1])
  X = FormObs(c[1])
  for i=2:n
    X = [X FormObs(c[i])]
  end
  return(X)
end

FormDataSet

n0=200; n=100; X=FormDataSet(n0,n,files); size(X)

[ 1200 100 ] (4)

kGroups=2; K=kmeans(X,kGroups);
S=assignments(K); nS=counts(K); C=K.centers;
S[1:10]'

[ 2 1 2 2 2 2 1 2 1 1 ] (5)

size(C)

[ 1200 2 ] (6)

2.3Display k-means results

2.3.1Display centers

Each center represents an image. Display it.

3Results

4Discussion

(Present your conclusions on the use of random numbers to simulate real-life models)