DCEM

Package Overview

Implements the Expectation Maximisation Algorithm for clustering the multivariate and univariate datasets. There are two versions of EM implemented-EM* (converge faster by avoiding revisiting the data) and EM. For more details on EM*, see the ‘References’ section below.

The package has been tested with both real and simulated datasets. The package comes bundled with a dataset for demonstration (ionosphere_data.csv). More help about the package can be seen by typing ?DCEM in the R console (after installing the package).

Currently, data imputation is not supported and user has to handle the missing data before using the package.

Issues

Installation Instructions

Dependencies First, install all the required packages as follows:

install.packages(c(“matrixcalc”, “mvtnorm”, “MASS”, “Rcpp”))

Installing from CRAN

Use install.packages() in the R console as follow:

install.packages("DCEM")

Installing from the Source Package

Download the source tar ball for DCEM from Github and install as follows:

R CMD install DCEM_2.0.3.tar.gz

Quick Start

• For demonstration purpose, use the dcem_test() function from the R console. This function invokes the dcem_star_train() on the bundeled ionosphere_data.

• The function dcem_test() returns a list containing the output i.e., posterior probabilities, meu, sigma, prior and cluster membership for data. The parameters can be accessed as follows where sample_out is the list containing the output:

Call the dcem_test()

library("DCEM")
sample_out = dcem_test()

Access the output

1. sample_out$prob: A matrix of posterior-probabilities 2. sample_out$meu: A matrix of cluster centers.

3. sample_out$sigma • For multivariate data: List of co-variance matrices for the Gaussian(s). • For univariate data: Vector of standard deviation for the Gaussian(s)) 4. sample_out$prior: A vector of prior.

5. sample_out4membership: A vector of cluster membership for data.

Use EM* (improved EM algorithm) to cluster the Ionosphere data

DCEM comes bundeled with the Ionosphere data. The example below shows how to use EM* for clustering this data set.

# Make sure you have imported the package.
library("DCEM")

# Set the file path
data_file =  file.path(trimws(getwd()), "data", "ionosphere_data.csv")

# Reading the input file into a dataframe.
file = data_file,
sep = ",",
stringsAsFactors = FALSE
)

# Cleaning the data by removing the 35th and 2nd column as they contain the
# labels and 0's respectively.
ionosphere_data =  trim_data("2, 35", ionosphere_data)

# Call the dcem_star_train() function on the cleaned data.
sample_out = dcem_star_train(ionosphere_data) 

Use EM-T (original EM algorithm) to cluster the Ionosphere data

The example below shows how to use EM-T for clustering the Ionosphere data set.

# Make sure you have imported the package.
library("DCEM")

# Set the file path
data_file =  file.path(trimws(getwd()), "data", "ionosphere_data.csv")

# Reading the input file into a dataframe.
file = data_file,
sep = ",",
stringsAsFactors = FALSE
)

# Cleaning the data by removing the 35th and 2nd column as they contain the
# labels and 0's respectively.
ionosphere_data =  trim_data("2, 35", ionosphere_data)

# Call the dcem_star_train() function on the cleaned data.
sample_out = dcem_train(ionosphere_data) 

Parameters in the function call:

Both dcem_star_train() and dcem_train() calls share the same parameters except the argument ‘threshold’ which is only present in dcem_train(). This is because for EM*, threshold is empirically found to not affect the clustering results significantly. The function arguments are described below:

* data (dataframe): Dataframe containing the user specified data.

* threshold (decimal): Convergence threshold (if meu are within this threshold then the algorithm stops and exit (default = 0.0001).

* iteration_count (numeric): Number of iterations for which the algorithm should run, if the convergence is not achieved then the algorithm stops and exit (default = 200).

* num_clusters (numeric): The number of clusters (default = 2).

* seed_meu (matrix): User specified set of meu to be used as initial centers (default = None).

* seeding (string): The initialization scheme (choices = ‘rand’ or ‘improved’, default = rand).

Initialization schemes

In case of iterative clustering algorithm like EM, choice of initial cluster centers can affect the rate of convergence in terms of execution time and number of iterations. Therefore, DCEM allows the users to choose from the following initialization schemes according to their requirement.

1. Random initialization: This is the (default) choice for both EM-T and EM* procedures. Cluster centers are selected by randomly sampling from the input data. It is fast, though it may not result in the best performance i.e., selected centers may result in the algorithm taking more iterations hence longer execution time. The following code illustrate how to use random initialization explicitly.

Set the seed and create a mixture of gaussians.

R> set.seed(49)
R> sample_uv_data = as.data.frame(c(rnorm(500, 5, 0.5), rnorm(1000, 20, 1),
+ rnorm(100, 31, 2)))

Randomly shuffle the data, set the seed and call the dcem_train() function.

R> sample_uv_data = as.data.frame(sample_uv_data[sample(nrow(sample_uv_data)),])
R> set.seed(21)
R> sample_uv_out = dcem_train(sample_uv_data, num_clusters = 3,
+ iteration_count = 100, threshold = 0.0001, seeding = "rand")

Note: The run with random initialization took 14 iterations to converge.

[1] "Specified threshold =  1e-04"
[1] "Specified iterations =  100"
[1] "Specified number of  clusters =  3"
[1] "Using the improved Kmeans++ initialization scheme."
[1] "Convergence at iteration number:  14"
1. Improved initialization: This approach utilizes an advanced initialization procedure based on the k++ technique. It was originally proposed to improve the initialization of the K-means algorithm. When compared to the random initialization, this method has been empirically shown to improve both the speed and accuracy. It can be used as follows.

Use the same seed and call the dcem_train() function with seeding set to ‘improved’.

R> set.seed(21)
R> sample_uv_out = dcem_train(sample_uv_data, num_clusters = 3,
+ iteration_count = 100, threshold = 0.0001, seeding = "improved")

Note: The run with improved initialization took 9 iterations to converge.

[1] "Specified threshold =  1e-04"
[1] "Specified iterations =  100"
[1] "Specified number of  clusters =  3"
[1] "Using the improved Kmeans++ initialization scheme."
[1] "Convergence at iteration number:  9"`