Spinning up a Spark Cluster on AWS and running a machine learning program.

In my previous blog I tried to create a standalone spark cluster using two machines, but we all know the limitations of our physical machines. So now we will try to move things to cloud and set up our spark cluster on AWS (Amazon web services). We will start from the very basic of setting up your account on AWS to running a machine learning program on our newly formed cluster. We have a long way to go so lets swiftly begin.

Start with creating a new account on amazon web services.screenshot-2016-12-08-21-39-21

Select the basic support plan. Fill in all your personal details and payment card details.


Amazon gives access to one basic machine of 1GB RAM and 1 core processor free for one year and thereafter they charge as per usage.Once you are done with your details and account verification login to your account.You will see various AWS services listed. Do not get confused we have specified steps to reach our goal but you can explore more services on your own.

Under the Build a solution tab, select the first option i.e. Launch a virtual machine with EC2,creating a new EC2 instance where in we will configure our machine.The following page gets loaded:screenshot-2016-12-08-22-26-33

Now instead of clicking on getting started we will do a few changes.Amazon provide various  Amazon Machine Images (AMI’s) which are pre-defined with multiple setups to suit your needs.These machines are stored at different server locations ,thus we will change location to the server where our AMI is present.


Go the right top corner and select US West(N.California) as the server location and click on advanced  EC2 launch instance wizard. This will open a console which will now help us choose our AMI.


Select the Community AMIs tab and type “ami-125b2c72” in the search box . This is the ID of that AMI which we will be using as virtual machine for creating our system.Select the listed  AMI and proceed to choose the instance (free tier eligible machine).


Now we will be configuring  our instance.


To do that the first step will be needing a hashed password. Open Ipython notebook and type the following command to create  your hashed password.

In [1]: from notebook.auth import passwd
In [2]: passwd()
Enter password:
Verify password:


output will be a hashed password that you will copy and use in the script below



if [ ! -d “$CERTIFICATE_DIR” ]; then
    openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout “$CERTIFICATE_DIR/mykey.key” -out “$CERTIFICATE_DIR/mycert.pem” -batch
    chown -R ubuntu $CERTIFICATE_DIR
if [ ! -f “$JUPYTER_CONFIG_DIR/jupyter_notebook_config.py” ]; then
    # generate default config file
    #jupyter notebook –generate-config
    # append notebook server settings
    cat <> “$JUPYTER_CONFIG_DIR/jupyter_notebook_config.py”
# Set options for certfile, ip, password, and toggle off browser auto-opening
c.NotebookApp.certfile = u’$CERTIFICATE_DIR/mycert.pem’
c.NotebookApp.keyfile = u’$CERTIFICATE_DIR/mykey.key’
# Set ip to ‘*’ to bind on all interfaces (ips) for the public server
c.NotebookApp.ip = ‘*’
c.NotebookApp.password = u”
c.NotebookApp.open_browser = False
# It is a good idea to set a known, fixed port for server access
c.NotebookApp.port = 8888
    chown -R ubuntu $JUPYTER_CONFIG_DIR

Update the script with the hashed password and paste it in the text of advanced details tab.The  bash script is jupyter_userdata.sh to execute Jupyter’s instructions for setting up a public notebook server, so you don’t have to manually configure the notebook server every time you want to spin up a new AMI instance.For the script to work, Jupyter should already be installed  — which is preesnt in our “ami-125b2c72″ AMI.


Storage and Tags tab need no changes to be done. Proceed to the configure security group tab. We can see that  on port 22,Secure Shell (SSH) connection is configured.Lets add another rule- Type: Custom TCP Rule; Protocol: TCP; Port Range: 8888; Source: Anywhere.


Now we get to click the most awaited button Review and Launch. In the pop up window select create a new key pair and give the key a name, click on download. Save this public key and guard it with your life as this is your only way access the virtual machine you have been trying to create for so long.screenshot-2016-12-08-22-46-43

Now click on launch instances and wait for instance to be created. Hold on to patience as this might take a few minutes.


utilize this time and change the permission on the key file till your instance is getting created. From the terminal go to the directory where your key file got saved and type the following command

$ chmod 600 PEM_FILENAME

your instances must be created by now, select it  and click on launch instances.


Also make sure to take a note of the Public ip of this instance which will be something like We would be needing this ip address to launch our instance from the terminal.

Open the terminal and write the following code :

$ ssh -i PEM_FILENAME ubuntu@PUBLIC_IP


Here we are accessing our AWS machine ! Not so difficult ,isn’t it. It’s pre-loaded with jupyter notebook and python.We just have to install spark on our virtual machine and it will be all good to run our machine learning problem. It would be similar to the spark  installation in our personal system.

Step 1 : Download spark in the virtual machine, unzip it and rename the folder
$ tar -zxvf  *.tgz
$ mv spark-2.0.2-bin-hadoop2.7 spark16
Step 2 : Upgrade and update the virtual machine and install java on it
$ sudo apt-get update
$ sudo apt-get install default-jdk
$ sudo apt-get upgrade
Step 3 : Update the spark path in the . profile file
$ export SPARK_HOME=/home/ubuntu/spark16
$ export PATH=$SPARK_HOME/bin:$PATH
$ export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.3-src.zip:$PYTHONPATH
Its complete ! lets open ipython notebook and run our first machine learning program on AWS  Spark cluster.
But launching notebook on this machine is little different. Go to the web browser and type https://PUBLIC_IP:8888 ,add it as security exception and give in the password to login. It would be the same password ,which we had earlier converted hashed password. Now we can start a new python notebook and start with our machine learning problem.

 Implementing Machine Learning Problem : Classification using Random Forest

Data set information: I will be working on breast cancer dataset obtained from the University of Wisconsis Hospitals,Madison from Dr. William H. Wolberg.Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.  They describe characteristics of the cell nuclei present in the image.This data is used to diagnose breast cancer from fine-needle aspirates.

Problem Statement: Our aim to create a model which can distinguish a cell as benign or malignant.

Data description: the data set contains data on 699 patients who got tested. The variables available for study are:

#  Attribute                                                     Domain
— —————————————–
1. Sample code number                           id number
2. Clump Thickness                                      1 – 10
3. Uniformity of Cell Size                            1 – 10
4. Uniformity of Cell Shape                        1 – 10
5. Marginal Adhesion                                    1 – 10
6. Single Epithelial Cell Size                       1 – 10
7. Bare Nuclei                                                  1 – 10
8. Bland Chromatin                                       1 – 10
9. Normal Nucleoli                                        1 – 10
10. Mitoses                                                        1 – 10
11. Class:                                        (2 for benign, 4 for malignant)

Data Preparation: We have to clean the data before we can use it for creating our model. First and most important is to check for NA or missing values. Our data set contained 16 missing values, since its a medical data which can be really sensitive so instead of imputing for missing values which might affect the results adversely ,we decide to remove those rows. Also Sample code number is just an identifier which will be of no use to us in modelling to removing that attribute. The Class contains values 2 and 4 converting 2 to 0 and 4 to 1 for convenience of machine to understand and predict.

Technique used: We will be using random forest classification to solve our problem.

Random Forest is considered to be a panacea of all data science problems. On a funny note, when you can’t think of any algorithm (irrespective of situation), use random forest!
Random Forest is a versatile machine learning method capable of performing both regression and classification tasks. It also undertakes dimensional reduction methods, treats missing values, outlier values and other essential steps of data exploration, and does a fairly good job. It is a type of ensemble learning method, where a group of weak models combine to form a powerful model.
In Random Forest, we grow multiple trees as opposed to a single tree in CART model. To classify a new object based on attributes, each tree gives a classification and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest) and in case of regression, it takes the average of outputs by different trees.
Lets use the data and create our classification model.We already have our notebook running so bring in the libraries you need, load everything from spark context to random forest model.
Load the clean dataset and split it into two. 70% of the data will be used to train the model and the rest 30% will be later used to test our model.
Train the random forest model using our train data. We will be specifying number of classes as 2 which is the two classes we are supposed to classify our data into. Also number of trees as 3 so 3 trees will be formed and then voting for the best will be done.
Evaluating our prediction model with the test data  and computing the error to check the accuracy of our model.
The output of the program will be:
Three trees are formed and the best is voted to be the final model.
Conclusion : we can see that the test error is 5.23 % so the model is approximately 95 % accurate in predicting if  the given cell is benign or malignant. This is a very important classification as Cells in benign tumors do not spread to other parts of the body where as Malignant tumors are cancerous and are made up of cells that grow out of control.
Every correct classification can help diagnose a patient better.
After we have used random forest for classification in my next blog we can try to solve a regression problem using it until then you can try some more classifications algorithms.
(Source of data : UCI machine Learning)

Creating a Standalone Spark cluster and Running Machine Learning Program.

My previous blog gave details on how to set up your own spark machine and a small word count program executed  three different ways, well this one will help you create a standalone cluster of your spark machines.Before starting let me introduce you to cluster terminologies in Apache Spark

Spark Master – manager of resources i.e worker nodes.
Spark Worker – cluster node which actually executes the task.
Spark Driver – Client application which ask for resources from spark master and executes task on worker nodes.

There are three different types of cluster manager in Apache Spark

  1. Spark-Standalone : Spark workers are registered with spark master
  2. Yarn : Spark workers are registered with YARN Cluster manager.
  3. Mesos : Spark workers are registered with Mesos.

We will be creating a  Spark – Standalone Cluster today.Spark will be sitting on two machines, one acting as the master and the other as the slave. If you need to add more slave nodes, you can simply do it by following steps in the section below on the machines you need run spark slave on.

Step 1: Creating two machines

Make sure you have spark installed and running on both the machines. Also  Ipyhton notebook ‘coz this is where we will run our machine learning program.If you are creating these machines for the first time check my previous blog for step wise instructions.

In case you don’t have two machines you can try this out on Virtual machines in the same system. Make one machine with all the  setup and then clone the same machine in VMware, renaming them as master and slave.

Step 2: Configuring SSH

Now that we have our machines ready (will be referring to them as master and slave from now on) our first task is to create a secure connection between the master and the slave.For cluster related communications, Spark master should be able to create password less SSH login to Spark slave.

Let us enable this communication.

  1. Make a note of ip address of both the machines. You check it from the terminal using

$ ifconfig

2.On both your machines do the following :

$ sudo apt-get install  openssh-server


3. Now do the following on Master machine:

$ ssh-keygen


This generates a public key id_rsa.pub in /.ssh directory which will enable password less connection of master with the slave.

4.Place the copy of public key of master node and add it as an authorized key on the slave node

$ ssh-copy-id -i  ~/.ssh/id_rsa.pub osboxes@


(please note osboxes is host name and is ip address of my slave machine,replace it receptively)

5. Check if  key was added successfully, you should be able to log into the machine.

$ ssh ‘osboxes@’


Step 3: Starting the cluster

On the master machine open the terminal and spark folder, type this

$ ./sbin/start-master.sh


This would run a script starting your master and slave nodes.When starting spark details of the nodes will get written to a log file. This will print, spark url of the master node. It usually looks like this. spark://<host-name>:7077.

Lets go to the web console of spark master and check the status of the cluster. If the slave node starts up correctly and joins to the cluster, we should see details of the worker node under Workers sections.


Implementing a Machine Learning program : K-means Clustering

Now that we have our cluster in place lets solve some machine learning problem.

Data set  Information :The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for the experiment. High quality visualization of the internal kernel structure was detected using a soft X-ray technique.It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology.Studies was conducted using combine harvested wheat grain originating from experimental fields.

Problem Statement :The objective is to create distinct clusters  from the  provided data set based on the different attributes representing the three varieties of wheat.

Data Description : Data for 210 randomly selected wheat samples were collected.To construct the data, seven geometric parameters of wheat kernels were measured:
1. area (A)
2. perimeter (P)
3. compactness  (C = 4*pi*A/P^2)
4. length of kernel
5. width of kernel
6. asymmetry coefficient
7. length of kernel groove
All of these parameters are real-valued continuous.

Data Preparation : Check if the data set has any missing or NA values, well this data set had none. Also make sure all the attributes values are numeric as we will be doing clustering.

Technique Used : We will be using K-means clustering technique to solve this problem. K-Means Clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity. Unsupervised learning means that there is no outcome to be predicted, and the algorithm just tries to find patterns in the data.

K-Means in preferred when we already know the number of clusters to formed by data. With our data set we know that 3 varieties of wheat are present so we can specify the number of clusters to be 3.

Lets get straight to the code now and get our wheat grains grouped.Log in to your master machine and open your Ipython or Jupyter notebook.

Bring in the libraries you need, and configure the spark context.


Making Functions to read the data, split the data based on a space and then applying K-means over that data.



Declaring the name of data file to be used  and the number of clusters to be formed.Calling or executing the main function.


Which would give us the output of


Lets understand what can be concluded from the output.

Model Accuracy is 92.38% this tells how accurately our k-means could identify the variety of wheat from the sample.

Confusion Matrix  give an insight of how many miss classifications occurred.

variety 1 : 64 correct classification , 1 got grouped under variety 2 and 5 under variety 3. variety 2 : 60 correct classification , 10 got grouped under variety 1.

All variety 3 wheat type got classified correctly.

Insights: This could mean the features of variety 3 wheat is very different from variety 1 and 2 and thus it forms a distinct cluster.

A few of variety 2 got classified as variety 1 (approx 15%) which means some variety 2 samples might contain attributes or properties which highly resemble variety1.

A few of variety 1 got classified as 3 which can indicate either they have a few matching attributes or there can be some outlier  in the data.

Also the image capturing technique even though not being very expensive still has been able to capture the features of very accurately.


K-means is a widely used algorithm for clustering of data.So now when you have got the ball rolling, try it on more famous data sets (like Iris ).


(Source of data : UCI machine Learning)

Install Spark and Run your first Program!

With increased adoption of Spark over Hadoop into major business systems due to its fast computing and compatibility with hadoop itself,it’s a good time to make a move to spark. Not just that Spark is also an escape artist with its memory management skills and if that was not enough, it supports numerous APIs in Java, Scala and Python.

This post will guide you through your journey of installing spark and python on your Ubuntu systems and then will help you run a very simple program of “word count” to give you that start to launch your expedition in the world of spark.

I am assuming that you already have Ubuntu running on your system or a windows machine with Ubuntu running on a Virtual machine setup (to  create a virtual machine you can follow these instructions).

Installing Python

Lucky us! Ubuntu (14.04 and higher) comes with a pre-installed version of python2 and python3 by default. You can check your default version by the following commands in the terminal.

python  -V



python –-version


Installing Spark

Latest version of spark is downloaded from the official website(do it from here). Make sure you choose the latest pre-built Hadoop and Direct download (do  not download a mirror).

Unzip the tgz file that you just downloaded and move it to your home directory, for me it was


Just to simplify the paths, I have renamed the file we extracted to spark16.You can leave it or do likewise but keep that in account that it would affect your following commands which contain the path to that file.

So time to set path for Spark, you have to make the following changes in the .profile  or just type these commands every time before running spark.

  1. export SPARK_HOME=/home/user/spark16
  2. export PATH=$SPARK_HOME/bin:$PATH
  4. export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH

Make sure to check your file name , path and  py4j *.zip version from SPARK_HOME/python/lib


But trust me I would prefer making changes in the .profile .Make sure you do not delete previous paths from the .profile, to be on the safer side keep a copy of your  .profile in some different folder.

Edit your .profile by

sudo gedit  ~/.bashrc

This would open the .profile in text editable mode , add the above mentioned export statements.


We are all set. Lets see if we have spark up and running. Enter the following command on the terminal and (fingers crossed) hope for the following screen to appear.


Going good, lets keep up the spirit and move towards running our first program- “word count” , the ‘hello world’ of spark universe.

Now the two most important requirements before going further is – word count program and  text file you want to count words in. I will be using ‘Hobbit.txt’ file as my input file you can choose your own text files for this, just make sure the file is placed in your home directory or the same directory as your spark.

I will try helping you execute this program in three different ways-

In terminal using Pyspark

Taking it up from where we had stopped last, the screen after typing pyspark in terminal execute each of the following lines :

text = sc.textFile(“hobbit.txt”)
print text
from operator import add
def tokenize(text):
return text.split()
words = text.flatMap(tokenize)
print (words)
wc = words.map(lambda x: (x,1))
print wc.toDebugString()
counts = wc.reduceByKey(add)


The final output gets stored in directory called “output-dir”. Check the files in output directory which would look  something like this


By Batch Processing – using python outside pyspark

Lets come out of the pyspark environment by typing quit() in the terminal.Now lets save the word count program in a batch file with extension py – looking like ‘wordcount3.py’. Save this batch file in the home directory and type the following command in terminal:

python ./wordcount3.py hobbit.txt


It would give the output  directly in the terminal as can be seen above.

Lastly by using JUPYTER Notebook

Which then brings me to the point where I would also tell you how to install jupyter notebook on your machines. Go the the terminal and type the following commands:

sudo apt-get -y install python-pip python-dev
sudo apt-get -y install ipython ipython-notebook
sudo -H pip install jupyter

Voila ! pip does everything for you now.Lets get started with the notebook.

Type jupyter notebook in the terminal


and wait for the notebook to open. Select the wordcount program (wordcount.ipynb) in the notebook.



Each line of code is executed. The output is stored in the new folder created as hobbit-out3.


Hope all of us are on the same page after going through this blog .Well this is only the beginning and for more data science posts keep following.