In my previous blog I tried to create a standalone spark cluster using two machines, but we all know the limitations of our physical machines. So now we will try to move things to cloud and set up our spark cluster on AWS (Amazon web services). We will start from the very basic of setting up your account on AWS to running a machine learning program on our newly formed cluster. We have a long way to go so lets swiftly begin.
Start with creating a new account on amazon web services.
Select the basic support plan. Fill in all your personal details and payment card details.
Amazon gives access to one basic machine of 1GB RAM and 1 core processor free for one year and thereafter they charge as per usage.Once you are done with your details and account verification login to your account.You will see various AWS services listed. Do not get confused we have specified steps to reach our goal but you can explore more services on your own.
Under the Build a solution tab, select the first option i.e. Launch a virtual machine with EC2,creating a new EC2 instance where in we will configure our machine.The following page gets loaded:
Now instead of clicking on getting started we will do a few changes.Amazon provide various Amazon Machine Images (AMI’s) which are pre-defined with multiple setups to suit your needs.These machines are stored at different server locations ,thus we will change location to the server where our AMI is present.
Go the right top corner and select US West(N.California) as the server location and click on advanced EC2 launch instance wizard. This will open a console which will now help us choose our AMI.
Select the Community AMIs tab and type “ami-125b2c72” in the search box . This is the ID of that AMI which we will be using as virtual machine for creating our system.Select the listed AMI and proceed to choose the instance (free tier eligible machine).
Now we will be configuring our instance.
To do that the first step will be needing a hashed password. Open Ipython notebook and type the following command to create your hashed password.
In : from notebook.auth import passwd
In : passwd()
output will be a hashed password that you will copy and use in the script below
if [ ! -d “$CERTIFICATE_DIR” ]; then
openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout “$CERTIFICATE_DIR/mykey.key” -out “$CERTIFICATE_DIR/mycert.pem” -batch
chown -R ubuntu $CERTIFICATE_DIR
if [ ! -f “$JUPYTER_CONFIG_DIR/jupyter_notebook_config.py” ]; then
# generate default config file
#jupyter notebook –generate-config
# append notebook server settings
cat <> “$JUPYTER_CONFIG_DIR/jupyter_notebook_config.py”
# Set options for certfile, ip, password, and toggle off browser auto-opening
c.NotebookApp.certfile = u’$CERTIFICATE_DIR/mycert.pem’
c.NotebookApp.keyfile = u’$CERTIFICATE_DIR/mykey.key’
# Set ip to ‘*’ to bind on all interfaces (ips) for the public server
c.NotebookApp.ip = ‘*’
c.NotebookApp.password = u”
c.NotebookApp.open_browser = False
# It is a good idea to set a known, fixed port for server access
c.NotebookApp.port = 8888
chown -R ubuntu $JUPYTER_CONFIG_DIR
Update the script with the hashed password and paste it in the text of advanced details tab.The bash script is jupyter_userdata.sh to execute Jupyter’s instructions for setting up a public notebook server, so you don’t have to manually configure the notebook server every time you want to spin up a new AMI instance.For the script to work, Jupyter should already be installed — which is preesnt in our “ami-125b2c72″ AMI.
Storage and Tags tab need no changes to be done. Proceed to the configure security group tab. We can see that on port 22,Secure Shell (SSH) connection is configured.Lets add another rule- Type: Custom TCP Rule; Protocol: TCP; Port Range: 8888; Source: Anywhere.
Now we get to click the most awaited button Review and Launch. In the pop up window select create a new key pair and give the key a name, click on download. Save this public key and guard it with your life as this is your only way access the virtual machine you have been trying to create for so long.
Now click on launch instances and wait for instance to be created. Hold on to patience as this might take a few minutes.
utilize this time and change the permission on the key file till your instance is getting created. From the terminal go to the directory where your key file got saved and type the following command
$ chmod 600 PEM_FILENAME
your instances must be created by now, select it and click on launch instances.
Also make sure to take a note of the Public ip of this instance which will be something like 22.214.171.124. We would be needing this ip address to launch our instance from the terminal.
Open the terminal and write the following code :
$ ssh -i PEM_FILENAME ubuntu@PUBLIC_IP
Here we are accessing our AWS machine ! Not so difficult ,isn’t it. It’s pre-loaded with jupyter notebook and python.We just have to install spark on our virtual machine and it will be all good to run our machine learning problem. It would be similar to the spark installation in our personal system.
Implementing Machine Learning Problem : Classification using Random Forest
Data set information: I will be working on breast cancer dataset obtained from the University of Wisconsis Hospitals,Madison from Dr. William H. Wolberg.Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.This data is used to diagnose breast cancer from fine-needle aspirates.
Problem Statement: Our aim to create a model which can distinguish a cell as benign or malignant.
Data description: the data set contains data on 699 patients who got tested. The variables available for study are:
# Attribute Domain
1. Sample code number id number
2. Clump Thickness 1 – 10
3. Uniformity of Cell Size 1 – 10
4. Uniformity of Cell Shape 1 – 10
5. Marginal Adhesion 1 – 10
6. Single Epithelial Cell Size 1 – 10
7. Bare Nuclei 1 – 10
8. Bland Chromatin 1 – 10
9. Normal Nucleoli 1 – 10
10. Mitoses 1 – 10
11. Class: (2 for benign, 4 for malignant)
Data Preparation: We have to clean the data before we can use it for creating our model. First and most important is to check for NA or missing values. Our data set contained 16 missing values, since its a medical data which can be really sensitive so instead of imputing for missing values which might affect the results adversely ,we decide to remove those rows. Also Sample code number is just an identifier which will be of no use to us in modelling to removing that attribute. The Class contains values 2 and 4 converting 2 to 0 and 4 to 1 for convenience of machine to understand and predict.
Technique used: We will be using random forest classification to solve our problem.
Random Forest is a versatile machine learning method capable of performing both regression and classification tasks. It also undertakes dimensional reduction methods, treats missing values, outlier values and other essential steps of data exploration, and does a fairly good job. It is a type of ensemble learning method, where a group of weak models combine to form a powerful model.