Week 5 Lab: Analyzing Data With Hadoop
Lab Overview
Scenario/Summary
Hadoop is a computing framework that provides the ability to store and analyze very large data sets using a cluster of many inexpensive computers. The objective of this lab is to familiarize you with the capabilities of the Hadoop platform, including how data files are transferred to Hadoop in preparation for analysis.
In this lab, you will view a video about a large-scale data analysis project done as part of the “Big Data for Social Good Challenge” sponsored by IBM and Hadoop. Because of the large volume of data involved, Hadoop was used to perform each analysis. You will answer a series of questions about how the analysis was performed, what could have been done differently, and other applications for this analytical technique. You will then create your own account on a cloud-based Hadoop system and practice performing some basic operations with Hadoop.
NOTE
Parts of this Lab activity are adapted from “Big Data for Social Good Example Demo” by the IBM Bluemix Team, 2015, retrieved from http://ibmhadoop.devpost.com/submissions (Links to an external site.)Links to an external site.. Adapted with permission under terms of the IBM Academic Initiative.
Additional resources and activities for learning about Hadoop and other big data topics are available through the IBM Big Data University site at http://bigdatauniversity.com/ (Links to an external site.)Links to an external site.. Students my optionally register for a free account at Big Data University and explore these supplemental resources
Deliverables
Submit a Word document named LabWeek5xxx.docx (where xxx = your initials) that includes your answers to the questions about the Hadoop tutorial video, plus screen shots documenting your completion of basic Hadoop operations on the cloud-based system.
Your deliverable will be evaluated according to the following rubric.
| Step | Criteria | Points | % | 
| 1 | Answers to Questions on “Big Data for Social Good” Project: All questions about the project answered clearly and correctly, with at least one full paragraph per question, written in standard professional English. | 30 | 60% | 
| 3 | Connection Data and Ambari Console: Hadoop connection details for the IBM Demo Cloud are shown in the document. A screen shot showing the Ambari Console is included. Answers to 3 questions about the Ambari Console match the information shown in the screen shot. | 10 | 16% | 
| 4 | Basic HDFS Commands: Screen shots show the establishment of a connection to Hadoop via the ssh command; a list of files and directories in the root directory of the Hadoop file system; the Hadoop test directory and contents after upload of a test file. | 10 | 16% | 
| Total | 50 | 100% | 
Required Software
MICROSOFT OFFICE: WORD
Use a personal copy or access the software at https://lab.devry.edu (Links to an external site.)Links to an external site..
All Steps
CYGWIN 1_7_35
Access the software at https://lab.devry.edu (Links to an external site.)Links to an external site..
Step 4
APACHE HADOOP AND COGNITIVE CLASS LABS CLOUD PLATFORM
Access the software at https://my.imdemocloud.com/ (Links to an external site.)Links to an external site..
All Steps
Tutorial
Watch the following tutorial videos to see how to perform the steps in this lab activity.
Week 5 Lab Video Part 1 (covers Steps 1 and 2)
Lab Steps
Step 1: Review Hadoop Project Video and Answer Questions
- Go to The “Big Data For Social Good Challenge” submissions page at http://ibmhadoop.devpost.com/submissions (Links to an external site.)Links to an external site. . Select and review any one of the project videos on this site that interests you, plus any other available information about the project.
- Answer the following questions. Each answer should be at least a full paragraph in length. Place your answers in a Word document named LabWeek5xxx.docx (where xxx = your initials).- Question 1: What were the goals of the selected project?
- Question 2: Describe the data sources used in the project, and evaluate whether these data sources meet the criteria to be described as “big data.”
- Question 3: What analytical methods were used in this project, and why was Hadoop needed to perform this analysis?
- Question 4: Describe how the results of the analysis were presented (i.e., via maps, charts, tables, etc.), and evaluate how effective you believe this presentation was.
- Question 5: What are some ways in which this project could potentially be extended or improved?
- Question 6: What other applications can you envision using the analytic methods and/or data sources involved in this project?
 
Step 2: Create Account on Cognitive Class Labs Cloud
- Go to the Cognitive Class Labs Cloud site (formerly known as the IBM Demo Cloud) at https://my.imdemocloud.com (Links to an external site.)Links to an external site. (Links to an external site.)Links to an external site.
- Click Sign Up to create a new account. Enter your name, e-mail address, and a password you create (twice), then click Sign Up. Record the password for signing into the account later.
- You will receive an email with a confirmation link. This may take up to 30 minutes. Click the link to confirm your email address and activate your account.
- After clicking the link in the confirmation email, you will be taken to a login page. Log in using your email address and the password you created earlier. The first time you log in, a terms of service agreement will be displayed. Review this agreement and, if you accept, select the “I accept” option at the bottom and click Next.
- Click the Hadoop Sandbox link.
- Review the license agreement for the software and, if you accept, select the “I accept” option and click Submit. (You may need to do this for multiple license pages).
- After accepting the license agreement(s), you should see the Hadoop Sandbox page.
Step 3: Get Connection Details and View Ambari Console
- WAIT AT LEAST 30 MINUTES after creating your account before attempting to access the Ambari Console. It may take this long for your access to be fully enabled.
- Select Systems.
- Scroll to the bottom of the page under Connection Details. Here you will find a username and server host name for connecting to your cloud-based Hadoop cluster. Copy and paste the username and server host name into your LabWeek5xxx.docx Word document under the heading “Hadoop Connection Details”.
- Under the Ambari Console section (just above Connection Details), click Launch and sign in using the username given under Connection Details (NOT your full email address) and the password you created earlier for your Cognitive Class Labs Cloud account.
- You should now see the Ambari Console which shows the status of the Hadoop cluster you are connected to. Some sections may show “No Data Available”; this is normal depending on what monitoring services are running on the console. Notice that you can obtain additional details for some sections by hovering your cursor over the section.
- Capture a screen shot of the Ambari Console and paste it into your Word document. In the Word document below the Ambari Console, answer the following questions based on what you see on the Ambari Console:- What percentage of disk space on the Hadoop Distributed File System (HDFS) is being used? How many gigabytes (GB) or terabytes (TB) does this represent? (Hint: Hover over the HDFS Disk Usage section and add the values for DFS Used and non-DFS Used.)
- How many DataNodes are in the cluster? (Hint: Look in the HDFS Links section.)
- How long has the cluster been running since the last restart? (Hint: Look in the NameNode Uptime section.)
 
Step 4: Use Basic HDFS Commands
- In a separate browser tab or window, go to the DeVry University Citrix server at https://lab.devry.edu (Links to an external site.)Links to an external site. and log in with your D-number and DeVry University password.
- Go to Apps, open a web browser in Citrix (Chrome or Firefox is recommended), and in this Citrix browser go to https://www.consumerfinance.gov/data-research/consumer-complaints/ (Links to an external site.)Links to an external site. . Select the View Complaint Data option, use the filtering tools to select a subset of a few thousand complaints, and export the results as a CSV file. Save the exported complaint file in the Downloads folder on the Citrix server (not on your local computer). Close the Citrix browser window.
- Go to Apps and use the search box at the upper right to locate the Cygwin 1_7_35 application. Click the icon to launch the application. (If desired, you can click the Details link and add the app to your Favorites page to make it easier to find in the future.) Cygwin is a Linux-like command-line environment containing tools we will use to communicate with our Hadoop cluster.
- Enter the following commands at the $ prompt in Cygwin:
 cd Downloads
 ls
- You should see the complaint file you downloaded in the previous step displayed.
- Enter the following commands at the $ prompt in Cygwin:
 scp *.csv username@hostname:
 where username and hostname are the names that were provided under Connection Details earlier. (You should have these saved in your Word document.) Don’t forget to add the colon ( : ) at the end of this command.You may receive a message saying that the authenticity of the host cannot be established and asking if you are sure you want to continue. Answer yes and press Enter. (This is because the Hadoop cluster on the Cognitive Class Labs Cloud is not set up as a secure production system, as it is only intended for educational purposes.) You may see messages indicating a directory could not be created and that the system failed to add the host to a list of known hosts. These are normal and will not create any problems for our purposes. When prompted, enter the password for the Cognitive Class Labs Cloud account you set up earlier, and press Enter. (Nothing will display on the screen as you enter the password.) The complaint file will be copied to your account on the Cognitive Class Labs Cloud system. 
 Capture a screen shot of this command and its output, and paste it into your Word document.
- After the previous command completes, enter the following command at the $ prompt in Cygwin:
 ssh username@hostname
 where username and hostname are the names that were provided under Connection Details earlier. (You should have these saved in your Word document.)You may receive a message saying that the authenticity of the host cannot be established and asking if you are sure you want to continue. Answer yes and press Enter. (This is because the Hadoop cluster on the Cognitive Class Labs Cloud is not set up as a secure production system, as it is only intended for educational purposes.) You may see messages indicating a directory could not be created and that the system failed to add the host to a list of known hosts. These are normal and will not create any problems for our purposes. When prompted, enter the password for the Cognitive Class Labs Cloud account you set up earlier, and press Enter. (Nothing will display on the screen as you enter the password.) A prompt like [username@hostname ~]$ indicates you have a connection established and are now ready to enter commands to Hadoop. 
- Enter the following command to list files and directories in your home directory on the system:
 ls
 You should see the complaint file you transferred previously. Capture a screen shot of the resulting file list and paste it into your Word document.
- Enter the following command to create a directory called test in the Hadoop file system:
 hadoop fs -mkdir test
- Enter the following to load the complaint file into the test directory in HDFS that you just created:
 hadoop fs -put /mnt/home/username/*.csv test
 where username is the Hadoop username provided under Connection Details earlier.
- Enter the following command to display your Hadoop test directory and its contents:
 hadoop fs -ls -R
 Capture a screen shot showing this command and its resulting output, and paste it into your Word document.Note that the number “3” appears after the file permissions on the line for complaint file. This indicates the replication factor for the file, i.e., by default, 3 copies of this file have been created on the Hadoop cluster. This is part of the distributed nature of HDFS; files are replicated on multiple DataNodes so that if one DataNode in the cluster goes down, the data will still be accessible. 
- Enter the following command to display the first few lines of the complaint file:
 head -3 *.csv
 You should see the first 3 rows of data from the complaint file. Capture a screen shot of this command and its output, and paste it into your Word document.
- Enter the following command to close the connection to the Hadoop cluster:
 exit
- Save your Word document and close it; close the Cygwin application; and log out of the Cognitive Class Labs Cloud site.
Step 5: Submit the Deliverable
Submit your completed file.