Over the last week or so we've had a few support calls asking questions about the scripts provided in chapter 5 of the course, that are used to switch Hadoop between standalone and pseudo-distributed modes.
This post will explain in a bit more detail what each script is and how it works. These are custom scripts that I've developed while working with Hadoop, so you won't (probably!) find them elsewhere on the internet, but I think they make the process of managing Hadoop configurations on a development machine really easy.
Until you've got through chapter 8 of the course not everything in this post will make sense, but feel free to contact me if you have any questions after reading this - or raise a support call through https://www.virtualpairprogrammers.com/technical-support.html
There are 4 scripts provided with the course:
This script is designed to clear down your HDFS workspace - that is to empty out all the files and folders in the Hadoop file system. It's like formatting a drive. What the script actually does is:
- stop any running Hadoop processes
- delete the HDFS folder structure from your computer
- recreate the top level HDFS folder, and set its permissions so that the logged on user can write to it
- run the hdfs format command - this will create the sub-folder structure needed
- restart the hadoop processes
- create the default folder structure within HDFS that's required for your pseudo-distributed jobs (/user/yourusername)
(1) You must be in the folder where the script is located to run this script. You should run it by entering the following command:
(2) The script contains a number of lines that must be run with admin privileges - these contain the word sudo. As a result, running this script will require you to enter your admin password 1 or more times. Although this might seem frustrating, you will not be running this script regularly - only when you wish to delete all your data, and then it's a quick and easy way to do it.
(3) Because this script creates the HDFS required file and folder structures, we use it to create them for the first time. When the course was first released there was a typing error - on line 2, sudo was misspelt sduo. This has been corrected but if you have downloaded a copy with the typo, you might wish to correct it!
This script will switch Hadoop into Pseudo-distributed mode - if you're currently in standalone mode then this is the only script you need to run.
What the script actually does is:
- remove the existing symbolic link to the configuration directory
- create a new symbolic link to the configuration directory containing the pseudo-distributed configuration files
- start the Hadoop processes
This script simply stops the Hadoop processes - it should be run if you're in pseudo-distributed mode and are going to switch back to standalone mode. It doesn't change any configuration settings, it just stops the processes running.
This script removes the existing symbolic link to the configuration directory, and creates a new symbolic link to the configuration directory containing the standalone files. Although I've called this script "startHadoopStandalone" it doesn't actually start anything, as no processes run in standalone mode.
So... which scripts do you need to run and when:
If you're in standalone mode and you want to be in pseudo-distributed mode, just run startHadoopPseudo
If you're in pseudo distributed mode and you want to be in standalone mode, first run stopHadoop and then run startHadoopStandalone
If you have just switched on your machine and want to run in either mode - just run the relevant startScript. In this instance you don't need to run the stop script because you have no running processes if you have just booted up.