Our popular Hadoop for Java Developers course was recorded using version 2.4.0 of Hadoop. Since the course was released there have been some further releases of Hadoop, with the current version being 2.6.0.
There are no differences in the content that we cover on the course between the two versions of Hadoop, so the course is completely valid if you wish to use 2.6.0 or 2.4.0. In this blog post, however, I want to point out a reason to stick with version 2.4.0, and a couple of pointers that you should be aware of if you are going to use 2.6.0. I'll also mention the process to upgrade from 2.4.0 to 2.6.0.
If you're starting to develop with Hadoop today then you might just want to download the latest version from the Hadoop website (2.6.0) and there is only really one reason that I can think of not to do this... and that is that Amazon's Elastic Map Reduce (EMR) service, which can be used to run Hadoop jobs "in the cloud" is not yet compliant with versions of Hadoop newer than 2.4.0.
Over the last week or so we've had a few support calls asking questions about the scripts provided in chapter 5 of the course, that are used to switch Hadoop between standalone and pseudo-distributed modes.
This post will explain in a bit more detail what each script is and how it works. These are custom scripts that I've developed while working with Hadoop, so you won't (probably!) find them elsewhere on the internet, but I think they make the process of managing Hadoop configurations on a development machine really easy.
We have just been made aware of an error in the Hadoop course, chapter 5, at approximately 19 minutes in the video. This error has been fixed in and the video was replaced on 22nd October - so this will only affect you if you downloaded chapter 5 before the 22nd October 2014. All customers can download the replacement video if preferred.
The issue relates to the part of the video that deals with setting your environment variables, and instructs you to edit either your .bashrc file (Linux) or .bash_profile (Mac)
I'm currently in the middle of writing my next course for Virtual Pair Programmers, which will be on using Hadoop. Typically in Virtual Pair Programmers courses, we use Apache's Derby database. We choose this because that's because it's light-weight, and so easy to distribute. It needs pretty much no installation / configuration etc. We can provide students with a full copy of the application and sample databases, and they can avoid having to spend time setting up and configuring a database server, such as MySQL, and having to import a database.
One of the topics we'll be covering on the Hadoop course is the use of the DBInputFormat and DBOutputFormat to read from and write to a relational database (we'll be learning about Sqoop too but the same issue will affect Sqoop... it's just that I've not got to that part of the script just yet!).