Install Hadoop on Mac OS
In today’s data-driven world, organizations constantly seek ways to handle and analyze massive amounts of information. Enter Hadoop - a game-changing framework that’s revolutionizing big data processing.
Hadoop is an open-source software framework that stores and processes big data in a distributed computing environment. Developed by the Apache Software Foundation, Hadoop allows for the distributed processing of large data sets across clusters of computers using simple programming models. In this blog, we will see how we can install Hadoop on Mac OS.
Step 1: Prerequisite for Hadoop
- Java: Java must be installed on your system, different versions of Hadoop require different versions of Java (this can be found here). We will use Java 8 for our setup since we will be working with Hadoop 3.3 and upper.
Download Java from here, you will be required to create an Oracle account to download the JDK, or you can download the OpenJDK version.
-
Set JAVA_HOME: Edit
.bash_profile
or.zprofile
and add the JDK installation path.export JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.8.0_411.jdk/Contents/Home" export PATH=${JAVA_HOME}/bin:${PATH}
Run the following command to apply changes or restart the terminal:
source ~/.zprofile
- ssh:
ssh
must be installed andsshd
must be running to use the Hadoop scripts that manage remote Hadoop daemons if the optional start and stop scripts are to be used. (follow this - enable ssh on mac)
Step 2: Download and Install Hadoop
- Download Hadoop from here Apache. We need to download binaries for Mac, for Intel-based Mac we would download the
binary
packages, and for Apple Silicon-based Mac we would download thebinary-arch64
package. - Extract the binary files and put them in the directory mentioned below:
usr/local/hadoop-3.4.0/
Step 3: Configure Hadoop Environment Variables
We need to update 5 configuration files in total located at usr/local/hadoop-3.4.0/etc/hadoop/
. They are mentioned below.
-
core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
-
hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
-
mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.application.classpath</name> <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value> </property> </configuration>
-
yarn-site.xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME</value> </property> </configuration>
-
hadoop-env.sh
export JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.8.0_411.jdk/Contents/Home"
Step 4: Add Hadoop Variables to Bash Script
Next, we will add following path variables to the .bash_profile
or .zprofile
file.
export PATH=$PATH:/usr/local/hadoop-3.4.0/bin/
export HADOOP_CLASSPATH=$(hadoop classpath)
Step 5: Set sudo user to SSH to the Localhost
The final step is to enable sudo user to ssh to the localhost without being prompted for a password, which can be done by following the steps below.
$ ssh localhost
# If you cannot ssh to localhost without a passphrase, execute the following commands:
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
Step 6: Format NameNode
-
Change the current directory to
/usr/local/hadoop-3.4.0/bin
in terminal.cd /usr/local/hadoop-3.4.0/bin
-
Execute the following command to format NameNode.
hdfs namenode -format
Step 7: Run Hadoop
-
Change the current directory to
/usr/local/hadoop-3.4.0/sbin
in the terminal.cd /usr/local/hadoop-3.4.0/sbin
-
Execute the following command in the terminal.
./start-all.sh
Step 8: Confirm Hadoop is Installed and Running
- Execute the command
jps
in the terminal. The expected output is shown below:
Step 9: Check Hadoop Configurations
Open a web browser to see current configurations for the Hadoop session: http://localhost:9870.
Step 10: Stop all Running Daemons for Hadoop
-
Execute the following command in the terminal.
./stop-all.sh
Troubleshooting
-
If NameNode doesn’t start: You might face some issues after shutting down/restarting your system. This is because the configuration for NameNode and Hadoop are stored in tmp storage of the system, and gets lost after shutdown/restart.
-
If Node Manager doesn’t start: If we get the error as shown in the screenshot below, double-check that you have added the configuration in the
hadoop-env.sh
file mentioned in Step 3.
Enjoyed the read? Give it a thumbs up 👍🏻!