In today’s data-driven world, organizations constantly seek ways to handle and analyze massive amounts of information. Enter Hadoop - a game-changing framework that’s revolutionizing big data processing.

Hadoop is an open-source software framework that stores and processes big data in a distributed computing environment. Developed by the Apache Software Foundation, Hadoop allows for the distributed processing of large data sets across clusters of computers using simple programming models. In this blog, we will see how we can install Hadoop on Mac OS.

Step 1: Prerequisite for Hadoop

  • Java: Java must be installed on your system, different versions of Hadoop require different versions of Java (this can be found here). We will use Java 8 for our setup since we will be working with Hadoop 3.3 and upper.

    Download Java from here, you will be required to create an Oracle account to download the JDK, or you can download the OpenJDK version.

  • Set JAVA_HOME: Edit .bash_profile or .zprofile and add the JDK installation path.

      export JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.8.0_411.jdk/Contents/Home"
      export PATH=${JAVA_HOME}/bin:${PATH}
    

    Run the following command to apply changes or restart the terminal:

      source ~/.zprofile
    
  • ssh: ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons if the optional start and stop scripts are to be used. (follow this - enable ssh on mac)

Step 2: Download and Install Hadoop

  • Download Hadoop from here Apache. We need to download binaries for Mac, for Intel-based Mac we would download the binary packages, and for Apple Silicon-based Mac we would download the binary-arch64 package.
  • Extract the binary files and put them in the directory mentioned below:
      usr/local/hadoop-3.4.0/
    

Step 3: Configure Hadoop Environment Variables

We need to update 5 configuration files in total located at usr/local/hadoop-3.4.0/etc/hadoop/. They are mentioned below.

  • core-site.xml

      <configuration>
      <property>
          <name>fs.defaultFS</name>
          <value>hdfs://localhost:9000</value>
      </property>
      </configuration>
    
  • hdfs-site.xml

      <configuration>
      <property>
          <name>dfs.replication</name>
          <value>1</value>
      </property>
      </configuration>
    
  • mapred-site.xml

      <configuration>
      <property>
          <name>mapreduce.framework.name</name>
          <value>yarn</value>
      </property>
      <property>
          <name>mapreduce.application.classpath</name>
          <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
      </property>
      </configuration>
    
  • yarn-site.xml

      <configuration>
      <property>
          <name>yarn.nodemanager.aux-services</name>
          <value>mapreduce_shuffle</value>
      </property>
      <property>
          <name>yarn.nodemanager.env-whitelist</name>
          <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME</value>
      </property>
      </configuration>
    
  • hadoop-env.sh

      export JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.8.0_411.jdk/Contents/Home"
    

Step 4: Add Hadoop Variables to Bash Script

Next, we will add following path variables to the .bash_profile or .zprofile file.

export PATH=$PATH:/usr/local/hadoop-3.4.0/bin/
export HADOOP_CLASSPATH=$(hadoop classpath)

Step 5: Set sudo user to SSH to the Localhost

The final step is to enable sudo user to ssh to the localhost without being prompted for a password, which can be done by following the steps below.

$ ssh localhost
# If you cannot ssh to localhost without a passphrase, execute the following commands:

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

Step 6: Format NameNode

  • Change the current directory to /usr/local/hadoop-3.4.0/bin in terminal.

      cd /usr/local/hadoop-3.4.0/bin
    
  • Execute the following command to format NameNode.

      hdfs namenode -format
    

Expected Output

Step 7: Run Hadoop

  • Change the current directory to /usr/local/hadoop-3.4.0/sbin in the terminal.

      cd /usr/local/hadoop-3.4.0/sbin
    
  • Execute the following command in the terminal.

      ./start-all.sh
    

Start Hadoop Daemons

Step 8: Confirm Hadoop is Installed and Running

  • Execute the command jps in the terminal. The expected output is shown below:

Expected jps Output

Step 9: Check Hadoop Configurations

Open a web browser to see current configurations for the Hadoop session: http://localhost:9870.

Visit localhost:9870

Visit HDFS

Step 10: Stop all Running Daemons for Hadoop

  • Execute the following command in the terminal.

      ./stop-all.sh
    

Troubleshooting

  • If NameNode doesn’t start: You might face some issues after shutting down/restarting your system. This is because the configuration for NameNode and Hadoop are stored in tmp storage of the system, and gets lost after shutdown/restart.

  • If Node Manager doesn’t start: If we get the error as shown in the screenshot below, double-check that you have added the configuration in the hadoop-env.sh file mentioned in Step 3.

Node Manager Error