In order to use Hadoop in your Java application, there are multiple points to consider like the location where it will be executed and what are the desired performances. Depending on these parameters, you could choose to include all the libraries in your application or use the available libraries on the Hadoop cluster.
Usually, all the needed libraries must be shipped with your software. To do so, it is possible to:
For Hadoop, the first technique is not used as is, but it is similar since all the Hadoop libraries are inside a local folder and available in the classpath. With that, you do not need to ship them with your application as long as you can access that local folder.
Your other software’s libraries still need to be bundled in any of the three previously described ways. The simplest ways are the last two, because distributing all the jars to all the nodes is more laborious than distributing a single jar (that jar would contain your application and all the third party libraries; not the Hadoop libraries).
This document provides more details about when you should include the Hadoop libraries with your software depending on where you will run it and what is your use case.
In this case, all the Hadoop libraries are included inside your single jar. That let your software run from anywhere, but that means that a big file will be distributed and replicated everywhere it needs to run and that the Hadoop version that is bundled is fixed. That last point is important because it can cause a lot of issues if the Hadoop version of the cluster is different than the one in your application.
This versioning issue will arise during binary communications between the Hadoop services since the used protocol is Thrift. Thrift provides a very compact binary message, but whenever a field is added, removed or modified in the message between versions, it tends to break with obscures error messages. The most used service using Thrift is HDFS which is used to manage and read files. Since the access to data is normally the main feature of a software, that point is not negligible.
On the other hand, there is a workaround for that HDFS issue. Instead of using that service, you can use WebHDFS. That protocol uses HTTP which makes it a non-binary protocol. It is useful for making operations between Hadoop versions, but there is a bandwidth price to pay since the HTTP protocol is more verbose. If your software needs to exchange large files, you will see a drop in the throughput.
When using the Hadoop libraries that are installed on the clusters servers, the versioning issue is no more. For all the applications that are started by Hadoop, as a Map/Reduce, Yarn or Spark application, the local libraries are automatically available in the classpath. If the software is started on a cluster node, but by a user in a shell, he will need to append the Hadoop classpath to his own. To get the list of paths to add, the command “hadoop classpath” is very useful.
Contrary to what some might think, using the local jars is not such an absurd idea since that is already used a lot in enterprise software that is using Java EE. When we think about Tomcat, WebSphere and all the other EE containers, they are not normally bundled with the application, but you deploy your application on them. That makes your application usable on all the different flavours of these containers. Doing the same with Hadoop would make the software distribution and version agnostic. Since the versions change quickly and that there are so many distributions like Cloudera, MapR, Hortonworks, etc. it makes sense not to need a recompile your application every time you upgrade or change your cluster.
An Hadoop installation has many components: the binary software to execute some tasks in the shell, the Java libraries to use with your own applications and the cluster’s configuration. Here are more details about each part.
To launch a Map/Reduce, Yarn or Spark application, you can create your own client that would prepare and start the final application on the cluster or, more simply, use the “Hadoop”, “yarn” or “spark-submit” command. With that last choice, you need the binaries to be available in your PATH and the cluster’s configuration to be available on that machine to know where to execute it.
Whenever you are on a cluster’s node, the configuration will be present and available in the classpath. If your application is executing in Hadoop directly or outside of it using the “hadoop classpath” command, it will be possible to grab the configuration with a simple “new Configuration()” in your Java code. There is no need to know where the configuration files are on the machine since they are in the classpath.
On the other hand, if you are creating a “fat jar”, you will need to know the path to the configuration files or, even worst, you will need to include them in your “fat jar”. That is worst since if you modify some parameters of your cluster, you will need to recompile your application. That removes a bit of the dynamic feature of Hadoop.
Your application can be launched in three different places: on the cluster, on an edge node or completely outside from the cluster. Depending on that choice, certain limitations would apply and force your final solution.
To run your software on a server that is not part of the Hadoop cluster, you have only one choice. Since you do not have the binaries, the libraries and the configuration, you will need to include all of them in your application.
The main use cases for that setup are:
For an application that uses Hadoop services, but that is not a Map/Reduce, Yarn or Spark software, this is the recommended place to run. This machine is not part of the cluster itself since it does not host any of the Hadoop services (no HDFS, no Yarn container, etc.), but it contains all the Hadoop binaries, libraries and configuration files of the cluster to which it is associated. That let you launch your application with “hadoop classpath“ to grab everything that is Hadoop related.
Usually on the cluster itself, you should only be running applications that run inside the Hadoop environment. They are Map/Reduce tasks or Yarn services that are managed by Yarn itself. It is important to specify that it is technically possible to execute an application the same way you would run it on the edge node, but if you do, the memory and process resources will not be managed by Yarn and some Yarn resources could be allocated on your node. That means that your application will be affecting the cluster and the cluster will be affecting your application.
By using this location correctly, your application or task will be started by Hadoop and you will automatically get the libraries and the configuration in your classpath. In this mode, if you create a “fat jar” that also contains the Hadoop libraries, you risk getting some class collision issues even if they are of the same version. So yes, create a “fat jar” that includes all your non-Hadoop related libraries and no, do not include the Hadoop libraries inside it.
|Outside||Edge Node||In the cluster|
|Hadoop binaries and jars||Not available||Available
|Cluster’s configuration||Not available||Available||Available|
|Available techniques||Fat jar with Hadoop libraries
Configuration must be included or generated
|Local Hadoop libraries (first choice)
Bundled Hadoop libraries (second choice)
|Local Hadoop libraries (first choice)
Bundled Hadoop libraries (not recommended)