Easy deployment of Zookeeper and Storm in RPM packages

In this post we will package Storm and its dependencies to achieve seamless deployment of a realtime big data processing system. Following up on the first Meteorit project article, we will be adding the minimal supervisor system mon, Zookeeper, zeromq and finally Storm itself. Packaging will enable fast deployment of the whole processing system using RPM packages.

Make sure you check out the code on Meteorit, which stands on the shoulders of giants to package a distributed fault-tolerant event and data processing system.

Firstly, we will provide a high level overview of the different packages added to the project that will eventually interoperate with the subprojects already shown on the first article.

Meteorit packages

Overview of the packages included in the project

  • meteorit-env – basic shell functions shared among packages
  • meteorit-mon – rpm packaging of the excellent process supervisor
  • meteorit-zookeeper – rpm packaging of Apache Zookeeper
  • meteorit-storm – rpm packaging of Storm

meteorit-compile-parent

Maven parent POM project which does not build into a package itself, already created in the previous article and just expanded a bit further. It defines a few common extra properties, the most interesting ones are: binary.architecture_, which defines the CPU system that the binaries will be compiled against (default x86_64) and install.prefix_ which can be used to specify the installation prefix of all the packages in the project (default /opt).

meteorit-env – setting up the environment

In this case, we provide bootstrap-style scripts to setup both the development environment and the deployment environment. Specifically meteorit-build.sh installs all the packages needed to build Meteorit. These include all needed compilers and utilities, which are standard packages installed using yum as well as the JDK 1.6 from OpenJDK and Apache Maven for builds. The script needs to be run as root and uses the following (optional) env vars:

  • MAVEN_URL – URL to the Maven tgz, currently set to version 3.1.1
  • PREFIX – base folder where to install Maven, /opt by default, which means Maven will end up installed at /opt/apache-maven/apache-maven-x.x.x

You should review the script before running it straightaway, most importantly as it needs to run under root permissions.
Next is meteorit-bootstrap.sh which is a script that installs the minimal standard stuff to deploy and run all the Meteorit packages, also ran as root.
The RPM also installs some common shell script functions and helpers for the daemon init scripts. That common functionality will be sourced by the relevant scripts whenever needed.

meteorit-backend-mon – monitoring processes with mon(1)

We’ll be using the simple and compact yet highly effective mon(1) process monitoring program so our daemon processes can run under supervision and be automatically restarted in case of failure. Configuration of mon(1) is really easy as all settings are passed as flags to the mon command, wihch means there is no need to maintain complex setup files. To build it we download the code and run make with the appropriate PREFIX. To lay out the files we configure a series of mappings (translation of files from Maven space to RPM installation) as usual, to leave everything in place, not forgetting the LICENSE file.

meteorit-backend-zookeeper – cluster coordination with Apache Zookeeper

Next, we package Apache ZooKeeper to allow the Storm cluster to coordinate its various processes and daemons across machines. Packaging is a little more convoluted this time as we need to first download and patch the source in two places:

  • mt_adaptor.c – patch it so it compiles on non-x86 architectures
  • zoo.cfg – patch it so autopurge is enabled and the data folder is not in /tmp but on the defined Maven property ${zookeeper.datafolder_}

Once that is done, we have Maven build ZooKeeper from source and add other resources such as a log4j configuration and the file zookeeper-env.sh. This environment file which will be read by the zookeeper init file, basically specifying the a few basic properties like the log destination folder.

Once we have everything compiled and resources added we can build zookeeper with mvn package which will generate an RPM with everything needed to run ZooKeeper under the zookeeper user, including an init.d script. Please notice that the rpm postinstall script does not modify any parameters in the system startup configuration nor enable any firewall ports, this should be done externally to the script.
To test the installation, we can do the following command sequence:

The meteorit-backend-zookeeper-installtest.sh script leverages shUnit2 to connect to ZooKeeper using the command line clients included in the package and test that the connection is made successfully (do not forget to enable the relevant ports on the firewall!).

meteorit-backend-storm – data processing

Finally, we package Storm itself with some specific configuration options such as patching the Logback configuration so log files sit under /var/log/storm and setting the Storm data directory under /var/storm. All these values are configurable as Maven properties so the RPM package can be built easily using completely different folders by supplying -D values in the command line. The Maven build process gets and builds the frozen JZMQ version that is guaranteed to work with Storm. Also included are the init.d files needed to start the required Storm daemons: nimbus, storm-supervisor and storm-ui.
Also included in the package is a shell script to change the main storm.yaml configuration file:

The init scripts leverage mon(1) to run the processes under supervision, in such a way that if say, nimbus fails, mon(1) will wait a few seconds and restart the process, as suggested in the Setting up a Storm cluster documentation.

Wrapping up

We have given an overview of the process of building a process supervisor, the ZooKeeper cluster coordinator and Storm itself. Building the whole setup on a CentOS 6.x system should be as easy as:

This will create all the needed RPMs, once installed we can fire up the Storm cluster on localhost like this:

And we have a complete Storm solution running on localhost, with software installed under /opt by default. Locations are intuitive, for instance, the storm executable to operate can be found at the path /opt/storm/bin/storm and all the logs can be found at /var/log. The different RPMs can be installed on different nodes in a cluster for a complete distributed production system. Do not forget to read the documentation and other sources such as Michael G. Noll’s blog excellent post on running a cluster.

Full source code available at GitHub on the following URL: https://github.com/danigiri/meteorit. Pull requests, comments and general issues welcome.

Happy deploying!

DISCLAIMER: Please note that authors, copyright and licensing remain the original for all packages and I have included the appropriate copyright notices on the Maven files and on the distributed RPMs themselves.

This entry was posted in Big Data, Computing and tagged , , , , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *