Skip to content

How to build and run in Docker

Sawood Alam edited this page May 10, 2017 · 23 revisions

This document describes the process of building the OpenWayback from source and running, all in the Docker environment. This can be very handy for development and testing in different environments. The OpenWayback source code includes a Dockerfile. Generated Docker image is kept minimal which makes it suitable for running in production as well.

Requirements

Docker (version 17.05 or later is required for building the image).

Building

Acquire the source code.

$ git clone https://github.com/iipc/openwayback.git
$ cd openwayback

Make any changes to the source code if needed. Then build the docker image.

$ docker image build -t openwayback .

This will download dependencies, compile the code, run tests, package, and place necessary components in appropriate places to build a minimal Docker image with the name openwayback. This process may take a while (depending on the network bandwidth and processor speed). It utilizes Multi-Stage Build feature of Docker to exclude compile-time environment and dependencies from the final image, which makes it both, secure and smaller in size.

By default, the source is built using the latest versions of Maven and JDK then the image is packaged with the latest versions of Tomcat and JRE. However, it is possible to build and package with custom combinations these dependencies using MAVEN_TAG and TOMCAT_TAG build arguments. These variations can be helpful for both testing and production needs without making any changes in the Dockerfile.

$ docker image build \
    --build-arg=MAVEN_TAG=3.5-jdk-7 \
    --build-arg=TOMCAT_TAG=7-jre7-alpine \
    -t openwayback:custom .

Above command would build an image named openwayback with tag custom where the source code would be built using Maven 3.5 with JDK 7 and then the built artifacts will be packaged in a small Alpine Linux image with Tomcat 7 and JRE 7. See available values of MAVEN_TAG and TOMCAT_TAG build arguments.

Another build argument SKIP_TEST is made available which is set to false by default. To skip tests, use --build-arg=SKIP_TEST=true argument in the Docker build command.

Running

The default configuration of the OpenWayback uses the automatic BDB Indexer and expects WARC files at ${WAYBACK_BASEDIR}/files1/ or ${WAYBACK_BASEDIR}/files2/. By default the WAYBACK_BASEDIR is set to /data volume in the Docker image. Create necessary directory structure on the host machine for testing and populate it with some test files.

$ mkdir -p /tmp/owb/files1
$ cp /path/to/sample/*.warc /tmp/owb/files1/

Run a Docker container with appropriately mounted volumes and port mapping. By default the container would run the Tomcat server.

$ docker container run -it --rm -v /tmp/owb:/data -p 8080:8080 openwayback

Once the WARC files are indexed, they should be ready for lookup at http://localhost:8080/.

The OpenWayback allows certain configuration overrides using environment variables that can be customized when running a container, but these customization are very limited.

WAYBACK_HOME=/usr/local/tomcat/webapps/ROOT/WEB-INF
WAYBACK_BASEDIR=/data
WAYBACK_URL_SCHEME=http
WAYBACK_URL_HOST=localhost
WAYBACK_URL_PORT=8080
WAYBACK_URL_PREFIX=http://localhost:8080

However, by strategically mounting certain volumes, it is possible to run the OpenWayback server with custom configuration files.

$ docker container run -it --rm -p 8080:8080 \
    -v /tmp/owb:/data \
    -v /path/to/custom/wayback.xml:/usr/local/tomcat/webapps/ROOT/WEB-INF/wayback.xml \
    -v /path/to/custom/CDXCollection.xml:/usr/local/tomcat/webapps/ROOT/WEB-INF/CDXCollection.xml \
    openwayback

This way of mounting configuration files can be handy for testing. However, for production purposes it is better to create derived image and override configuration files with custom files.

Utilities

The Docker image contains various executable utilities with their necessary dependencies that can be used in one-off mode. The following command illustrates one possible usage of the cdx-indexer to index WARC files into CDX files on the host machine with appropriate volume mounting while utilizing a one-off container.

$ docker container run -it --rm -v /tmp/owb:/data openwayback cdx-indexer /data/files1/sample1.warc > /tmp/owb/index1.cdx

Alternatively, access the bash prompt of the container to run utility scripts inside or perform debugging.

$ docker container run -it --rm -v /tmp/owb:/data openwayback bash
[CONTAINER ID]# cdx-indexer /data/files1/sample1.warc > /data/index1.cdx

IMPORTANT If you are using the bash sort command to sort CDX files, you must set the environment variable LC_ALL=C. This tells sort how to sort and ensures that it matches how OpenWayback expects CDX indexes to be sorted.

Clone this wiki locally