Install on a Hadoop Cluster
Contents
Install on a Hadoop Cluster¶
Here we provide instructions for installing and configuring
dask-gateway-server
on a Hadoop Cluster.
Create a user account¶
Before installing anything, you’ll need to create the user account which will
be used to run the dask-gateway-server
process. The name of the user
doesn’t matter, only the permissions they have. Here we’ll use dask
:
$ adduser dask
Enable proxy-user permissions¶
Dask-Gateway makes full use of Hadoop’s security model, and will start Dask
workers in containers with the requesting user’s permissions (e.g. if alice
creates a cluster, their dask workers will be running as user alice
). To
accomplish this, the gateway server needs proxy-user permissions. This
allows the Dask-Gateway server to perform actions impersonating another user.
For dask-gateway-server
to work properly, you’ll need to enable
proxy-user permissions for the dask
user account. The users dask
has
permission to impersonate can be restricted to certain groups, and requests to
impersonate may be restricted to certain hosts. At a minimum, the dask
user
will require permission to impersonate anyone using the gateway, with requests
allowed from at least the host running the dask-gateway-server
.
<property>
<name>hadoop.proxyuser.dask.hosts</name>
<value>host-where-dask-gateway-is-running</value>
</property>
<property>
<name>hadoop.proxyuser.dask.groups</name>
<value>group1,group2</value>
</property>
If looser restrictions are acceptable, you may also use the wildcard *
to allow impersonation of any user or from any host.
<property>
<name>hadoop.proxyuser.dask.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.dask.groups</name>
<value>*</value>
</property>
See the proxy-user documentation for more information.
Create install directories¶
A dask-gateway-server
installation has three types of files which will need
their own directories created before installation:
Software files: This includes a Python environment and all required libraries. Here we use
/opt/dask-gateway
.Configuration files: Here we use
/etc/dask-gateway
.Runtime files: Here we use
/var/dask-gateway
.
# Software files
$ mkdir -p /opt/dask-gateway
# Configuration files
$ mkdir /etc/dask-gateway
# Runtime files
$ mkdir /var/dask-gateway
$ chown dask /var/dask-gateway
Install a python environment¶
To avoid interactions between the system python installation and
dask-gateway-server
, we’ll install a full Python environment into the
software directory. The easiest way to do this is to use miniconda, but this
isn’t a strict requirement.
$ curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o /tmp/miniconda.sh
$ bash /tmp/miniconda.sh -b -p /opt/dask-gateway/miniconda
$ rm /tmp/miniconda.sh
We also recommend adding miniconda to the root
user’s path to ease further
commands.
$ echo 'export PATH="/opt/dask-gateway/miniconda/bin:$PATH"' >> /root/.bashrc
$ source /root/.bashrc
Install dask-gateway-server¶
Now we can install dask-gateway-server
and its dependencies.
$ conda install -y -c conda-forge dask-gateway-server-yarn
If you want to use Kerberos for user-facing authentication, you’ll also want to
install dask-gateway-server-kerberos
:
$ conda install -y -c conda-forge dask-gateway-server-kerberos
Configure dask-gateway-server¶
Now we’re ready to configure our dask-gateway-server
installation.
Configuration is written as a Python file (typically
/etc/dask-gateway/dask_gateway_config.py
). Options are assigned to a config
object c
, which is then loaded by the gateway on startup. You are free to
use any python syntax/libraries in this file that you want, the only things
that matter to dask-gateway-server
are the values set on the c
config
object.
Here we’ll walk through a few common configuration options you may want to set.
Enable YARN integration¶
First you’ll want to enable YARN as the cluster backend.
# Configure the gateway to use YARN as the backend
c.DaskGateway.backend_class = (
"dask_gateway_server.backends.yarn.YarnBackend"
)
Enable kerberos security (optional)¶
If your cluster has Kerberos enabled, you’ll also need to create a principal
and keytab for the dask
user. You’ll also need to create a HTTP
service
principal for the host running dask-gateway-server
(if one doesn’t already
exist). Keytabs can be created on the command-line as:
# Create the dask principal
$ kadmin -q "addprinc -randkey dask@YOUR_REALM.COM"
# Create the HTTP principal (if not already created)
$ kadmin -q "addprinc -randkey HTTP/FQDN"
# Create a keytab
$ kadmin -q "xst -norandkey -k /etc/dask-gateway/dask.keytab dask HTTP/FQDN"
where FQDN
is the fully qualified domain name of the host running
dask-gateway-server
.
Store the keytab file wherever you see fit (we recommend storing it along with
the other configuration in /etc/dask-gateway/
, as above). You’ll also want
to make sure that dask.keytab
is only readable by the dask
user.
$ chown dask /etc/dask-gateway/dask.keytab
$ chmod 400 /etc/dask-gateway/dask.keytab
To configure dask-gateway-server
to use this keytab file, you’ll need to
add the following line to your dask_gateway_config.py
:
# Specify the location of the keytab file, and the principal name
c.YarnBackend.keytab = "/etc/dask-gateway/dask.keytab"
c.YarnBackend.principal = "dask"
# Enable Kerberos for user-facing authentication
c.DaskGateway.authenticator_class = "dask_gateway_server.auth.KerberosAuthenticator"
c.KerberosAuthenticator.keytab = "/etc/dask-gateway/dask.keytab"
Configure the server addresses (optional)¶
By default, dask-gateway-server
will serve all traffic through
0.0.0.0:8000
. This includes both HTTP(S) requests (REST api, dashboards,
etc…) and dask scheduler traffic.
If you’d like to serve at a different address, or serve web and scheduler traffic on different ports, you can configure the following fields:
c.Proxy.address
- Serves HTTP(S) traffic, defaults to:8000
.c.Proxy.tcp_address
- Serves dask client-to-scheduler tcp traffic, defaults toc.Proxy.address
.
Here we configure web traffic to serve on port 8000 and scheduler traffic to serve on port 8001:
c.Proxy.address = ':8000'
c.Proxy.tcp_address = ':8001'
Specify user python environments¶
Since the Dask workers/schedulers will be running in their own YARN containers, you’ll need to provide a way for Python environments to be available to these containers. You have a few options here:
Install identical Python environments on every node
Archive environments to be distributed to the container at runtime (recommended)
In either case, the Python environment requires at least the
dask-gateway
package be installed to work properly.
Using a local environment¶
If you’ve installed identical Python environments on every node, you only need
to configure dask-gateway-server
to use the provided Python. This could be
done a few different ways:
# Configure the paths to the dask-scheduler/dask-worker CLIs
c.YarnClusterConfig.scheduler_cmd = "/path/to/dask-scheduler"
c.YarnClusterConfig.worker_cmd = "/path/to/dask-worker"
# OR
# Activate a local conda environment before startup
c.YarnClusterConfig.scheduler_setup = 'source /path/to/miniconda/bin/activate /path/to/environment'
c.YarnClusterConfig.worker_setup = 'source /path/to/miniconda/bin/activate /path/to/environment'
# OR
# Activate a virtual environment before startup
c.YarnClusterConfig.scheduler_setup = 'source /path/to/your/environment/bin/activate'
c.YarnClusterConfig.worker_setup = 'source /path/to/your/environment/bin/activate'
Using an archived environment¶
YARN also provides mechanisms to “localize” files/archives to a container before starting the application. This can be used to distribute Python environments at runtime. This approach is appealing in that it doesn’t require installing anything throughout the cluster, and allows for centrally managing user’s Python environments.
Packaging environments for distribution is usually accomplished using
conda-pack for conda environments
venv-pack for virtual environments (both venv and virtualenv supported)
Both are tools for taking an environment and creating an archive of it in a way that (most) absolute paths in any libraries or scripts are altered to be relocatable. This archive then can be distributed with your application, and will be automatically extracted during YARN resource localization
Below we demonstrate creating and packaging a Conda environment containing
dask-gateway
, as well as pandas
and scikit-learn
. Additional
packages could be added as needed.
Packaging a conda environment with conda-pack
# Make a folder for storing the conda environments locally
$ mkdir /opt/dask-gateway/envs
# Create a new conda environment
$ conda create -c conda-forge -y -p /opt/dask-gateway/envs/example
...
# Activate the environment
$ conda activate /opt/dask-gateway/envs/example
# Install dask-gateway, along with any other packages
$ conda install -c conda-forge -y dask-gateway pandas scikit-learn conda-pack
# Package the environment into example.tar.gz
$ conda pack -o example.tar.gz
Collecting packages...
Packing environment at '/opt/dask-gateway/envs/example' to 'example.tar.gz'
[########################################] | 100% Completed | 17.9s
Using the packaged environment
It is recommended to upload the environments to some directory on HDFS
beforehand, to avoid repeating the upload cost for every user. This directory
should be readable by all users, but writable only by the admin user managing
Python environments (here we’ll use the dask
user, and create a
/dask-gateway
directory).
$ hdfs dfs -mkdir -p /dask-gateway
$ hdfs dfs -chown dask /dask-gateway
$ hdfs dfs -chmod 755 /dask-gateway
Uploading our already packaged environment to hdfs:
$ hdfs dfs -put /opt/dask-gateway/envs/example.tar.gz /dask-gateway/example.tar.gz
To use the packaged environment with dask-gateway-server
, you need to
include the archive in YarnClusterConfig.localize_files
, and activate the
environment in
YarnClusterConfig.scheduler_setup
/YarnClusterConfig.worker_setup
.
c.YarnClusterConfig.localize_files = {
'environment': {
'source': 'hdfs:///dask-gateway/example.tar.gz',
'visibility': 'public'
}
}
c.YarnClusterConfig.scheduler_setup = 'source environment/bin/activate'
c.YarnClusterConfig.worker_setup = 'source environment/bin/activate'
Note that we set visibility
to public
for the environment, so that
multiple users can all share the same localized environment (reducing the cost
of moving the environments around).
For more information, see the Skein documentation on distributing files.
Additional configuration options¶
dask-gateway-server
has several additional configuration fields. See the
Configuration Reference docs (specifically the yarn configuration docs) for more information on all available options. At a minimum
you’ll probably want to configure the worker resource limits, as well as which
YARN queue to use.
# The resource limits for a worker
c.YarnClusterConfig.worker_memory = '4 G'
c.YarnClusterConfig.worker_cores = 2
# The YARN queue to use
c.YarnClusterConfig.queue = '...'
If your cluster is under high load (and jobs may be slow to start), you may also want to increase the cluster/worker timeout values:
# Increase startup timeouts to 5 min (600 seconds) each
c.YarnClusterConfig.cluster_start_timeout = 600
c.YarnClusterConfig.worker_start_timeout = 600
Example¶
In summary, an example dask_gateway_config.py
configuration might look
like:
# Configure the gateway to use YARN as the backend
c.DaskGateway.backend_class = (
"dask_gateway_server.backends.yarn.YarnBackend"
)
# Specify the location of the keytab file, and the principal name
c.YarnBackend.keytab = "/etc/dask-gateway/dask.keytab"
c.YarnBackend.principal = "dask"
# Enable Kerberos for user-facing authentication
c.DaskGateway.authenticator_class = "dask_gateway_server.auth.KerberosAuthenticator"
c.KerberosAuthenticator.keytab = "/etc/dask-gateway/dask.keytab"
# Specify location of the archived Python environment
c.YarnClusterConfig.localize_files = {
'environment': {
'source': 'hdfs:///dask-gateway/example.tar.gz',
'visibility': 'public'
}
}
c.YarnClusterConfig.scheduler_setup = 'source environment/bin/activate'
c.YarnClusterConfig.worker_setup = 'source environment/bin/activate'
# Limit resources for a single worker
c.YarnClusterConfig.worker_memory = '4 G'
c.YarnClusterConfig.worker_cores = 2
# Specify the YARN queue to use
c.YarnClusterConfig.queue = 'dask'
Open relevant port(s)¶
For users to access the gateway server, they’ll need access to the public
port(s) set in Configure the server addresses (optional) above (by default
this is port 8000
). How to expose ports is system specific - cluster
administrators should determine how best to perform this task.
Start dask-gateway-server¶
At this point you should be able to start the gateway server as the dask
user using your created configuration file. The dask-gateway-server
process
will be a long running process - how you intend to manage it (supervisord
,
etc…) is system specific. The requirements are:
Start with
dask
as the userStart with
/var/dask-gateway
as the working directoryAdd
/opt/dask-gateway/miniconda/bin
to pathSpecify the configuration file location with
-f /etc/dask-gateway/dask_gateway_config.py
For ease, we recommend creating a small bash script stored at
/opt/dask-gateway/start-dask-gateway
to set this up:
#!/usr/bin/env bash
export PATH="/opt/dask-gateway/miniconda/bin:$PATH"
cd /var/dask-gateway
dask-gateway-server -f /etc/dask-gateway/dask_gateway_config.py
For testing here’s how you might start dask-gateway-server
manually:
$ cd /var/dask-gateway
$ sudo -iu dask /opt/dask-gateway/start-dask-gateway
Validate things are working¶
If the server started with no errors, you’ll want to check that things are working properly. The easiest way to do this is to try connecting as a user.
A user’s environment requires the dask-gateway
library be installed. If
your cluster is secured with kerberos, you’ll also need to install
dask-gateway-kerberos
.
# Install the dask-gateway client library
$ conda create -n dask-gateway -c conda-forge dask-gateway
# If kerberos is enabled, also install dask-gateway-kerberos
$ conda create -n dask-gateway -c conda-forge dask-gateway-kerberos
You can connect to the gateway by creating a dask_gateway.Gateway
object, specifying the public address (note that if you configured
c.Proxy.tcp_address
you’ll also need to specify the proxy_address
).
>>> from dask_gateway import Gateway
# When running without kerberos
>>> gateway = Gateway("http://public-address")
# OR, if kerberos is enabled, you'll need to kinit and then do
>>> gateway = Gateway("http://public-address", auth="kerberos")
You should now be able to make API calls. Try
dask_gateway.Gateway.list_clusters()
, this should return an empty list.
>>> gateway.list_clusters()
[]
Next, see if you can create a cluster. This may take a few minutes.
>>> cluster = gateway.new_cluster()
The last thing you’ll want to check is if you can successfully connect to your newly created cluster.
>>> client = cluster.get_client()
If everything worked properly, you can shutdown your cluster with
dask_gateway.GatewayCluster.shutdown()
.
>>> cluster.shutdown()