Spark is a fast and general engine for large-scale data processing and computing on a distributed cluster. Apache Spark provides a simple standalone deploy mode that uses its own resource manager and allows the creation of a distributed master-slave architecture.
Here is a quick step by step guide on how setup a Spark Standalone Cluster on Qarnot, connect to it via SSH tunneling and submit a spark application that counts the number of words in the Iliad.
If you are interested in another version, please send us an email at qlab@qarnot.com.
Before starting a calculation with the Python SDK, a few steps are required:
Note: in addition to the Python SDK, Qarnot provides C# and Node.js SDKs and a Command Line.
This tutorial will showcase how to start a Qarnot Spark cluster from your computer by following these simple steps:
Before moving forward you should setup your working environment to contain the following files that can be downloaded from here:
input
: iliad.txt
: text file containing the Iliad to be counted on Qarnot.word_count.py
: script for counting the number of words in iliad.txt
ssh-spark.py
: script for starting the cluster on Qarnot, see belowOnce everything is set up, use the following script in your terminal to start the cluster on Qarnot.
Be sure to copy your authentication token, your public ssh key (instead of <<<MY_SECRET_TOKEN>>>
and <<<PUBLIC_SSH_KEY>>>
) in the script ssh-spark.py
. By default, your public SSH key can be found in ~/.ssh/<<<ssh_key>>>.pub
.
<<<MY_SECRET_TOKEN>>>
in line 10.<<<PUBLIC_SSH_KEY>>>
in line 26.<<<PORT>>>
in line 66, to the port you want to use for SSH tunneling with Qarnotssh-spark.py
#!/usr/bin/env python3
# Import the Qarnot SDK
import sys
import argparse
import qarnot
import subprocess
# Connect to the Qarnot platform
conn = qarnot.Connection(client_token = '<<<MY_SECRET_TOKEN>>>')
# Create a task
task = conn.create_task('Hello World - SSH-Spark', 'docker-cluster-network-ssh', 3)
# Create a resource bucket and add input files
input_bucket = conn.create_bucket('ssh-spark-in')
input_bucket.sync_directory('input/')
# Attach the bucket to the task
task.resources.append(input_bucket)
# Create a result bucket and attach it to the task
task.results = conn.create_bucket('ssh-spark-out')
# Task constants are the main way of controlling a task's behaviour
task.constants['DOCKER_SSH'] = '<<<PUBLIC_SSH_KEY>>>'
task.constants['DOCKER_REPO'] = "qarnotlab/spark"
task.constants['DOCKER_TAG'] = "v3.1.2"
task.constants['DOCKER_CMD_MASTER'] = "/bin/bash \
/opt/start-master.sh ${INSTANCE_COUNT} /opt/log.sh"
task.constants['DOCKER_CMD_WORKER'] = "/bin/bash -c '/opt/start-worker.sh /opt/log.sh && sleep infinity'"
# Submit the task to the Api, that will launch it on the cluster
task.submit()
error_happened = False
# Initial values for forward port and ssh tunneling bool
ssh_forward_port = None
ssh_tunneling_done = False
# Get optional terminal app name from argument
parser = argparse.ArgumentParser()
parser.add_argument("--terminal", type=str,
help="Unix terminal app to be used for ssh connexion",
default = 'gnome-terminal', required=False)
args = parser.parse_args()
# If not provided by the user it will be set to gnome-terminal by default
terminal = args.terminal
done = False
while not done:
# Wait for the task to be FullyExecuting
if task.state == 'FullyExecuting' :
# If the ssh connexion was not done yet and the list active_forward is available (len!=0)
active_forward = task.status.running_instances_info.per_running_instance_info[0].active_forward
if not ssh_tunneling_done and len(active_forward)!=0:
ssh_forward_port = active_forward[0].forwarder_port
# Once the port has been fetched, spawn a new terminal with the ssh command
cmd = 'ssh -L <<<PORT>>>:localhost:6060 -o StrictHostKeyChecking=no root@forward01.qarnot.net -p '+ str(ssh_forward_port)
# if --terminal was set to off, bypass this and connect manually with ssh
if terminal != 'off':
ssh_cmd = [terminal,'-e',cmd]
subprocess.call(ssh_cmd)
else:
print("\n** Run this command in your terminal to connect via ssh **\n", cmd,
"\n**********************************************************")
# set this var to True in order to not run ssh again
ssh_tunneling_done = True
# sync ouptput files of the qarnot machine with the bucket,
# then the bucket with your local directory
task.instant()
done = task.wait(5)
task.download_results('outputs')
To launch this script, simply copy the above code in a Python script and execute python3 ssh-spark.py
in your terminal.
By default, it will connect you to Qarnot via ssh in a gnome-terminal
. If you do not have this terminal app installed or wish to use another one you can run python3 run.py --terminal=<<<unix-terminal-app>>>
. Additionally, if you want to disable this feature and only print out the command that you can run in your terminal on your own, you can set --terminal=off
.
Once a new terminal spawns on your end, it means that the ssh connection with the cluster is secured. All you have to do is run the following commands in your ssh terminal:
# commands to source the environment variables in the shell
PID=$(cat /tmp/pid_start_master)
. <(xargs -0 bash -c 'printf "export %q\n" "$@"' -- < /proc/$PID/environ)
# submit an app to the cluster to count the number of words in the Iliad
/spark/bin/spark-submit --master $SPARK_MASTER --driver-memory 2G --executor-memory 14G /job/word_count.py `sed -n 1p /share/qhost`
If you see the terminal pop up and quickly disappear, it most likely means that the port you chose is currently busy and the connection cannot be established. You can change which port to use and try again.
It is possible to access to the Spark UI forwarded dashboard via SSH tunneling by typing localhost:<<<PORT>>>
in your browser. You can monitor your cluster’s status and the different apps you have submitted
At any given time, you can monitor the status of your task on our platform.
The output of the count is written in a text file output_iliad.txt
that can be found in your output bucket spark-ssh-out
, along with the master and worker logs, all of which can be viewed in your output bucket.
Once you are done with the task, just type exit
in the ssh terminal to close the tunneling and make sure to abort the task from our platform. If you do not abort the task manually, it will continue running and use your credits.
That’s it! If you have any questions, please contact qlab@qarnot.com and we will help you with pleasure!