SLING workshop FRI 9.11.2017

Prerequisites for using the SLING infrastructure

For this workshop the environment is already set and the demo certificate installed on the system. We will work without the VO membership, the cluster is configured to accept your certificates.

Otherwise, in order to use the SLING infrastructure or follow this workshop on your own computer, you need to follow the procedure below.

  1. Request SiGNET certificate
  2.  Certificate installation

    openssl pkcs12 -in usercert.p12 -clcerts -nokeys -out usercert.pem
    openssl pkcs12 -in usercert.p12 -nocerts -out userkey.pem
    chmod 400 userkey.pem
    chmod 644 usercert.pem
    mv userkey.pem ~/.arc/
    mv usercert.pem ~/.arc/
  3.  Virtual organization membership
    • In order to use the grid, all researchers and students must be members of at least one virtual organization (VO).
    • Slovenian supercomputing infrastructure can be used by joining the national VO gen.vo.sling.si. To join, follow this page: https://voms.sling.si:8443/voms/gen.vo.sling.si
    • FRI has its own virtual organization. Join this VO as well (some clusters give this VO higher priority): https://voms.sling.si:8443/voms/fri.vo.sling.si
    • With SiGNET certificate you can also use EGI (European Grid Initiative) resources, by joining an international VO, based on your project membership or field of research. The list of international VO-s is available on this address: https://operations-portal.egi.eu/vo/search
  4. EGI-trustanchors repository and CA-s installation

    • Detailed instructions are available in the ARC Client installation instructions (see section 5), general information is available here: https://wiki.egi.eu/wiki/EGI_IGTF_Release
    • Note: CA-s need to be updated regularly. It is best to install fetch-crl package (fetch-crl tool updates the CA-s when run) and put in into the cron job (run it at least once a day). Fetch-crl must be run as root.
  5. ARC client installation and VOMS settings
    • Follow the ARC Client installation instructions, that are available on this plight vitualizationage (mostly in Slovenian): http://www-old.sling.si/sling/uporabniki/uporabniski-vmesniki/
    • VOMS settings for gen.vo.sling.si and fri.vo.sling.si are summarized here:
      cd ~/.arc
      mkdir -p vomsdir/gen.vo.sling.si
      cat vomsdir/gen.vo.sling.si/voms.sling.si.lsc 
      /C=SI/O=SiGNET/O=SLING/CN=voms.sling.si
      /C=SI/O=SiGNET/CN=SiGNET CA
      
      cat vomses/gen.vo.sling.si 
      "gen.vo.sling.si" "voms.sling.si" "15001" "/C=SI/O=SiGNET/O=SLING/CN=voms.sling.si""gen.vo.sling.si"
      cd ~/.arc
      mkdir -p vomsdir/fri.vo.sling.si
      cat vomsdir/fri.vo.sling.si/voms.sling.si.lsc 
      /C=SI/O=SiGNET/O=SLING/CN=voms.sling.si
      /C=SI/O=SiGNET/CN=SiGNET CA
      
      cat vomses/fri.vo.sling.si 
      "fri.vo.sling.si" "voms.sling.si" "15003" "/C=SI/O=SiGNET/O=SLING/CN=voms.sling.si" "fri.vo.sling.si"

The above set-up is required for using the infrastructure from your own computer. If you are using it through the pre-prepared virtual machine image, please proceed to the following sections.

Anatomy of a Supercomputing  Cluster Network

Scaling the CPU power

While the development in processor and system architecture has yielded multi-socket systems with multi-core processors and system architectures these days support their underlying NUMA architectures, many tasks surpass what can be achieved with a single computer system.

There are different architectures that enable the construction of single operating system image systems, where a number of computers with a low latency network interconnect behave as a single machine, but such systems are difficult to manage and they stretch the operating system paradigm so far in the non-uniform layers that efficiency becomes a serious problems.

For this reason, most heavy computing tasks are performed on networked computing clusters (in specialized computing centres or on cloud infrastructure). Hundreds of individual computers (called “worker nodes”) are interconnected with fast networks, often low-latency enabled networks (InfiniBand and custom variants), provided with networked file systems and  manged by specialized servers.

Managing the Load

It would be very difficult to use such a system efficiently if every user was to connect to individual machines and run their applications directly on them. Traditionally, the problem is solved by the use of a “batch system” that accepts information on what application the user wants to be run and then runs it on one or several of the computers when the become available. If all the computers are in use, the request (“the job”) is put in a queue and wait for the resources to be available.

Batch systems have been in use for half a century because they offer a number of advantages when managing a large computing resource for many users:

  • load balancing – the system can assign the jobs to the best resources
  • fair share use – the data on system use is available, so the system can be configured to allow fair share of the resources to be used by users, groups etc., and reservations and manual overrides to manage special situations
  • better resource usage – when sufficient data is available, the scheduling of jobs can be perfomed in such a way as to maximize the resource availability – example: instead of spreading out small jobs, they can be concentrated to allow some computers to be completely freed up for computing tasks with large memory requirements etc.

A number of important additional features have developed in the supercomputing and grid community. One of the more important from the user perspective is the management of software. When starting their work, it is natural for users to require the most recent computing environment, compilers, libraries and applications. But when they are nearing the end of their project, they need their environment to remain stable. To facilitate these requirements, supercomputing systems package the software in “runtime environments”. There are different technologies – the most pervasive (installable on any POSIX-like system) being “environment modules” and the most recent different light vitualization technologies.

Data

In the times where “Big Data” is a term we use every day, it is worth considering that a number of jobs in modern clusters, where the clusters are suitably equipped with High Throuhgput Computing software and features, many jobs process frmo 100 to 250 GB in 5 to 6 hour runs, which amounts to 112 TB per hour or 260 Gbits/s. Which this considertaion, data management is a serious matter.

Conceptually, clusters feature the following data spaces:

  • scratch space – local to your process, directly on the node (or in a high-speed localized storage on some supercomputers)
  • networked job home – on a possibly parallel and high speed networked file system, where all the processes in a single job are gauranteed to access the same disk space – so data sharing is possible
  • mid-term storage space – usually a larger data storage system in the same or even in a remote data center, where data can be retrieved from before and stored after processing – but is too remote to be accessed during the run.

Distributed Network

Grid middleware is used for better load balancing, data management and uniformed distribution in grids and in supercomputing centres. The use of grid middleware is important to allow similar compatible interfaces on different clusters, data management, including remote storage and replication, data expulsion, user management and better resource utilization – since gird interfaces require users to provide job requirement descriptions.

ARC middleware – how does it work?

ARC server provides 3 main interfaces:

  • job submission interface
  • file access interface
  • information query interface

ARC CE main services:

  • Grid-manager = the execution service (communicates with LRMS)
  • Information system services
  • GridFTP Server (GFS)

The interfaces (source: nordugrid.org)

 

User mapping (source: nordugrid.org):


Job handling by A-REX (source: nordugrid.org):

Job submission:

ARC client usage

ARC client is the part of the grid middleware that is used directly by the user to manage their jobs and data. The official documentation is available here.

Some common commands are necessarry to be able to use the system.

To see available services and interfaces of a certain cluster:

$ arcinfo jost.arnes.si
Computing service: Arnes (production)
Information endpoint: ldap://jost.arnes.si:2135/Mds-Vo-Name=local,o=grid
Information endpoint: ldap://jost.arnes.si:2135/Mds-Vo-Name=resource,o=grid
Information endpoint: ldap://jost.arnes.si:2135/o=glue
Information endpoint: https://jost.arnes.si:443/arex
Information endpoint: https://jost.arnes.si:443/arex
Submission endpoint: https://jost.arnes.si:443/arex \
(status: ok, interface: org.ogf.bes)
Submission endpoint: https://jost.arnes.si:443/arex \
(status: ok, interface: org.ogf.glue.emies.activitycreation)
Submission endpoint: gsiftp://jost.arnes.si:2811/jobs \
(status: ok, interface: org.nordugrid.gridftpjob)

There are two interfaces that can be used for job submission: via HTTPS or GRIDFTP. You can specify the preferred protocol in the client configuration, see examples below.

For GRIDFTP:

[computing/jost] 
url=ldap://jost.arnes.si:2135 
infointerface=org.nordugrid.ldapng 
submissioninterface=org.nordugrid.gridftpjob

For HTTPS:

[computing/jost]
url=https://jost.arnes.si:443/arex
infointerface=org.ogf.glue.emies.resourceinfo
submissioninterface=org.ogf.glue.emies.activitycreation
default=yes

Create a proxy

Before using the cluster (before job submission, job retrieval etc) a proxy certificate has to be created, using your certificate. By default proxy certificate is only valid for 12 hours (up to 24 hours). For long-term proxies, myproxy service has to be used or arcproxy has to be renewed.

arcproxy #create a proxy
arcproxy -S gen.vo.sling.si #create a proxy for gen.vo.sling.si
arcproxy –I #check the proxy information
arcproxy -S gen.vo.sling.si -c validityPeriod=24h -c vomsACvalidityPeriod=24h
arcrenew #renew the proxy

Useful commands for job management

#job submission
arcsub
arcsub –c jost.arnes.si test.xrsl 
arcsub –c jost.arnes.si test.xrsl -d DEBUG #debug mode

#check status of the job
arcstat
arcstat JOBID  #check job with ID
arcstat -a #check the status of all submitted jobs

#check stdout/stderr of a running job
arccat 
arccat JOBID or arccat -a

#retrieve results (they are deleted from the cluster)
arcget 
arcget JOBID ali arcget –a
arcget -K #keep the results on the cluster after the transmission

#cancel/kill the jobs
arckill #kills/cancels all active jobs
arcclean #clean the jobs

#resume the jobs
arcresume #required after arcrenew

Check the help and man pages for more detail.

Job statuses

Basic commands for data management

Data management on external location (e.g. dcache.arnes.si):

arcls #list content of the directory
arccp #copy
arcrm #remove
arcmkdir #create folder
arcrename #rename the document

Other useful ARC commands

arcinfo jost.arnes.si #check the status of the cluster
arcsync #sync the job list from the cluster
arctest #send a test job

How to debug a failed job?

If a job fails, you will get the error messages from the grid-manager. See the grid-manager log files in the retrieved log folder. The name of the folder depends on your setting in xrsl, for example:

(gmlog="log")

For example:

cat log/failed
LRMS error: (-1) Job was cancelled

This error tells us that the job was cancelled. If the user did not cancel it, the error message doesn’t explain much. In this case, we have to check the errors log:

cat log/errors
job submitted successfully!
local job id: 2078340
----- exiting submit_slurm_job -----

2017-10-04T12:36:46Z Job state change SUBMIT -> INLRMS   Reason: Job is passed to LRMS
------- Contents of output stream forwarded by the LRMS ---------
slurmstepd: error: Job 2078340 exceeded virtual memory limit (134084 > 4096), being killed
/bin/bash: error while loading shared libraries: libdl.so.2: failed to map segment from shared object
slurmstepd: error: Exceeded job memory limit
------------------------- End of output -------------------------
2017-10-04T12:37:41Z Job state change INLRMS -> FINISHING   Reason: Job processing error
2017-10-04T12:37:41Z Job state change FINISHING -> FINISHED   Reason: Job processing error

If you can see “job submitted successfully”, it is obvious that nothing went wrong on the server side, since the job got to the LRMS. In the LRMS output, we can see, that the job was killed, because it exceeded the virtual memory limit. We have to increase the required amount of memory in the description file.

xRSL job description

The documentation is available here: http://www.nordugrid.org/documents/xrsl.pdf

This is an example of the xrsl job description.

  &
  (executable = /usr/bin/env)
  (jobname = "test")

  (stdout=test.log)
  (join=yes)
  (gmlog=log)
  (memory=1000)

Example with input and output files on external location:

&
(executable="execution_script.sh")
(stdout="stdout.txt")
(join=yes)
(stdin="stdin.txt")
(inputFiles=
    ("stdin.txt" "http://www.external-url.com/example/stdin.txt")
    ("file1.py" "gsiftp://dcache.arnes.si/data/arnes.si/gen.vo.sling.si/file1.py")
    ("file2.py" "gsiftp://dcache.arnes.si/data/arnes.si/gen.vo.sling.si/file2.py")
)
(outputFiles=
    ("output1.txt" "gsiftp://dcache.arnes.si/data/arnes.si/gen.vo.sling.si/output1.txt")
    ("output2.txt" "gsiftp://dcache.arnes.si/data/arnes.si/gen.vo.sling.si/output2.txt")
)

Simple job examples

Before you begin with the job submission, create a proxy:

arcproxy

Check its validity:

arcproxy -I
Subject: /C=SI/O=SiGNET/O=SLING/OU=ARCSchoool FRI 2017/CN=Arc School Student 01/CN=1936774291
Issuer: /C=SI/O=SiGNET/O=SLING/OU=ARCSchoool FRI 2017/CN=Arc School Student 01
Identity: /C=SI/O=SiGNET/O=SLING/OU=ARCSchoool FRI 2017/CN=Arc School Student 01
Time left for proxy: 11 hours 59 minutes 51 seconds
Proxy path: /tmp/x509up_u541
Proxy type: X.509 Proxy Certificate Profile RFC compliant impersonation proxy - RFC inheritAll proxy
Proxy key length: 1024
Proxy signature: sha512

Helloworld example

Job description hellogrid.xrsl

&
(executable=hellogrid.sh)
(wallTime="5 minutes")
(stdout="stdout.txt")
(stderr="stderr.txt")
(runtimeenvironment="APPS/DEMO/DEFAULT")

Shell script: hellogrid.sh

#!/bin/sh
echo "Hello, grid"

Send the job to the cluster

arcsub -c jost.arnes.si hellogrid.xrsl

Check the status and retrieve the results.

Runtime environments

Software packages which are preinstalled on a computing  resource and made available on the cluster. The list of the available RTE-s in available by querying the information system or by selecting the cluster on the Grid Monitor.

The query for the information system:

ldapsearch -x -h jost.arnes.si -p 2135 -b ' Mds-Vo-name=local,o=grid' \
| grep nordugrid-cluster-runtimeenvironment

You can send the software also as input file with the job, but it is not recommended if its size exceeds 100MB. If the software you require is not available on the cluster, send an email to support@sling.si and the local administrator will arrange it for you. You can also put the binaries on the SLING storage system (dcache.arnes.si) and specify its location in the xrsl.

To specify a certain environment in the description file:

(runtimeenvironment="APPS/BASE/GPU"

OpenMPI example using infiniband network

Infiniband network ensures low latency and is recommended for multicore jobs.

Helloworld example using OpenMPI:

First the program hellompi.c

/* C Example */
#include <stdio.h>
#include <mpi.h> 


int main (argc, argv)
     int argc;
     char *argv[];
{
  int rank, size;

  MPI_Init (&argc, &argv);      /* starts MPI */
  MPI_Comm_rank (MPI_COMM_WORLD, &rank);        /* get current process id */
  MPI_Comm_size (MPI_COMM_WORLD, &size);        /* get number of processes */
  printf( "Hello world from process %d of %d\n", rank, size );
  MPI_Finalize();
  return 0;
}

Write the executable:

#!/bin/bash
date
hostname
echo "Compiling example"
mpicc -o hello hellompi.c
echo "Done."
echo "Running example:"
mpirun --mca btl openib,self -np 4 ${PWD}/hello > hellompi.out
echo "Done."
date

Write the job description for this job:

&

(count = 4)
(jobname = "hellompi")
(inputfiles =
  ("hellompi.sh" "")
  ("hellompi.c" "")
)
(outputfiles = 
  ("hellompi.out" "")
)
(executable = "hellompi.sh")
(stdout = "hellompi.log")
(join = yes)
(walltime = "15 minutes")
(gmlog = log)
(memory = 2000)
(runtimeenvironment = "APPS/DEMO/OPENMPI-2.0.2")

If you want to run a multicore job on a single machine, use count=4 and countpernode=4. If you have larger jobs (more than 30 cores), limit the number of worker nodes, on which the job will run. You will wait longer, for the worker nodes to drain, but the success rate of the job execution will be higher.

GPU job using CUDA

Measure the memory bandwidth with CUDA:

First the program cudatest.cu:

#include <stdio.h>  

int main() {
  int nDevices;

  cudaGetDeviceCount(&nDevices);
  for (int i = 0; i < nDevices; i++) {
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, i);
    printf("Device ID: %d\n", i);
    printf("  GPU, device name: %s\n", prop.name);
    printf("  Memory Clock Rate (KHz): %d\n",
           prop.memoryClockRate);
    printf("  Memory Bus Width (bits): %d\n",
           prop.memoryBusWidth);
    printf("  Peak Memory Bandwidth (GB/s): %f\n\n",
           2.0*prop.memoryClockRate*(prop.memoryBusWidth/8)/1.0e6);
  }
}

Execution script cudatest.sh:

#!/bin/bash
env| grep -i cuda > cudaenv.out
nvcc cudatest.cu -o cudatest
./cudatest > cudatest.out

And finally the job description:

&
(executable="cudatest.sh")
(inputfiles=
("cudatest.sh" "")
("cudatest.cu" "")
)
(outputfiles=
("cudatest.out"  " ")
("cudaenv.out"  " ")
)
(stdout="error.txt")
(join=yes)
(gmlog="log")
(jobName="cudatest")
(walltime="10")
(count="1")
(memory="500")
(runtimeenvironment="APPS/DEMO/GPU")

Singularity

  • Permissions: user inside == user outside, if you want to have root permissions, you need to be root on the host system
  • Customization of the runtime environment without administrator privileges on the cluster (portability): create an image on your desktop and use it on the cluster
  • IO is passed directly through the container – bound directories from the host
  • no impact on performance
  • limitations: devices, drivers, stability of overlayfs, to bind directories from host, they need to exist in the container

Multiple options how to use Singularity. Some images are and will be available on the cluster.  They will be bound from the cvmfs (cvmfs.sling.si) – not available yet.

Use the image from Docker or Singularity hub

singularity exec --nv docker://nvidia/cuda:latest nvcc --version
Docker image path: index.docker.io/nvidia/cuda:latest
Cache folder set to /net/jost/home/gen009/.singularity/docker
[10/10] |===================================| 100.0% 
Creating container runtime...
tar: usr/local/cuda-9.0/.wh..wh..opq: implausibly old time stamp 1970-01-01 01:00:00
tar: usr/share/doc/cuda-cudart-9-0/.wh..wh..opq: implausibly old time stamp 1970-01-01 01:00:00
...
...

WARNING: Skipping user bind, non existent bind point (file) in container: '/usr/bin/nvidia-smi'
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

Or if you want to use the latest Ubuntu image:

$ singularity exec docker://ubuntu:latest echo "Hello from the container"
Docker image path: index.docker.io/library/ubuntu:latest
Cache folder set to /net/jost/home/gen009/.singularity/docker
Creating container runtime...
Hello from the container

singularity pull docker://ubuntu:latest
singularity build ubuntu.img docker://ubuntu:latest
$ singularity exec docker://fedora:latest cat /etc/redhat-release
Docker image path: index.docker.io/library/fedora:latest
Cache folder set to /net/jost/home/gen009/.singularity/docker
[1/1] |===================================| 100.0%
Creating container runtime...
WARNING: Could not chdir to home: /net/jost/home/gen009
Fedora release 26 (Twenty Six)

Multiple collections available on the hub:

  • https://singularity-hub.org/collections
  • https://hub.docker.com/explore/

Use the image on the cluster

Images will be eventually available on the cvmfs filesystem, which is automounted when required. For example, to use the CUDA container, you can run:

$ singularity exec /grid/arc/sw/singularity/cuda-latest.img cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.3 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.3 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

 Use your own image

Put your image on an external location, such as dCache, https, ftp. Specify the path in the InputFiles (xrsl), then run your workload within the container.

Exercise/Homework

The goal of the homework is to familiarize yourself with the system and to successfully submit a number of jobs, retrieve results and finish the task using the remote interfaces. You are allowed to use the provided access to pre-installed VM image or install and run the software from your own system.

The task is to use the Monte Carlo method to calculate the number π (PI, if your Unicode fails you). The premise has been explained at the Workshop, but to reiterate:

If we imagine a square with side 2r inscribed with a circle with radius r, we know that the area A of the circle is π × r², and the area of the square must be 2r² – therefore the ratio of the area of the circle to the area of the square will be π/4. This means that if you randomly pick N points inside the square, approximately N × π/4 of those points should fall inside the circle.
Using the Monte Carlo method (a large number or random examples), we can therefore get a large number of examples and estimate the ratio between the points inside and outside the circle to estimate the value of πIn order to do that, our program will generate a large number of random points and check for each point whether it is located inside or outside the circle. We can then use the result to estimate π in the following way: π = (4×m)/n, where n is the number of points and m is the number of points inside the circle.
it is easy to set this up to check if a point is inside the circle by checking x² + y² < r², where x and y are the coordinates of the point and r is the radius of the circle.
It is possible to trivially combine multiple sample runs to get a better estimate, so this problem is in the category of “embarrassingly parallel problems” where you can run parallel jobs without any inter-job communication and combine the results in post-processing.

There is a number of online references for this idea, including code examples, alternate set-ups (using a quarter of the circle, for example) etc. You are allowed to use existing code snippets or write your own, as long as you manage to run the jobs on the cluster provided to complete the task.

There are some different options according to the desired level of complexity/challange:

  1. Calculate the number π on a single machine with one long or multiple sequential jobs
  2. Calculate the number π on multiple machines (using MPI) or, perhaps, using NVidia/CUDA.
  3. Calculate number π in your own container on the grid.

Please use support@sling.si at any time to get regular user support.

Final note: The same approach of using Monte Carlo method for estimating values that are difficult to approach analytically is used extensively in many fields with real problems, for example in complex mechanical problems, in the calculation of volumes of complex bodies (phase space calculations), and especially where the volume is defined by a probability function (i.e. when touching on quantum mechanics). The same approach is adaptable to many other fields, but is comparatively easy to implement and profits extensively by the use of a large computing resource since it is intrinsically slow (or, if you like, CPU intensive).

Exercise/Homework Report

The goal of the work is to successfully use computing resources. Please submit your programme, running script, job description file and any output files from the jobs (with logs) in an archive, and include or add a short report explaining: (1) your approach to the problem, (2) any problems with the platform; (3) the way you used jobs/tasks and (4) the final result (Pi estimation, number of points, number ob jobs, complete calculation time etc).

Thank you!