Archive for February, 2018

February 21st, 2018

In a previous blog post, I told the story of how I used Amazon AWS and AlcesFlight to create a temporary multi-user HPC cluster for use in a training course.  Here are the details of how I actually did it.

Note that I have only ever used this configuration as a training cluster.  I am not suggesting that the customisations are suitable for real work.

Before you start

Before attempting to use AlcesFlight on AWS, I suggest that you ensure that you have the following things working

Customizing the HPC cluster on AWS

AlcesFlight provides a CloudFormation template for launching cluster instances on Amazon AWS.  The practical upshot of this is that you answer a bunch of questions on a web form to customise your cluster and then you launch it.

We are going to use this CloudFormation template along with some bash scripts that provide additional customisation.

Get the customisation scripts

The first step is to get some customization scripts in an S3 bucket. You could use your own or you could use the ones I created.

If you use mine, make sure you take a good look at them first to make sure you are happy with what I’ve done!  It’s probably worth using your own fork of my repo so you can customise your cluster further.

It’s the bash scripts that allow the creation of a bunch of user accounts for trainees with randomized passwords.  My scripts do some other things too and I’ve listed everything in the github README.md.

git clone https://github.com/mikecroucher/alces_flight_customisation
cd alces_flight_customisation

Now you need to upload these to an s3 bucket. I called mine walkingrandomly-aws-cluster

aws s3api create-bucket --bucket walkingrandomly-aws-cluster --region eu-west-2 --create-bucket-configuration LocationConstraint=eu-west-2
aws s3 sync . s3://walkingrandomly-aws-cluster --delete

Set up the CloudFormation template

  • Head over to Alces Flight Solo (Community Edition) and click on continue to subscribe
  • Choose the region you want to create the cluster in, select Personal HPC compute cluster and click on Launch with CloudFormationConsole

personal_hpc_compute_cluster

  • Go through the CloudFormation template screens, creating the cluster as you want it until you get to the S3 bubcket for customization profiles box where you fill in the name of the S3 bucket you created earlier.
  • Enable the default profile

flight_customisation

  • Continue answering the questions asked by the web form.  For this simple training cluster, I just accepted all of the defaults and it worked fine

When the CloudFormation stack has been fully created, you can log into your new cluster as an administrator.  To get the connection details of the headnode, go to the EC2 management console in your web-browser, select the headnode and click on Connect.

When you log in to the cluster as administrator, the usernames and passwords for your training cohort will be in directory specified by the password_file variable in the configure.d/run_me.sh script. I set my administrator account to be called walkingrandomly and so put the password file in /home/walkingrandomly/users.txt.  I could then print this out and distribute the usernames and passwords to each training delegate.

This is probably not great sysadmin practice but worked on the day.  If anyone can come up with a better way, Pull Requests are welcomed!

alces_flight_login

Try a training account

At this point, I suggest that you try logging in as one of the training user accounts and make sure you can successfully submit a job.  When I first tried all of this, the default scheduler on the created cluster was SunGridEngine and my first attempt at customisation left me with user accounts that couldn’t submit jobs.

The current scripts have been battle tested with Sun Grid Engine, including MPI job submission and I’ve also done a very basic test with Slurm. However, you really should check that a user account can submit all of the types of job you expect to use in class.

Troubleshooting

When I first tried to do this, things didn’t go completely smoothly.  Here are some things I learned to help diagnose the problems

Full documentation is available at http://docs.alces-flight.com/en/stable/customisation/customisation.html

On the cluster, we can see where its looking for its customisation scripts with the alces about command


alces about customizer
Customizer bucket prefix: s3://walkingrandomly-aws-cluster/customizer

The log file at /var/log/clusterware/instance.log on both the head node and worker nodes is very useful.

Once, I did all of this using a Windows CMD bash prompt and the customisation scripts failed to run.  The logs showed this error

/bin/bash^M: bad interpreter: No such file or directory

This is a classic dos2unix error and could be avoided, for example, by using the Windows Subsystem for linux instead of CMD.exe.

February 21st, 2018

I needed a supercomputer…..quickly!

One of the things that we do in Sheffield’s Research Software Engineering Group is host training courses delivered by external providers.  One such course is on parallel programming using MPI for which we turn to the experts at NAG (Numerical Algorithms Group).  A few days before turning up to deliver the course, the trainer got in touch with me to ask for details about our HPC cluster.

Because Croucher’s law, I had forgotten to let our HPC sysadmin know that I’d need a bunch of training accounts and around 128 cores set-aside for us to play around with for a couple of days.

In other words, I was hosting a supercomputing course and had forgotten the supercomputer.

Building a HPC cluster in the cloud

AlcesFlight is a relatively new product that allows you to spin up a traditional-looking High Performance Computing cluster on cloud computing substrates such as Microsoft Azure or Amazon AWS.  You get a head node, a bunch of worker nodes and a job scheduler such as Slurm or Sun Grid Engine. It looks just the systems that The University of Sheffield provides for its researchers!

You also get lots of nice features such as the ability to scale the number of worker nodes according to demand, a metric ton of available applications and the ability to customise the cluster at start up.

The supercomputing budget was less than the coffee budget

…and I only bought coffee for myself and the two trainers over the two days!  The attendees had to buy their own (In my defence…the course was free for attendees!).

I used the following

  • A head node of:  t2.large (2 vCPUs, 8Gb RAM)
  • Initial worker nodes: 4 of c4.4xlarge (16 vCPUs and 30GB RAM each)
  • Maximum worker nodes: 8 of c4.4xlarge (16 vCPUs and 30GB RAM each)

This gave me a cluster with between 64 and 128 virtual cores depending on the amount that the class were using it.  Much of the time, only 4 nodes were up and running – the others spun up automatically when the class needed them and vanished when they hadn’t been used for a while.

I was using the EU (Ireland) region and the prices at the time were

  • Head node: On demand pricing of $0.101 per hour
  • Worker nodes: $0.24 (ish) using spot pricing. Each one about twice as powerful as a 2014 Macbook Pro according to this benchmark.

HPC cost: As such, the maximum cost of this cluster was $2.73 per hour when all nodes were up and running. The class ran from 10am to 5pm for two days so we needed it for 14 hours.  Maximum cost would have been $38.22.

Coffee cost: 2 instructors and me needed coffee twice a day. So that’s 12 coffees in total.  Around £2.50 or $3.37 per coffee so $40.44

The HPC cost was probably less than that since we didn’t use 128 cores all the time and the coffee probably cost a little more.

Setting up the cluster

Technical details of how I configured the cluster can be found in the follow up post at https://www.walkingrandomly.com.wb0.net/?p=6431

 

February 10th, 2018

The Meltdown bug which affects most modern CPUs has been called by some ‘The worst ever CPU bug’. Accessible explanations about what the Meltdown bug actually is are available here and here.

Software patches have been made available but some people have estimated a performance hit of up to 30% in some cases. Some of us in the High Performance Computing (HPC) community (See here for the initial twitter conversation) started to wonder what this might mean for the type of workloads that run on our systems. After all, if the worst case scenario of 30% is the norm, it will drastically affect the power of our systems and hence reduce the amount of science we are able to support.

In the video below, Professor Mark Handley from University College London gives a detailed explanation of both Meltdown and Spectre at an event held at Alan Turing Institute in London.

Another video that gives a great introduction to this topic was given by Jon Masters at  https://fosdem.org/2018/schedule/event/closing_keynote/ 

To patch or not to patch

To a first approximation, a patch causing a 30% performance hit on a system costing £1 million pounds is going to cost an equivalent of £300,000 — not exactly small change! This has led to some people wondering if we should patch HPC systems at all:

All of the UK Tier-3 HPC centres I’m aware of have applied the patches (Sheffield, Leeds and Manchester) but I’d be interested to learn of centres that decided not to.  Feel free to comment here or message me on twitter if you have something to add to this discussion and I’ll update this post where appropriate.

Research paper discussing the performance penalties of these patches on HPC workloads

A group of people have written a paper on Arxiv that looks at HPC performance penalties in more detail.  From the paper’s abstract:

The results show that although some specific functions can have execution times decreased by as much as 74%, the majority of individual metrics indicates little to no decrease in performance. The real-world applications show a 2-3% decrease in performance for single node jobs and a 5-11% decrease for parallel multi node jobs.
The full pdf is available at https://arxiv.org/abs/1801.04329

 

Other relevant results and benchmarks

Here are a few other links that discuss the performance penalty of applying the Meltdown patch.

Acknowledgements

Thanks to Adrian Jackson, Phil Tooley, Filippo Spiga and members of the UK HPC-SIG for useful discussions.

TOP