CLUSTERPUNCH v0.1 Cluster monitoring and performance measurement with mini-benchmarks Martin Krzywinski Genome Sciences Centre martink@bcgsc.ca ################################################################ 0. INTRODUCTION 0.a purpose 0.b requirements 0.c distribution 1. GETTING STARTED 1.a installation 1.b testing 1.c deployment 2. BUGS 2.a report bugs and comments ################################################################ 0. INTRODUCTION 0.a purpose The purpose of Clusterpunch is to provide a portable system which distributes user-defined mini-benchmarks across nodes on a network, analyzes the results and ranks the nodes accordingly. The tools is designed to be simple enough to use casually and flexibility is retained for the command-line sysadmin. Clusterpunch uses 'punches', defined to be either bits of Perl code or calls to external binaries, to act as benchmarks. Punches can be timed or return their own values. Diagnostic punches which poll the nodes for information return values. 0.b requirements You need - at least one computer - Perl v5.005 (I did not test this with older verions) Modules you may need to download Config::General Modules you probably already have Data::Dumper FindBin Getopt::Std IO::Socket Time::HiRes 0.c distribution The latest version can always be found at http://mkweb.bcgsc.ca/clusterpunch I'd be very happy to hear whether this package is useful to you and how it can be improved. 1. GETTING STARTED 1.a installation > tar xvfz clusterpunch-x.xx.tgz .... > cd clusterpunch-x.xx > ls -rw-r--r-- 1 martink users 555 Jan 16 23:58 CHANGES -rw-r--r-- 1 martink users 2079 Jan 17 00:19 README drwxr-xr-x 2 martink users 4096 Jan 16 23:51 bin/ -rwxr-xr-x 1 martink users 326 Jan 16 23:58 clusterpunch.start drwxr-xr-x 6 martink users 4096 Jan 16 23:54 doc/ drwxr-xr-x 2 martink users 4096 Jan 17 00:19 etc/ -rw-r--r-- 1 martink users 110 Jan 16 23:58 hosts.sample drwxr-xr-x 2 martink users 4096 Jan 16 23:53 lib/ Done. 1.b testing The first thing to try to run is the benchdriver script. This script reads in the configuration file in etc/ and executes all the punches. If nothing appears out of the ordinary then you're ready to start using the system. If you get some kind of error, read the docs/man/clusterpunch.conf.man man page, check http://mkweb.bcgsc.ca/clusterpunch for solutions or send me an email > bin/benchdriver punch1 0of8 0.429812 punch2 0of8 0.248565 benchmem 0of8 0.792111 benchio 0of8 1.285386 benchcpu 0of8 0.573965 mhz 0of8 2792 load 0of8 0.11 uptime 0of8 90.5988078703704 nusers 0of8 3 jobusers 0of8 mapper:0.00 martink:0.17 lsof 0of8 440 date 0of8 00:20:03 nrunning 0of8 1 The output shows the name of each punch, along with the hostname on which the script was executed and the return value of the punch. Notice that some punches are timed (e.g. punch1) and some punches return their own values (e.g. date) Now edit etc/clusterpunch.conf and edit the port = 8095 broadcast = 10.1.2.255 lines with your own network settings. Try running the clusterpunchserver on one machine, in the foreground, using the new settings. > bin/clusterpunchserver -v 0of8 | - [2003-01-17 00:25:23] logdir 0of8 | - [2003-01-17 00:25:23] broadcast 10.1.2.255 0of8 | - [2003-01-17 00:25:23] daemon 0 0of8 | - [2003-01-17 00:25:23] port 8095 0of8 | - [2003-01-17 00:25:23] verbose 1 0of8 | - [2003-01-17 00:25:23] timeout 5 0of8 | - [2003-01-17 00:25:23] debug 1 0of8 | - [2003-01-17 00:25:23] logging 0 0of8 | - [2003-01-17 00:25:23] sort 3 0of8 | - [2003-01-17 00:25:23] punch 13 0of8 | - [2003-01-17 00:25:23] Running in foreground 0of8 | - [2003-01-17 00:25:23] Servicing incoming requests - parent PID 20656 Now open another terminal on the same machine and type > bin/clustersnapshot -c "live" You should see something like host live 0of8 1 TOTAL 1 Meanwhile, the clusterpunchserver should have produced output like 0of8 | 0of8 [2003-01-17 00:27:42] RCV 4 bytes 0of8 | 0of8 [2003-01-17 00:27:42] client 0of8 commands live() 0of8 | 0of8 [2003-01-17 00:27:42] $STAT1 = { 'live' => 1, 'host' => '0of8' }; Now go to another machine on the same network and run clustersnapshot in the same way. If you get the same output, everything is working. If you get no output, but clusterpunchserver appears to receive the punch, extend the timeout using > bin/clustersnapshot -c "live" -t 10 10 seconds is reams of time for a simple "live" punch. 1.c deployment Edit the hosts.sample file in the distribution root directory and add your own host names. By default, there is no logging by any of the nodes. If you want logging, create the directory and set logging = true in the etc/clusterpunch.conf file. Start the clusterpunch daemons with clusterpunch.start. This script assumes that rsh works, and is just a loop over the hosts. Once you've started the daemons, try running clustersnapshot again > bin/clustersnapshot -c "live" host live 0of0 1 0of1 1 0of2 1 0of3 1 .... 9of3 1 9of4 1 9of7 1 TOTAL 59 Now try executing a set of benchmark punches, sorting by the cumulative benchmark > bin/clustersnapshot -c "benchcpu;benchmem;benchio" -t 20 -s "b_all" Use a large timeout to make sure that all the nodes get a chance to finish the mini-benchmarks. At this point it might be a good idea to read the man page for clusterpunch.conf and adjust any of the default settings for the punches to suit your hardware. host b_all b_cpu b_io b_mem live 0of1 2.284 0.717 0.609 0.959 1 1of0 2.299 0.720 0.595 0.984 1 1of3 2.323 0.726 0.613 0.984 1 3of1 2.329 0.723 0.604 1.002 1 .... 0of0 6.856 0.726 5.080 1.050 1 5of4 7.824 1.224 4.654 1.946 1 6of4 9.009 1.291 5.392 2.326 1 1of4 17.130 1.628 12.230 3.272 1 TOTAL 227.091 45.856 113.243 67.993 59 2. BUGS 2.a report bugs and comments If you find a bug or would like to see something added or changed, please let me know. Martin Krzywinski martink@bcgsc.ca http://mkweb.bcgsc.ca/clusterpunch ################################################################ $Id: README,v 1.3 2003/01/17 08:53:02 martink Exp $