Clusterpunch: a distributed mini-benchmark system for clusters

Methodology

Clusterpunch uses a UDP broadcast mechanism to sent punches to nodes. The data sent to nodes is a string command. The data received from nodes is a serialized hash.

Communication Phases

During the distribution, processing and gathering of punches there are four distinct phases. These phases are organizational only, and do not necessarily happen serially in time.. In the first phase, a client utility, like clustersnapshot, broadcasts a punch request to all machines on the network. In the second phase, all nodes running the clusterpunchserver receive the request. In the third phase, each node executes the code for each punch received and builds up a hash of results. In the fourth phase, the results are returned to the requesting client and the client processes the data.

Each of these stages is illustrated below. The examples use the client utility clustersnapshot and the network is 10.1.1.0/24 in this example. The client is the machine at 10.1.1.2.

Broadcast Phase

A client makes a punch request via UDP to all hosts on the network. The request is comprised of a series of punch names, concatenated with a semi-colon and sent as a text message. The client utility, in this case clustersnapshot, was run at the command line by a user, or possibly from a cron job.

UDP is the protocol chosen to minimize overhead in the communication between the nodes.

Phase 1. The client announces a punch request on the network using UDP broadcast.

Message Reception Phase

Each node running the clusterpunchserver receives the punch request. The node has loaded the punch definitions from the configuration file and it analyzes the command to figure out what punches it understands.

Phase 2. Nodes receive punch requests.

Punch Execution Phase

After the reception of the message, a node executes each punch. A punch is associated with Perl code or a system call. The node performs the punch task and stores either the return value of the code or the time taken to execute the code. The punch result type (timed or value) is defined in the configuration file. As the node executes all the requested punches, it stores the results in a hash.

Since punches are stored in an external configuration file, and multiple configuration files can be read in (e.g. one global and one local), it is possible for each node to perform a different task for a given punch. For example, you could define a CPU benchmark punch differently on an alpha system then on an Intel system.

Phase 3. Nodes execute punches. All punch definitions are read in from global and local configuration files, allowing nodes can respond differently to the same punch request.

Result Gathering Phase

The client will wait a prescribed amount of time for answers from the nodes. As results come in over this time, the client constructs a reponse hash, comprised of the nodes' results.

After the client is done waiting, it process the data. The client can show the results in a formatted table (clustersnapshot), derive a total cluster benchmark (clusterbench) or provide a wrapper around a login facility, such as rsh/ssh, to log the user into the highest ranked node (clusterlogin).

For each punch, the client knows how to rank a node based on the return value. The type of sort done for each punch is defined in the configuration file.

Phase 4. As nodes complete their tasks, their result hashes are returned to the client.