GEXEC installed on the cluster

By: | Comments: 17 Comments

Posted in categories: Work related

Caltech GEXEC is a scalable cluster remote execution system which provides fast, RSA authenticated remote execution of parallel and distributed jobs. It provides transparent forwarding of stdin, stdout, stderr, and signals to and from remote processes, provides local environment propagation, and is designed to be robust and to scale to systems over 1000 nodes. Internally, GEXEC operates by building an n-ary tree of TCP sockets and threads between gexec daemons and propagating control information up and down the tree. By using hierarchical control, GEXEC distributes both the work and resource usage associated with massive amounts of parallelism across multiple nodes, thereby eliminating problems associated with single node resource limits (e.g., limits on the number of file descriptors on front-end nodes). An initial release of the software (below) consists of a daemon, a client program, and a library which provides programmatic interface to the GEXEC system.

Please ref http://www.theether.org/gexec/ and http://www.theether.org/authd/ for original document.

For using it, see the following example:
1.
when you log in bitc, and want to submit a job to node12, you can do the following:
# export GEXEC_SVRS=”node12″
# gexec -n 1 <your_command_here>

2.  if you want to submit a job to more nodes simutanuously (for example, node 11~13) , you can do the following:
# export GEXEC_SVRS=”node11 node12 node13″
# gexec -n 3 <your_command_here>

You can write a script in this way to submit a series of jobs to a series nodes, also.

For questions, please ask me.

17 Comments

Leave a Reply