Cluster: New BobSCEd Install Log

From Earlham Cluster Department

(Difference between revisions)
Jump to: navigation, search
(Head Node)
(Cluster Image Review)
 
(46 intermediate revisions not shown)
Line 1: Line 1:
-
== Scratch Space ==
+
 
 +
The source code for *anything* installed locally is in <code>/usr/local/src</code>.  The source for *anything* installed on NFS is in <code>/mounts/bobsced/usr/local/src</code>.
 +
 
 +
== Modules Software ==
 +
'''Intel Compilers'''
 +
* [https://wiki.cs.earlham.edu/images/8/83/Release_NotesC_en_US.pdf C/C++ Compiler Release Notes (PDF)]
 +
* Installed both custom and *not from RPM* so that I could put it in a different install location (on NFS):
 +
** /mounts/bobsced/usr/local/modules-sw/intel/cce/11.1/
 +
** /mounts/bobsced/usr/local/modules-sw/intel/fce/11.1/
 +
 
 +
'''Openmpi'''
 +
* 1.3.1 - configured with <code>./configure --prefix=/mounts/bobsced/usr/local/modules-sw/openmpi/1.3.1/ --enable-mpi-threads --with-openib</code>
 +
* 1.3.3 same
 +
* 1.4.1 same
 +
 
 +
'''MPICH'''
 +
* MPICH1 installed with <code>./configure --prefix=/mounts/bobsced/usr/local/modules-sw/mpich1/1.2.7p1</code>
 +
** Make note: to uninstall - /mounts/bobsced/usr/local/modules-sw/mpich1/1.2.7p1/sbin/mpiuninstall
 +
** It looks to /mounts/bobsced/usr/local/modules-sw/mpich1/1.2.7p1/share/machines.ARCH instead of just running on the host machine if run without a machine file.  I removed this file to force people to use a machinesfile... that's going to have to be generated from PBS as part of the qsub script. <font color="green">Gave up and added it back in until I finish writing a qsub script that incorporates it</font>
 +
* MPICH2 installed with
 +
:<code>./configure --prefix=/mounts/bobsced/usr/local/modules-sw/mpich2/2.1.2 --enable-cxx --enable-f90 --enable-f77 --enable-threads=multiple --with-thread-package=posix</code>
 +
** Threw OSC's Torque functionality mpiexec on top of it (originally this was package mpich2, now it's mpich2-osc):
 +
*** ./configure --prefix=/mounts/bobsced/usr/local/modules-sw/mpich2/2.1.2 --with-pbs=/var/spool/pbs/ --with-default-comm=mpich2-pmi
 +
*** See [http://debianclusters.cs.uni.edu/index.php/MPICH_with_Torque_Functionality Debian Clustesr: MPICH with Torque]
 +
 
 +
=== Using Modules ===
 +
Important commands -
 +
* <code>. /etc/profile.d/modules.sh</code>
 +
* <code>module load modules modules-init modules-bobsced</code>
 +
* <code>module avail</code>
 +
* <code>module load x</code>
== Log ==
== Log ==
Line 25: Line 55:
* for Intel updates:
* for Intel updates:
** compat-libstdc++-33.i386
** compat-libstdc++-33.i386
 +
* blas.x86_64 (on all nodes)
''' Install C3 tools''' from http://www.csm.ornl.gov/torc/C3/C3softwarepage.shtml
''' Install C3 tools''' from http://www.csm.ornl.gov/torc/C3/C3softwarepage.shtml
Line 87: Line 118:
* yum installed perl-LDAP.noarch, didn't work, so used CPAN to install Authen::Simple::LDAP
* yum installed perl-LDAP.noarch, didn't work, so used CPAN to install Authen::Simple::LDAP
* edited /var/www/cgi-bin/interfaces/authen.conf for our LDAP settings
* edited /var/www/cgi-bin/interfaces/authen.conf for our LDAP settings
 +
* Before externally authenticated users can use it, you have to go in as administrator and check the box to allow them in the Webmo group (or whatever other group)
 +
* Gamess:
 +
** yum install compat-gcc-34-g77.x86_64 and gfortran
 +
** Followed directions from [http://www.webmo.net/support/gamess_linux.html Webmo site]
 +
* Added the following line to httpd.conf:
 +
:<code>SuexecUserGroup bob users </code>
 +
* Gaussian 09 not supported, though it's installed in /mounts/bobsced/usr/local/g09
 +
* Installed g03, except get errors:
 +
<pre>Erroneous write during file extend. write 160 instead of 4096
 +
Probably out of disk space.
 +
Write error in NtrExt1: No such file or directory
 +
</pre>
 +
or
 +
<pre>Write error in NtrExt1: Bad address</pre>
 +
** To fix this, do <code>echo 0 > /proc/sys/kernel/randomize_va_space</code>
 +
** <font color="green">This needs to be set to happen all the time on boot</font>
 +
 +
== Infiniband ==
 +
* Drivers downloaded from [http://www.mellanox.com/content/pages.php?pg=products_dyn&product_family=26&menu_section=34#tab-three here] - the Red Hat 5.3 ones
 +
** mount the ISO as a loopback somewhere (ie <code>mount -o loop /mounts/bobsced/usr/src/MLNX_OFED_LINUX-1.4-rhel5.3.iso /media/</code>)
 +
** run with -msm (ie <code>/media/mlnxofedinstall --msm</code>
 +
* Need to boot into the kernel that came in the original install (2.6.18-128.el5), otherwise get a message like this:
 +
<pre>
 +
The 2.6.18-164.el5 kernel is installed, but do not have drivers available.
 +
Cannot continue.</pre>
 +
* Then (before rebooting back to old kernel), run <code>mst start</code>
 +
* Then, sym link to current kernel version, like this: (see [https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2009-May/039758.html Rocks discussion here about it])
 +
** You should get the version number in the error mst start will give you
 +
:<code>ln -s /usr/mst/lib/2.6.18-128.el5/ /usr/mst/lib/2.6.18-164.2.1.e15.plus</code>
 +
* Success looks like this:
 +
<pre>
 +
[root@bs0-new ~]# mst start
 +
    Starting MST (Mellanox Software Tools) driver set:
 +
Loading MST PCI module                                    [  OK  ]
 +
Loading MST PCI configuration module                      [  OK  ]
 +
Saving configuration for PCI device 01:00.0                [  OK  ]
 +
Create devices
 +
</pre>
 +
* <font color="green">Does this need to be done every time?  It looks like yes.</font>
 +
 +
== Cluster Image Review ==
 +
* LDAP needs to be installed on hopper
 +
* Need to test if these are working
 +
** mpich1
 +
** mpich2
 +
** openmpi
 +
* Intel MPI needs to be installed (maybe?)
 +
* Init scripts for pbs and maui need to be setup
 +
 +
{| {{table}}
 +
| align="center" style="background:#f0f0f0;"|''''''
 +
| align="center" style="background:#f0f0f0;"|'''Command Line -np 8'''
 +
| align="center" style="background:#f0f0f0;"|'''Command Line -np 9'''
 +
| align="center" style="background:#f0f0f0;"|'''Torque -np 8'''
 +
| align="center" style="background:#f0f0f0;"|'''Torque -np 9'''
 +
| align="center" style="background:#f0f0f0;"|'''Machinefile'''
 +
| align="center" style="background:#f0f0f0;"|'''Notes'''
 +
|-
 +
| mpich1||OK, except allocates 1 process to bs0||OK, except allocates 1 process to bs0||OK||OK||/mounts/bobsced/usr/local/modules-sw/mpich1/1.2.7p1/share/machines.LINUX||Creates temporary file PIxxxxx while running under qsub; MPI does not respect qsub nodes given, could get around this by creating machinefile at runtime
 +
|-
 +
| mpich2-osc||N/A||N/A||OK, MUST specify # nodes, ppn||OK, MUST specify # nodes, ppn||Generated by PBS||Currently uses OSC's pbs-specific mpiexec (cannot be run outside of qsub)
 +
|-
 +
| mpich2 ||OK, must setup mpd ring||OK, must set up mdp ring||OK, must set up mdp ring||OK, must set up mdp ring||N/A (mpd ring)||Requires chmod 600 .mpd.conf in home directory with MPD_SECRETWORD=somevalue; setup ring with <code><nowiki>sort $PBS_NODEFILE | uniq -c | awk '{print $2":"$1}' > /cluster/home/kwanous/tmp/mpd.nodes; mpdboot -f /cluster/home/kwanous/tmp/mpd.nodes -n 2</nowiki></code>
 +
|-
 +
| openmpi 1.*||OK for running on one node||OK for running on one node||Uses PBS_NODES automatically||Uses PBS_NODES automatically||
 +
|}
 +
 +
Submission scripts:<br>
 +
'''Mpich1'''
 +
 +
'''Mpich2'''
 +
<pre>
 +
#!/bin/bash
 +
#PBS -N testmpich2
 +
#PBS -l cput=00:60:00
 +
#PBS -l nodes=2:ppn=4
 +
 +
. /etc/profile.d/modules.sh
 +
module load modules modules-init modules-bobsced
 +
module load mpich2
 +
 +
sort $PBS_NODEFILE | uniq -c | awk '{print $2":"$1}' > /cluster/home/kwanous/tmp/mpd.nodes
 +
 +
mpdboot -f /cluster/home/kwanous/tmp/mpd.nodes -n 2
 +
mpiexec -np 9 /cluster/home/kwanous/a.out
 +
mpdallexit
 +
 +
rm -rf /cluster/home/kwanous/tmp/mpd.nodes
 +
</pre>
 +
 +
'''Mpich2 - OSC'''
 +
<pre>
 +
#!/bin/bash
 +
#PBS -N testmpich1
 +
#PBS -l cput=00:60:00
 +
#PBS -l nodes=2:ppn=4
 +
 +
hostname
 +
. /etc/profile.d/modules.sh
 +
module load modules modules-init modules-bobsced
 +
module load mpich2
 +
mpirun -np 8 /cluster/home/kwanous/a.out</pre>
 +
 +
'''Openmpi'''
 +
<pre>#!/bin/bash
 +
#PBS -N testopenmpi
 +
#PBS -l cput=00:60:00
 +
#PBS -l nodes=2:ppn=4
 +
 +
hostname
 +
. /etc/profile.d/modules.sh
 +
module load modules modules-init modules-bobsced
 +
module load openmpi/1.3.1
 +
mpirun -np 9 /cluster/home/kwanous/a.out</pre>

Latest revision as of 00:16, 22 April 2010

The source code for *anything* installed locally is in /usr/local/src. The source for *anything* installed on NFS is in /mounts/bobsced/usr/local/src.

Contents

Modules Software

Intel Compilers

Openmpi

MPICH

./configure --prefix=/mounts/bobsced/usr/local/modules-sw/mpich2/2.1.2 --enable-cxx --enable-f90 --enable-f77 --enable-threads=multiple --with-thread-package=posix

Using Modules

Important commands -

Log

Green color indicates something that still needs to be done.

Cloning

Head Node

Yum installed:

Install C3 tools from http://www.csm.ornl.gov/torc/C3/C3softwarepage.shtml

Ganglia

Networking

static_routes="bs0"
route_bs0="192.168.0.1 159.28.234.200"

Modules

Torque

Maui

Intel Firmware Updates

Mail

NFS

WebMO

Path to perl:         /usr/bin/perl
Webserver name:       bs0-new.cluster.earlham.edu
HTML directory:       /var/www/webmo
HTML URL:             /webmo
CGI script directory: /var/www/cgi-bin
CGI script URL:       /cgi-bin
User files directory: /mounts/bobsced/WebMO
SuexecUserGroup bob users
Erroneous write during file extend. write 160 instead of 4096
Probably out of disk space.
Write error in NtrExt1: No such file or directory

or

Write error in NtrExt1: Bad address

Infiniband

The 2.6.18-164.el5 kernel is installed, but do not have drivers available. 
Cannot continue.
ln -s /usr/mst/lib/2.6.18-128.el5/ /usr/mst/lib/2.6.18-164.2.1.e15.plus
[root@bs0-new ~]# mst start
    Starting MST (Mellanox Software Tools) driver set: 
Loading MST PCI module                                     [  OK  ]
Loading MST PCI configuration module                       [  OK  ]
Saving configuration for PCI device 01:00.0                [  OK  ]
Create devices

Cluster Image Review

' Command Line -np 8 Command Line -np 9 Torque -np 8 Torque -np 9 Machinefile Notes
mpich1OK, except allocates 1 process to bs0OK, except allocates 1 process to bs0OKOK/mounts/bobsced/usr/local/modules-sw/mpich1/1.2.7p1/share/machines.LINUXCreates temporary file PIxxxxx while running under qsub; MPI does not respect qsub nodes given, could get around this by creating machinefile at runtime
mpich2-oscN/AN/AOK, MUST specify # nodes, ppnOK, MUST specify # nodes, ppnGenerated by PBSCurrently uses OSC's pbs-specific mpiexec (cannot be run outside of qsub)
mpich2 OK, must setup mpd ringOK, must set up mdp ringOK, must set up mdp ringOK, must set up mdp ringN/A (mpd ring)Requires chmod 600 .mpd.conf in home directory with MPD_SECRETWORD=somevalue; setup ring with sort $PBS_NODEFILE | uniq -c | awk '{print $2":"$1}' > /cluster/home/kwanous/tmp/mpd.nodes; mpdboot -f /cluster/home/kwanous/tmp/mpd.nodes -n 2
openmpi 1.*OK for running on one nodeOK for running on one nodeUses PBS_NODES automaticallyUses PBS_NODES automatically

Submission scripts:
Mpich1

Mpich2

#!/bin/bash
#PBS -N testmpich2
#PBS -l cput=00:60:00
#PBS -l nodes=2:ppn=4

. /etc/profile.d/modules.sh
module load modules modules-init modules-bobsced
module load mpich2

sort $PBS_NODEFILE | uniq -c | awk '{print $2":"$1}' > /cluster/home/kwanous/tmp/mpd.nodes

mpdboot -f /cluster/home/kwanous/tmp/mpd.nodes -n 2
mpiexec -np 9 /cluster/home/kwanous/a.out
mpdallexit

rm -rf /cluster/home/kwanous/tmp/mpd.nodes

Mpich2 - OSC

#!/bin/bash
#PBS -N testmpich1
#PBS -l cput=00:60:00
#PBS -l nodes=2:ppn=4

hostname
. /etc/profile.d/modules.sh
module load modules modules-init modules-bobsced
module load mpich2
mpirun -np 8 /cluster/home/kwanous/a.out

Openmpi

#!/bin/bash
#PBS -N testopenmpi
#PBS -l cput=00:60:00
#PBS -l nodes=2:ppn=4

hostname
. /etc/profile.d/modules.sh
module load modules modules-init modules-bobsced
module load openmpi/1.3.1
mpirun -np 9 /cluster/home/kwanous/a.out
Personal tools
Namespaces
Variants
Actions
websites
wiki
this semester
Toolbox