Ccg-admin

From Earlham Cluster Department

(Difference between revisions)
Jump to: navigation, search
(Current To Do)
(Disable graphical booting screen in CentOS)
 
(74 intermediate revisions not shown)
Line 1: Line 1:
= Current To Do =
= Current To Do =
-
* '''Dali.cluster.earlham.edu'''
+
* This section has been moved to a GDrive document, https://docs.google.com/document/d/1_qsH4eFZRW_rmqq2kgMF_DbKBL-HukjBuAdr93j0oJ0/edit
-
**<strike>ldap [KM]</strike>
+
-
**make/export volume
+
-
**<strike>sudo [KM]</strike>
+
-
* '''General'''
+
= Cluster Pages =
-
** NFS and flock() and ownership problems between v3 and v4
+
* http://cluster.earlham.edu/wiki/index.php/Al-salam
-
** Intel compiler (C/C++ & FORTRAN), al-salam, fatboy, layout, bobsced
+
= New Software =
-
** GCC recent, al-salam, fatboy, layout, bobsced
+
For a lot of the software we install on the clusters, we install them as modules. Environment modules are an easy way of installing multiple versions of software and allowing users to trivially change their environment variables path (<tt>PATH, LD_LIBRARY_PATH, C_INCLUDE_PATH, etc. </tt>) to point to which version they want to use.
-
** OpenMPI 1.6.5, al-salam, layout, bobsced
+
If I want to use gcc version 5.1.0 instead of the system default 4.4.7, all I would have to type is:
-
** CHARMM (command line, not through WebMO, use Michael's?)
+
    <tt> module load gcc/5.1.0 </tt>
-
** pdynamo, some cluster depending on the usage model
+
If gcc 4.4.7 were also a module and I wanted to swap them to make sure I'm using gcc 5.1.0, all I would type is:
 +
    <tt> module swap gcc/4.4.7 gcc/5.1.0 </tt>
-
* '''Fatboy.cluster.earlham.edu'''
+
Before installing the software, figure-out if this should be a yum package or a source kit, installed into the system space or modules, and what library dependencies it has. Proceed as appropriate.  
-
**<strike>sudo[KM]</strike>
+
-
** figure out groups problem [KM]
+
-
** <strike>anaconda python [AR] </strike>
+
-
* '''Layout.cluster.earlham.edu'''
+
== Building software packages ==
-
** <strike>sudo on all nodes [KM]</strike>
+
* Download the source tarball into /root/
-
** change passwords on all nodes (root, exxact) [KM]
+
* Unpack, build with ./configure, etc. Make sure you set the --prefix option to be --prefix=/mounts/[machine]/software/[package-name]/[package-version].  
-
** michael's list (including anaconda python (https://store.continuum.io/cshop/anaconda) on all nodes, [KM]
+
-
*** MDAnalysis (normally installed inside of Anaconda via "pip install MDAnalysis")
+
-
*** VMD
+
-
*** PyMOL
+
-
** <strike>qsub install (torque) on all nodes [KM]</strike>
+
-
** <strike>Install WebMo [KM]</strike>
+
-
*** <strike>Put CD in fatboy and get stuff off of it to copy to layout. Install Gaussian, then Torque, then WebMo in /mounts/software</strike>
+
-
* '''Al-salam.cluster.earlham.edu'''
+
===Needs to be Updated===
-
** anaconda python on all nodes
+
* Make a <tt><package>-<version>.config.sh</tt> script that runs ./configure with all your options (so that it's kept around in case we need to reinstall).
-
**<strike>sudo on all nodes[KM]</strike>
+
-
** update WebMo on Al-Salam (after layout install/test) [KM]
+
-
* '''All machines'''
+
== Installing a yum package into Modules structure ==
-
** Close root's sshd's configuration ('''on head node only''') to disallow remote root connection [KM]
+
* How?
-
** Develop a global file system naming structure and implement it [KM,AR,CP]
+
-
**<strike>ntp client config (using proto) on hopper, layout*, al-salam*, fatboy, bigfe [AR]</strike>
+
-
* '''General'''
+
== Enabling built software packages within Modules structure ==
-
** Low latency network - fiber or copper based 10GbE [IB]
+
On the head node all of the clusters, modules are installed into <tt> /mounts/machine/software </tt> where machine is the name of the actual machine. This directory is visible to all of the nodes of the machine. In that directory, there <i>should</i> be subdirectories of the different modules that are available, and within those each version has it's own directory (as of July 15 2015, the modules setup and organization is messy on many of the cluster, the notes here are how it should be set up from here on out).  
-
** Rack consolidation -
+
-
** Permission masks for places where N people will be working, layout, fatboy, shared filesystems (/cluster, etc.) [AR]
+
-
** Ganglia or equivalent setup, consider MRTG too?
+
-
** Fix difference between head and compute nodes for nsswitch.conf
+
-
This list should be annotated with the initials of who is working on each item.
+
So, for example, if we have gcc versions 5.1.0, 4.7.1, and 4.9.0 installed on layout, it would look like this:
 +
  <tt> $ ls -1 /mounts/fatboy/software/gcc</tt>
 +
  <tt>    5.1.0 </tt>
 +
  <tt>    4.7.1 </tt>
 +
  <tt>    4.9.0 </tt>
-
= Cluster Pages =
+
* <tt>$ sudo su -</tt>
-
* http://cluster.earlham.edu/wiki/index.php/Al-salam
+
* <tt>$ cd /mounts/al-salam/software/Modules/3.2.7/modulefiles</tt>
-
= Installing Software =
+
* <tt>$ ls</tt> and look for another package that has a similar usage model as the package you're installing (e.g., Python module, C/C++ library, utility, library+utilities)
-
* Download the source tarball into /root/install
+
-
* Unpack
+
-
* Make a <tt><package>-<version>.config.sh</tt> script that runs ./configure with all your options (so that it's kept around in case we need to reinstall).
+
-
* To configure, give <tt>--prefix=/mounts/al-salam/software/<package>-<version></tt>
+
-
* Run your config.sh and continue building/installing as normal
+
-
* Create a soft link from <tt>/mounts/al-salam/software/<package> to /mounts/al-salam/software/<package>-<version></tt>
+
-
 
+
-
== Enabling a package within Modules ==
+
-
* <tt>sudo su -</tt>
+
-
* <tt>cd /mounts/al-salam/software/Modules/3.2.7/modulefiles</tt>
+
-
* <tt>ls</tt> and look for another package that has a similar usage model as the package you're installing (e.g., Python module, C/C++ library, utility, library+utilities)
+
* Copy that to your new package, e.g., <tt>cp -r openmpi <software></tt>
* Copy that to your new package, e.g., <tt>cp -r openmpi <software></tt>
-
* <tt>cd <software>; ls</tt> Note the filename that appears.
+
* <tt>$ cd <software>; ls</tt> Note the filename that appears.
* Move that file to your package's <version>
* Move that file to your package's <version>
* Edit <version>
* Edit <version>
Line 73: Line 46:
= DNS/DCHP for a single host =
= DNS/DCHP for a single host =
-
*1)Find an IP that's not in use. Easiest way to do that is look in this file <tt>/var/named/etc/namedb/master/cluster.zone</tt>.
+
Find an IP that's not in use. Easiest way to do that is look in this file <tt>/var/named/master/cluster.zone</tt>.
-
*2)Add name and IP like the pattern in the file.  
+
Add name and IP like the pattern in the file, like below. At the top of the file, be sure to change the serial number at the top to represent the year, month, day, and version.
-
*3)At the top of the file, be sure to change the serial number at the top to represent the year,month,day, and version.
+
    <tt> dali.cluster.earlham.edu. IN A 159.28.234.126 </tt>
-
*4)Save the file. Every time you add an entry to the zone file, you have to edit the reverse zone file. The reverse zone file is <tt>/var/named/etc/namedb/master/159.28.234.zone</tt>.
+
 
-
*5)Add an entry for the host you add in the zone file.
+
Save the file. Every time you add an entry to the zone file, you have to edit the reverse zone file. The reverse zone file is <tt>/var/named/master/159.28.234.zone</tt>.
-
*6)Next you'll want to stop DNS and then start DNS with the following command. <tt>service named stop</tt> and then use <tt>service named start</tt>
+
Add an entry for the host you added in the zone file. Notice the first number there is the last octet of the IP that you gave the host.  
-
*7)Now that DNS is updated, we have to update DHCP.  
+
    <tt> 126 IN PTR dali.cluster.earlham.edu. </tt>
-
*8)The file you want to edit is <tt>/usr/local/etc/dhcpd.conf</tt>. Towards the bottom of the file you'll add <tt> host <hostname> { hardware ethernet <MACaddress> fixed-address <hostname>.cluster.earlham.edu; </tt>.
+
 
-
*9)Save the file. Just like we did for the DNS config file, we need to stop and the start DHCP. To stop the config use the command <tt>/usr/local/etc/rc.d/isc-dhcpd stop</tt>. Then start the DHCP with <tt>/usr/local/etc/rc.d/isc-dhcpd start</tt>.
+
Next you'll want to stop DNS and then start DNS with the following command.  
-
*10)As a test, reboot the client.
+
    <tt>service named stop</tt>
 +
    <tt>service named start</tt>
 +
 
 +
Now that DNS is updated, we have to update DHCP. The file you want to edit is <tt>/etc/dhcp/dhcpd.conf</tt>.  
 +
Towards the bottom of the file you'll add  
 +
    <tt> host <hostname> { hardware ethernet <MACaddress> fixed-address <hostname>.cluster.earlham.edu; </tt>.
 +
 
 +
Save the file. Just like we did for the DNS config file, we need to stop and the start DHCP.  
 +
    <tt>service dhcpd stop</tt>
 +
    <tt>service dhcpd start</tt>
 +
 
 +
As a test, reboot the client.
 +
 
 +
= Setting up LDAP =
 +
When installing and configuring ldap, it can be tedious and frustrating, but no worries! I went through the troubles and took notes as I went so no one else would have to suffer like I did! These notes are pretty detailed, but I would suggest using one of the other servers with a newer centos version (layout, fatboy) as a resource when installing and configuring, especially if you are configuring it for a cluster.
 +
 
 +
Packages that need to be installed (both head and compute nodes):
 +
*openldap
 +
*openldap-clients
 +
*openssh-ldap
 +
*pam_ldap
 +
*nss-pam-ldapd
 +
 
 +
We use NSS and NSLCD in conjunction with PAM for ldap authentication. It may be older than SSSD, but we already know how to do it. So, we want to turn off SSSD. If sssd is not running, then great, that'll make your life a lot easier!
 +
    <tt>service stop sssd</tt>
 +
    <tt>chkconfig sssd off</tt> #so it doesn't restart if the machines reboots
 +
    <tt>chkconfig --del sssd</tt> #delete it as a service because we don't want it
 +
 
 +
There are a lot of files that need to be modified in order for ldap to work correctly.
 +
*/etc/openldap/ldap:
 +
    URI ldap://cluster.earlham.edu/
 +
    BASE dc=cluster, dc=loc
 +
    TLS_CACERTDIR /etc/openldap/cacerts
 +
 
 +
*/etc/nsswitch.conf
 +
<pre class="text">
 +
    passwd:  ldap files
 +
    group:  ldap files
 +
    shadow: ldap files
 +
 
 +
    ethers:    files
 +
    netmasks:  files
 +
    networks:  files
 +
    protocols:  files
 +
    rpc:        files
 +
    services:  files ldap
 +
 
 +
    netgroup:  ldap files
 +
 
 +
    publickey:  nisplus
 +
 
 +
    automount:  files ldap
 +
    aliases:    files
 +
 
 +
    sudoers:    ldap files
 +
</pre>
 +
 
 +
*/etc/pam_ldap.conf
 +
    rootbinddn cn=Manager,dc=cluster,dc=loc
 +
    nss_base_passwd ou=people,dc=cluster,dc=loc?one
 +
    nss_base_shadow ou=people,dc=cluster,dc=loc?one
 +
    nss_base_group ou=group,dc=cluster,dc=loc?one
 +
    nss_map_objectclass posixAccount User
 +
    nss_map_objectclass shadowAccount User
 +
    nss_map_objectclass posixGroup Group
 +
    nss_map_attribute uid userName
 +
    nss_map_attribute gidNumber gid
 +
    nss_map_attribute uidNumber uid
 +
    nss_map_attribute cn groupName
 +
    base dc=cluster,dc=loc
 +
    pam_password crypt
 +
    uri ldap://cluster.earlham.edu/
 +
    ssl no
 +
    tls_cacertdir /etc/openldap/cacert
 +
 
 +
*/etc/sudo-ldap.conf
 +
    uri ldap://cluster.earlham.edu/
 +
 
 +
*/etc/pam.d/password-auth
 +
*/etc/pam.d/system-auth
 +
    instead of pam_sss.o, it should be pam_ldap.so
 +
 
 +
*/etc/nslcd.conf
 +
    base  group  ou=group,dc=cluster,dc=loc
 +
    base  passwd ou=people,dc=cluster,dc=loc
 +
    base  shadow ou=people,dc=cluster,dc=loc
 +
    uid nslcd
 +
    gid ldap
 +
    uri ldap://cluster.earlham.edu/
 +
    base dc=cluster, dc=loc
 +
    ssl no
 +
    tls_cacertdir /etc/openldap/cacerts
 +
 
 +
*/etc/ssh/sshd_config
 +
    UsePAM yes
 +
 
 +
*/etc/sysconfig/authconfig
 +
    USESSSDAUTH=no
 +
    USESHADOW=yes
 +
    USESSSD=no
 +
    USELDAPAUTH=yes
 +
    USELDAP=yes
 +
    USECRACKLIB=yes
 +
    PASSWDALGORITHM=descrypt
 +
 
 +
Since we deleted sssd, we need to start the alternative, and make sure it starts on boot up.
 +
    <tt>service nslcd start</tt>
 +
    <tt>chkconfig nslcd on</tt>
 +
 
 +
Check to make sure nscd is turned off. That is a caching service for ldap. Since we're so small here, we don't really need that.
 +
    <tt>service nscd stop</tt>
 +
    <tt>chkconfig nscd off</tt>
= Users and Groups =
= Users and Groups =
-
Users are authenticated based on an LDAP server running on Hopper. <tt>cpu</tt> is installed on Hopper as an LDAP-user management tool. You should use it to view/edit/create users unless you're super comfortable with ldapmodify and LDIF. Passwords can be changed easily with the <tt>ldpasswd</tt> command on Hopper. It can be used both by users to change their own password and root to change another user's password.
+
Users are authenticated using an LDAP (Lightweight Directory Access Protocol) server running on Hopper. This is how users are authenticated throughout the entire cluster realm. We use LDAP for all users and groups except for ccg-admin user, root user, and the wheel group. Those users and that group are local to each cluster. Every user is apart of the users LDAP group, which is group number 115, and all clusters should look at ldap first and then files. This is specified in the /etc/nsswitch.conf file.  
-
Groups are also in LDAP. Check the tail end of the result of <tt>cpu cat</tt> for group info.
+
A user can change their password by the <tt>passwd</tt> command while on Hopper. It will prompt them for their current password, and then their desired new password. If it's successful it will tell you that at the end, with something like 'All LDAP tokens successfully changed.' or something close to that.
-
<tt>man cpu-ldap</tt> will tell you all about using cpu for user/group management. For the most part, its format is pretty similar to pw, but there are some minor differences. Read the man page.
+
== Creating New Users ==
 +
Because creating users in LDAP is somewhat confusing, sample files and a python script were written to help. The script is addusers.py and lives in <tt>~root/ldap-files/</tt> on hopper. I'll explain things on here, but there's a README file in that directory that will explain everything as well.
 +
 
 +
To create a user in LDAP, you must create an .ldif file for that user. This is what <tt>addusers.py</tt> does for you. <tt>addusers.py</tt> takes a file of new users as a command line argument. The file must specify <tt> First Name:username:email </tt> for each user, and each user should be on a separate line. The file <tt>add-user.ldif</tt> is an example of what the file should look like.
 +
 
 +
<tt> sudo python addusers.py add-user.ldif </tt> will create an .ldif file for each user, and use ldapadd to add them to the LDAP database. The contents of the .ldif file for each user added will be printed to the screen, and each user will be sent an email with their username and password. addusers.py is set to clean up after itself, so you don't need to worry about that. There's one more thing that has to be done after this step. We need to setup the ssh keys for each user.
 +
For each user created:
 +
 
 +
Become that user:
 +
    <tt> su - user </tt>
 +
 
 +
SSH to as0:
 +
    <tt> ssh as0 </tt>
 +
 
 +
It will ask you for the password of the user. Then it will prompt you for information about where to save the public key file and for a passphrase. For all of these, just hit enter. That will set it to the default. Go back to hopper and do the same thing for all the new users.
 +
 
 +
It is VERY important that you use UID and GID numbers that have not already been taken. If new users and groups have been being added correctly, then there shouldn't be a problem with that. maxuid is a file that specifies the next UID to use when creating a new user. addusers.py reads from that file when creating the .ldif files for each user, and at the end overwrites that file with the new naxuid. If you're nervous about the UID numbers, it is ok to double check. Doing <tt> ldapsearch -x </tt> should output everything in the database with the latest entry at the bottom. Look for the UID in that entry and compare it with maxuid. If maxuid is one above that number then all is golden. It's also safe to look in /etc/passwd to make sure no one is using that number either.
 +
 
 +
== Other modifications to the DB ==
 +
Other modifications to the database, like adding a new group, adding users to that group, deleting users, all use .ldif files similar to adding users. In the same directory as the above files, there are sample .ldif files that do these operations. Each file's name should be what it does. add-group.ldif will add a group, using ldapadd command. add-user-to-group.dif can be used to add a user to a group, and del-from-grp.ldif can delete users from groups, and chg-pw show an example of changing the password of a user. All three using the ldapmodify command. Make sure you modify the files to what you need, especially don't forget to change the gid if you add a group. Make sure it's a GID that's not already in use.
 +
 
 +
The command for modifying the database, if i was adding a user to a group. You'll need to specify the Manager password to the end of this command. It has been redacted here but can be found in the README file, only with root privileges.
 +
 
 +
    <tt> ldapmodify -f add-user-to-group.ldif -D "cn=Manager,dc=cluster,dc=loc" </tt>
 +
 
 +
If you're adding a group, change <tt>ldapmodify</tt> to <tt>ldapadd</tt>. The cn=Manager stuff is just specifying the manager of the database, which will allow you to change it.
 +
 
 +
To delete a user from the ldap database, it's a little simpler. Use the command below. Again, the Manager password must be specified and has been redacted here, but it will be the same as the password used in the other commands, and can be found in the README file. You will need to change the uid to be equal to the uid of the person you want to delete. The uid here is just the person's username.
 +
 
 +
    <tt> ldapdelete "uid=sbsp,ou=people,dc=cluster,dc=loc" -D "cn=Manager,dc=cluster,dc=loc" </tt>
= Monitoring =  
= Monitoring =  
* Ganglia - http://cluster.earlham.edu/ganglia/?r=month&s=descending&c=
* Ganglia - http://cluster.earlham.edu/ganglia/?r=month&s=descending&c=
-
* Machine room environment -
+
* Machine room environment - (Xunfei/Tugi/Wilson)
 +
 
 +
= Disable graphical booting screen in CentOS =
 +
 
 +
To enable verbose booting and remove the loading bar graphic, simply remove <code>rhgb quiet</code> from the file <code>/boot/grub/grub.conf</code>.
 +
 
 +
<code>rhgb</code> stands for redhat graphical boot, the <code>quiet</code> option tells CentOS to suppress even more boot information.
 +
 
 +
Grub2 works a little differently. There is no <code>grub.conf</code> file. Instead, it is generated from a bunch of files living inside <code>/etc/grup.d/</code>.
 +
 
 +
To disable graphical boot on grub2:
 +
1. backup default grub <pre>cp /etc/default/grub /etc/default/grub.bak</pre>
 +
2. Remove <code>rhgb quiet</code> from the line <code>GRUB_CMDLINE_LINUX="crashkernel=auto rhgb quiet"</code> <pre>vim /etc/default/grub</pre>
 +
3. generate new grub config <pre>grub2-mkconfig --output=/boot/grub2/grub.cfg</pre>
 +
 
 +
* https://www.redhat.com/archives/rhl-list/2004-May/msg07775.html
 +
* Grub2: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/ch-working_with_the_grub_2_boot_loader
 +
 
 +
= Rebuilding and Creating RAID arrays with mdadm =
 +
 
 +
[[mdadm|mdadm RAID Documentation]]
 +
 
 +
= Torque PBS =
 +
[[torque|Torque PBS Documentation]]
 +
 
 +
= Infiniband =
 +
 
 +
[[Infiniband|Infiniband Documentation]]
 +
 
 +
= Installing CHARMM =
 +
 
 +
== Load latest openMPI ==
 +
 
 +
<pre class="text">module load modules
 +
module load gcc/4.9.0
 +
module load openmpi
 +
</pre>
 +
== Install <code>libquadmath.so.0</code> ==
 +
 
 +
<pre class="text">./install.com gnu M
 +
</pre>
 +
== Clean (if needed) ==
 +
 
 +
<pre class="text">  ./install.com gnu M distclean
 +
</pre>
 +
 
 +
= Bacula Backup Management =
 +
 
 +
2016-06-21
 +
 
 +
== Fluorite (Machine) ==
 +
 
 +
The jail <code>quartz</code> is the CS bacula director which lives on the machine fluorite, <code>fluorite.earlham.edu</code>.
 +
 
 +
Location of configuration file on BSD (i.e. quartz) can be found here: <code>/usr/local/etc/bacula/bacula-dir.conf/</code> and <code>/usr/local/etc/bacula/bacula-fd.conf/</code>. Each bacula client has its own <code>bacula-fd.conf</code> configuration file that points back to <code>quartz</code>.
-
= Creating new users =  
+
== Helpful Commands for working with jails ==
-
Users are created with a perl script called <tt>newusers.pl</tt> and with a data file that includes the new user information called <tt>batch-current.dat</tt>.
+
-
First figure out whether the users you are creating need shell access or not. Make sure the perl script reflects this characteristics. The line below is what you'll need to modify, and you'll need to be root to do any of the following.
+
* <code>jls</code> lists jails
 +
* <code>jexec &lt;JID&gt; &lt;some command&gt;</code> execute command through jail
 +
* <code>jexec &lt;JID&gt; bash</code> “connect” to jail
-
<tt> $cpu_out = system("cpu useradd -c '$gecos' -m -k/etc/skel -m -p$password -g users -d /cluster/home/$name -s /bin/bash $name"); </tt>
+
== Bacula Commands (on Quartz) ==
-
If the users need shell access then what is there is fine. If they don't need shell access then change the <tt> /bin/bash </tt> to <tt> /sbin/nologin </tt>.
+
* <code>jexec 1 bconsole</code>
-
Next modify and save the <tt>batch-current.dat</tt> file with the new users you want to add. The pattern is: <tt>full name:username:email address</tt>. If you wanting to add more than one, each one should go on a separate line.
+
* <code>jexec 1 /usr/local/etc/rc.d/bacula-dir restart</code>
 +
* <code>jexec 1 /usr/local/etc/rc.d/bacula-fd restart</code>
-
Now, to add the users: <tt>perl newusers.pl -f batch-current.dat </tt>
 
-
Some stuff should pass on the screen saying that the new users are created (hopefully). Now we're going to set up those new user's ssh keys. To do that (as yourself):
+
= Using Infiniband for Layout's NFS =
 +
Internal NFS mounts on Layout are now done over Infiniband
-
- <tt> ssh as0 </tt>
+
<ol>
 +
<li>add to <code>/etc/hosts</code> of all layout nodes
 +
<ul>
 +
<li><code>192.168.50.100 lo0.layout.ib</code></li></ul>
 +
</li>
 +
<li>update <code>/etc/exports</code> on lo0 with infiniband IP address
 +
<ul>
 +
<li>then run <code>exportfs -a</code> to update nfs</li>
 +
<li>use <code>showmount</code> to check mounted machines</li></ul>
 +
</li>
 +
<li>change nfs mounts to ib fabric on layout nodes</li>
 +
<li><p>un-mount then mount under lo0.layout.ib</p>
 +
<pre class="text"># umount /scratch
 +
# mount lo0.layout.ib:/scratch /scratch
 +
# umount /mounts
 +
# mount lo0.layout.ib:/mounts /mounts
 +
# umount /var/www/
 +
# mount lo0.layout.ib:/var/www/ /var/www/
 +
</pre>
 +
<ul>
 +
<li>if it says &quot;device is busy&quot;, then you can try <code>umount -l /some/mount</code></li></ul>
 +
</li>
 +
<li><p>update <code>/etc/fstab</code> to use lo0.layout.ib hostname</p></li></ol>
-
- <tt> sudo su - root </tt>
+
NFS Reference: https://unix.stackexchange.com/questions/106122/mount-nfs-access-denied-by-server-while-mounting-on-ubuntu-machines
-
For every new user:
+
= Routing between 10Gb and IB in Layout =
-
- <tt> su - <newuser> </tt>
+
In order to route traffic over the 10Gb network to layout's compute nodes, we need lo0 to mediate the exchange.
 +
To temporarily establish routing between severs we can use the <code>ip route add</code> command <code>ip route add 10.10.10.0/24 via 192.168.50.100 dev ib0</code>
 +
In order to make this presistant, we must create the file <code>/etc/sysconfig/network-scripts/route-ib0</code> with the static routing rule inside it.
-
= New Hopper =
+
* [https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Deployment_Guide/s1-networkscripts-static-routes.html static routing with centos]
-
* Currently known as [[Ccg-admin/Megamind|Megamind]]
+

Latest revision as of 16:18, 13 October 2017

Contents

Current To Do

Cluster Pages

New Software

For a lot of the software we install on the clusters, we install them as modules. Environment modules are an easy way of installing multiple versions of software and allowing users to trivially change their environment variables path (PATH, LD_LIBRARY_PATH, C_INCLUDE_PATH, etc. ) to point to which version they want to use. If I want to use gcc version 5.1.0 instead of the system default 4.4.7, all I would have to type is:

    module load gcc/5.1.0  

If gcc 4.4.7 were also a module and I wanted to swap them to make sure I'm using gcc 5.1.0, all I would type is:

    module swap gcc/4.4.7 gcc/5.1.0 

Before installing the software, figure-out if this should be a yum package or a source kit, installed into the system space or modules, and what library dependencies it has. Proceed as appropriate.

Building software packages

Needs to be Updated

Installing a yum package into Modules structure

Enabling built software packages within Modules structure

On the head node all of the clusters, modules are installed into /mounts/machine/software where machine is the name of the actual machine. This directory is visible to all of the nodes of the machine. In that directory, there should be subdirectories of the different modules that are available, and within those each version has it's own directory (as of July 15 2015, the modules setup and organization is messy on many of the cluster, the notes here are how it should be set up from here on out).

So, for example, if we have gcc versions 5.1.0, 4.7.1, and 4.9.0 installed on layout, it would look like this:

   $ ls -1 /mounts/fatboy/software/gcc
      5.1.0 
      4.7.1 
      4.9.0 

If you think your new package is important enough to be loaded by default, then add it to the list in /mounts/al-salam/software/Modules/3.2.7/init/al-salam.{sh,csh}

DNS/DCHP for a single host

Find an IP that's not in use. Easiest way to do that is look in this file /var/named/master/cluster.zone. Add name and IP like the pattern in the file, like below. At the top of the file, be sure to change the serial number at the top to represent the year, month, day, and version.

    dali.cluster.earlham.edu.	IN	A	159.28.234.126 

Save the file. Every time you add an entry to the zone file, you have to edit the reverse zone file. The reverse zone file is /var/named/master/159.28.234.zone. Add an entry for the host you added in the zone file. Notice the first number there is the last octet of the IP that you gave the host.

    126	IN	PTR	dali.cluster.earlham.edu. 

Next you'll want to stop DNS and then start DNS with the following command.

   service named stop
   service named start

Now that DNS is updated, we have to update DHCP. The file you want to edit is /etc/dhcp/dhcpd.conf. Towards the bottom of the file you'll add

    host <hostname> { hardware ethernet <MACaddress> fixed-address <hostname>.cluster.earlham.edu; .

Save the file. Just like we did for the DNS config file, we need to stop and the start DHCP.

   service dhcpd stop 
   service dhcpd start

As a test, reboot the client.

Setting up LDAP

When installing and configuring ldap, it can be tedious and frustrating, but no worries! I went through the troubles and took notes as I went so no one else would have to suffer like I did! These notes are pretty detailed, but I would suggest using one of the other servers with a newer centos version (layout, fatboy) as a resource when installing and configuring, especially if you are configuring it for a cluster.

Packages that need to be installed (both head and compute nodes):

We use NSS and NSLCD in conjunction with PAM for ldap authentication. It may be older than SSSD, but we already know how to do it. So, we want to turn off SSSD. If sssd is not running, then great, that'll make your life a lot easier!

   service stop sssd
   chkconfig sssd off #so it doesn't restart if the machines reboots
   chkconfig --del sssd #delete it as a service because we don't want it

There are a lot of files that need to be modified in order for ldap to work correctly.

   URI ldap://cluster.earlham.edu/
   BASE dc=cluster, dc=loc
   TLS_CACERTDIR /etc/openldap/cacerts
    passwd:  ldap files
    group:  ldap files
    shadow: ldap files

    ethers:     files
    netmasks:   files
    networks:   files
    protocols:  files
    rpc:        files
    services:   files ldap

    netgroup:   ldap files

    publickey:  nisplus

    automount:  files ldap
    aliases:    files

    sudoers:    ldap files
   rootbinddn cn=Manager,dc=cluster,dc=loc
   nss_base_passwd ou=people,dc=cluster,dc=loc?one
   nss_base_shadow ou=people,dc=cluster,dc=loc?one
   nss_base_group ou=group,dc=cluster,dc=loc?one
   nss_map_objectclass posixAccount User
   nss_map_objectclass shadowAccount User
   nss_map_objectclass posixGroup Group
   nss_map_attribute uid userName
   nss_map_attribute gidNumber gid
   nss_map_attribute uidNumber uid
   nss_map_attribute cn groupName
   base dc=cluster,dc=loc
   pam_password crypt
   uri ldap://cluster.earlham.edu/
   ssl no
   tls_cacertdir /etc/openldap/cacert
   uri ldap://cluster.earlham.edu/
   instead of pam_sss.o, it should be pam_ldap.so
   base   group  ou=group,dc=cluster,dc=loc
   base   passwd ou=people,dc=cluster,dc=loc
   base   shadow ou=people,dc=cluster,dc=loc
   uid nslcd
   gid ldap
   uri ldap://cluster.earlham.edu/
   base dc=cluster, dc=loc
   ssl no
   tls_cacertdir /etc/openldap/cacerts
   UsePAM yes
   USESSSDAUTH=no 
   USESHADOW=yes 
   USESSSD=no 
   USELDAPAUTH=yes 
   USELDAP=yes
   USECRACKLIB=yes 
   PASSWDALGORITHM=descrypt

Since we deleted sssd, we need to start the alternative, and make sure it starts on boot up.

   service nslcd start
   chkconfig nslcd on

Check to make sure nscd is turned off. That is a caching service for ldap. Since we're so small here, we don't really need that.

   service nscd stop
   chkconfig nscd off

Users and Groups

Users are authenticated using an LDAP (Lightweight Directory Access Protocol) server running on Hopper. This is how users are authenticated throughout the entire cluster realm. We use LDAP for all users and groups except for ccg-admin user, root user, and the wheel group. Those users and that group are local to each cluster. Every user is apart of the users LDAP group, which is group number 115, and all clusters should look at ldap first and then files. This is specified in the /etc/nsswitch.conf file.

A user can change their password by the passwd command while on Hopper. It will prompt them for their current password, and then their desired new password. If it's successful it will tell you that at the end, with something like 'All LDAP tokens successfully changed.' or something close to that.

Creating New Users

Because creating users in LDAP is somewhat confusing, sample files and a python script were written to help. The script is addusers.py and lives in ~root/ldap-files/ on hopper. I'll explain things on here, but there's a README file in that directory that will explain everything as well.

To create a user in LDAP, you must create an .ldif file for that user. This is what addusers.py does for you. addusers.py takes a file of new users as a command line argument. The file must specify First Name:username:email for each user, and each user should be on a separate line. The file add-user.ldif is an example of what the file should look like.

sudo python addusers.py add-user.ldif will create an .ldif file for each user, and use ldapadd to add them to the LDAP database. The contents of the .ldif file for each user added will be printed to the screen, and each user will be sent an email with their username and password. addusers.py is set to clean up after itself, so you don't need to worry about that. There's one more thing that has to be done after this step. We need to setup the ssh keys for each user. For each user created:

Become that user:

    su - user 

SSH to as0:

    ssh as0  

It will ask you for the password of the user. Then it will prompt you for information about where to save the public key file and for a passphrase. For all of these, just hit enter. That will set it to the default. Go back to hopper and do the same thing for all the new users.

It is VERY important that you use UID and GID numbers that have not already been taken. If new users and groups have been being added correctly, then there shouldn't be a problem with that. maxuid is a file that specifies the next UID to use when creating a new user. addusers.py reads from that file when creating the .ldif files for each user, and at the end overwrites that file with the new naxuid. If you're nervous about the UID numbers, it is ok to double check. Doing ldapsearch -x should output everything in the database with the latest entry at the bottom. Look for the UID in that entry and compare it with maxuid. If maxuid is one above that number then all is golden. It's also safe to look in /etc/passwd to make sure no one is using that number either.

Other modifications to the DB

Other modifications to the database, like adding a new group, adding users to that group, deleting users, all use .ldif files similar to adding users. In the same directory as the above files, there are sample .ldif files that do these operations. Each file's name should be what it does. add-group.ldif will add a group, using ldapadd command. add-user-to-group.dif can be used to add a user to a group, and del-from-grp.ldif can delete users from groups, and chg-pw show an example of changing the password of a user. All three using the ldapmodify command. Make sure you modify the files to what you need, especially don't forget to change the gid if you add a group. Make sure it's a GID that's not already in use.

The command for modifying the database, if i was adding a user to a group. You'll need to specify the Manager password to the end of this command. It has been redacted here but can be found in the README file, only with root privileges.

    ldapmodify -f add-user-to-group.ldif -D "cn=Manager,dc=cluster,dc=loc" 

If you're adding a group, change ldapmodify to ldapadd. The cn=Manager stuff is just specifying the manager of the database, which will allow you to change it.

To delete a user from the ldap database, it's a little simpler. Use the command below. Again, the Manager password must be specified and has been redacted here, but it will be the same as the password used in the other commands, and can be found in the README file. You will need to change the uid to be equal to the uid of the person you want to delete. The uid here is just the person's username.

    ldapdelete "uid=sbsp,ou=people,dc=cluster,dc=loc" -D "cn=Manager,dc=cluster,dc=loc" 

Monitoring

Disable graphical booting screen in CentOS

To enable verbose booting and remove the loading bar graphic, simply remove rhgb quiet from the file /boot/grub/grub.conf.

rhgb stands for redhat graphical boot, the quiet option tells CentOS to suppress even more boot information.

Grub2 works a little differently. There is no grub.conf file. Instead, it is generated from a bunch of files living inside /etc/grup.d/.

To disable graphical boot on grub2:

1. backup default grub
cp /etc/default/grub /etc/default/grub.bak
2. Remove rhgb quiet from the line GRUB_CMDLINE_LINUX="crashkernel=auto rhgb quiet"
vim /etc/default/grub
3. generate new grub config
grub2-mkconfig --output=/boot/grub2/grub.cfg

Rebuilding and Creating RAID arrays with mdadm

mdadm RAID Documentation

Torque PBS

Torque PBS Documentation

Infiniband

Infiniband Documentation

Installing CHARMM

Load latest openMPI

module load modules
module load gcc/4.9.0
module load openmpi

Install libquadmath.so.0

./install.com gnu M

Clean (if needed)

  ./install.com gnu M distclean

Bacula Backup Management

2016-06-21

Fluorite (Machine)

The jail quartz is the CS bacula director which lives on the machine fluorite, fluorite.earlham.edu.

Location of configuration file on BSD (i.e. quartz) can be found here: /usr/local/etc/bacula/bacula-dir.conf/ and /usr/local/etc/bacula/bacula-fd.conf/. Each bacula client has its own bacula-fd.conf configuration file that points back to quartz.

Helpful Commands for working with jails

Bacula Commands (on Quartz)


Using Infiniband for Layout's NFS

Internal NFS mounts on Layout are now done over Infiniband

  1. add to /etc/hosts of all layout nodes
    • 192.168.50.100 lo0.layout.ib
  2. update /etc/exports on lo0 with infiniband IP address
    • then run exportfs -a to update nfs
    • use showmount to check mounted machines
  3. change nfs mounts to ib fabric on layout nodes
  4. un-mount then mount under lo0.layout.ib

    # umount /scratch
    # mount lo0.layout.ib:/scratch /scratch
    # umount /mounts
    # mount lo0.layout.ib:/mounts /mounts
    # umount /var/www/
    # mount lo0.layout.ib:/var/www/ /var/www/
    
    • if it says "device is busy", then you can try umount -l /some/mount
  5. update /etc/fstab to use lo0.layout.ib hostname

NFS Reference: https://unix.stackexchange.com/questions/106122/mount-nfs-access-denied-by-server-while-mounting-on-ubuntu-machines

Routing between 10Gb and IB in Layout

In order to route traffic over the 10Gb network to layout's compute nodes, we need lo0 to mediate the exchange.

To temporarily establish routing between severs we can use the ip route add command ip route add 10.10.10.0/24 via 192.168.50.100 dev ib0

In order to make this presistant, we must create the file /etc/sysconfig/network-scripts/route-ib0 with the static routing rule inside it.

Personal tools
Namespaces
Variants
Actions
websites
wiki
this semester
Toolbox