Running on a9 nodes

The facility CPUs are being upgraded from Scientific Linux 7 (SL7) to a newer operating system called Alma 9 (a9).  At this stage, more than 2/3rd of the farm was converted to a9, therefore, in order to facilitate large number of jobs, you are encouraged to follow the instructions herein (this is why jobs submitted from SL7 are pending for so long). On a9, AFS is no longer supported which means that we also need to find a replacement for AFS. This replacement has been set to be NFS version 4. 

Below are instructions allowing to bridge this gap while we transition (both the underlined OS and file system).

In this transition period, STAR will continue to use a code based on SL7 (we do NOT yet have A9 native support for our code). This means you will need to assemble and compile codes as usual, using rcas node ... Here is a quick rundown recipe on how to run on a9 in SL7 containers. 
 
  1. First, make sure your login is modified as follows:
    1. Instead of having something like "setenv GROUP_DIR /afs/rhic.bnl.gov/star/group", replace by "setenv GROUP_DIR /star/nfs4/AFS/star/group"
    2. You need to modify BOTH your $HOME/.cshrc and $HOME/.login
    3. This will take care of using NFSv4 instead of AFS. Using rcas nodes, that should work as usual (if not, please do not revert to AFS but report any issues)
      NB: For now, we are testing ONLY official STAR libraries - please do NOT use other libraries, private or otherwise.
       
  2. The submit nodes for launching jobs on Alma 9 are named starsub0X where X is a number from 1 to 7 (ex: starsub03).
    1. However, we ask you use 03 or upper number for now. Reasons: starsub03 and above are at  Condor version 24.0.4 2025-02-02 BuildID: 784178 while 01/02 are still at 23.9.6 2024-08-08 BuildID: 748275 PackageID: 23.9.6-1 (adjustments were made with the latest version)
    2. starsub0X nodes are Alma 9 nodes. The STAR software is not yet available on those but you will be able to submit to the a9 farm from there.
       
  3. To submit from a9 (starsub0X)
    1. If you are a STAR Scheduler user, please use star-submit-beta and/or star-submit-template-beta . All you need to do is to add the following line in your XML 
      <shell>singularity exec -e -B /direct -B /star -B /afs -B /gpfs -B /sdcc/lustre02 /cvmfs/star.sdcc.bnl.gov/containers/rhic_sl7.sif</shell>


    2. If you are not a STAR scheduler user, make sure that whatever you do to submit jobs, you execute a shell script in the container. In condor land, this may be adjusting your JDL to read as follows:
      Arguments = "singularity exec -e -B /direct -B /star -B /afs -B /gpfs -B /sdcc/lustre02 /cvmfs/star.sdcc.bnl.gov/containers/rhic _sl7.sif /blabla/where-my-csh-script-is.csh"
      instead of
      Arguments = /blabla/where-my-csh-script-is.csh
      Adapt this as needed.
       
  4. NOTE: Before submitting, you may want to test ONE job interactively to make sure it works. To do this
    1. Remember that on starsub0X, you are on Alma 9 and therefore, our code is not yet supported as indicated earlier.
    2. Therefore, you will need to start a shell like this:
      singularity exec -e -B /direct -B /star -B /afs -B /gpfs -B /sdcc/lustre02 /cvmfs/star.sdcc.bnl.gov/containers/rhic_sl7.sif csh
      this will start a SL7 login on an Alma9 node
    3. From that shell, you can execute one of the generated csh scripts and verify all goes according to plan.
    4. If this runs, you are ready to submit outside the singularity shell.
       

Possible issues
  1. There has been reports of issues with the 32bits version of ROOT/CInt - if you encounter an issue, please try the 64bits environment.
  2. While using SIMD instructions, there may be a need to restrict jobs to some CPU architecture. We currently do not have a flag in the STAR scheduler for this but
    requirements = (Microarch >= "x86_64-v4")  would limit to one kind of nodes with specific SIMD instructions. 
  3. From the production test, we have evidence of a slowdown when more jobs are running.