AFS and STAR login scripts
STAR Login Testing
The current test node is "spare.star.bnl.gov", with SL 4.6, (i386), with a cvs checkout of the GROUP_DIR in /usr/local/star/group, but no other STAR-specific configuration. (not even /opt/star)
One confusion -- how to clear the local AFS cache?
eg:
[root@spare ~]# fs getcacheparms
AFS using 54372 of the cache's available 100000 1K byte blocks.
[root@spare ~]# fs flushmount /afs/rhic.bnl.gov
[root@spare ~]# fs flushmount -p /afs/rhic.bnl.gov
[root@spare ~]# fs flushvolume -p /afs/rhic.bnl.gov
[root@spare ~]# fs flushvolume /afs/rhic.bnl.gov
[root@spare ~]# fs getcacheparms
AFS using 54372 of the cache's available 100000 1K byte blocks.
These (and several other variations I tried) have no effect on the reported size of cached content.
On to the login test results...
I have set DECHO for debugging output in all cases.
---
First, the control state:
TEST 1: Using "setenv GROUP_DIR /afs/rhic.bnl.gov/rhstar/group", with AFS in a normal working state. Results are in the attached file "TEST_01.txt". The login looks normal; root4star is in the path and executes.
---
TEST 2: Next, approximate the original problem state, use "setenv GROUP_DIR /afs/rhic.bnl.gov/rhstar/group", and block the AFS servers with two iptables rules:
iptables -I INPUT 1 -s 130.199.6.0/23 -j REJECT
iptables -I INPUT 1 -d 130.199.6.0/23 -j REJECT
and then wait several minutes. (In this case, I intentionally wait until seeing the "lost contact" messages in the system log.) Then try to login. The login takes an instant, and there is no STAR environment:
[root@spare ~]# su - test
[test@spare ~]$
(no file associated with this test)
---
TEST 3: Next test -- "setenv GROUP_DIR /afs/rhic.bnl.gov/rhstar/group"), block AFS servers as before with iptables, then log in immediately.
Login initially stalls without any output... This is likley the initially reported behaviour that spurred this whole effort. But after about two minutes, the login works - no STAR environment, no output from the login scripts, but the shell is interactive:
[root@spare ~]# su - test
[test@spare ~]$
This differs from the previous test case only in the delay for the shell to become interactive.
(no file associated with this test)
---
Now move to testing with Jerome's suggestion for /usr/local/star/group:
Tests 4-6 use "setenv GROUP_DIR /usr/local/star/group"
TEST 4: First, with AFS working normally: The login proceeds quickly and root4star works. (See the attached file "TEST_04.txt" for the login output.)
---
TEST 5: Now with the simulated AFS failure (with iptables, as above), waiting several minutes between applying the iptables rules and logging in. The login completes quickly, and has only a handful of STAR or AFS environment variables. (See attached file "TEST_05.txt")
---
TEST 6: And now with simulated AFS failure, but logging in immediately:
The login started and paused for a bit over one minute before proceeding. (See the attached "TEST_06.txt"). This is not much different from TEST 3, though a handful of environmentment variables do get set (though are probably not useful by themselves).
Conclusion (so far): I don't see much point in using the /usr/local/star/group installation by itself. It does not significantly change the users' ability to use their accounts. Logins still hang (but do eventually succeed) if trying to login within a couple minutes of the AFS server (or network) failure.
---
TEST 7: Change the simluated AFS failure by using DROP instead of REJECT in the iptables rules. Back to using "GROUP_DIR /afs/rhic.bnl.gov/rhstar/group", and pause for several minutes after the iptables change. - same as TEST 2. (no file associated with this test)
---
TEST 8: use DROP instead of REJECT, login immediately with "GROUP_DIR /afs/rhic.bnl.gov/rhstar/group". There is a pause of about a minute during the login. The exact output from the login scripts is a bit different from any others. (My guess is that there are cache effects coming into play.) (See "TEST_08.txt")
---
TEST 9: Use DROP instead of REJECT, wait to login, "GROUP_DIR /usr/local/star/group". Login proceeds quickly, and warns that STAR Login is incomplete. The environment contains many afs and star references. This is very different from TEST 5, which I'd execpted to be about the same. (see "TEST_09.txt")
---
TEST 10: Use DROP instead of REJECT, immediate login, "GROUP_DIR /usr/local/star/group": A pause for about 1.5 minutes in the middle of the login. See "TEST_10.txt" It does not give any particular warning or notice about the login failure or missing STAR environment, instead, simply ending with "STAR_ROOT: Undefined variable."
Conclusion -- same conclusion as before: having /usr/local/star/group or /afs/rhic.bnl.gov/star/group would make very little difference to the users. Logins can seem to get stuck wither way, but eventually complete and the working environment is crippled without AFS in any case.
more tests to try:
1. Login with AFS working, then blocking AFS - shell is likely to get stuck in some cases for up to two minutes or so.
2. Login after AFS has been blocked for while, then restore AFS - does the user's environment have enough to start using the AFS software stack?
None of these conclusions are applicable to the planned longer term state of eliminating AFS and using a local NFS (or something else?) installation of root, which will be subject to its own failure modes.
---
Other tests are possible, such as pulling the local network cable or blocking individual AFS servers rather than all of them at once (eg. only block one file server or only one volume location server).
- wbetts's blog
- Login or register to post comments