![]() But in my shell script, I call boinccmd twice and the last call hang. boinc 7.2.47 works fine and I have no problem when a monitor is connected to the computer.īoinccmd works fine if I called it once (no problem when launching in command line). What kernel and linux flavour are you using? From memory they seem to have been fixed with a kernel update, even though it shouldn’t have anything to do with it. I have seen issues with boinccmd which go away if you cd to /etc/boinc-client. Don't have any Manager connected at the same time.Ģ 21:48:49 GUI RPC Command = 'Ģ 21:48:49 GUI RPC reply: 'Ģ 21:48:49 handler returned -102, closing socketĢ 21:48:49 got new GUI RPC connection Try enable and reproduce the problem for starters. ![]() 102 is ERR_READ which in this case means reading from socket. Here is the script to reproduce the problem :īOINC_CURRENT_MODE_STATUS=$($BOINCCMD_BIN -host $FQDN:$BOINC_PORT -passwd $BOINC_PASS_VALUE -get_cc_status | grep 'current mode') So I suspect boinccmd require more time to answer with no screen and the second request is launch before first request completion. If I made two requests, the second request failed (ps display the boinccmd process but it never stop) I request boinc status using : boinccmd -host -passwd -get_cc_status (to extract boinc status and gpu status) I have been able to reproduce the problem. Are you using ssh to run it directly on the affected machine, are you running boinccmd on a different machine with a monitor and using BOINC's inbuilt remote control protocols, or something else?įirst, I connect to the host via ssh (there is no screen/keyboard/mouse attached to this computer). ![]() During the 17 hour draining period, BOINC jobs ramped up as we designed, and filled up all the cluster, so the overall cpu time used by ATLAS jobs stayed about the same compared to before the draining.I don't have a hint, but I've reported it upstream as #2535Įdit: it would probably help if you described how you're invoking boinccmd on the headless machine. We replaced the RSA certs with IGTF certs on the gatekeepers, and the site started to ramp up. The change was made late in the afternoon, and the error was not caught until the next morning, so the site got drained overnight. And later we fixed the sl-um-es5 node.ĭuring our annual renewal of the host certificates, we made a mistake to request the gatekeepers’ host certificates issued by the InCommon RSA instead of by InCommon IGTF, and this started to cause authentication errors on all gatekeepers for any incoming jobs. The BOINC jobs started to refill the work nodes after we changed the proxy. We switched the BOINC proxy server to sl-um-es3, which is located in the Tier2 server room and should be more robust. The slate kubelet cluster node sl-um-es5 reverted the ip forwarding change by cfengine, so the squid service went down again, this caused a lot of BOINC jobs failing as all BOINC clients are configured to use this proxy server. The kublete node problem caused one of the squid servers hosted on the kubelet cluster to be down, and all traffic went to the other squid server and did not cause job failure. ![]() ![]() (The containerd failure was caused by a wrong configuration of the .forwarding and .forwarding, they should be set as 1). There was a scheduled power shutdown in the UM Tier3 server room due to maintenance of the facility, the shutdown lasted 6 hours, a couple of things broke during the shutdown, including the network card for one UPS unit and the containerd/network forwarding service on one of the nodes of the slate kubelet cluster. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |