WU failing, but never ending


Advanced search

Message boards : Number crunching : WU failing, but never ending

Author Message
rebirthman
Send message
Joined: 11 Jun 16
Posts: 7
Credit: 309,976
RAC: 0
    
Message 6129 - Posted: 9 Feb 2017, 8:12:20 UTC

Hello,

just wanted to give a hint about an effect I recognise quite frequently.

WUs consuming all cores but idleing for ever and which never end until user abort.

CPU time is always almost zero.

Most recent example:
http://atlasathome.cern.ch/result.php?resultid=8286610

Not sure if this is just my individual setup or a common issue ? Maybe someone from the project team might want to look into this in more detail to provide feedback if I should change something on my side.

Its a bit frustrating to see that the PC is blocked by nothing useful for such a long time that frequent.

Looking forward to your feedback

besr regards
Michael

hsdecalc
Send message
Joined: 21 Feb 15
Posts: 5
Credit: 494,496
RAC: 0
    
Message 6130 - Posted: 9 Feb 2017, 11:18:24 UTC

Same here. One task run since 26 hours, but no/low cpu usage.
The older thread is here: Return of the long running..

Profile Yeti
Avatar
Send message
Joined: 20 Jul 14
Posts: 699
Credit: 22,597,832
RAC: 0
    
Message 6131 - Posted: 9 Feb 2017, 14:13:56 UTC - in response to Message 6129.

rebirthman wrote:
CPU time is always almost zero.

Did you already make a journey through this checklist?

rebirthman
Send message
Joined: 11 Jun 16
Posts: 7
Credit: 309,976
RAC: 0
    
Message 6135 - Posted: 10 Feb 2017, 10:43:35 UTC - in response to Message 6131.

Hello,

yes the checklist and the forum thread is known to me and I assume my setup is in line to it.

Anyway I was more hopeing to gain feedback on the errorlog of the failed WU I shared as it might help to understand the reason better.

Especially lines like these might direct the experts closer to the source ?

2017-02-08 15:29:12 (9144): Guest Log: Copying input files into RunAtlas.
2017-02-08 15:29:12 (9144): Guest Log: Copied input files into RunAtlas.
2017-02-08 15:29:12 (9144): Guest Log: Starting ATLAS job. Output is redirected into runtime_log.
2017-02-08 15:29:12 (9144): Guest Log: Failed! Shutting down the machine.

Looking forward to feedback

br Michael

PHILIPPE
Send message
Joined: 24 Jul 16
Posts: 84
Credit: 53,413
RAC: 0
    
Message 6137 - Posted: 10 Feb 2017, 17:27:58 UTC - in response to Message 6135.

Hi , rebirthman ,

when an error occurs , it comes from either the server side or either the client side.
I don't know what is happening behind the server, but if i inspect your log ,i notice most of the wu validated with errors appeared in the slot 9 of your computer.
Dynamically, the "weakness" was in this slot (wus validated with errors on the 7 feb 2017).(Perhaps remnants of faulty wus not deleted by boinc).This is maybe the reason why you have to abort this abnormal wu where cpu time consuming was near zero.
But since you change the version of virtualbox (5.1.10-->5.1.14), it seems now to be solved.
From time to time ,it is necessary to clean the slots directories after serial errors (either with virtualBox ,either with explorer).
It is difficult to say more at this level...

Regards.

ヽ(⌒‐⌒)ゝ

Message boards : Number crunching : WU failing, but never ending