Did the spammer cause new longrunners??


Advanced search

Message boards : Number crunching : Did the spammer cause new longrunners??

Author Message
Tom*
Send message
Joined: 28 Jun 14
Posts: 118
Credit: 8,761,428
RAC: 0
    
Message 6198 - Posted: 9 Mar 2017, 3:12:04 UTC
Last modified: 9 Mar 2017, 3:18:36 UTC

Ever since the spammer spammed all my tasks run longer than 10 hours never finishing so far.

They do use cpu and look ok on show vm console

I have 5 tasks on three systems over 12 hours now.

DaveM
Send message
Joined: 24 Jun 14
Posts: 3
Credit: 1,172,677
RAC: 0
    
Message 6199 - Posted: 9 Mar 2017, 4:10:30 UTC

Same problem here. It's good to know someone else is having the same issue. At least now I know it's not my machine.

Erich
Send message
Joined: 18 Dec 15
Posts: 253
Credit: 1,942,248
RAC: 0
    
Message 6200 - Posted: 9 Mar 2017, 4:10:35 UTC - in response to Message 6198.

same with me here, on all 3 systems, multi-core as well as single-core;
I just now have abortet three tasks, as they have been running for some 15-17 hours and seemed never-ending.
We have had that a few months ago, the WUs turned out to be faulty.
Maybe same is true this time.

David Cameron
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 13 May 14
Posts: 252
Credit: 2,028,556
RAC: 1
    
Message 6201 - Posted: 9 Mar 2017, 6:44:02 UTC - in response to Message 6200.

I see the same problem everywhere, I'm checking it.

maeax
Send message
Joined: 25 Jun 14
Posts: 50
Credit: 1,700,662
RAC: 0
    
Message 6202 - Posted: 9 Mar 2017, 7:20:39 UTC

same here, have cancelled the tasks.

On the homepage there is no workunit-list 200.000 with more than 100.000 finished!

hsdecalc
Send message
Joined: 21 Feb 15
Posts: 5
Credit: 494,496
RAC: 0
    
Message 6203 - Posted: 9 Mar 2017, 8:09:31 UTC

What I do:
If the value <fraction_done>0.000000</fraction_done> in file boinc_task_state.xml has not changed after some mintues I cancel the job.

David Cameron
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 13 May 14
Posts: 252
Credit: 2,028,556
RAC: 1
    
Message 6204 - Posted: 9 Mar 2017, 8:54:20 UTC

I found the problem, it's related to a small change I made yesterday. What is happening is that the task is running over and over again inside the VM. I have made a fix that will be automatically picked up by running tasks within a few hours so they should eventually exit. So if you want to get the credit for these longrunners it's better to keep them running.

David Cameron
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 13 May 14
Posts: 252
Credit: 2,028,556
RAC: 1
    
Message 6205 - Posted: 9 Mar 2017, 12:34:53 UTC

Things seem to be ok now, I see successful WU coming in again.

I got 2200 credit for a WU using 13 CPU hours, so it's worth leaving these running until they finish.

Erich
Send message
Joined: 18 Dec 15
Posts: 253
Credit: 1,942,248
RAC: 0
    
Message 6206 - Posted: 9 Mar 2017, 14:31:31 UTC - in response to Message 6205.

I got 2200 credit for a WU using 13 CPU hours, so it's worth leaving these running until they finish.

too bad that I had cancelled all these WUs just half an hour before you told us what was going on :-(

Profile HerveUAE
Avatar
Send message
Joined: 18 Dec 16
Posts: 44
Credit: 509,829
RAC: 0
    
Message 6207 - Posted: 9 Mar 2017, 19:27:19 UTC
Last modified: 9 Mar 2017, 19:29:23 UTC

All my 4 ATLAS@Home tasks are long-runners: 4d 18h, 1d 23h, 1d 2h and 1d 1h. I'll wait till tomorrow to see if the fix from David has helped recover them.

Maybe the one that has been running for more than 4 days is a different issue. I have seen from the stats that a task may take up to 7 days to complete, so I may be waiting a little bit longer.

Edit: it's been 7 hours since the post from David, and my long-runners are still happily crunching (each is using 100% of it's allocated cores).

rbpeake
Send message
Joined: 27 Jun 14
Posts: 86
Credit: 8,794,961
RAC: 0
    
Message 6208 - Posted: 9 Mar 2017, 20:14:39 UTC - in response to Message 6207.

All my 4 ATLAS@Home tasks are long-runners: 4d 18h, 1d 23h, 1d 2h and 1d 1h. I'll wait till tomorrow to see if the fix from David has helped recover them.

Maybe the one that has been running for more than 4 days is a different issue. I have seen from the stats that a task may take up to 7 days to complete, so I may be waiting a little bit longer.

Edit: it's been 7 hours since the post from David, and my long-runners are still happily crunching (each is using 100% of it's allocated cores).

If you want to force it, I did Update from the BOINC application, and the long runners ended shortly thereafter.

David Cameron
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 13 May 14
Posts: 252
Credit: 2,028,556
RAC: 1
    
Message 6209 - Posted: 10 Mar 2017, 8:35:42 UTC - in response to Message 6208.

I suspect the 4-day task is a different issue, one of the "normal" longrunners.

Can you check the stderr.txt in the slots directory? This will show you if the task is continually restarting. For example see this result with "Starting ATLAS job" every hour:

http://atlasathome.cern.ch/result.php?resultid=8485201

Profile Yeti
Avatar
Send message
Joined: 20 Jul 14
Posts: 699
Credit: 22,597,832
RAC: 0
    
Message 6210 - Posted: 10 Mar 2017, 10:47:59 UTC - in response to Message 6207.
Last modified: 10 Mar 2017, 10:48:15 UTC

All my 4 ATLAS@Home tasks are long-runners: 4d 18h, 1d 23h, 1d 2h and 1d 1h. I'll wait till tomorrow to see if the fix from David has helped recover them.

Maybe the one that has been running for more than 4 days is a different issue. I have seen from the stats that a task may take up to 7 days to complete, so I may be waiting a little bit longer.

Edit: it's been 7 hours since the post from David, and my long-runners are still happily crunching (each is using 100% of it's allocated cores).

Here is another idea how you can find out something about longrunners:

Open the "VM Console"
Click with the mouse into the console-screen
At Login-Prompt, enter a username, e.g. Atlas
If the prompt for Password appears, all seems to be fine
If the prompt for password doesn't appear, something inside the VM seems to be broken and the only thing you can do is abort it

I have tested this with something 10 WUs on my system and all long-running-WUS that showed the Login Prompt finished fine and succesfull. I have aborted all other

Profile HerveUAE
Avatar
Send message
Joined: 18 Dec 16
Posts: 44
Credit: 509,829
RAC: 0
    
Message 6211 - Posted: 10 Mar 2017, 14:36:08 UTC

Can you check the stderr.txt in the slots directory?

For each of the 4 tasks, the entry "Starting ATLAS job" is logged only once.

Profile HerveUAE
Avatar
Send message
Joined: 18 Dec 16
Posts: 44
Credit: 509,829
RAC: 0
    
Message 6212 - Posted: 10 Mar 2017, 14:50:32 UTC
Last modified: 10 Mar 2017, 14:56:49 UTC

Open the "VM Console"
Click with the mouse into the console-screen
At Login-Prompt, enter a username, e.g. Atlas
If the prompt for Password appears, all seems to be fine
If the prompt for password doesn't appear, something inside the VM seems to be broken and the only thing you can do is abort it

I tried that as well and the "localhost login:" prompt appeared but the password did not appear. But when I left the console, the VM itself stopped resulting in "Computation error" for the WU.
Here is the Task itself: http://atlasathome.cern.ch/result.php?resultid=8454991

Erich
Send message
Joined: 18 Dec 15
Posts: 253
Credit: 1,942,248
RAC: 0
    
Message 6213 - Posted: 10 Mar 2017, 17:38:07 UTC - in response to Message 6203.

What I do:
If the value <fraction_done>0.000000</fraction_done> in file boinc_task_state.xml has not changed after some mintues I cancel the job.

I checked this now and found out, that the value is always 0.000000, even for WUs that are working fine.

PHILIPPE
Send message
Joined: 24 Jul 16
Posts: 84
Credit: 53,413
RAC: 0
    
Message 6214 - Posted: 10 Mar 2017, 18:26:47 UTC - in response to Message 6213.
Last modified: 10 Mar 2017, 18:27:47 UTC

I agree with Erich.
fraction done stays always at 0.0000000 in boinc_task_state.xml, for both running and paused wus.

Profile HerveUAE
Avatar
Send message
Joined: 18 Dec 16
Posts: 44
Credit: 509,829
RAC: 0
    
Message 6215 - Posted: 11 Mar 2017, 4:33:37 UTC
Last modified: 11 Mar 2017, 4:37:25 UTC

Open the "VM Console"
Click with the mouse into the console-screen
At Login-Prompt, enter a username, e.g. Atlas
If the prompt for Password appears, all seems to be fine
If the prompt for password doesn't appear, something inside the VM seems to be broken and the only thing you can do is abort it

I did some more tests and understood the behaviour between a "good" running WU and a "broken" running WU. It appears that all my 4 long-runners were broken, so I have aborted them.
Many thanks for the hint, Yeti.

Mogens Dam
Send message
Joined: 1 Jul 14
Posts: 26
Credit: 3,264,917
RAC: 0
    
Message 6235 - Posted: 14 Mar 2017, 16:35:54 UTC - in response to Message 6215.

Has this issues been solved for others?
The last week none of my tasks have ever terminated
and they seem not to use CPU.

Profile HerveUAE
Avatar
Send message
Joined: 18 Dec 16
Posts: 44
Credit: 509,829
RAC: 0
    
Message 6239 - Posted: 15 Mar 2017, 1:26:08 UTC

Has this issues been solved for others?

For me ATLAS multi-core tasks work as usual now, so yes the issue has been solved.

Message boards : Number crunching : Did the spammer cause new longrunners??