Issues with a bad payload definition in the past few days


Advanced search

Message boards : News : Issues with a bad payload definition in the past few days

Author Message
Andrej Filipcic
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 3 Jun 14
Posts: 41
Credit: 3,891,911
RAC: 0
    
Message 1719 - Posted: 21 Jan 2015, 13:15:51 UTC

Dear volunteers,

in the past few days, most of you have been experiencing difficulties with the massive WU failures. This was caused by one task with 20k WUs, which was badly defined and 80% of the WUs failed. We did not spot this immediately due to a lacking functionality of our monitoring system which still needs to be adapted to BOINC workloads. The corresponding task was aborted yesterday, so the things should be back to normal. In addition, all of us have been traveling and we did not have much time to follow the message boards. We have now assigned a dedicated person who will track the issues, so we hope for a much better response time in the future.

Apologies for the issues you have been experiencing lately, and thank you for your patience and efforts to help us.

Phil
Avatar
Send message
Joined: 27 Jun 14
Posts: 39
Credit: 383,974
RAC: 0
    
Message 1720 - Posted: 21 Jan 2015, 14:01:09 UTC

Well it was only four days of bad jobs, very little to worry about really!

Phil
Avatar
Send message
Joined: 27 Jun 14
Posts: 39
Credit: 383,974
RAC: 0
    
Message 1721 - Posted: 21 Jan 2015, 15:03:26 UTC - in response to Message 1720.

Well it was only four days of bad jobs, very little to worry about really!

[edit]Theres a few of these jobs still in the pipeline that are bubbling up as resends, I dont think they are creating a real problem.{/edit]

Luigi R.
Send message
Joined: 6 Sep 14
Posts: 52
Credit: 123,536
RAC: 0
    
Message 1722 - Posted: 21 Jan 2015, 19:26:33 UTC - in response to Message 1719.

Good to hear that. I'll immediately put my pc back to work.

Dennis Wynes
Avatar
Send message
Joined: 25 Jun 14
Posts: 19
Credit: 1,007,409
RAC: 0
    
Message 1723 - Posted: 21 Jan 2015, 19:34:51 UTC
Last modified: 21 Jan 2015, 19:35:03 UTC

Unfortunately still mostly validate errors.

http://atlasathome.cern.ch/ATLAS/results.php?userid=375
____________

Luigi R.
Send message
Joined: 6 Sep 14
Posts: 52
Credit: 123,536
RAC: 0
    
Message 1724 - Posted: 21 Jan 2015, 21:48:18 UTC

I was looking at the same thing on top hosts.

Tom*
Send message
Joined: 28 Jun 14
Posts: 118
Credit: 8,761,428
RAC: 1
    
Message 1725 - Posted: 21 Jan 2015, 21:59:59 UTC
Last modified: 21 Jan 2015, 22:14:17 UTC

Unfortunately still mostly validate errors.


They generated 20000 work units of which 80% were validate errors.

That's 16,000 work Units but the number of tasks needed to eliminate them are

64,000 invalid tasks. Each Work Unit is sent to four Users before it dies.

Unless they can find and remove the work units on the server we all need to process as many as we can to clean up the database.

I am processing good work units now with occasional validate errors.

I am sure we have made a huge dent in the errors but while a good task
takes 3 to 4 hours the top users can process a huge number of validate errors
that take only 5 to 7 minutes each, leading to a full page of validate errors
when perusing the stats.

Dennis Wynes
Avatar
Send message
Joined: 25 Jun 14
Posts: 19
Credit: 1,007,409
RAC: 0
    
Message 1731 - Posted: 22 Jan 2015, 8:47:25 UTC
Last modified: 22 Jan 2015, 8:49:00 UTC

Fair enough, I took "The corresponding task was aborted yesterday, so the things should be back to normal" in the first post at face value and thought this was new work.
____________

David Cameron
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 13 May 14
Posts: 252
Credit: 2,028,082
RAC: 0
    
Message 1734 - Posted: 23 Jan 2015, 14:38:27 UTC - in response to Message 1725.


That's 16,000 work Units but the number of tasks needed to eliminate them are

64,000 invalid tasks. Each Work Unit is sent to four Users before it dies.


Unfortunately those bad tasks were resubmitted to boinc by our ATLAS systems so the number is much higher than that. In addition when we cleaned these WU on Tuesday we did not manage to get all of them so some were still in there. This is why there are still so many validation errors today. I hope that now we have flushed most of them out but it is hard to know exactly which are the bad WU so some may remain over the next few days. Overall things should get better over the weekend.

Snow Crash
Send message
Joined: 26 Jul 14
Posts: 21
Credit: 929,492
RAC: 0
    
Message 1735 - Posted: 24 Jan 2015, 12:45:45 UTC - in response to Message 1734.
Last modified: 24 Jan 2015, 12:49:52 UTC

It looks like something else in the scheduler has changed as work units I have aborted (known bad) are being re-issued to the same computer. I even tried aborting the download to push a different error message to BOINC but it still keeps sending them to me.
http://atlasathome.cern.ch/results.php?hostid=3090

A few times it is sending work units to one of my other computers. I don't think this is standard behavior but I could very well be wrong.

I'm throwing a few more cores into the mix so we can finish getting these bad WUs flushed out of the system. It's a fair amount of baby sitting but we'll see how many I get cleared before returning to normal operations.


Thanks for a very interesting and challenging project - keep up the good work!
Steve

osoyoosking
Send message
Joined: 10 Nov 14
Posts: 1
Credit: 35,172
RAC: 0
    
Message 1799 - Posted: 8 Feb 2015, 4:59:26 UTC - in response to Message 1719.

ya its still not working runs out of memory then aborts

something to do with memory being set wrong in the vm-vb , the video memory is set to 8 and it need 9

anyhow i tried to change it and re start it and it doesnt work

this is as of feb 07 sat night west coast time

thanks.

Snow Crash
Send message
Joined: 26 Jul 14
Posts: 21
Credit: 929,492
RAC: 0
    
Message 1803 - Posted: 8 Feb 2015, 14:21:09 UTC - in response to Message 1799.
Last modified: 8 Feb 2015, 14:23:29 UTC

ya its still not working runs out of memory then aborts, something to do with memory being set wrong in the vm-vb ...

I see what you're saying ... VB reports using 2.27 GB but the config file only allocates 2.00 GB. You can change the base CPU memory inside the "ATLAS_vbox_job_1.29_windows_x86_64.xml" file. If you are running close to the edge of what your installed RAM qty is I can understand this might a problem as BOINC will just keep starting up additional taks even though you've run out of memory. Have you tried runing only 1 at a time?

The 8 GB and 9 GB you are referring to are virtual memory and I think as long as your disk can swap that back and forth it should not be an issue. I was not having any problems but long ago I bumped my swap file to handle the regular settings (installed RAM * 1.5) + the amount of "virtual" disk I need for the number of ATLAS tasks I intend to run on each machine (9GB * task qty).

Good luck and happy crunching,
Steve

Message boards : News : Issues with a bad payload definition in the past few days