v1.28 and validation sample


Advanced search

Message boards : News : v1.28 and validation sample

1 · 2 · Next
Author Message
lancone
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 26 May 14
Posts: 219
Credit: 53,963
RAC: 0
    
Message 1124 - Posted: 28 Oct 2014, 20:06:54 UTC

A new sample of tasks was launched, and there is a mixture of short and long jobs. The purpose of those tasks is to do a validation of samples that run through Boinc against the standard ATLAS sample to verify that physics results do not differ. So far ~2k jobs (out of 3.6k) have terminated successfully.

Monitoring of ATLAS simulation jobs on BOINC against WLCG is available on our test server :
http://boincai04.cern.ch/Atlas-test/atlas_job.php

Also, the logic of launching the jobs was changed drastically to minimize the downloads of input files. Please, report if you notice something non-optimal with v1.28.

A new thread is being created for this purpose.

Mogens Dam
Send message
Joined: 1 Jul 14
Posts: 26
Credit: 3,264,917
RAC: 24
      
Message 1127 - Posted: 28 Oct 2014, 23:25:43 UTC - in response to Message 1124.

That monitoring page is very interesting.
Can one conclude, that BOINC does something like a 2000/60000 part,
i.e. about 3%, of the complete ATLAS simulation?

BTW, does the BOINC stats shown include jobs which do noting useful?
I believe there are still a few users who participate without
doing any real work, due to this firewall glitch thing.

Profile Yeti
Avatar
Send message
Joined: 20 Jul 14
Posts: 699
Credit: 22,597,832
RAC: 211
      
Message 1129 - Posted: 29 Oct 2014, 13:30:19 UTC - in response to Message 1127.

BTW, does the BOINC stats shown include jobs which do noting useful?
I believe there are still a few users who participate without
doing any real work, due to this firewall glitch thing.


Let me guess: The shown data comes from the Servers behind the BOINC-Platform and show only figures of real done work

lancone
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 26 May 14
Posts: 219
Credit: 53,963
RAC: 0
    
Message 1133 - Posted: 29 Oct 2014, 16:41:39 UTC - in response to Message 1127.

That monitoring page is very interesting.
Can one conclude, that BOINC does something like a 2000/60000 part,
i.e. about 3%, of the complete ATLAS simulation?

BTW, does the BOINC stats shown include jobs which do noting useful?
I believe there are still a few users who participate without
doing any real work, due to this firewall glitch thing.


Yes the observation is correct, BOINC is currently at ~3% level of ATLAS simulation. Which represents ~40% of ATLAS.

The monitoring shows only jobs that have started ATLAS software. Unfortunately those suffering from firewall issue do appear on the monitoring (firewall issues reveal after ATLAS software initialisation, to be fixed in a next version of our BOINC interface).

Aurel
Send message
Joined: 28 Jun 14
Posts: 9
Credit: 804
RAC: 0
  
Message 1134 - Posted: 29 Oct 2014, 16:59:04 UTC

I got two WU´s from them and let them running over the night, now I see the following text:

Waiting to run (Sheduler wait: VM job unmanageable, restarting later)

stderr message:

2014-10-29 07:49:09 (7600): vboxwrapper (7.5.26110): starting 2014-10-29 07:49:09 (7600): Feature: Checkpoint interval offset (114 seconds) 2014-10-29 07:49:10 (7600): Detected: VirtualBox 4.3.4r91027 2014-10-29 07:49:11 (7600): Detected: Minimum checkpoint interval (900.000000 seconds) 2014-10-29 07:49:11 (7600): Restore from previously saved snapshot. 2014-10-29 07:49:11 (7600): Restore completed. 2014-10-29 07:49:11 (7600): Starting VM. 2014-10-29 07:49:17 (7600): Successfully started VM. (PID = '4452') 2014-10-29 07:49:17 (7600): Reporting VM Process ID to BOINC. 2014-10-29 07:49:17 (7600): Lowering VM Process priority. 2014-10-29 07:49:18 (7600): VM state change detected. (old = 'poweroff', new = 'running') 2014-10-29 07:49:18 (7600): Detected: Web Application Enabled (2014-10-29 07:49:18 (7600):) 2014-10-29 07:49:18 (7600): Preference change detected 2014-10-29 07:49:18 (7600): Setting CPU throttle for VM. (80%) 2014-10-29 07:49:18 (7600): Setting network throttle for VM. (35KB) 2014-10-29 07:49:18 (7600): Checkpoint Interval is now 60 seconds. 2014-10-29 07:50:10 (7600): Creating new snapshot for VM. 2014-10-29 07:50:10 (7600): Restoring VM Process priority. 2014-10-29 07:50:17 (7600): Lowering VM Process priority. 2014-10-29 07:50:17 (7600): Deleting stale snapshot. 2014-10-29 07:50:17 (7600): Error in delete stale snapshot for VM: -2147467259 Command: VBoxManage -q snapshot "boinc_676b9e0d5138cc0d" delete "1b06eda4-2778-4680-b31a-e326feb83222" Output: 0%... Progress state: E_FAIL VBoxManage.exe: error: Snapshot operation failed VBoxManage.exe: error: Hard disk 'D:\boincdata\slots\5\vm_cache.vdi' has more than one child hard disk (2) VBoxManage.exe: error: Details: code E_FAIL (0x80004005), component SessionMachine, interface IMachine VBoxManage.exe: error: Context: "int __cdecl handleSnapshot(struct HandlerArg *)" at line 431 of file VBoxManageSnapshot.cpp 2014-10-29 07:50:17 (7600): ERROR: Checkpoint maintenance failed, rescheduling task for a later time. (-2147467259) 2014-10-29 07:50:17 (7600): Powering off VM. 2014-10-29 07:50:18 (7600): Status Report: virtualbox.exe/vboxheadless.exe is no longer running. 2014-10-29 07:50:18 (7600): Successfully powered off VM.

Profile Steve Hawker*
Avatar
Send message
Joined: 27 Jul 14
Posts: 27
Credit: 125,084
RAC: 3
      
Message 1137 - Posted: 29 Oct 2014, 17:55:03 UTC

I tried one v1.28 task on a Linux box and two v1.28 tasks on my MacBook. All of them raced to 50% as usual and then started to crawl towards 99.999% which is where they were when I aborted them at 10x the estimated duration.

Both machines run vLHC perfectly.

Of all the 50+ projects I run, those related to CERN are my favorites. I'd really like to be an ATLAS regular but I can't keep running my machines for 24+ hours without completing a WU.

Profile MAGIC Quantum Mechanic
Avatar
Send message
Joined: 4 Jul 14
Posts: 331
Credit: 485,372
RAC: 0
    
Message 1142 - Posted: 30 Oct 2014, 5:49:10 UTC

There has to be a reason (in the code) that makes these tasks run to 99% in 14 hours and then sit there for another 14 hours getting .001 every 85 seconds.

The Boinc manager says it is running and the VB also says it is running.

Fact is it will never complete and be sent in.

I am just deleting them any time they get up to 98% and then run like a digital snail.

I have already tried several sitting at 99% and 100% and always ending up aborted.

It is obvious this is not a memory problem.
____________

Profile Yeti
Avatar
Send message
Joined: 20 Jul 14
Posts: 699
Credit: 22,597,832
RAC: 211
      
Message 1144 - Posted: 30 Oct 2014, 7:42:39 UTC - in response to Message 1142.

There has to be a reason (in the code) that makes these tasks run to 99% in 14 hours and then sit there for another 14 hours getting .001 every 85 seconds.

The Boinc manager says it is running and the VB also says it is running.


Forget the percentage-figures ! Before the WU is send out, there is made an estimate how Long it will take (Time or Events or something similar). Now, when this estimate is wrong (for example because of unknown Facts inside the WU) the calculation Comes ad adsurdum when the estimated 100% mark is reached.

The WU in real is doing all the time the same amount of work, only the percentage-figures are crazy sometimes.

If a WU is running good or not has nothing to do with the Speed of the percentage-growing !

fabby
Avatar
Send message
Joined: 24 Oct 14
Posts: 31
Credit: 13,947
RAC: 0
    
Message 1146 - Posted: 30 Oct 2014, 8:18:12 UTC - in response to Message 1144.

Yep, here too! Percentage towards the end slows down, but WU still finish correctly! I haven't had to abort any yet...

lancone
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 26 May 14
Posts: 219
Credit: 53,963
RAC: 0
    
Message 1161 - Posted: 30 Oct 2014, 15:54:20 UTC - in response to Message 1144.

Hello,

Forget the percentage-figures ! Before the WU is send out, there is made an estimate how Long it will take (Time or Events or something similar). Now, when this estimate is wrong (for example because of unknown Facts inside the WU) the calculation Comes ad adsurdum when the estimated 100% mark is reached.


In fact yes the estimated time is wrong with the validation campaign. A mixture of all possible type of simulation is being submitted (from very short to very long jobs). We dont have a way to estimate the time needed easily...

fabby
Avatar
Send message
Joined: 24 Oct 14
Posts: 31
Credit: 13,947
RAC: 0
    
Message 1164 - Posted: 30 Oct 2014, 17:05:58 UTC - in response to Message 1124.

Feed-back on 1.28: still receiving 1.27 tasks (Application field: "ATLAS Simulation 1.27 (vbox_64)" ) as from 4h48min ago...

Crystal Pellet
Send message
Joined: 25 Jun 14
Posts: 39
Credit: 59,718
RAC: 0
    
Message 1167 - Posted: 30 Oct 2014, 20:26:28 UTC - in response to Message 1164.

Feed-back on 1.28: still receiving 1.27 tasks (Application field: "ATLAS Simulation 1.27 (vbox_64)" ) as from 4h48min ago...

Only the Windows and Mac applications are updated to v1.28

Profile MAGIC Quantum Mechanic
Avatar
Send message
Joined: 4 Jul 14
Posts: 331
Credit: 485,372
RAC: 0
    
Message 1168 - Posted: 30 Oct 2014, 21:19:09 UTC - in response to Message 1144.

I have never seen a task in my 10+ years with Boinc getting to 99% or 100% running normal on the Manager and on the VB log and then sitting there at 99% or 100% for another 25 hours.....or more.....and run slowly getting .001% each minute or longer and then just sit there and never make it to a completed task.

I check several other members tasks and see the same thing happened.

Yet some tasks run normal all the way to 100%

Sure if you have 100 cores than it is no big deal if you waste 50+ hours on a couple cores.

But it seems that you would look at the logs and stderr and see why that happens and fix it.

And I can imagine people that come here with just one or two hosts after a while they just figure they should go run tasks they know will work (like the alpha-beta days at T4T)

I know it isn't the host with the problem when they have no problems running LHC,GPU's,and vLHC X2 for years (so it sure isn't a VB problem as far as the host)

No big deal to me since all of mine are at home and I am here to watch them.
____________

Jacob Klein
Send message
Joined: 21 Jun 14
Posts: 48
Credit: 27,798
RAC: 0
    
Message 1170 - Posted: 30 Oct 2014, 21:34:31 UTC - in response to Message 1168.
Last modified: 30 Oct 2014, 21:38:09 UTC

Magic / Guys:

Look, it's clearly an estimation problem. Some of the RNA World tasks acted similarly, where, they had no idea how long it would take... so they decided to have the progress control script increment progress up to 98.765%, and then at that point, it would just stay there for however long it needed (sometimes up to 6 months additional processing time!), until it would go from 98.765% instantly to 100%.

It was a bad idea. People thought the tasks got stuck, and aborted massive amounts of work. People need to see progress on each task.

So, instead, I've convinced them instead to use logic where, they use a conservative estimate of how long it will take, and then base the progress control script to be "at" 95% when that estimated time has been spent... and then increment 0.001% every hour after that, so as to indicate progress, while also allowing up to 5000 extra hours of processing time while doing so.

Basically, just because a task is stuck somewhere (98.765%, 99.999%, even 100%)... or is becoming slower to indicate progress (0.001% every hour, etc.)... so long as the task is still checkpointing, resuming, and eating CPU... and the admins don't indicate it's somehow broken or in a loop... then I'd suggest running it until completion.

Hopefully the ATLAS admins can use this post to adjust their progress control script to always indicate progress, if they aren't already doing so.

Regards,
Jacob

Profile Yeti
Avatar
Send message
Joined: 20 Jul 14
Posts: 699
Credit: 22,597,832
RAC: 211
      
Message 1171 - Posted: 30 Oct 2014, 21:37:43 UTC - in response to Message 1168.
Last modified: 30 Oct 2014, 21:38:10 UTC

I have never seen a task in my 10+ years with Boinc getting to 99% or 100% running normal on the Manager and on the VB log and then sitting there at 99% or 100% for another 25 hours.....or more.....and run slowly getting .001% each minute or longer and then just sit there and never make it to a completed task.

I have seen this several times with a variety of Projects.

All Projects that use the wrapper struggle and fight with this Problem; I remember that RNA-World had a lot of Problems with this.

And I can imagine people that come here with just one or two hosts after a while they just figure they should go run tasks they know will work (like the alpha-beta days at T4T)

I can understand this, but why must they crunch a Project in Alpha-Phase ? Alpha-Phase is for the Projects to identify, find and solve Problems before going live and if you choose to crunch an Alpha-Phase-Project you should know what you do

Tom*
Send message
Joined: 28 Jun 14
Posts: 118
Credit: 8,761,428
RAC: 215
      
Message 1187 - Posted: 31 Oct 2014, 15:48:06 UTC
Last modified: 31 Oct 2014, 15:49:49 UTC

Not just an estimation problem in my estimation.

David (top participant) has very few or none of the loooooooooonnnnnnnnggggggg running jobs

ie over 50,000 seconds, everything looks good but they never finish.

Please until we get a handle on the validate errors can we go back to

100 meg input files or short tasks only.

Thanks

zombie67 [MM]
Avatar
Send message
Joined: 18 Jun 14
Posts: 31
Credit: 1,175,117
RAC: 0
    
Message 1188 - Posted: 31 Oct 2014, 16:04:31 UTC

It is not only an estimation problem. There is also some other problem going on too. I just had to abort two of these tasks. One was running over 4 days, the other over 2 days.

http://atlasathome.cern.ch/result.php?resultid=389919
http://atlasathome.cern.ch/result.php?resultid=387072
____________
Dublin, California
Team: SETI.USA

Rick Cannan
Send message
Joined: 25 Sep 14
Posts: 4
Credit: 3,328
RAC: 0
  
Message 1191 - Posted: 31 Oct 2014, 23:07:31 UTC

Decided to make a contribution of unused computing time on a whim one day 6 weeks ago...

Unlike some others I only have 2 cores (and two threads). So this work does affect my ability to do other tasks.

Have 2 simulations running at 100.000%. With 54.5hrs on one and 48.5hrs on the 2nd it seems they are way over estimated time. It has taken well over 24hrs to "progress" these tasks from 95%. They just keep running...

Not sure if its meant to be like this but it seems that the estimates of running time need to be revised.

If I keep getting tasks that behave this way I will need to re-think my contribution to the project.

Rick Cannan
Send message
Joined: 25 Sep 14
Posts: 4
Credit: 3,328
RAC: 0
  
Message 1192 - Posted: 31 Oct 2014, 23:26:59 UTC - in response to Message 1170.

Great suggestion.

Visible progress helps ensure a task is not "deleted".

Is there any alternate way that a task can show that it is still active and NOT caught in an infinite loop. When you are not the person who designed the task it is often not possible to work out visually whether this is an infinite loop or very very extremely slow progress towards an outcome.

Also suggest that the CPU Benchmarks play some general role in limiting the ultimate size of jobs allocated. Say a maximum estimated time 50Hrs - when hat's a realistic estimate of required computing resources.

:-)

Rick Cannan
Send message
Joined: 25 Sep 14
Posts: 4
Credit: 3,328
RAC: 0
  
Message 1221 - Posted: 2 Nov 2014, 0:43:18 UTC - in response to Message 1191.

Decided to make a contribution of unused computing time on a whim one day 6 weeks ago...

Unlike some others I only have 2 cores (and two threads). So this work does affect my ability to do other tasks.

Have 2 simulations running at 100.000%. With 54.5hrs on one and 48.5hrs on the 2nd it seems they are way over estimated time. It has taken well over 24hrs to "progress" these tasks from 95%. They just keep running...

Not sure if its meant to be like this but it seems that the estimates of running time need to be revised.

If I keep getting tasks that behave this way I will need to re-think my contribution to the project.



The runtime and resource usage of these two tasks after reaching 95% is totally out of proportion. One task has now exceeded 80 Hrs of processor time, the other 75hrs.

395436 327402 6156 27 Oct 2014, 15:12:13 UTC 3 Nov 2014, 15:12:13 UTC In progress --- --- --- ATLAS Simulation v1.28 (vbox_64)

395580 327529 6156 27 Oct 2014, 17:11:35 UTC 3 Nov 2014, 17:11:35 UTC In progress --- --- --- ATLAS Simulation v1.28 (vbox_64)

Please review the tasks with this information as they appear to have created infinite loops.

1 · 2 · Next

Message boards : News : v1.28 and validation sample