Longer jobs coming and problems in the last 24h


Advanced search

Message boards : News : Longer jobs coming and problems in the last 24h

1 · 2 · Next
Author Message
David Cameron
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 13 May 14
Posts: 252
Credit: 2,028,556
RAC: 1
    
Message 252 - Posted: 11 Jul 2014, 14:49:42 UTC

Due to a problem in the submission system that injects ATLAS tasks into Boinc there were no new tasks submitted for around 16 hours until ~9 UTC today. The tasks are now slowly ramping up. Apologies for the inconvenience.

We also expect soon some longer tasks which should bring increased CPU efficiency.

markus tingsnaes
Send message
Joined: 4 Jul 14
Posts: 4
Credit: 77,834
RAC: 0
    
Message 256 - Posted: 12 Jul 2014, 13:38:54 UTC - in response to Message 252.

Hi David

I dont know if the problem in the submission system that injects ATLAS tasks into Boinc is back again because there are no new tasks to down load.

best regards/
Markus

Profile Johnny L. Williams
Send message
Joined: 9 Jul 14
Posts: 1
Credit: 22,125
RAC: 0
    
Message 257 - Posted: 12 Jul 2014, 18:12:21 UTC - in response to Message 252.

I will be standing by, I currently have 9 ATLAS@home running ( as of 7/12/2014 )

Regards,

Johnny Williams
Fort Myers, Florida

lancone
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 26 May 14
Posts: 219
Credit: 53,963
RAC: 0
    
Message 277 - Posted: 14 Jul 2014, 8:16:46 UTC - in response to Message 256.

Hello Markus

The system has some inertia, we should be back to nominal submission rate now.

Regards
Eric

rulez-alex
Send message
Joined: 6 Jul 14
Posts: 1
Credit: 0
RAC: 0
Message 284 - Posted: 15 Jul 2014, 15:32:27 UTC

hi, I'm a new member of the project, why not jobs?

Tom*
Send message
Joined: 28 Jun 14
Posts: 118
Credit: 8,761,428
RAC: 0
    
Message 285 - Posted: 15 Jul 2014, 15:46:49 UTC

They just switched from wrapper 1.21 to 1.22, not sure if that has anything to do
with the dearth of tasks. Must be more Inertia in 1.22 than 1.21

1.21 was doing so good too:-)

markus tingsnaes
Send message
Joined: 4 Jul 14
Posts: 4
Credit: 77,834
RAC: 0
    
Message 286 - Posted: 15 Jul 2014, 16:45:35 UTC - in response to Message 285.

Yes it something wrong again no jobbs to work on :(

Profile MAGIC Quantum Mechanic
Avatar
Send message
Joined: 4 Jul 14
Posts: 332
Credit: 485,372
RAC: 0
    
Message 288 - Posted: 15 Jul 2014, 19:04:51 UTC

Good thing we have LHC,T4T,and Atlas at the same time
____________

markus tingsnaes
Send message
Joined: 4 Jul 14
Posts: 4
Credit: 77,834
RAC: 0
    
Message 317 - Posted: 18 Jul 2014, 17:27:46 UTC

No more tasks to download :(

Profile LHCByloved
Avatar
Send message
Joined: 28 Jun 14
Posts: 29
Credit: 6,406
RAC: 0
  
Message 320 - Posted: 19 Jul 2014, 0:33:48 UTC
Last modified: 19 Jul 2014, 0:40:13 UTC

Ah, new (and huge!!) jobs are coming, just got one with 1.2 GB input files downloading, lets see how long it takes the computer to chew on, rather heavy midnight snack. :-)

dskagcommunity
Avatar
Send message
Joined: 11 Jul 14
Posts: 19
Credit: 44,513
RAC: 0
    
Message 323 - Posted: 19 Jul 2014, 10:12:24 UTC
Last modified: 19 Jul 2014, 10:49:03 UTC

Then you are a lucky guy, nothing new here :(

Edit: Ah now i got two :)

Profile LHCByloved
Avatar
Send message
Joined: 28 Jun 14
Posts: 29
Credit: 6,406
RAC: 0
  
Message 324 - Posted: 19 Jul 2014, 13:34:41 UTC
Last modified: 19 Jul 2014, 14:01:01 UTC

Hi there

The big one was finished quickly but by regular VM shutdown (I mean no abortion, computation error or so) after some minutes, so possibly something was not OK so far that made the VM shutting down, I just watched booting, loading stuff and starting calculations, but did not catch what made it shut down so quickly after that as I did not expect that happening.

Greetings,
Bylo

Tom*
Send message
Joined: 28 Jun 14
Posts: 118
Credit: 8,761,428
RAC: 0
    
Message 325 - Posted: 19 Jul 2014, 13:51:25 UTC
Last modified: 19 Jul 2014, 13:57:44 UTC

My theory is the Feeder needs a swift kick, but the other issue is there seems

to be 1400 jobs stuck as they never get processed no matter how long the Feeder

isn't feeding.

Profile LHCByloved
Avatar
Send message
Joined: 28 Jun 14
Posts: 29
Credit: 6,406
RAC: 0
  
Message 334 - Posted: 20 Jul 2014, 19:51:04 UTC
Last modified: 20 Jul 2014, 20:50:38 UTC

Hi there

Theres something wrong with the huge jobs, just got another one of 1.2 GB input and kept an eye on it this time.

VM was again shut down very quickly after loading/starting all the stuff, but no errors (as far as I could see) shown inside VM, only shown that calculation is finished and files to be exported, upload of result file was only 4 KB (compared to about 20 MB output for the 70 MB input jobs). Obviously it could not really start the calculations to be done.

These were the two big ones I had, as VM did a regular shutdown, tasks were credited, but its not really comfortable neither for ATLAS as work is not really done nor for users computers if downloading files takes longer than time for (failing to be performed) calculations.

ID: 110067, Name: Z7GODmk0nSknDDn7oo6G73TpABFKDmABFKDmqNFKDmABFKDmnUZCTo_0
ID: 113962, Name: 8tLNDmps7SknDDn7oo6G73TpABFKDmABFKDmR0GKDmABFKDmnfZuHo_1

Possibly you could check this on your end for the tasks in question as the problem was inside VM and not on BOINC.

As alone for input files itself it was 1.2 GB, I'd say allocated Memory for VM of 1 GB is far not enough for these big jobs, therefore it could not work properly and shut down. Already after loading input VM Memory usage was at the limit while with the smaller jobs theres enough Memory left that increases with calculations going on till reaching the limit.

Just an idea, don't know if its possible on an easy way: could VM Memory be set for VM to be created individually for each job instead of "global" settings by the xml file in projects directory, for example 1 GB for the ones with smaller input and 2 GB for the bigger ones?

I will try this manually via ATLAS_vbox.......xml file inside projects folder if I catch another big job to see if it solves the problem and will let you know about it.

Could please someone who gets one of these big jobs have a look at this too, either to see if it runs properly (and me being the black sheep here :-)) or to confirm that there are some problems?

Greetings,
Bylo

Profile LHCByloved
Avatar
Send message
Joined: 28 Jun 14
Posts: 29
Credit: 6,406
RAC: 0
  
Message 339 - Posted: 21 Jul 2014, 0:07:08 UTC
Last modified: 21 Jul 2014, 0:50:02 UTC

Hi there again

I got another 1.2 GB task and played a bit with it....

Set checkpoint interval to bigger value, so I could suspend task (quick enough) when shut down procedure of VM started before any checkpoint is made and task therefore forced to start over from the beginning after I cleaned up the slot directory and restarted BOINC to make sure it takes the changed values with each try.

Have tried now to set Memory to higher values step by step and starting over and over with the task, took care of setting for Memory in ATLAS_vbox.......xml and values for rsc_memory_bound in client_state and init_data file for that certain task, with 2048, 2560 and even with 3072 MB. I couldn't try any further, otherwise my system runs out of Memory. All leading to the same problem of not starting the calculations as I can also see by very low CPU use of 4-5% going on for several minutes after VM loaded the stuff and till it decides to shut down.

I sent it back now, using 3072 MB Memory and no success to solve the problem.

ID: 114492, Name: garNDm0Z9SknDDn7oo6G73TpABFKDmABFKDmxOHKDmABFKDmhLr9um_1

Thats the screen I could capture for this recent task in question, it says "successful" but I can't imagine this as it did no real work.....



Or dark matter hiding in there so it did dark work? (Sorry, couldn't resist that one) :-)

I'll continue now with the smaller tasks with 70 MB input that worked fine so far.

Greetings (and sorry for delivering problems on a Monday morning),
Bylo

Jacob Klein
Send message
Joined: 21 Jun 14
Posts: 48
Credit: 27,798
RAC: 0
    
Message 344 - Posted: 21 Jul 2014, 21:16:50 UTC - in response to Message 339.
Last modified: 21 Jul 2014, 21:20:50 UTC

Were you adjusting the VM's "Base Memory" setting at all, or just the client_state.xml file's "rsc_memory_bound" setting only?

rsc_memory_bound does not increase the VM's Base Memory. rsc_memory_bound is used to say "The client's computer will need at least this much memory usable by BOINC, before attempting to download this task", and "if it's a VM task, consider this much memory as budgeted when calculating running memory amounts... since Oracle insists that the Base Memory amount must be guaranteed free while the VM is running.". There's an assumption (and a necessity) that the admins have the values set equal to each other.

But the VM's actual "Base Memory" setting, however, is a completely different setting than rsc_memory_usage. I think "Base Memory" is controlled via <memory_size_mb> setting in the ATLAS_vbox_job_1.22_windows_x86_64.xml file.

Profile LHCByloved
Avatar
Send message
Joined: 28 Jun 14
Posts: 29
Credit: 6,406
RAC: 0
  
Message 346 - Posted: 22 Jul 2014, 2:43:27 UTC
Last modified: 22 Jul 2014, 3:02:07 UTC

Hi Jacob

Thank you for further information!

I think "Base Memory" is controlled via <memory_size_mb> setting in the ATLAS_vbox_job

I edited that one too to make sure but as I did it all together manually, so I can't tell which setting influences what value.

Again had a look into client_state file, this is what I actually got for a "small" job (and after putting back everything to defaults)

<workunit>
<name>lIPKDmD6XTknDDn7oo6G73TpABFKDmABFKDmsYLKDmABFKDmFLfdDn</name>
<app_name>ATLAS</app_name>
<version_num>122</version_num>
<rsc_fpops_est>299880000000000.000000</rsc_fpops_est>
<rsc_fpops_bound>6000000000000000000.000000</rsc_fpops_bound>
<rsc_memory_bound>1024000000.000000</rsc_memory_bound>
<rsc_disk_bound>15000000000.000000</rsc_disk_bound>

Now with closer look also noticed that there are further Memory settings hidden in client_state which I did not touch (overseen yesterday).

<active_task>
<project_master_url>http://atlasathome.cern.ch/</project_master_url>
<result_name>lIPKDmD6XTknDDn7oo6G73TpABFKDmABFKDmsYLKDmABFKDmFLfdDn_0</result_name>
(...)
<swap_size>6256304128.000000</swap_size>
<working_set_size>1149906944.000000</working_set_size>
<working_set_size_smoothed>1024000000.000000</working_set_size_smoothed>
<page_fault_rate>0.000000</page_fault_rate>
</active_task>

Anyway, while testing settings with the big job yesterday, I saw via system monitor Memory for VM Headless process growing up till 3 GB were reached, but might have not worked as supposed because of the settings for working_set_size (this being the Base Memory you mean??) not being adjusted either. *Argh*

Can't check VM directly via VBox manager due to the sandboxing feature with invisible VM on Mac OS and setting up VM from image outside BOINC to play with Memory and other settings (as possible with vLHC) is not applicable cause of tasks structure, otherwise I would have tried it that way instead of crawling through settings inside BOINCs files.

I'll take another try with settings in client_state in case I receive another big job, ambition to make task/VM run with doing the calculations instead of "came, saw, frightened and went away" behavior by VM - if its not beyond my systems abilities.

Today I just got small ones. Possibly the few big ones been taken out of queue?

Greetings,
Bylo

Jacob Klein
Send message
Joined: 21 Jun 14
Posts: 48
Credit: 27,798
RAC: 0
    
Message 347 - Posted: 22 Jul 2014, 3:00:21 UTC

I stated what I did, in my last message, as pure fact. So far as I know, everything in my prior post is correct.

The settings (rsc_memory_bound in client_state.xml, and <memory_size_mb> in ATLAS_vbox_job_1.22_windows_x86_64.xml)... should not really be modified by a user. I don't recommend to edit settings that you are unfamiliar with. :)

lancone
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 26 May 14
Posts: 219
Credit: 53,963
RAC: 0
    
Message 366 - Posted: 22 Jul 2014, 16:47:22 UTC - in response to Message 347.

We incidentally discovered that at regular time intervals, a few high memory ATLAS test jobs were also submitted to BOINC causing several issues on the client side. The submission of these pathological jobs has been canceled.

We apologise for the inconveniences caused.

And thank you for reporting the issue.

regards

Jacob Klein
Send message
Joined: 21 Jun 14
Posts: 48
Credit: 27,798
RAC: 0
    
Message 368 - Posted: 22 Jul 2014, 16:55:46 UTC - in response to Message 366.

Thank you for getting back to us on this.

We have no problem running these, so long as the settings (rsc_memory_bound in client_state.xml, and <memory_size_mb> in ATLAS_vbox_job_1.22_windows_x86_64.xml)... are correct.

It seems, though, that, since the xml file is used to setup the VM, that if you want VMs of differing base memory amounts, then.. maybe you need separate applications.

Regards,
Jacob

1 · 2 · Next

Message boards : News : Longer jobs coming and problems in the last 24h