Tips on how to gain better performance on the ATLAS_MCORE app


Advanced search

Message boards : News : Tips on how to gain better performance on the ATLAS_MCORE app

Author Message
Profile Wenjing Wu
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 23 Jun 14
Posts: 31
Credit: 2,849,678
RAC: 18
      
Message 5208 - Posted: 6 Sep 2016, 1:23:31 UTC
Last modified: 7 Sep 2016, 5:04:48 UTC

According to our test, the CPU performance (We measure the CPU performance by seconds of CPU per event. For example, the current ATLAS job running on the ATlAS_MCORE app processes 100 events, if the overall CPU time for this job is 30000 seconds, then the CPU performance is 300 seconds/event) varies to the number of cores of the vm.

In the test, we also compared the CPU performance on different Virtualbox versions.

The test result can be seen here:



The above test is done on 2 hosts, HOST1 has HT(HypherThreading) enabled, while HOST2 has HT disabled. The test result is consistent despite of whether disabling/enabling HT of the CPU.

Also, we conclude a result from the ATLAS job statistics based on the jobs from over 1 month period. The following result shows the average CPU performance on different number of cores. (ATLAS_MCORE supports up to 12 cores for now).



The benefit of using more cores in one vm is to save memory usage, but using big number of cores can also significantly reduce the CPU performance. Our test result also concludes this is a case in all cloud computing platforms, not just on ATLAS@home.

In order to have a good tradeoff between memory usage and CPU performance, we advise that you configure the cores to each ATLAS_MCORE job(VM) according to the overall cores and memory allocated to BOINC.For example,if your host allocates 12 cores to BOINC, by default, ATLAS_MCORE creates a vm with 12 cores with 12.1GB memory, but if the host has enough memory, you can customize the usage with the app_config.xml file, i.e. each vm uses 6 cores with 7.3GB memory, so that your host runs 2 vms, and the overall memory usage is 14.6GB.

You can limit the MultiCore-App by using the app_config.xml (This file needs to be put in your project/atlasathome.cern.ch/ directory).

Below is an example to limit each ATLAS_MCORE job to use only 6 Cores:

<app_config> <app_version> <app_name>ATLAS_MCORE</app_name> <avg_ncpus>6.000000</avg_ncpus> <plan_class>vbox_64_mt_mcore</plan_class> <cmdline>--memory_size_mb 7300</cmdline> </app_version> </app_config>


You should change these two lines to your needs:

<avg_ncpus>4.000000</avg_ncpus> <cmdline>--memory_size_mb 7300</cmdline>


Memory usage calculated by the ATLAS_MCORE app is by this formula:

memory = 2500 + (800* NumerOfCores)

so it is 7300MB for 6 cores.

We will also make some changes on the server side very soon:
1. Require a minimum version(5.0.0) of Virtualbox for the ATLAS_MCORE app.
2. Limit the ATLAS_MCORE app to use at most 8 cores.

Jacob Klein
Send message
Joined: 21 Jun 14
Posts: 48
Credit: 27,798
RAC: 0
    
Message 5209 - Posted: 6 Sep 2016, 3:25:10 UTC
Last modified: 6 Sep 2016, 3:40:40 UTC

Interesting!!

I wonder if the results for "large number of cores" become skewed due to being allocated hyperthreaded cores instead of real ones. What I mean to say is ... Once you start using up more threads than "real cores", the performance benefits will decrease substantially, and possibly (though not likely) be worse than just using "real cores".

I'd recommend making a clear distinction in your original posts, indicating whether your tests were on "real cores only", or "real and hyperthreaded", or "unknown". Oh, and I wouldn't artificially limit the tasks to a certain number of cores.

For instance, using Intel Performance Counter, http://www.intel.com/software/pcm, I recently did some hyperthreading testing. It indicated that, for RNA World tasks, VirtualBox left a lot of headroom to be utilized by hyperthreading with some other CPU-intensive tasks. Note: I don't have enough RNA VM tasks to accurately test if hyperthreading provides any benefit when running only RNA VM tasks hyperthreaded.


20160904
Intel® Core™ i7-5960X Processor Extreme Edition - Overclocked to 3.8 GHz
8 Cores, 16 Threads
Intel PCM - Instructions Retired

Non-RNA-CPU tasks (WCG, MilkyWay, Universe):
0: 700 M
1: 12 G
2: 20 G
3: 29 G
4: 40 G
5: 45 G
6: 51 G
7: 58 G
8: 61 G
9: 66 G
10: 72 G
11: 75 G
12: 80 G
13: 82 G
14: 84 G
15: 87 G
16: 89 G

RNA-VM-CPU tasks:
0: 700 M
1: 3 G
2: 5 G
3: 7 G
4: 10 G
5: 11 G
6: 13 G
7: 14 G
8: 16 G
9: 16 G

8 RNA-VM-CPU tasks with more Non-RNA-CPU tasks:
8+1: 23 G
8+2: 29 G
8+3: 34 G
8+4: 40 G
8+5: 43 G
8+6: 48 G
8+7: 53 G
8+8: 56 G

Conclusions:
- Each Non-RNA-CPU task takes 3-11 G (average ~7.5 G)
- Throughput increases for any task when hyperthreaded
- CPU-packed ones (like WCG) likely offer less of an increase
- Each RNA-VM-CPU task takes ~1.9 G
- Definitely worth hyperthreading them against other CPU-packed tasks!

Profile Wenjing Wu
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 23 Jun 14
Posts: 31
Credit: 2,849,678
RAC: 18
      
Message 5210 - Posted: 6 Sep 2016, 4:31:22 UTC - in response to Message 5209.
Last modified: 6 Sep 2016, 8:27:57 UTC

Indeed, very good point! After i check different machines w/ and w/o HT, the performance varies. I will put together the result and update the above information soon!

Cheers!

Interesting!!

I wonder if the results for "large number of cores" become skewed due to being allocated hyperthreaded cores instead of real ones. What I mean to say is ... Once you start using up more threads than "real cores", the performance benefits will decrease substantially, and possibly (though not likely) be worse than just using "real cores".

Toby Broom
Send message
Joined: 1 Jul 14
Posts: 70
Credit: 12,032,688
RAC: 88
      
Message 5213 - Posted: 6 Sep 2016, 6:19:39 UTC

I have a 20 core 40 thread machine, I can see what happens for 12 cores on this machine.

http://atlasathome.cern.ch/results.php?hostid=9137

I have to run down the older style WU's 1st.

How can I get the run times?

Profile Wenjing Wu
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 23 Jun 14
Posts: 31
Credit: 2,849,678
RAC: 18
      
Message 5214 - Posted: 6 Sep 2016, 6:58:54 UTC - in response to Message 5213.

I notice your host does not run the ATLAS_MOCRE app,
http://atlasathome.cern.ch/results.php?hostid=9137&offset=0&show_names=0&state=4&appid=

It is only running the ATLAS single core app.
By the above link, you can use the cpu_time divide 25, then it is the cpu_time per event. (Jobs on ATLAS single core processes 25 events)



I have a 20 core 40 thread machine, I can see what happens for 12 cores on this machine.

http://atlasathome.cern.ch/results.php?hostid=9137

I have to run down the older style WU's 1st.

How can I get the run times?

Profile Yeti
Avatar
Send message
Joined: 20 Jul 14
Posts: 699
Credit: 22,597,832
RAC: 211
      
Message 5219 - Posted: 6 Sep 2016, 15:45:16 UTC - in response to Message 5208.

In order to have a good tradeoff between memory usage and CPU performance, we advise that you configure the cores to each ATLAS_MCORE job(VM) according to the overall cores and memory allocated to BOINC.For example,if your host allocates 12 cores to BOINC, by default, ATLAS_MCORE creates a vm with 12 cores with 12.1GB memory, but if the host has enough memory, you can customize the usage with the app_config.xml file, i.e. each vm uses 6 cores with 7.3GB memory, so that your host runs 2 vms, and the overall memory usage is 14.6GB.

...

We will also make some changes on the server side very soon:
1. Require a minimum version(5.0.0) of Virtualbox for the ATLAS_MCORE app.
2. Limit the ATLAS_MCORE app to use at most 8 cores.

I'm missing one very important point: Find a solution for the Scheduler-bug using an app_config.xml. You remember that the Scheduler tells the BOINC-Client a way too large amount of memory needed to run Atlas-Multi-Core-WUs

Toby Broom
Send message
Joined: 1 Jul 14
Posts: 70
Credit: 12,032,688
RAC: 88
      
Message 5222 - Posted: 6 Sep 2016, 20:55:50 UTC - in response to Message 5214.
Last modified: 6 Sep 2016, 21:04:40 UTC

I had to run down the regular task switching to MCORE today, setup one 12core task on my 20 core, 40 thread machine.

I assume its a bit less than perfect as you have 2 cores on the 2nd CPU.

Toby Broom
Send message
Joined: 1 Jul 14
Posts: 70
Credit: 12,032,688
RAC: 88
      
Message 5223 - Posted: 6 Sep 2016, 21:21:19 UTC - in response to Message 5219.
Last modified: 6 Sep 2016, 21:55:59 UTC

On my machines I see the following:

Cores, #VMx#Core/VM, BOINC WS, app_config setting

20 cores, 1x12, 11.82GB WS, 12.1GB
12 Cores, 2x8, 11.82GB WS, 8.9GB
10 cores, 2x8, 11.82GB WS, 8.9GB
10 cores, 1x6, 11.82GB WS, 7.3GB
6 cores, 1x1, 11.82GB WS, 2.5GB
4 cores, 1x1, 6.35GB WS, 2.5GB
4 cores, 1x2, 8.69GB WS, 3.3GB

Seems like with the high core machines the WS is fixed?, if I set 4 or 6 core WU on the high core machines it still proportions 11.82GB WS to these even though appconfig specifies much lower amount.

computezrmle
Send message
Joined: 29 Oct 14
Posts: 54
Credit: 1,137,404
RAC: 35
      
Message 5226 - Posted: 7 Sep 2016, 6:10:23 UTC

IMHO the cause of several problems is an incorrect server side use of the parameter <max_ncpus>.

This includes:


  • "Core No." shown in the task table
  • GFLOPS of the device
  • RAM estimation
  • credit calculation



The project server should use <avg_ncpus> instead.

Profile Wenjing Wu
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 23 Jun 14
Posts: 31
Credit: 2,849,678
RAC: 18
      
Message 5230 - Posted: 7 Sep 2016, 7:44:26 UTC - in response to Message 5226.
Last modified: 7 Sep 2016, 7:53:23 UTC

Thanks!
Could you be more specific about the configuration on the server side?
we use plan_class:

<plan_classes> <plan_class> <name>vbox_64</name> <virtualbox/> <is64bit/> <min_vbox_version>30200</min_vbox_version> </plan_class> <plan_class> <name>vbox_64_mt_mcore</name> <virtualbox/> <is64bit/> <min_vbox_version>30200</min_vbox_version> <min_ncpus>2</min_ncpus> <max_threads>12</max_threads> <mem_usage_base_mb>2500</mem_usage_base_mb> <mem_usage_per_cpu_mb>800</mem_usage_per_cpu_mb> <projected_flops_scale> <nthreads_cmdline/> </plan_class> </plan_classes>

IMHO the cause of several problems is an incorrect server side use of the parameter .

This includes:

  • "Core No." shown in the task table
  • GFLOPS of the device
  • RAM estimation
  • credit calculation



The project server should use instead.

computezrmle
Send message
Joined: 29 Oct 14
Posts: 54
Credit: 1,137,404
RAC: 35
      
Message 5237 - Posted: 7 Sep 2016, 9:33:02 UTC - in response to Message 5230.

Every time the client contacts the ATLAS project server it generates and uploads a file sched_request_atlasathome.cern.ch.xml.
This file includes a lot of information about the client, WUs etc. that may be interesting/necessary for the server.

One section of this file is a copy of the most recent settings in client_state.xml:

<app_version> <app_name>ATLAS_MCORE</app_name> <version_num>104</version_num> <platform>x86_64-pc-linux-gnu</platform> <avg_ncpus>2.000000</avg_ncpus> <max_ncpus>7.000000</max_ncpus> <flops>8211708435.660692</flops> <plan_class>vbox_64_mt_mcore</plan_class> <api_version>7.7.0</api_version> <cmdline>--memory_size_mb 4608</cmdline> <dont_throttle/> <is_wrapper/> <needs_network/> </app_version>

In this example the project server is told that the client currently uses avg_ncpus=2 for ATLAS_MCORE and max_ncpus=7 for all attached projects.
Since avg_ncpus is the relevant parameter on the client that controls how many cores are used by an ATLAS_MCORE WU I would expect that avg_ncpus is also used on the server to calculate/display a couple of values like:

  • "Core No." shown in the task table
  • GFLOPS of the device
  • RAM estimation
  • credit calculation



Instead the project server uses max_ncpus as input.
This can be tested if you force your client to send other values (2, 3, 4, ...) for max_ncpus.

Results:


  • "Core No." shows "7" although the WU runs on 2 cores
  • GFLOPS of the device shows 7xGFLOPS of one core instead of 2xGFLOPS
  • RAM estimation is 8.1 GB but should be 4.1 GB
  • credit calculation ... (who cares, but wrong)




The RAM estimation is IMHO the most critical point as this value leads to problems on clients that have not much RAM and don´t limit the WUs by an app_config.xml
Unfortunately app_config.xml can only modify avg_ncpus and not max_ncpus.
Therefore it´s necessary to make some changes on the server side.
See the vLHC project if you need an example.

Toby Broom
Send message
Joined: 1 Jul 14
Posts: 70
Credit: 12,032,688
RAC: 88
      
Message 5244 - Posted: 7 Sep 2016, 18:47:07 UTC

my 12 core tests with n=3 results are worse than yours, I got 1106±58.

My ATLAS Simulation v2.01 (vbox_64), results when running 20 tasks at once were
400±167 n=40

Toby Broom
Send message
Joined: 1 Jul 14
Posts: 70
Credit: 12,032,688
RAC: 88
      
Message 5252 - Posted: 9 Sep 2016, 17:54:16 UTC - in response to Message 5244.

I switched to run 7, 6core task on this machine, it did 20 task. with around 1100, so not much change between the two.

Profile Phil1966
Send message
Joined: 14 Jun 14
Posts: 39
Credit: 1,185,758
RAC: 1
    
Message 5254 - Posted: 10 Sep 2016, 18:35:18 UTC
Last modified: 10 Sep 2016, 18:43:25 UTC

Hello,

I switched from using 8 cores to 2 x 4 cores => 2 WU's

On this machine, no problem : http://atlasathome.cern.ch/show_host_detail.php?hostid=11

But on this one : http://atlasathome.cern.ch/show_host_detail.php?hostid=54516 , impossible to have it run, although I added enough RAM to reach 24Go.

A lot of invalid tasks.

I checked, and during 10 minutes, the RAM load is < 4GO, and CPU < 5%. And then BM closes the WU's ...

(NB have reduced the RAM to 16 Go as impossible to run 2 WU's at the same time)

Another point :

Concerning the RAM, your 2500 + (800 * number of cores) seems to be too optimistic.

Currently running 2 WU's at the same time, 4 cores each, and although the RAM use should be around 11400, it is in fact > 18500 (total) !!!

See http://atlasathome.cern.ch/show_host_detail.php?hostid=11 valid WU's details.

Thank You

Best

Phil1966

EDIT : Might have another problem on http://atlasathome.cern.ch/show_host_detail.php?hostid=54516 ... Imposible to complete a WU since I tried to run 2 WU's ....

rbpeake
Send message
Joined: 27 Jun 14
Posts: 86
Credit: 8,794,961
RAC: 66
      
Message 5255 - Posted: 10 Sep 2016, 19:33:36 UTC - in response to Message 5254.

Hello,

I switched from using 8 cores to 2 x 4 cores => 2 WU's


(NB have reduced the RAM to 16 Go as impossible to run 2 WU's at the same time)

Another point :

Concerning the RAM, your 2500 + (800 * number of cores) seems to be too optimistic.

Currently running 2 WU's at the same time, 4 cores each, and although the RAM use should be around 11400, it is in fact > 18500 (total) !!!

See http://atlasathome.cern.ch/show_host_detail.php?hostid=11 valid WU's details.

Thank You

Best

Phil1966
....

Same here. I cannot run 2x4 on 16GB of RAM.

David Cameron
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 13 May 14
Posts: 251
Credit: 2,028,082
RAC: 30
      
Message 5267 - Posted: 12 Sep 2016, 7:55:31 UTC - in response to Message 5254.


Another point :

Concerning the RAM, your 2500 + (800 * number of cores) seems to be too optimistic.

Currently running 2 WU's at the same time, 4 cores each, and although the RAM use should be around 11400, it is in fact > 18500 (total) !!!


This is due to the problem discussed in this thread

For each multicore WU BOINC assumes all available cores are used, so in your case each WU thinks it needs 2.5 + 0.8 * 8 cores = 8.9GB. But when the virtual machine is started it is only allocated 5.7GB because it knows it only has 4 cores.

So the real RAM usage should not (cannot) be more than 11.4GB because that is what the VMs have. But you need to over-commit the memory in your settings so that the BOINC scheduler thinks that you have enough to run 2 tasks.

Profile Yeti
Avatar
Send message
Joined: 20 Jul 14
Posts: 699
Credit: 22,597,832
RAC: 211
      
Message 5283 - Posted: 13 Sep 2016, 12:43:41 UTC

One point about performance with and without HT:

In your table, you show following figures:

HOST 1 (HT) with 4 cores: 330 sec/event

HOST 2 (non HT) with 4 cores: 405 sec/event

I understand this so: Host1 needs only 330 seconds per event, while Host2 needs 405 seconds.

So, Host1 (with HT) is faster than Host2 (no HT) ? !

sMASH
Send message
Joined: 29 Jul 16
Posts: 1
Credit: 22,489
RAC: 0
    
Message 5294 - Posted: 15 Sep 2016, 11:46:09 UTC

ok, i haven't created any vm on my machine. i just let atlas rip. i have a 6core ht with 16gigs of ram available. any thoughts? and i don't see much cpu usage, about 25% and 7.+ gigs of ram usage.. both figure are total while surfing youtube.

other bionic programs use up to 90% of my cpu with like 12 gigs of ram....

Message boards : News : Tips on how to gain better performance on the ATLAS_MCORE app