Discussion:
[Rcpp-devel] Question on performance and strategy
Jordi Molins
2018-09-22 15:21:32 UTC
Permalink
Hello, I am new in this distribution list. I am using Rcpp and
RcppArmadillo for my project.

Now, I will have access to a relatively big computer, with both CPUs and
GPUs. And I want to take advantage of it.

So far, I am calling from C++ a R function (the nls.lm from minpack.lm,
which is a port of MINPACK, in Fortran, for the Levenberg-Marquardt nls
optimization). This call is pretty fast, so I am happy.

When I learned something like RcppParallel and RcppArrayFire existed, I was
quite happy. But then I saw that RcppParallel does not allow R calls, and
RcppArrayFire does not work on Windows (I am planning to have a Linux
partition soon on my personal computer, though).

So, the planning to do parallel calculations from C++ has gone down a bit.
For sure, I can do parallel calculations from R (and my R functions may use
Rcpp). But ideally, I would like to do all the calculations from C++, if
possible, especially on production.

Note: I know in RcppEigen there is a C++ port for the Levenberg-Marquardt
implementation. But again, this is just luck. Maybe next time I will need
the equivalent of an R function, and this function does not exist in C++. I
really would like to be able to call R functions from C++.

Also, I do not know how the ArrayFire implementation works, and especially,
how it relates to parallel calculations (and I do not know if I can call R
functions from RcppArrayFire, either, as it happens with RcppParallel). Is
it possible to use both parallel calculations and GPU calculations by using
Rcpp and its relatives?

I know my questions are not Rcpp per se, but they take into account other,
related packages. But I think there is no "devel distribution list" for
these other packages, so for this reason I ask here.

So, my question is: how do you use Rcpp in relation to parallel and GPU
calculations? Do you use Rcpp single threaded, and then do parallel from R?
Or do you use RcppParallel and / or RcppArrayFire? If so, how do you cope
with the "loss" of not being able to call R functions? And also if so, is
it possible that RcppParallel and RcppArrayFire work together?

Thank you in advance for your guidance.

Jordi Molins i Coronado
+34 69 38 000 59
Dirk Eddelbuettel
2018-09-22 15:33:58 UTC
Permalink
Jordi,

RcppParallel uses _thread_ parallelism on the CPU. RcppArrayFire can use that
too, but can also use two GPU related mechanisms. But those do not give you
extra CPUs, so in short you cannot do "CPU x GPU".

None of this is Rcpp specific. But the respective articles on the
RcppGallery are very good. Try by replicating their results. Then
experiment.

Just keep reading, and trying / experimenting. A lot of this is best learned
by trial and error, along with background reading.

Dirk
--
http://dirk.eddelbuettel.com | @eddelbuettel | ***@debian.org
Jordi Molins
2018-09-22 15:52:26 UTC
Permalink
Thank you, Dirk. For sure I will follow your advice, and I will try /
experiment. From your comment, I will probably try RcppArrayFire first,
rather than RcppParallel.

In relation to doing "CPU x GPU": what would happen if I have 3 variables
to be parallelized (independent from each other, no interdependencies) and
then I create an R function, using RcppArrayFire, to GPU-parallelize two of
them. Then, I use foreach (or similar) in R to CPU-paralellize the third
one (and for each variable of the third one, the R function is called, and
then internally, RcppArrayFire uses GPUs).

Would this scheme work? Or is there anything that blocks the combination of
CPU and GPU, even when CPU and GPU calculations are encapsulated and they
"do not see each other"?

Jordi Molins i Coronado
+34 69 38 000 59
Post by Dirk Eddelbuettel
Jordi,
RcppParallel uses _thread_ parallelism on the CPU. RcppArrayFire can use that
too, but can also use two GPU related mechanisms. But those do not give you
extra CPUs, so in short you cannot do "CPU x GPU".
None of this is Rcpp specific. But the respective articles on the
RcppGallery are very good. Try by replicating their results. Then
experiment.
Just keep reading, and trying / experimenting. A lot of this is best learned
by trial and error, along with background reading.
Dirk
--
Dirk Eddelbuettel
2018-09-22 17:35:45 UTC
Permalink
On 22 September 2018 at 17:52, Jordi Molins wrote:
| In relation to doing "CPU x GPU": what would happen if I have 3 variables
| to be parallelized (independent from each other, no interdependencies) and
| then I create an R function, using RcppArrayFire, to GPU-parallelize two of
| them. Then, I use foreach (or similar) in R to CPU-paralellize the third
| one (and for each variable of the third one, the R function is called, and
| then internally, RcppArrayFire uses GPUs).

Just because you want to access ONE gpu device N times does not make it N gpus.

And as you have only one GPU, if you call it N times "in parallel" (we know:
time sliced) you get contention.

No different from having OpenBLAS or Intel MKL using ALL your cores for
matrix algebra. If you can that from any of the R process parallel helpers
you get contention. All this is well documented.

Dirk
--
http://dirk.eddelbuettel.com | @eddelbuettel | ***@debian.org
Jordi Molins
2018-09-22 18:41:33 UTC
Permalink
I have access to a machine (not a desktop) with quite a few CPUs and quite
a few GPUs. So, if for example there are 100 CPU cores and 100,000 GPU
cores, I guess that I could do a foreach for these 100 CPU cores for an R
function, and then if this R function calls RcppArrayFire, RcppArrayFire
could call 1,000 GPU cores for each call, to make the whole 100,000 GPU
cores, no? Or is everything more complex than that?

Jordi Molins i Coronado
+34 69 38 000 59
Post by Dirk Eddelbuettel
| In relation to doing "CPU x GPU": what would happen if I have 3 variables
| to be parallelized (independent from each other, no interdependencies) and
| then I create an R function, using RcppArrayFire, to GPU-parallelize two of
| them. Then, I use foreach (or similar) in R to CPU-paralellize the third
| one (and for each variable of the third one, the R function is called, and
| then internally, RcppArrayFire uses GPUs).
Just because you want to access ONE gpu device N times does not make it N gpus.
time sliced) you get contention.
No different from having OpenBLAS or Intel MKL using ALL your cores for
matrix algebra. If you can that from any of the R process parallel helpers
you get contention. All this is well documented.
Dirk
--
Ralf Stubner
2018-09-24 10:07:08 UTC
Permalink
Post by Jordi Molins
I have access to a machine (not a desktop) with quite a few CPUs and
quite a few GPUs. So, if for example there are 100 CPU cores and 100,000
GPU cores, I guess that I could do a foreach for these 100 CPU cores for
an R function, and then if this R function calls RcppArrayFire,
RcppArrayFire could call 1,000 GPU cores for each call, to make the
whole 100,000 GPU cores, no? Or is everything more complex than that?
ArrayFire can make use of multiple GPUs [1]. I do not know if it is able
to treat them as one unit. I would expect that one has to do this more
explicitly, e.g. using an explicit loop and setting the device to be
used like here https://github.com/arrayfire/arrayfire-python/issues/165.

However, all this is not related to Rcpp.

Greetings
Ralf

[1] c.f. http://arrayfire.org/docs/group__device__mat.htm
--
Ralf Stubner
Senior Software Engineer / Trainer

daqana GmbH
Dortustraße 48
14467 Potsdam

T: +49 331 23 61 93 11
F: +49 331 23 61 93 90
M: +49 162 20 91 196
Mail: ***@daqana.com

Sitz: Potsdam
Register: AG Potsdam HRB 27966 P
Ust.-IdNr.: DE300072622
GeschÀftsfÌhrer: Prof. Dr. Dr. Karl-Kuno Kunze
Continue reading on narkive:
Loading...