Discussion:
[Rcpp-devel] [RcppParallel] Segfault but only on TravisCI
Alexis Sarda
2018-01-16 20:02:40 UTC
Permalink
Hello,

I am integrating RcppParallel into my R package and I'm running into
strange problems with segmentation faults, but only during the continuous
integration checks. I have essentially variations of the following (I hope
GitHub gist links are ok):

https://gist.github.com/asardaes/7d78af394f848a967997ff23e433c9cf

On TravisCI, my Linux builds simply freeze, and the OSX builds show
messages like:

*** caught segfault ***
address 0x100000001, cause 'memory not mapped'

I would assume that my distance functions are trying to access memory they
shouldn't, but during interactive use everything works flawlessly, and I've
tested all of the following with no problems (which also test correctness,
i.e. numeric consistency with respect to past results):

- Local Linux R CMD check
- Local Windows R CMD check
- CRAN's WinBuilder check
- AppVeyor (x32 and x64 Windows)
- Docker R CMD check using rocker's r-devel-san
- Local Linux R CMD check with valgrind (no leaks)

It is worth mentioning that some of the examples ran during the OSX build
show incorrect results long before the segfault occurs: some results are
zero when they shouldn't be. I don't have access to a machine with OSX, but
the Linux builds in TravisCI also show problems (no segfaults explicitly,
just hangs).

I am at my wit's end. Any input would be appreciated.

Regards,
Alexis.
Dirk Eddelbuettel
2018-01-17 12:32:59 UTC
Permalink
On 16 January 2018 at 21:02, Alexis Sarda wrote:
| Hello,
|
| I am integrating RcppParallel into my R package and I'm running into
| strange problems with segmentation faults, but only during the continuous
| integration checks. I have essentially variations of the following (I hope
| GitHub gist links are ok):
|
| https://gist.github.com/asardaes/7d78af394f848a967997ff23e433c9cf
|
| On TravisCI, my Linux builds simply freeze, and the OSX builds show
| messages like:
|
| *** caught segfault ***
| address 0x100000001, cause 'memory not mapped'
|
| I would assume that my distance functions are trying to access memory they
| shouldn't, but during interactive use everything works flawlessly, and I've
| tested all of the following with no problems (which also test correctness,
| i.e. numeric consistency with respect to past results):
|
| - Local Linux R CMD check
| - Local Windows R CMD check
| - CRAN's WinBuilder check
| - AppVeyor (x32 and x64 Windows)
| - Docker R CMD check using rocker's r-devel-san
| - Local Linux R CMD check with valgrind (no leaks)
|
| It is worth mentioning that some of the examples ran during the OSX build
| show incorrect results long before the segfault occurs: some results are
| zero when they shouldn't be. I don't have access to a machine with OSX, but
| the Linux builds in TravisCI also show problems (no segfaults explicitly,
| just hangs).
|
| I am at my wit's end. Any input would be appreciated.

Hard to tell for us, but maybe try the old and trusted route of smaller and
smaller reproducible examples til you reproduce it?

Or else if it _just_ Travis CI maybe it is a compiler version issue? Travis
is very conservative in its default setup but there are .travis.yaml scripts
out there that turn on the PPA for compiler builds giving you gcc-5, gcc-6,
... amd different clang versions.

Dirk
--
http://dirk.eddelbuettel.com | @eddelbuettel | ***@debian.org
Kevin Ushey
2018-01-17 16:41:56 UTC
Permalink
In your RcppParallel worker, it looks like you're trying to write to
an Rcpp matrix; e.g. you have:

distmat_(i,j) = local_calculator->calculate(i,j);

where distmat_ is a matrix. You should avoid using Rcpp classes within
RcppParallel workers, as there's no guarantee that the methods
available on Rcpp classes are thread-safe (and this could lead to
these kinds of segfaults).

I'll echo Dirk and say that without a minimally reproducible example
(or access to the package source code) there's not much else we can
say.

You could try running your code with `gctorture(TRUE)` to see if that
triggers your segfault more reliably -- if that's the case, then you
almost surely have an protection issue somewhere (most likely the
result of using non-threadsafe APIs within an RcppParallel worker, but
without full context it's impossible to be sure)
Post by Dirk Eddelbuettel
| Hello,
|
| I am integrating RcppParallel into my R package and I'm running into
| strange problems with segmentation faults, but only during the continuous
| integration checks. I have essentially variations of the following (I hope
|
| https://gist.github.com/asardaes/7d78af394f848a967997ff23e433c9cf
|
| On TravisCI, my Linux builds simply freeze, and the OSX builds show
|
| *** caught segfault ***
| address 0x100000001, cause 'memory not mapped'
|
| I would assume that my distance functions are trying to access memory they
| shouldn't, but during interactive use everything works flawlessly, and I've
| tested all of the following with no problems (which also test correctness,
|
| - Local Linux R CMD check
| - Local Windows R CMD check
| - CRAN's WinBuilder check
| - AppVeyor (x32 and x64 Windows)
| - Docker R CMD check using rocker's r-devel-san
| - Local Linux R CMD check with valgrind (no leaks)
|
| It is worth mentioning that some of the examples ran during the OSX build
| show incorrect results long before the segfault occurs: some results are
| zero when they shouldn't be. I don't have access to a machine with OSX, but
| the Linux builds in TravisCI also show problems (no segfaults explicitly,
| just hangs).
|
| I am at my wit's end. Any input would be appreciated.
Hard to tell for us, but maybe try the old and trusted route of smaller and
smaller reproducible examples til you reproduce it?
Or else if it _just_ Travis CI maybe it is a compiler version issue? Travis
is very conservative in its default setup but there are .travis.yaml scripts
out there that turn on the PPA for compiler builds giving you gcc-5, gcc-6,
... amd different clang versions.
Dirk
--
_______________________________________________
Rcpp-devel mailing list
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
Alexis Sarda
2018-01-17 17:10:46 UTC
Permalink
(I think I forgot to reply to all)

I failed to mention that I am also using RcppArmadillo elsewhere,
and I found the following post by the Coatless Professor saying that
Armadillo requires at least gcc 4.7.2:

http://thecoatlessprofessor.com/programming/r/selecting-an-a
lternative-compiler-for-r-package-testing-on-travis-ci/

However, my Travis builds were already using Trusty (gcc 4.8.4).
I think WinBuilder uses gcc 4.9.3, and that one didn't fail.
In case it is of interest, I am using Armadillo for exactly one thing:

arma::vec cc_seq = arma::real(arma::ifft(fftx % ffty));

and that was also being problematic in Travis
(element-wise multiplication due to incompatible dimensions, showing
vectors with unreasonably large dimensions).
Using version 5 of the compilers as explained in the post above seems to
have solved the problems.

Thanks,
Alexis.
Post by Kevin Ushey
In your RcppParallel worker, it looks like you're trying to write to
distmat_(i,j) = local_calculator->calculate(i,j);
where distmat_ is a matrix. You should avoid using Rcpp classes within
RcppParallel workers, as there's no guarantee that the methods
available on Rcpp classes are thread-safe (and this could lead to
these kinds of segfaults).
I'll echo Dirk and say that without a minimally reproducible example
(or access to the package source code) there's not much else we can
say.
You could try running your code with `gctorture(TRUE)` to see if that
triggers your segfault more reliably -- if that's the case, then you
almost surely have an protection issue somewhere (most likely the
result of using non-threadsafe APIs within an RcppParallel worker, but
without full context it's impossible to be sure)
Post by Dirk Eddelbuettel
| Hello,
|
| I am integrating RcppParallel into my R package and I'm running into
| strange problems with segmentation faults, but only during the
continuous
Post by Dirk Eddelbuettel
| integration checks. I have essentially variations of the following (I
hope
Post by Dirk Eddelbuettel
|
| https://gist.github.com/asardaes/7d78af394f848a967997ff23e433c9cf
|
| On TravisCI, my Linux builds simply freeze, and the OSX builds show
|
| *** caught segfault ***
| address 0x100000001, cause 'memory not mapped'
|
| I would assume that my distance functions are trying to access memory
they
Post by Dirk Eddelbuettel
| shouldn't, but during interactive use everything works flawlessly, and
I've
Post by Dirk Eddelbuettel
| tested all of the following with no problems (which also test
correctness,
Post by Dirk Eddelbuettel
|
| - Local Linux R CMD check
| - Local Windows R CMD check
| - CRAN's WinBuilder check
| - AppVeyor (x32 and x64 Windows)
| - Docker R CMD check using rocker's r-devel-san
| - Local Linux R CMD check with valgrind (no leaks)
|
| It is worth mentioning that some of the examples ran during the OSX
build
Post by Dirk Eddelbuettel
| show incorrect results long before the segfault occurs: some results
are
Post by Dirk Eddelbuettel
| zero when they shouldn't be. I don't have access to a machine with
OSX, but
Post by Dirk Eddelbuettel
| the Linux builds in TravisCI also show problems (no segfaults
explicitly,
Post by Dirk Eddelbuettel
| just hangs).
|
| I am at my wit's end. Any input would be appreciated.
Hard to tell for us, but maybe try the old and trusted route of smaller
and
Post by Dirk Eddelbuettel
smaller reproducible examples til you reproduce it?
Or else if it _just_ Travis CI maybe it is a compiler version issue?
Travis
Post by Dirk Eddelbuettel
is very conservative in its default setup but there are .travis.yaml
scripts
Post by Dirk Eddelbuettel
out there that turn on the PPA for compiler builds giving you gcc-5,
gcc-6,
Post by Dirk Eddelbuettel
... amd different clang versions.
Dirk
--
_______________________________________________
Rcpp-devel mailing list
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
Alexis Sarda
2018-01-20 11:45:30 UTC
Permalink
I've found out that the problem remains on OSX builds, and apparently it is
caused by clang itself. I used R-hub's fedora-clang-devel to test:

https://artifacts.r-hub.io/dtwclust_5.1.0.9000.tar.gz-6f452fd6aeea4307921df2ab2337e6bb/dtwclust.Rcheck/00check.log

The error that stands out to me is:

*** Error in `/opt/R-devel/lib64/R/bin/exec/R': corrupted
double-linked list: 0x00000000099a3870 ***


I am essentially doing a parallel distance matrix calculation as shown in
the Rcpp gallery, but I have several distance functions. All the classes
that provide distance calculations have a member wrapping std::vector of
either RcppParallel's RVector<double>, RMatrix<double>, or Armadillo's
cx_vec. Here's the template I'm using to wrap those members:

https://github.com/asardaes/dtwclust/blob/master/src/utils/TSTSList.h

Could the corruption be caused by this?

Regards,
Alexis.
Post by Dirk Eddelbuettel
| Hello,
|
| I am integrating RcppParallel into my R package and I'm running into
| strange problems with segmentation faults, but only during the continuous
| integration checks. I have essentially variations of the following (I hope
|
| https://gist.github.com/asardaes/7d78af394f848a967997ff23e433c9cf
|
| On TravisCI, my Linux builds simply freeze, and the OSX builds show
|
| *** caught segfault ***
| address 0x100000001, cause 'memory not mapped'
|
| I would assume that my distance functions are trying to access memory they
| shouldn't, but during interactive use everything works flawlessly, and I've
| tested all of the following with no problems (which also test correctness,
|
| - Local Linux R CMD check
| - Local Windows R CMD check
| - CRAN's WinBuilder check
| - AppVeyor (x32 and x64 Windows)
| - Docker R CMD check using rocker's r-devel-san
| - Local Linux R CMD check with valgrind (no leaks)
|
| It is worth mentioning that some of the examples ran during the OSX build
| show incorrect results long before the segfault occurs: some results are
| zero when they shouldn't be. I don't have access to a machine with OSX, but
| the Linux builds in TravisCI also show problems (no segfaults explicitly,
| just hangs).
|
| I am at my wit's end. Any input would be appreciated.
Hard to tell for us, but maybe try the old and trusted route of smaller and
smaller reproducible examples til you reproduce it?
Or else if it _just_ Travis CI maybe it is a compiler version issue?
Travis
is very conservative in its default setup but there are .travis.yaml scripts
out there that turn on the PPA for compiler builds giving you gcc-5, gcc-6,
... amd different clang versions.
Dirk
--
Dirk Eddelbuettel
2018-01-20 13:46:46 UTC
Permalink
On 20 January 2018 at 12:45, Alexis Sarda wrote:
| I've found out that the problem remains on OSX builds, and apparently it is
| caused by clang itself. I used R-hub's fedora-clang-devel to test:
|
| https://artifacts.r-hub.io/dtwclust_5.1.0.9000.tar.gz-6f452fd6aeea4307921df2ab2337e6bb/dtwclust.Rcheck/00check.log
|
| The error that stands out to me is:
|
| *** Error in `/opt/R-devel/lib64/R/bin/exec/R': corrupted
| double-linked list: 0x00000000099a3870 ***
|
|
| I am essentially doing a parallel distance matrix calculation as shown in
| the Rcpp gallery, but I have several distance functions. All the classes
| that provide distance calculations have a member wrapping std::vector of
| either RcppParallel's RVector<double>, RMatrix<double>, or Armadillo's
| cx_vec. Here's the template I'm using to wrap those members:
|
| https://github.com/asardaes/dtwclust/blob/master/src/utils/TSTSList.h
|
| Could the corruption be caused by this?

It looks to me like you are just moving _actual Rcpp vectors_ around from the
Rcpp::List into your container, and then access them using your operator
types. But ... that still accesses R memory through these vectors, and with
that we may get a (rare ?) race condition on stack unwinding etc.

The truly paranoid approach would be to actually make truly distinct types
and copy (ie memcpy). That file is short, so maybe you can try it.

Dirk
--
http://dirk.eddelbuettel.com | @eddelbuettel | ***@debian.org
Alexis Sarda
2018-01-20 14:05:07 UTC
Permalink
The idea is indeed to avoid copying memory. I thought that doing something
like the following would allow me to read the values created in R from
within the threads:

Rcpp::NumericVector vec(vector_from_R);
std::vector<RcppParallel::RVector<double>> series;
series.push_back(RcppParallel::RVector<double>(vec));
// then in the threads:
double val = series[index_for_this_thread][0];

The data created on the R side is never modified by these functions, just
read. It is possible for different threads to read the same memory, but I
thought reading was not subject to race conditions.

The segfaults are very consistent, every OSX build fails with the same
error at the same point. The fact that it happens with clang++ but not with
gcc++ is puzzling to me.

The Rcpp::List may contain a lot of NumericVector or NumetricMatrix series,
so I would rather not copy all of them.
Post by Dirk Eddelbuettel
| I've found out that the problem remains on OSX builds, and apparently it is
|
| https://artifacts.r-hub.io/dtwclust_5.1.0.9000.tar.gz-
6f452fd6aeea4307921df2ab2337e6bb/dtwclust.Rcheck/00check.log
|
|
| *** Error in `/opt/R-devel/lib64/R/bin/exec/R': corrupted
| double-linked list: 0x00000000099a3870 ***
|
|
| I am essentially doing a parallel distance matrix calculation as shown in
| the Rcpp gallery, but I have several distance functions. All the classes
| that provide distance calculations have a member wrapping std::vector of
| either RcppParallel's RVector<double>, RMatrix<double>, or Armadillo's
|
| https://github.com/asardaes/dtwclust/blob/master/src/utils/TSTSList.h
|
| Could the corruption be caused by this?
It looks to me like you are just moving _actual Rcpp vectors_ around from the
Rcpp::List into your container, and then access them using your operator
types. But ... that still accesses R memory through these vectors, and with
that we may get a (rare ?) race condition on stack unwinding etc.
The truly paranoid approach would be to actually make truly distinct types
and copy (ie memcpy). That file is short, so maybe you can try it.
Dirk
--
Dirk Eddelbuettel
2018-01-20 15:22:19 UTC
Permalink
On 20 January 2018 at 15:05, Alexis Sarda wrote:
| The idea is indeed to avoid copying memory. I thought that doing something
| like the following would allow me to read the values created in R from
| within the threads:
|
| Rcpp::NumericVector vec(vector_from_R);
| std::vector<RcppParallel::RVector<double>> series;
| series.push_back(RcppParallel::RVector<double>(vec));
| // then in the threads:
| double val = series[index_for_this_thread][0];
|
| The data created on the R side is never modified by these functions, just
| read. It is possible for different threads to read the same memory, but I
| thought reading was not subject to race conditions.
|
| The segfaults are very consistent, every OSX build fails with the same
| error at the same point. The fact that it happens with clang++ but not with
| gcc++ is puzzling to me.
|
| The Rcpp::List may contain a lot of NumericVector or NumetricMatrix series,
| so I would rather not copy all of them.

But the RcppParallel documentation is pretty clear on "do not touch R memory
from multiple threads".

Dirk
--
http://dirk.eddelbuettel.com | @eddelbuettel | ***@debian.org
Alexis Sarda
2018-01-20 18:20:50 UTC
Permalink
After testing locally with clang, it seems the problem actually was that
there were different classes inheriting from RcppParallel::Worker in
different cpp files, but they all had the same names and the same
constructors, yet different logic. There were no compilation warnings, but
apparently clang doesn't isolate each class to the file where it is
declared+defined, even if it doesn't appear in any header.

Thanks,
Alexis.
Post by Dirk Eddelbuettel
| The idea is indeed to avoid copying memory. I thought that doing something
| like the following would allow me to read the values created in R from
|
| Rcpp::NumericVector vec(vector_from_R);
| std::vector<RcppParallel::RVector<double>> series;
| series.push_back(RcppParallel::RVector<double>(vec));
| double val = series[index_for_this_thread][0];
|
| The data created on the R side is never modified by these functions, just
| read. It is possible for different threads to read the same memory, but I
| thought reading was not subject to race conditions.
|
| The segfaults are very consistent, every OSX build fails with the same
| error at the same point. The fact that it happens with clang++ but not with
| gcc++ is puzzling to me.
|
| The Rcpp::List may contain a lot of NumericVector or NumetricMatrix series,
| so I would rather not copy all of them.
But the RcppParallel documentation is pretty clear on "do not touch R memory
from multiple threads".
Dirk
--
Dirk Eddelbuettel
2018-01-20 19:06:38 UTC
Permalink
On 20 January 2018 at 19:20, Alexis Sarda wrote:
| After testing locally with clang, it seems the problem actually was that
| there were different classes inheriting from RcppParallel::Worker in
| different cpp files, but they all had the same names and the same
| constructors, yet different logic. There were no compilation warnings, but
| apparently clang doesn't isolate each class to the file where it is
| declared+defined, even if it doesn't appear in any header.

Glad to know you found it.

Dirk
--
http://dirk.eddelbuettel.com | @eddelbuettel | ***@debian.org
Loading...