Memory

One of the most vexing issues in R is memory. For anyone who works with large datasets - even if you have 64-bit R running and lots (e.g., 18Gb) of RAM, memory can still confound, frustrate, and stymie even experienced R users.

I am putting this page together for two purposes. First, it is for myself - I am sick and tired of forgetting memory issues in R, and so this is a repository for all I learn. Two, it is for others who are equally confounded, frustrated, and stymied.

However, this is a work in progress! And I do not claim to have a complete grasp on the intricacies of R memory issues. That said... here are some hints


1) Read R> ?"Memory-limits". To see how much memory an object is taking, you can do this:
R> object.size(x)/1048600  #gives you size of x in Mb

2) As I said elsewhere, 64-bit computing and a 64-bit version of R are indispensable for working with large datasets (you're capped at ~ 3.5 Gb RAM with 32 bit computing). Error messages of the type “Cannot allocate vector of size...” is saying that R cannot find a contiguous bit of RAM that is that large enough for whatever object it was trying to manipulate right before it crashed. This is usually (but not always, see #5 below) because your OS has no more RAM to give to R.

How to avoid this problem? Short of reworking R to be more memory efficient, you can buy more RAM, use a package designed to store objects on hard drives rather than RAM (ff, filehash, R.huge, or bigmemory), or use a library designed to perform linear regression by using sparse matrices such as t(X)*X rather than X (big.lm - haven't used this yet). For example, package bigmemory helps create, store, access, and manipulate massive matrices.  Matrices are allocated to shared memory and may use memory-mapped files.  Thus, bigmemory provides a convenient structure for use with parallel computing tools (SNOW, NWS, multicore, foreach/iterators, etc...) and either in-memory or larger-than-RAM matrices. I have yet to delve into the RSqlite library, which allows an interface between R and the SQLite database system (thus, you only bring in the portion of the database you need to work with).

If you're unwilling to do any of the above, the final option is to read in only the part of the matrix you need, work with that portion of it, and then remove it from memory. Slow but doable for most things.

3) It is helpful to constantly keeping an eye on the top unix function (not sure what the equivalent is in windoze) to check the RAM your R session is taking. Usually I type in Terminal:
top -orsize
which, on my mac, sorts all programs by the amount of RAM being used. If you want to understand what the readout means, see here. The long and short of it is this: your computer has available to it the “free” PLUS the “inactive” memory. No program should run out of memory until these are depleted. The column to pay attention to in order to see the amount of RAM being used is “RSIZE.” Here is an article describing even more gory detail re Mac’s memory usage.

4) gc() is a function that returns memory to the operating system. I used to think that this can be helpful in certain circumstances but no longer believe this. Basically, if you purge an object in R, that unused RAM will remain in R’s ‘possession,’ but will be returned to the OS (or used by another R object) when needed. Thus, an explicit call to gc() will not help - R’s memory management goes on behind the scenes and does a pretty good job.

Also, often you’ll note that the R process in top will use more memory than is the sum of all objects in your memory. Why? My understanding of it is that R keeps some memory in reserve that is not returned to the OS but that can be accessed by R for future objects. Thus, don’t worry too much if your R session in top seems to be taking more memory than it should.

5) Swiss cheese memory and memory fragmentation. R looks for *contiguous* bits of RAM to place any new object. If it cannot find such a contiguous piece of RAM, it returns a “Cannot allocate vector of size...” error. If you are allocating lots of different sized objects with no game plan, your RAM will begin to look like swiss cheese - lots of holes throughout and no order to it. Thus, good programmers keep a mental picture of ‘what their RAM looks like.’ A few ways to do this:
  a) If you are making lots of matrices then removing them, make sure to make the large matrices first. Then, the RAM taken for the smaller matrices can fit inside the footprint left by the larger matrices.
  b) It can be helpful to ‘pre-allocate’ matrices by telling R what the size of the matrix is before you begin filling it up. The wrong way to fill in a matrix is to allow it to grow dynamically (e.g., in a loop). In this case, R has to find a matrix of (say) 100 rows, then 101 rows, then 102 rows, etc... Each new matrix can’t fit inside the RAM footprint of the old one, so R has to find a *new* bit of contiguous RAM for the newly enlarged matrix. Thus, instead of just using one chunk of RAM that it takes to make a matrix of size, say, 1000 rows by 200 columns, you are instead using RAM to make matrices of size 1000 x 200 AND 999 x 200 AND 998 x 200 AND 997 x 200 etc. This is what I meant above by “swiss cheese.”
  c) Switch to 64-bit computing. Memory fragmentation tends to be much less of an issue (nonexistent?) on 64-bit computing.

Useful code to remember for pulling in large datasets:

#create SNP information in new haplotype matrix - 88.9 seconds
system.time({
for (i in 0:199){
ss <- paste("X",scan("ss4.out", what='character', skip=i,nlines=1),sep="")
index <- match(ss,nms)
new.hap[i+1,index] <- 1}
})

#this took 2.3 seconds
system.time({
con <- file("ss4.out",open='r')
for (i in 0:199){
ss <- paste("X",scan(con,what='character',nlines=1),sep="")
index <- match(ss,nms)
new.hap[i+1,index] <- 1}
})
 

[Matthew C Keller's Home Page] [Biosketch] [Vita] [Publications] [Keller Lab] [Program Code] [R] [64 bit R on Mac] [Aquamacs] [Bioconductor] [Memory] [Parallelization] [UNIX] [Courses] [Links]