Skip to content

left_join two dataframes out of memory #352

@SMousavi90

Description

@SMousavi90

I have recently started using this library but I have an issue while doing a join between two data.frames

a <- my first dataframe has 2127625 rows
b <- the second one has 73364 rows

after setting up the diskframe, It shows that:
The number of workers available for disk.frame is 8

a and b are converted to diskframes before doing the join:

a <- as.disk.frame(a, outdir = "/home/ruser/tmp/a.mdf", overwrite = TRUE)
b <- as.disk.frame(b, outdir = "/home/ruser/tmp/b.mdf", overwrite = TRUE) 

now at some part of my code I'm doing this join:

c <- a %>% left_join(b)

this line returns:

...
Hashing...
Hashing...
Appending disk.frames: 
Error: cannot allocate vector of size 581.5 Gb

then I tried doing it in this way:
c <- a %>% left_join(b, merge_by_chunk_id = TRUE)

firstly, it used the whole (62.8 GB) of my ram, then returned an error:

Error in unserialize(node$con) : 
  Failed to retrieve the value of MultisessionFuture (future_mapply-1) from cluster RichSOCKnode #1 (PID 25440 on localhost ‘localhost’). The reason reported was ‘error reading from connection’. Post-mortem diagnostic: A process with this PID exists, which suggests that the localhost worker is still alive.

and didn't do the join!

I also tested it with:
setup_disk.frame(workers = 16)

same result!

just to mention that I have done some joins on my other data.frames and they were done successfully but this one (which is only greater than the other data.frames) failed.

could you please help me to understand what the problem is?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions