Skip to content

csv_to_disk.frame running extremely slow #363

@bryan-rt

Description

@bryan-rt

I am fairly new at handling medium sized data so I could very well be doing something basic wrong, but I am not seeing what my issue could be.

I have a 9 GB csv file, and am running on an 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00 Ghz with 4 cores, 8 logical processors, and 15.7 GB of Memory. I am using R 3.6.1 (can't update due to employer). 120M rows and 19 columns.

When I run the fallowing code the stage1splitter runs for hours with no results. and the cpu usage for the R for Windows front-end workers is 0% most of the time.

Code

library(dplyr)
library(purrr)
library(disk.frame)

# this willl set disk.frame with multiple workers
setup_disk.frame()

# this will allow unlimited amount of data to be passed from worker to worker
options(future.globals.maxSize = Inf)

path_to_data = "F:/<file_path>"

# read 1 million at once
in_chunk_size = 1e7

system.time(
  csv_to_disk.frame(
    paste0(path_to_data, "<file_name>.csv"), 
    in_chunk_size = in_chunk_size
  )
)

Output

 ----------------------------------------------------- 
Stage 1 of 2: splitting the file F:/30000/30000/Credit Policy/BryanT/TU Scorecard CL Monitoring/perf_v3.csv into smallers files:
Destination: C:\Users\B9800\AppData\Local\Temp\Rtmp4AoDer\file42f4627257a4
 ----------------------------------------------------- 
Stage 1 of 2 took: 02:44:06 elapsed (45.4s cpu)
 ----------------------------------------------------- 
Stage 2 of 2: Converting the smaller files into disk.frame
 ----------------------------------------------------- 
csv_to_disk.frame: Reading multiple input files.
Please use `colClasses = `  to set column types to minimize the chance of a failed read
=================================================

 ----------------------------------------------------- 
-- Converting CSVs to disk.frame -- Stage 1 of 2:

Converting 13 CSVs to 20 disk.frames each consisting of 20 chunks

  Progress: ---------------------------------------------------------------- 100%-- Converting CSVs to disk.frame -- Stage 1 or 2 took: 33.0s elapsed (0.120s cpu)
 ----------------------------------------------------- 
 
 ----------------------------------------------------- 
-- Converting CSVs to disk.frame -- Stage 2 of 2:

Row-binding the 20 disk.frames together to form one large disk.frame:
Creating the disk.frame at C:\Users\B9800\AppData\Local\Temp\Rtmp4AoDer\file42f47e5437.df

Appending disk.frames: 
Stage 2 of 2 took: 32.4s elapsed (0.190s cpu)
 ----------------------------------------------------- 
Stage 1 & 2 in total took: 00:01:05 elapsed (0.310s cpu)
Stage 2 of 2 took: 00:01:08 elapsed (0.340s cpu)
 ----------------------------------------------------- 
Stage 2 & 2 took: 02:45:15 elapsed (45.7s cpu)
 ----------------------------------------------------- 
   user  system elapsed 
  45.75   28.98 9915.55

Is this an issue with using 3.6.1? When I load disk.frame I get

Warning message: package ‘disk.frame’ was built under R version 3.6.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions