-
-
Notifications
You must be signed in to change notification settings - Fork 39
Closed
Description
I am fairly new at handling medium sized data so I could very well be doing something basic wrong, but I am not seeing what my issue could be.
I have a 9 GB csv file, and am running on an 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00 Ghz with 4 cores, 8 logical processors, and 15.7 GB of Memory. I am using R 3.6.1 (can't update due to employer). 120M rows and 19 columns.
When I run the fallowing code the stage1splitter runs for hours with no results. and the cpu usage for the R for Windows front-end workers is 0% most of the time.
Code
library(dplyr)
library(purrr)
library(disk.frame)
# this willl set disk.frame with multiple workers
setup_disk.frame()
# this will allow unlimited amount of data to be passed from worker to worker
options(future.globals.maxSize = Inf)
path_to_data = "F:/<file_path>"
# read 1 million at once
in_chunk_size = 1e7
system.time(
csv_to_disk.frame(
paste0(path_to_data, "<file_name>.csv"),
in_chunk_size = in_chunk_size
)
)
Output
-----------------------------------------------------
Stage 1 of 2: splitting the file F:/30000/30000/Credit Policy/BryanT/TU Scorecard CL Monitoring/perf_v3.csv into smallers files:
Destination: C:\Users\B9800\AppData\Local\Temp\Rtmp4AoDer\file42f4627257a4
-----------------------------------------------------
Stage 1 of 2 took: 02:44:06 elapsed (45.4s cpu)
-----------------------------------------------------
Stage 2 of 2: Converting the smaller files into disk.frame
-----------------------------------------------------
csv_to_disk.frame: Reading multiple input files.
Please use `colClasses = ` to set column types to minimize the chance of a failed read
=================================================
-----------------------------------------------------
-- Converting CSVs to disk.frame -- Stage 1 of 2:
Converting 13 CSVs to 20 disk.frames each consisting of 20 chunks
Progress: ---------------------------------------------------------------- 100%-- Converting CSVs to disk.frame -- Stage 1 or 2 took: 33.0s elapsed (0.120s cpu)
-----------------------------------------------------
-----------------------------------------------------
-- Converting CSVs to disk.frame -- Stage 2 of 2:
Row-binding the 20 disk.frames together to form one large disk.frame:
Creating the disk.frame at C:\Users\B9800\AppData\Local\Temp\Rtmp4AoDer\file42f47e5437.df
Appending disk.frames:
Stage 2 of 2 took: 32.4s elapsed (0.190s cpu)
-----------------------------------------------------
Stage 1 & 2 in total took: 00:01:05 elapsed (0.310s cpu)
Stage 2 of 2 took: 00:01:08 elapsed (0.340s cpu)
-----------------------------------------------------
Stage 2 & 2 took: 02:45:15 elapsed (45.7s cpu)
-----------------------------------------------------
user system elapsed
45.75 28.98 9915.55
Is this an issue with using 3.6.1? When I load disk.frame I get
Warning message: package ‘disk.frame’ was built under R version 3.6.3
Metadata
Metadata
Assignees
Labels
No labels