-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathspark.Rmd
More file actions
78 lines (56 loc) · 1.58 KB
/
spark.Rmd
File metadata and controls
78 lines (56 loc) · 1.58 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
title: "Apache Spark + sparklyr"
author: "KTH Library & ITA - ABM project"
date: "2020-01-15"
output:
ioslides_presentation:
logo: kth-logo.png
transition: slower
mathjax: default
self-contained: true
# css: kth.css
---
```{r setup, include=FALSE}
```
## Apache Spark + sparklyr
Resources:
- https://therinspark.com
- https://spark.rstudio.com/guides/data-lakes/#spark-as-an-analysis-engine
## Get Apache Spark locally
```{r, eval=FALSE}
library(sparklyr)
# get Apache Spark locally
spark_install()
# install Java 8 (if you don't have it already), on linux:
# sudo apt install openjdk-8-jdk
# update-java-alternatives --list
# sudo update-java-alternatives --set java-1.8.0-openjdk-amd64
sc <- spark_connect(master = "local")
```
## Migrate data
```{r, eval=FALSE}
library(dplyr)
library(purrr)
library(bibliomatrix)
library(RSQLite)
# move the "masterfile" table
src1 <- con_bib_sqlite() %>% tbl("masterfile")
dst1 <- copy_to(sc, src1, name = "masterfile")
# make some query
dst1 %>% filter(Unit_code == "u101eneg")
```
## Apache Spark config
```{r, eval = FALSE}
library(sparklyr)
# config for Apache Spark
Sys.setenv("SPARK_MEM" = "12g")
config <- spark_config()
config$`sparklyr.shell.driver-memory` <- '12G'
config$`sparklyr.shell.executor-memory` <- '4G'
config$sparklyr.defaultPackages <- "com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M3"
config$spark.cassandra.cassandra.host <- "localhost"
config$spark.driver.maxResultSize <- "4G"
config$spark.executor.cores <- 3
# make the connection
sc <- spark_connect(master = "local", config = config)
```