KTH-Library.github.io/performance/spark.Rmd at master · KTH-Library/KTH-Library.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
title: "Apache Spark + sparklyr"
author: "KTH Library & ITA - ABM project"
date: "2020-01-15"
output:
  ioslides_presentation:
    logo: kth-logo.png
    transition: slower
    mathjax: default
    self-contained: true
#    css: kth.css

---

```{r setup, include=FALSE}
```

## Apache Spark + sparklyr

Resources:

- https://therinspark.com
- https://spark.rstudio.com/guides/data-lakes/#spark-as-an-analysis-engine

## Get Apache Spark locally

```{r, eval=FALSE}
library(sparklyr)
# get Apache Spark locally
spark_install()

# install Java 8 (if you don't have it already), on linux:

# sudo apt install openjdk-8-jdk
# update-java-alternatives --list
# sudo update-java-alternatives --set java-1.8.0-openjdk-amd64

sc <- spark_connect(master = "local")
```

## Migrate data

```{r, eval=FALSE}

library(dplyr)
library(purrr)
library(bibliomatrix)
library(RSQLite)

# move the "masterfile" table
src1 <- con_bib_sqlite() %>% tbl("masterfile")
dst1 <- copy_to(sc, src1, name = "masterfile")

# make some query
dst1 %>% filter(Unit_code == "u101eneg")

```

## Apache Spark config

```{r, eval = FALSE}
library(sparklyr)

# config for Apache Spark
Sys.setenv("SPARK_MEM" = "12g")
config <- spark_config()
config$`sparklyr.shell.driver-memory` <- '12G'
config$`sparklyr.shell.executor-memory` <- '4G'
config$sparklyr.defaultPackages <- "com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M3"
config$spark.cassandra.cassandra.host <- "localhost"
config$spark.driver.maxResultSize <- "4G"
config$spark.executor.cores <- 3

# make the connection
sc <- spark_connect(master = "local", config = config)
```