KTH-Library.github.io/performance/index.Rmd at master · KTH-Library/KTH-Library.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
title: "Performance improvements"
author: "KTH Library & ITA - ABM project"
date: "2020-01-10"
output:
  ioslides_presentation:
    logo: kth-logo.png
    transition: slower
    mathjax: default
    self-contained: true
#    css: kth.css

---


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```

## Performance improvements in ABM

A typical issue (db timeout is set to 60s!) is that patience is required:

    > source(here("data-raw/public_data.R"))
    Patience, please. It takes a while to fetch the data into the cache.
     Error: nanodbc/nanodbc.cpp:950: HYT00: [Microsoft]
     [ODBC Driver 17 for SQL Server]Login timeout expired

    > source(here("data-raw/public_data.R"))
    Patience, please. It takes a while to fetch the data into the cache.
    Updating cached data for public data at: /home/markus/.config/bibmon/public.rds
    ✔ Setting active project to '/home/markus/repos/bibliomatrix'
    ✔ Saving 'abm_public_kth' to 'data/abm_public_kth.rda'

## Some potential areas of improvement

- Improve database calls (at backend level - views/SPs/temptables etc)
- Improve functions (at R package level)
- Improve caching (at app level)
- Improve server (more hardware / memory)
- Evaluate alternative backends (investigate using Apache Spark?)
- Faster rendering of rmarkdown (see [how on this link](https://bookdown.org/yihui/rmarkdown-cookbook/cache.html))

> What areas do you see?

> Which areas are the most important?

## Refactoring db calls and functions

- How long does calls to `abm_table*()` take?
- Everything else equal - differences between backends (mssql vs sqlite3)?
- When to issue a `collect`? Should `compute()` be used? See this [link](https://dbplyr.tidyverse.org/articles/dbplyr.html).
- Any opportunities to index?
- Other ideas?

## Caching

- Since interactive time (sub-second) is not there -> caching
- Caching of pre-rendered flexdashboards - a function exists
- Two kinds of flexdashboards (almost dupes), variants:
  - logged in (SAML auth)
  - not logged in ("anonymous")

## Measuring fcn performance {.smaller}

```{r, message=FALSE, echo=TRUE}
library(microbenchmark)
library(bibliomatrix)
library(dplyr)

mb <- microbenchmark(times = 1, unit = "s",
  t1 <- abm_table1(con = pool_bib("mssql")),
  t2 <- abm_table1(con = pool_bib("sqlite"))
)

print(mb)

```

## Measuring fcn performance

We can measure the times spent inside the `abm_table1()` when connecting against mssql using `profvis`:

```{r, message=FALSE, echo=TRUE}
library(profvis)

perf_profile <- profvis(abm_table1(con = pool_bib("mssql")))

```

Results are on the next slide

## {.smaller}

```{r}
perf_profile
```

## Measuring query perf

We can look at the query sent to the backend and profile it using EXPLAIN etc:

```{r, echo=TRUE}

abm_data(unit_code = "KTH") %>% show_query()
#abm_data(unit_code = "KTH") %>% explain()  # for query plan used in db

# sqlite3 db does not have indexes

```

## Adding indexes and tuning a query {.smaller}

    # go to the sqlite3 db and start the db CLI
    cd ~/.config/bibmon && sqlite3 bibmon.db

    # use the dbplyr sql (which uses %in%)
    sqlite> .timer on
    sqlite> SELECT count(*)
       ...> FROM `masterfile`
       ...> WHERE (`Unit_code` IN ('KTH'));
    Run Time: real 0.018 user 0.008986 sys 0.008708

    # add an index
    sqlite> create index idx1 on Masterfile(Unit_code);
    Run Time: real 0.092 user 0.064022 sys 0.016046

    sqlite> select count(*) from masterfile where Unit_code = "KTH";
    Run Time: real 0.006 user 0.005362 sys 0.000000

## SQLite3 db after index{.smaller}

EXPLAIN the query plan (shows that index is being used)

    sqlite> .eqp on
    sqlite> select count(*) from masterfile where Unit_code in ("KTH");
    --EQP-- 0,0,0,SEARCH TABLE masterfile USING COVERING INDEX idx1 (Unit_code=?)
    37274
    Run Time: real 0.005 user 0.005458 sys 0.000000

## Measuring several fcns {.smaller}

```{r, message=FALSE, echo=TRUE}
library(rlang)
library(purrr)
library(dplyr)

fns <- function(table_no, db_type)
  sprintf("abm_table%s(con = pool_bib('%s'))", table_no, db_type)

fns_sqlite <- fns(1:5, "sqlite")
fns_mssql <- fns(1:5, "mssql")

mb <- function(f)
  microbenchmark(times = 1, unit = "s", x <- force(f))$time / 1e9

perf <- function(f) tibble(
    f = f,
    p = map_dbl(f, function(x) mb(eval_tidy(parse_quo(x, env = global_env()))))
  )

res <- bind_cols(
  perf(fns_sqlite),
  perf(fns_mssql)
)

```

Results on next page.

## {.smaller}

```{r}
knitr::kable(res)
```


## Piñatas / TODOs

- Make `abm_table*()` fcns faster
- ...