|
| 1 | +# Hot File Replication |
| 2 | + |
| 3 | +Hot file replication automatically creates additional replicas of files that are being read |
| 4 | +frequently (i.e. "hot" files), distributing load across the pool fleet. |
| 5 | + |
| 6 | +## How It Works |
| 7 | + |
| 8 | +### Protocol-Agnostic Design |
| 9 | + |
| 10 | +All dCache read protocols (DCap, NFS/pNFS, WebDAV, xrootd, FTP, HTTP, pool-to-pool) send a |
| 11 | +`PoolIoFileMessage` to the pool when a client requests a file. The pool dispatches that message |
| 12 | +to `PoolV4.ioFile()`, which is therefore a single, protocol-independent entry point for every |
| 13 | +read request. Hot file monitoring is implemented at that point: |
| 14 | + |
| 15 | +``` |
| 16 | +Any Door (DCap / NFS / WebDAV / xrootd / FTP / HTTP / …) |
| 17 | + → sends PoolIoFileMessage |
| 18 | + → Pool.messageArrived(PoolIoFileMessage) |
| 19 | + → PoolV4.ioFile() ← monitoring happens here |
| 20 | + → queues mover (protocol-specific) |
| 21 | + → FileRequestMonitor.reportFileRequest(pnfsId, currentCount, protocolInfo) |
| 22 | +``` |
| 23 | + |
| 24 | +Because the counting and triggering happen in `PoolV4.ioFile()`, no protocol-specific code |
| 25 | +changes are required to benefit from this feature. |
| 26 | + |
| 27 | +### Request Counting |
| 28 | + |
| 29 | +`PoolV4.ioFile()` reads the concurrent mover count for the file from `IoQueueManager`: |
| 30 | + |
| 31 | +```java |
| 32 | +long requestCount = _ioQueue.numberOfRequestsFor(message.getPnfsId()); |
| 33 | +_fileRequestMonitor.reportFileRequest(message.getPnfsId(), requestCount, |
| 34 | + message.getProtocolInfo()); |
| 35 | +``` |
| 36 | + |
| 37 | +When `requestCount` reaches or exceeds the configured threshold, |
| 38 | +`MigrationModule.reportFileRequest()` creates a migration job named |
| 39 | +`hotfile-<pnfsId>` that replicates the file to additional pools. |
| 40 | + |
| 41 | +### Pool Selection |
| 42 | + |
| 43 | +The migration job selects target pools by querying PoolManager via `PoolMgrQueryPoolsMsg`, |
| 44 | +deriving `protocolUnit` from the triggering request's `ProtocolInfo` (e.g., `"DCap/3"`) and |
| 45 | +`netUnitName` from the client's IP address when available (e.g., `"192.168.1.10"`). When the |
| 46 | +client IP is not available (non-IP protocol or unknown), an empty string is used for `netUnitName`, |
| 47 | +which causes PoolManager to match any network unit. When `ProtocolInfo` is null (e.g., for |
| 48 | +internal pool-to-pool transfers), `protocolUnit` falls back to `"*/*"` and `netUnitName` to `""` |
| 49 | +so that selection is based solely on the file's storage group and pool-group read preferences. |
| 50 | + |
| 51 | +`PoolMgrQueryPoolsMsg.getPools()` returns a `List<String>[]` where index 0 is the highest |
| 52 | +read-preference level. `PoolListByPoolMgrQuery` selects **only** the first non-empty |
| 53 | +preference level, so the file is always replicated to the best available pools: |
| 54 | + |
| 55 | +```java |
| 56 | +// Only take the first non-empty preference level (highest read preference) |
| 57 | +for (int i = 0; i < poolLists.length; i++) { |
| 58 | + List<String> poolList = poolLists[i]; |
| 59 | + if (poolList != null && !poolList.isEmpty()) { |
| 60 | + selectedPools = poolList; |
| 61 | + break; |
| 62 | + } |
| 63 | +} |
| 64 | +``` |
| 65 | + |
| 66 | +Prior to this, all preference levels were flattened into a union, causing files to be |
| 67 | +replicated to pools from lower-preference groups (e.g. flush pools) instead of the intended |
| 68 | +read-only pools. |
| 69 | + |
| 70 | +### Job Housekeeping |
| 71 | + |
| 72 | +To prevent unbounded memory growth, `MigrationModule` keeps at most 50 hotfile jobs. When a |
| 73 | +new job would exceed that limit, the oldest jobs that have reached a terminal state |
| 74 | +(`FINISHED`, `CANCELLED`, `FAILED`) are pruned first. |
| 75 | + |
| 76 | +## Configuration |
| 77 | + |
| 78 | +| Property | Default | Description | |
| 79 | +|---|---|---| |
| 80 | +| `pool.hotfile.replication.enable` | `false` | Enable/disable hot file monitoring. **Must be `true` to activate.** | |
| 81 | +| `pool.migration.hotfile.threshold` | `50` | Number of concurrent read movers required to trigger replication | |
| 82 | +| `pool.migration.hotfile.replicas` | `1` | Number of additional replicas to create | |
| 83 | +| `pool.migration.concurrency.default` | `1` | Number of files the migration job migrates concurrently | |
| 84 | + |
| 85 | +Example (`dcache.conf` or pool layout file): |
| 86 | + |
| 87 | +```ini |
| 88 | +pool.hotfile.replication.enable = true |
| 89 | +pool.migration.hotfile.threshold = 3 |
| 90 | +pool.migration.hotfile.replicas = 3 |
| 91 | +pool.migration.concurrency.default = 1 |
| 92 | +``` |
| 93 | + |
| 94 | +> **Note:** The feature is disabled by default. A pool restart is required after any |
| 95 | +> configuration change. |
| 96 | +
|
| 97 | +## Key Source Files |
| 98 | + |
| 99 | +| File | Role | |
| 100 | +|---|---| |
| 101 | +| `modules/dcache/src/main/java/org/dcache/pool/classic/PoolV4.java` | Entry point; checks enable flag, calls `FileRequestMonitor` | |
| 102 | +| `modules/dcache/src/main/java/org/dcache/pool/migration/MigrationModule.java` | Implements `FileRequestMonitor`; counts requests, creates and manages migration jobs | |
| 103 | +| `modules/dcache/src/main/java/org/dcache/pool/migration/PoolListByPoolMgrQuery.java` | Queries PoolManager for eligible target pools; selects highest-preference level only | |
| 104 | +| `modules/dcache/src/test/java/org/dcache/pool/classic/HotfileMonitoringTest.java` | Spring-context integration test for enable/disable behaviour | |
| 105 | +| `modules/dcache/src/test/java/org/dcache/pool/migration/MigrationModuleTest.java` | Unit tests for `reportFileRequest`, threshold, housekeeping | |
| 106 | +| `modules/dcache/src/test/java/org/dcache/pool/migration/PoolListByPoolMgrQueryTest.java` | Unit tests for pool selection, preference-level handling, unknown net unit, and wildcard protocol | |
| 107 | +| `skel/share/defaults/pool.properties` | Canonical defaults for all `pool.hotfile.*` and `pool.migration.hotfile.*` properties | |
| 108 | + |
| 109 | +## Diagnostics |
| 110 | + |
| 111 | +### Log Messages |
| 112 | + |
| 113 | +With the default log level the following INFO messages are emitted by `MigrationModule`: |
| 114 | + |
| 115 | +``` |
| 116 | +Hot file monitoring: pnfsId=<id>, requests=<n>, threshold=<t> |
| 117 | +Hot file detected! Triggering replication for pnfsId=<id> |
| 118 | +Created migration job with id hotfile-<id> for pnfsId <id> with <n> replicas and concurrency <c> |
| 119 | +Starting migration job hotfile-<id> for pnfsId <id> |
| 120 | +Successfully started migration job hotfile-<id> for pnfsId <id> |
| 121 | +Job hotfile-<id> already exists with state <STATE> |
| 122 | +``` |
| 123 | + |
| 124 | +`PoolV4` emits at INFO: |
| 125 | + |
| 126 | +``` |
| 127 | +PoolV4.ioFile: Received IO request for pnfsId=<id>, hotFileEnabled=<bool>, monitorSet=<bool> |
| 128 | +PoolV4.ioFile: Calling reportFileRequest for pnfsId=<id>, count=<n> |
| 129 | +``` |
| 130 | + |
| 131 | +And at ERROR if the monitor is not wired: |
| 132 | + |
| 133 | +``` |
| 134 | +PoolV4.ioFile: Hot file replication enabled but FileRequestMonitor is NULL! |
| 135 | +``` |
| 136 | + |
| 137 | +`PoolListByPoolMgrQuery` emits at DEBUG when a preference level is selected: |
| 138 | + |
| 139 | +``` |
| 140 | +Selected preference level <i> with <n> pools for <pnfsId>: [pool1, pool2, …] |
| 141 | +``` |
| 142 | + |
| 143 | +### Runtime Log Level Adjustment |
| 144 | + |
| 145 | +``` |
| 146 | +# In the pool's admin shell |
| 147 | +log set org.dcache.pool.classic.PoolV4 DEBUG |
| 148 | +log set org.dcache.pool.migration.MigrationModule DEBUG |
| 149 | +``` |
| 150 | + |
| 151 | +### Checking Job and Replica Status |
| 152 | + |
| 153 | +```bash |
| 154 | +# List active migration jobs on all pools |
| 155 | +ssh -p <admin-port> admin@<host> '\s <pool-pattern> migration ls' |
| 156 | + |
| 157 | +# Inspect a specific job |
| 158 | +ssh -p <admin-port> admin@<host> '\s <pool-name> migration info hotfile-<pnfsId>' |
| 159 | + |
| 160 | +# List replicas of a file |
| 161 | +ssh -p <admin-port> admin@<host> '\sl <pnfsId> rep ls <pnfsId>' |
| 162 | +``` |
| 163 | + |
| 164 | +### Interpreting Absence of Log Messages |
| 165 | + |
| 166 | +| Symptom | Likely Cause | |
| 167 | +|---|---| |
| 168 | +| No `PoolV4.ioFile` messages | IO requests are not reaching the pool, or the feature is disabled (`hotFileEnabled=false`) | |
| 169 | +| `monitorSet=false` in PoolV4 log | `FileRequestMonitor` not wired — check Spring context startup errors | |
| 170 | +| `requests` stays at 1 | IoQueue not counting movers correctly | |
| 171 | +| "Hot file detected" but no job created | Exception during job creation — check ERROR lines for a stack trace | |
| 172 | +| Job created but not started | `MigrationModule` not started — run `migration start` in the admin interface | |
| 173 | +| "Job already exists" repeating | Previous job is stuck in a non-terminal state — inspect job state | |
0 commit comments