High priority MDL_BACKUP_COMMIT_HIGH_PRIO and Slave-side BACKUP-COMMIT tracking to prevent parallel-slave/backup deadlock#5123
Conversation
…backup..
The problem the patch tackles is that the binlogging parallel slave
can expose prepared state of transactions to the backup process.
That is the latter can enroll such transactions, typically waiting for
prior commits - in other words *undecided* (of whether they are going
to commit at all - sic!), into backup image.
The technical possibility of that owes to earlier deadlocks fixes in this area
that made the parallel slave worker to re-acquire its BACKUP MDL within
the wait-for-prior commit phase (which is post- engine prepare) one.
Note the --skip-log-bin or --log-slave-update=OFF parallel slave is
not vulnerable to the exposure of its prepared transactions.
The fixes this patch provides make sure the backup image contains only
transactions that are (binlog-order) committed, while
parallel slave does not deadlock (MDEV-23586).
The principal part of the fixes implements leapfrogging of a waiting
"high-priority" BACKUP MDL by a slave parallel worker which is
exemplified in the following. First mind about notations:
here (as elsewhere in the patch) `Fn` group commit followers
indexed by the gtid seq-no `n`; `F|xyz` indicates the execution
status (timing) of the MDL request. `B` - stands for backup
thread; `Si,Xk` are BACKUP share and exclusive MDL sequenced by
`i,k` that are logical times of the MDL acquisition; The
semi-column separates the granted head of the lock queue from the
waiters.
The queue below is the BACKUP MDL queue.
| which initially is
|
| [Initial Queue State]
| GRANTED ; WAITING QUEUE
| ----------------------------------
| S1(F2) ; X2(B)
|
| Next it receives
| <- Incoming Request: S3(L1)
| the scheduler having the following context
| [Scheduler Evaluation]
| Requestor: L1 (gtid_sub_id = 1)
| Holder : F2 (gtid_sub_id = 2)
| Condition: holder->gtid_sub_id > requestor->gtid_sub_id (2 > 1) -> TRUE!
| Action : ignore_mdl_priority = true. L1 bypasses the waiting X2 lock.
|
| decides where to place it
|
| [Final Queue State]
| GRANTED ; WAITING QUEUE
| ----------------------------------
| S1(F2), S3(L1) ; X2(B)
| so it chooses to team up L1,F2 correctly. Any next S4(F_n), where n > 2
| won't be granted which means F_n will wait for the backup completion.
| Like this
|
| [Final Queue State]
| GRANTED ; WAITING QUEUE
| ----------------------------------
| S1(F2), S3(L1) ; X2(B),S4(F3)
The decision to allow S3 leapfrog X2 requires the current lock holder
F2 to have a greater gtid_sub_id (2 > 1).
To implement this scheduling policy requires the following modifications.
P1 sql/mdl.h
The new method determines whether a BACKUP lock requester is a teammate
of a group that can share the MDL lock and that there exists at least
on granted member of the group.
+MDL_request::bool (*is_teammate_callback)(const THD*, const THD*);
P2. sql/log.cc
dismantling of mdl_context.{release,acquire}_lock() in the parallel worker
wait-for-prior commit
P3. sql/handler.cc
setting the P1 MDL_request::is_teammate_callback to the slave parallel worker
as hint to try acquiring the MDL lock by team membership.
P4. sql/rpl_rli.cc
defines the teammate callback for the parallel slave.
+rpl_group_info::ignore_mdl_priority
P5. sql/handler.cc
Logics of "careful" release of the MDL lock could not be streamlined.
The failed parallel slave worker still has to abide with the former policy of
``` as there is extra replication
book-keeping to be done before rolling back and allowing a conflicting
transaction to continue (MDEV-7458).``` [ha_commit_trans].
For that reason of MDEV-7458 the BACKUP MDL request's memory is now allocated
for the worker in THD::st_transaction::mem_root. Also guards are deployed to not let
the mem-root be be cleaned too early, before the lock gets released. It can be
released fast to follow with transaction->cleanup() for
not failing trx:s.
Failing to commit parallel workers defer that to Relay_log_info::cleanup_context()
from where now ha_rollback_trans() will make it.
Aslo necessary changes are caused by to the gtid implicit statement's design.
1. sql/sql_class.cc
Removed earlier backup-on-parallel-slave bugfixes of MDEV-23586.
The parallel worker does not release anymore BACKUP MDL at its wait-for-prior
commit stage.
2. sql/sql_lex.h, sql/sql_base.cc, sql/rpl_gtid.cc
Has to be introduced
+#define TL_OPTION_GTID_TABLE_SLAVE 64
as a part of a method to find out (see open_table() hunk) at
executing record_gtid() by implicit GTID statement that its
BACKUP MDL lock (however it is necessary - it is not challenged
here) needs the ignore-priority hint. Note this hint applies for granting
a "special" S' share locked not yet the commit time ha_commit_trans()'s
MDL_BACKUP_COMMIT (denoted as S) lock, which is going to be requested later.
Let's exemplify it on the following diagrams, calling S' a sort of
shared lock, compatible with with S of the commit time.
*Without* the teammate hint the request for S'3(F2) at record_gtid() time
S1(F3|wait_for_prior_commit); X2(B) <- S'3(F2|record_gtid)
would end up to wait behind X2
S1; /* waiting */ X2, S'3
and we'd be regress back to hang/deadlock of earlier bugs: F3 would not be
awaken by F2 who is blocked by B.
Tested with extended set of parallel slave with backup scenarios.
This commit must require more and extensive testing and may cause
followup amendments, through review comments as well.
Co-authored-with: hemant.dangi@mariadb.com.
When the standard request of MDL_BACKUP_COMMIT is not granted to a worker
it first searches for a possible grantee that is a parallel slave
worker and if there exist one with greater commit order the request
type is escalate to the highest which is re-submitted.
To the changes in sql_base.cc,
note there also exist a request of the highest (new) type at time of
Xid_apply_log_event::do_record_gtid which is a committing phase of
slave transaction when it is executing an implicit statement.
It must be able to bypass any waiting lock by backup process even
without yet knowing whether the transaction is going to be enrolled
into a pending to commit slave group. When later it turns out
not the case the strong MDL_STATEMENT requested lock would've been
already relinquished.
Tested with previous commit's refined and extended BASE's
rpl.parallel_backup, rpl.rpl_parallel_backup_worker_retry
and new added
rpl.rpl_parallel_backup_waits_worker_retry
Todo:
According to Svoj's review, this patch still needs
removing the slave specific code from the MDL.
That is changes in MDL_context::acquire_lock().
Probably
ha_commit_trans() would call first
try_acquire_lock_impl()
when it returns empty ticket a part of the added to acquire_lock
Teammate-Aware Escalation Check would be invoked just to compute
`has_teammate` and when that is true to call the whole BASE version
of MDL_context::acquire_lock() having the highest type of request.
|
|
There was a problem hiding this comment.
Code Review
This pull request addresses deadlocks between parallel replication and backup operations (MDEV-36025) by introducing a high-priority Metadata Lock (MDL_BACKUP_COMMIT_HIGH_PRIO) and a Backup_commit_team mechanism. Slave workers now register their gtid_sub_id and can escalate their lock priority to bypass waiting backup locks when a teammate with a higher ID is already in flight, replacing previous logic that released locks during waits. Feedback was provided to optimize memory usage in the Backup_commit_team class by erasing empty domain entries from the internal map and simplifying the is_empty function accordingly.
tracking to prevent parallel-slave/backup deadlock A parallel-slave worker that has acquired MDL_BACKUP_COMMIT and is waiting for a prior commit can be deadlocked by an incoming BACKUP STAGE BLOCK_COMMIT: the prior commit needs MDL_BACKUP_COMMIT too but the standard fairness rule denies it while the backup X-request is in the waiting queue. Design ------ Add MDL_BACKUP_COMMIT_HIGH_PRIO, a sibling lock mode of MDL_BACKUP_COMMIT. The only matrix difference is one cell of m_waiting_incompatible: HIGH_PRIO is granted even when a BACKUP X-request is waiting. A committing parallel-slave worker: 1. registers its sub_id in backup_team, 2. queries has_higher() — true iff a teammate in the same domain has a strictly greater sub_id, 3. requests MDL_BACKUP_COMMIT_HIGH_PRIO if true, otherwise MDL_BACKUP_COMMIT, 4. unregisters after release_lock(). Register-before-decide ordering ensures that any teammate arriving later observes our sub_id, and we observe at least our own back. How it solves the problem ------------------------- When backup's X-request is in the waiting queue and a worker that is needed to unblock a prior commit arrives, that worker observes a later-numbered teammate already holding the BACKUP lock, requests HIGH_PRIO, and is granted past the X-request via the matrix. The prior-commit chain drains; the backup X-request is then granted.
db6a6c6 to
a20023b
Compare
A parallel-slave worker that has acquired MDL_BACKUP_COMMIT and is
waiting for a prior commit can be deadlocked by an incoming
BACKUP STAGE BLOCK_COMMIT: the prior commit needs MDL_BACKUP_COMMIT too
but the standard fairness rule denies it while the backup X-request is
in the waiting queue.
Design
Add MDL_BACKUP_COMMIT_HIGH_PRIO, a sibling lock mode of
MDL_BACKUP_COMMIT. The only matrix difference is one cell of
m_waiting_incompatible: HIGH_PRIO is granted even when a BACKUP
X-request is waiting.
A committing parallel-slave worker:
a strictly greater sub_id,
MDL_BACKUP_COMMIT,
Register-before-decide ordering ensures that any teammate arriving
later observes our sub_id, and we observe at least our own back.
How it solves the problem
When backup's X-request is in the waiting queue and a worker that is
needed to unblock a prior commit arrives, that worker observes a
later-numbered teammate already holding the BACKUP lock, requests
HIGH_PRIO, and is granted past the X-request via the matrix. The
prior-commit chain drains; the backup X-request is then granted.
The teammate approach made the MDL scheduler call a slave-supplied callback during lock acquisition: when MDL_BACKUP_COMMIT was blocked, MDL itself walked m_granted.
This new approach inverts the ownership: the slave maintains its own per-(rli, domain) multiset of in-flight gtid_sub_ids on Relay_log_info, and before requesting the lock, the worker reads this slave-owned state to pick its lock mode — MDL_BACKUP_COMMIT_HIGH_PRIO if a teammate with greater sub_id is registered, else MDL_BACKUP_COMMIT. MDL then just consults its compatibility/priority matrices like for any other lock; the leapfrog is a pure matrix property of the new mode, not a scheduler-side override.