-
Notifications
You must be signed in to change notification settings - Fork 15
[TNTP-2109] Switch to safe mode on vshard rebalance #467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 1 commit
2f630af
9b626bb
abbcfbd
757680b
8ea36f9
a8352f2
bb11b86
d937de4
c7a913e
b8cf51d
b552571
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -23,6 +23,7 @@ local readview = require('crud.readview') | |
| local schema = require('crud.schema') | ||
| local storage_info = require('crud.storage_info') | ||
| local storage = require('crud.storage') | ||
| local rebalance = require('crud.common.rebalance') | ||
|
|
||
| local crud = {} | ||
|
|
||
|
|
@@ -158,8 +159,23 @@ crud.readview = readview.new | |
| -- @function schema | ||
| crud.schema = schema.call | ||
|
|
||
| crud.rebalance = {} | ||
|
|
||
| -- @refer rebalance.router_cache_clear | ||
| -- @function router_cache_clear | ||
| crud.rebalance.router_cache_clear = rebalance.router.cache_clear | ||
|
|
||
| -- @refer rebalance.router_cache_length | ||
| -- @function router_cache_length | ||
| crud.rebalance.router_cache_length = rebalance.router.cache_length | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we export
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think they will be useful in TCM/Cartridge to indicate that cache has actually been cleared and when did it happen. |
||
|
|
||
| -- @refer rebalance.router_cache_last_clear_ts | ||
| -- @function router_cache_last_clear_ts | ||
| crud.rebalance.router_cache_last_clear_ts = rebalance.router.cache_last_clear_ts | ||
|
|
||
| function crud.init_router() | ||
| rawset(_G, 'crud', crud) | ||
| rawset(_G, 'crud', crud) | ||
| rebalance.metrics.enable_router_metrics() | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks like a crutch. Will it be disabled, when user disables metrics? It seems the whole metrics part should be done in the |
||
| end | ||
|
|
||
| function crud.stop_router() | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
| @@ -0,0 +1,211 @@ | ||||
| local fiber = require('fiber') | ||||
| local log = require('log') | ||||
| local vshard_consts = require('vshard.consts') | ||||
| local utils = require('crud.common.utils') | ||||
|
|
||||
| local has_metrics_module, metrics = pcall(require, 'metrics') | ||||
|
|
||||
| local SETTINGS_SPACE_NAME = '_crud_settings' | ||||
| local SAFE_MOD_ENABLE_EVENT = '_crud.safe_mode_enable' | ||||
|
|
||||
| local M = { | ||||
| safe_mode = false, | ||||
| safe_mode_enable_hooks = {}, | ||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can use |
||||
| safe_mode_disable_hooks = {}, | ||||
| _router_cache_last_clear_ts = fiber.time() | ||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we initialize the timestamp with current time on module load? It may lead to bugs, we didn't really cleared the router`s cache on module load |
||||
| } | ||||
|
|
||||
| local function create_space() | ||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. AFAICS, the space has general name |
||||
| local settings_space = box.schema.space.create(SETTINGS_SPACE_NAME, { | ||||
| engine = 'memtx', | ||||
| format = { | ||||
| { name = 'key', type = 'string' }, | ||||
| { name = 'value', type = 'any' }, | ||||
| }, | ||||
| if_not_exists = true, | ||||
| }) | ||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, the space is not local, but before restart instances are in different states:
That all seems inconsistent and dangerous. Why can't we just enable safe mode on replicas, when we see the change in the |
||||
| settings_space:create_index('primary', { parts = { 'key' }, if_not_exists = true }) | ||||
| end | ||||
|
|
||||
| local function safe_mode_trigger(_, new, space, op) | ||||
| if space ~= '_bucket' then | ||||
| return | ||||
| end | ||||
| if (op == 'INSERT' and new.status == vshard_consts.BUCKET.RECEIVING) or | ||||
| (op == 'REPLACE' and new.status == vshard_consts.BUCKET.SENDING) then | ||||
| box.broadcast(SAFE_MOD_ENABLE_EVENT, true) | ||||
vakhov marked this conversation as resolved.
Show resolved
Hide resolved
|
||||
| end | ||||
| end | ||||
|
|
||||
| local function register_enable_hook(func) | ||||
| M.safe_mode_enable_hooks[func] = true | ||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need that variable and why cannot we just set |
||||
| end | ||||
|
|
||||
| local function remove_enable_hook(func) | ||||
| M.safe_mode_enable_hooks[func] = nil | ||||
| end | ||||
|
|
||||
| local function register_disable_hook(func) | ||||
| M.safe_mode_disable_hooks[func] = true | ||||
| end | ||||
|
|
||||
| local function remove_disable_hook(func) | ||||
| M.safe_mode_disable_hooks[func] = nil | ||||
| end | ||||
|
|
||||
| local function safe_mode_status() | ||||
| return M.safe_mode | ||||
| end | ||||
|
|
||||
| local function safe_mode_enable() | ||||
| if not box.info.ro then | ||||
| box.space[SETTINGS_SPACE_NAME]:replace{ 'safe_mode', true } | ||||
| for _, trig in pairs(box.space._bucket:on_replace()) do | ||||
| if trig == safe_mode_trigger then | ||||
| box.space._bucket:on_replace(nil, safe_mode_trigger) | ||||
| end | ||||
| end | ||||
| end | ||||
| M.safe_mode = true | ||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alternative to the following:
We can make the change of The problem with this approach is that
The preferred solution, from my point of view: |
||||
|
|
||||
| for hook, _ in pairs(M.safe_mode_enable_hooks) do | ||||
| hook() | ||||
| end | ||||
|
|
||||
| log.info('Rebalance safe mode enabled') | ||||
| end | ||||
|
|
||||
| local function safe_mode_disable() | ||||
| if not box.info.ro then | ||||
| box.space[SETTINGS_SPACE_NAME]:replace{ 'safe_mode', false } | ||||
| box.space._bucket:on_replace(safe_mode_trigger) | ||||
| end | ||||
| M.safe_mode = false | ||||
|
|
||||
| for hook, _ in pairs(M.safe_mode_disable_hooks) do | ||||
| hook() | ||||
| end | ||||
|
|
||||
| log.info('Rebalance safe mode disabled') | ||||
| end | ||||
Satbek marked this conversation as resolved.
Show resolved
Hide resolved
|
||||
|
|
||||
| local function rebalance_init() | ||||
| M.metrics.enable_storage_metrics() | ||||
|
|
||||
| -- box.watch was introduced in tarantool 2.10.0 | ||||
| if not utils.tarantool_supports_box_watch() then | ||||
| log.warn('This version of tarantool does not support autoswitch to safe mode during rebalance. ' | ||||
| .. 'Update to newer version or use `_crud.rebalance_safe_mode_enable()` to enable safe mode manually.') | ||||
| return | ||||
| end | ||||
|
|
||||
| box.watch('box.status', function() | ||||
| if box.info.ro then | ||||
| return | ||||
| end | ||||
|
|
||||
| local stored_safe_mode | ||||
| if box.space[SETTINGS_SPACE_NAME] == nil then | ||||
| create_space() | ||||
| box.space[SETTINGS_SPACE_NAME]:insert{ 'safe_mode', false } | ||||
| else | ||||
| stored_safe_mode = box.space[SETTINGS_SPACE_NAME]:get{ 'safe_mode' } | ||||
| end | ||||
| M.safe_mode = stored_safe_mode and stored_safe_mode.value or false | ||||
|
|
||||
| if M.safe_mode then | ||||
| for hook, _ in pairs(M.safe_mode_enable_hooks) do | ||||
| hook() | ||||
| end | ||||
| else | ||||
| box.space._bucket:on_replace(safe_mode_trigger) | ||||
| for hook, _ in pairs(M.safe_mode_disable_hooks) do | ||||
| hook() | ||||
| end | ||||
| end | ||||
| end) | ||||
|
|
||||
| box.watch(SAFE_MOD_ENABLE_EVENT, function(_, do_enable) | ||||
| if box.info.ro or not do_enable then | ||||
| return | ||||
| end | ||||
| safe_mode_enable() | ||||
| end) | ||||
| end | ||||
|
|
||||
| local function router_cache_clear() | ||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Aren't we supposed to check, that rebalancing is ended and we have no buckets in forbidden states? |
||||
| M._router_cache_last_clear_ts = fiber.time() | ||||
| return utils.get_vshard_router_instance():_route_map_clear() | ||||
| end | ||||
|
|
||||
| local function router_cache_length() | ||||
| return utils.get_vshard_router_instance().known_bucket_count | ||||
| end | ||||
|
|
||||
| local function router_cache_last_clear_ts() | ||||
| return M._router_cache_last_clear_ts | ||||
| end | ||||
|
|
||||
| -- Rebalance related metrics | ||||
| local function enable_storage_metrics() | ||||
| if not has_metrics_module then | ||||
| return | ||||
| end | ||||
|
|
||||
| local safe_mode_enabled_gauge = metrics.gauge( | ||||
| 'tnt_crud_storage_safe_mode_enabled', | ||||
| "is safe mode enabled on this storage instance" | ||||
| ) | ||||
|
|
||||
| metrics.register_callback(function() | ||||
| safe_mode_enabled_gauge:set(safe_mode_status() and 1 or 0) | ||||
| end) | ||||
| end | ||||
|
|
||||
| local function enable_router_metrics() | ||||
| if not has_metrics_module then | ||||
| return | ||||
| end | ||||
|
|
||||
| local router_cache_length_gauge = metrics.gauge( | ||||
| 'tnt_crud_router_cache_length', | ||||
| "number of bucket routes in vshard router cache" | ||||
| ) | ||||
| local router_cache_last_clear_ts_gauge = metrics.gauge( | ||||
| 'tnt_crud_router_cache_last_clear_ts', | ||||
| "when vshard router cache was cleared last time" | ||||
| ) | ||||
|
|
||||
| metrics.register_callback(function() | ||||
| router_cache_length_gauge:set(router_cache_length()) | ||||
| router_cache_last_clear_ts_gauge:set(router_cache_last_clear_ts()) | ||||
| end) | ||||
| end | ||||
|
|
||||
| M.init = rebalance_init | ||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: let's make the code consistent with all other modules, it seems we tend to make the variable with proper name and assign methods to it crud/crud/stats/local_registry.lua Line 39 in c52a6f5
|
||||
| M.safe_mode_status = safe_mode_status | ||||
| M.safe_mode_enable = safe_mode_enable | ||||
| M.safe_mode_disable = safe_mode_disable | ||||
| M.register_safe_mode_enable_hook = register_enable_hook | ||||
| M.remove_safe_mode_enable_hook = remove_enable_hook | ||||
| M.register_safe_mode_disable_hook = register_disable_hook | ||||
| M.remove_safe_mode_disable_hook = remove_disable_hook | ||||
|
|
||||
| M.router = { | ||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: api naming is incosistent, |
||||
| cache_clear = router_cache_clear, | ||||
| cache_length = router_cache_length, | ||||
| cache_last_clear_ts = router_cache_last_clear_ts, | ||||
| } | ||||
Satbek marked this conversation as resolved.
Show resolved
Hide resolved
|
||||
|
|
||||
| M.storage_api = { | ||||
| rebalance_safe_mode_status = safe_mode_status, | ||||
| rebalance_safe_mode_enable = safe_mode_enable, | ||||
| rebalance_safe_mode_disable = safe_mode_disable, | ||||
| } | ||||
|
|
||||
| M.metrics = { | ||||
| enable_storage_metrics = enable_storage_metrics, | ||||
| enable_router_metrics = enable_router_metrics, | ||||
| } | ||||
|
|
||||
| return M | ||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,6 +4,7 @@ local dev_checks = require('crud.common.dev_checks') | |
| local stash = require('crud.common.stash') | ||
| local utils = require('crud.common.utils') | ||
|
|
||
| local rebalance = require('crud.common.rebalance') | ||
| local call = require('crud.common.call') | ||
| local sharding_metadata = require('crud.common.sharding.sharding_metadata') | ||
| local insert = require('crud.insert') | ||
|
|
@@ -62,6 +63,7 @@ local function init_storage_call(user, storage_api) | |
| end | ||
|
|
||
| local modules_with_storage_api = { | ||
| rebalance, | ||
| call, | ||
| sharding_metadata, | ||
| insert, | ||
|
|
@@ -103,6 +105,8 @@ local function init_impl() | |
| user = utils.get_this_replica_user() or 'guest' | ||
| end | ||
|
|
||
| rebalance.init() | ||
|
|
||
| for _, module in ipairs(modules_with_storage_api) do | ||
| init_storage_call(user, module.storage_api) | ||
| end | ||
|
|
@@ -141,7 +145,6 @@ function storage.stop() | |
| internal_stash.watcher:unregister() | ||
| internal_stash.watcher = nil | ||
| end | ||
|
|
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: this change is not needed) We should minimize the changed lines in commits, when we can |
||
| rawset(_G, utils.STORAGE_NAMESPACE, nil) | ||
| end | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guys, the commits are absolutely chaotic, this is not the way, we develop open-source modules. We should rebase that branch in master to make the commit history linear, and not just merge it. So, all commits like Minor changes according to review comments will be visible to our users. You can check out, how Georgy Moiseev did it: proper commits, proper commit messages, tests in every commit (e.g. 8d7cae0).
Now, I'm forced to review more than 1k lines in one step, which is very inconvenient and increases the chanse of missed bugs. And our users won't be able to check the individual commits, if they want to. Of course, it's up to you, since I'm not the maintainer of that module, but it's just not nice to develop and merge code like that.
IMHO, the features should be properly split between commits, every commit must include the associated tests, proper commit messages and mentioning of the #448 ticket. Of course, refactoring or code moving should be in the separate commits. Between commits all of the test must pass
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course these commits will not go to master, I will re-split them before merging the PR.
My bad. Never thought of this PR as of multiple features that can be reviewed separately.