Skip to content

Conversation

@sidkhillon
Copy link
Contributor

@sidkhillon sidkhillon commented Dec 26, 2025

Currently, sleepForRetries (the sleep time between retry attempts during replication) is only configurable globally via the replication.source.sleepforretries configuration property. This makes it impossible to tune behavior for individual replication peers that may have different requirements.

This change would add support for overriding replication source config values on a per-peer basis, with fallback to the global configuration when not set.

This is related to #7578 because this will not cleanly merge into branch-2

skhillon added 5 commits December 26, 2025 08:14
This squashed commit combines 8 commits:
- Allow peers to override sleep config
- Dynamic config update
- Always get value
- Use protobuf instead of string
- Add to test
- Add shell command
- Use builder instead
- Update UI to include sleep
The previous commit incorrectly added methods (getStartPosition,
getRecoveredQueueStartPos, terminate) that don't exist in upstream master.
These were from the old branch base and should not be included.
@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@ndimiduk ndimiduk requested review from Apache9 and taklwu December 26, 2025 19:04
@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache9
Copy link
Contributor

Apache9 commented Dec 27, 2025

Since we have a Configuration Object in ReplicationPeerConfig, what about just create a combined Configuration instance and use it when creating ReplicationSource? In this way you can directly change the sleepForRetry through the configuration.

@Apache-HBase

This comment has been minimized.

@sidkhillon
Copy link
Contributor Author

sidkhillon commented Dec 27, 2025

Since we have a Configuration Object in ReplicationPeerConfig, what about just create a combined Configuration instance and use it when creating ReplicationSource? In this way you can directly change the sleepForRetry through the configuration.

Just to confirm, you're suggesting I use the existing configuration map in ReplicationPeerConfig and store the value as "replication.source.sleepforretries". Then, we can override the config by doing something like:

Configuration combinedConf = new Configuration(globalConf);
// Override any values in globalConf with the existing value in peerConfig
peerConfig.getConfiguration().forEach(combinedConf::set);
// set this.conf = combinedConf in the ReplicationSource

Set peer-specific override via update_peer_config '1', CONFIG => {"replication.source.sleepforretries" => 2000}

Is that the approach you are looking for?

@Apache9
Copy link
Contributor

Apache9 commented Dec 28, 2025

Since we have a Configuration Object in ReplicationPeerConfig, what about just create a combined Configuration instance and use it when creating ReplicationSource? In this way you can directly change the sleepForRetry through the configuration.

Just to confirm, you're suggesting I use the existing configuration map in ReplicationPeerConfig and store the value as "replication.source.sleepforretries". Then, we can override the config by doing something like:

Configuration combinedConf = new Configuration(globalConf);
// Override any values in globalConf with the existing value in peerConfig
peerConfig.getConfiguration().forEach(combinedConf::set);
// set this.conf = combinedConf in the ReplicationSource

Set peer-specific override via update_peer_config '1', CONFIG => {"replication.source.sleepforretries" => 2000}

Is that the approach you are looking for?

Yes.

More specific, we can change the code in ReplicationSourceManager.createSource.

We have a CompoundConfiguration in hbase, where we can merge multiple Configurations together. You can check ReplicationPeerConfigUtil.getPeerClusterConfiguration method to find the usage.

Thanks.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for configuring the replication source sleepForRetries parameter on a per-peer basis, with fallback to the global configuration when not set. Previously, this parameter was only configurable globally via the replication.source.sleepforretries property.

  • Adds sleepForRetries field to ReplicationPeerConfig with builder support and protobuf serialization
  • Implements getSleepForRetries() method in ReplicationSource with fallback logic to global config when peer value is 0
  • Adds shell command set_peer_sleep_for_retries for managing the per-peer configuration

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
hbase-protocol-shaded/src/main/protobuf/server/master/Replication.proto Adds sleep_for_retries field to ReplicationPeer protobuf message
hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationPeerConfig.java Adds sleepForRetries field to data model with getter/setter and includes in toString()
hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationPeerConfigBuilder.java Adds setSleepForRetries() method to builder interface
hbase-client/src/main/java/org/apache/hadoop/hbase/client/replication/ReplicationPeerConfigUtil.java Adds conversion logic for sleepForRetries between protobuf and Java objects
hbase-client/src/test/java/org/apache/hadoop/hbase/replication/ReplicationPeerConfigTestUtil.java Updates test utilities to include sleepForRetries in config generation and assertions
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java Implements getSleepForRetries() with fallback logic and updates all usage of sleepForRetries to call this method
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceInterface.java Adds getSleepForRetries() method to interface
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java Updates to use source.getSleepForRetries() instead of local field
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java Updates to use source.getSleepForRetries() instead of local field
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java Updates to use source.getSleepForRetries() instead of local field
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/ReplicationSourceDummy.java Implements getSleepForRetries() in test dummy class
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSource.java Updates mocks to return 0L for getSleepForRetries()
hbase-shell/src/main/ruby/shell/commands/set_peer_sleep_for_retries.rb Adds new shell command for setting per-peer sleep for retries
hbase-shell/src/main/ruby/hbase/replication_admin.rb Adds set_peer_sleep_for_retries() method implementation
hbase-shell/src/main/ruby/shell.rb Registers new set_peer_sleep_for_retries command
hbase-shell/src/test/ruby/hbase/replication_admin_test.rb Adds test for set_peer_sleep_for_retries command

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

public boolean isSerial() {
return serial;
}

Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The public getter method getSleepForRetries() is missing Javadoc documentation. For consistency with other public methods in this class and to help users understand the purpose and behavior of this configuration property, add Javadoc that explains:

  • What this value represents (sleep time between retries in milliseconds)
  • The default value (0, which means use global configuration)
  • When it's used (during replication retries)
Suggested change
/**
* Returns the per-peer sleep time between replication retries, in milliseconds.
* <p>
* A value of {@code 0} means that this peer does not define its own retry sleep and the
* global replication configuration should be used instead.
* This value is consulted when performing replication retries for this peer.
*
* @return sleep time between replication retries in milliseconds, or {@code 0} to use the
* global configuration
*/

Copilot uses AI. Check for mistakes.
Comment on lines 530 to 536
public long getSleepForRetries() {
long peerSleepForRetries = replicationPeer.getPeerConfig().getSleepForRetries();
if (peerSleepForRetries > 0) {
return peerSleepForRetries;
}
return this.conf.getLong("replication.source.sleepforretries", 1000);
}
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback logic is inconsistent with the existing getCurrentBandwidth() pattern. The bandwidth implementation (lines 519-523) treats 0 as "use default", but this implementation treats values > 0 as valid peer settings and 0 as "use default". This creates an inconsistency where setting sleepForRetries to 0 means "use global default" but for bandwidth, 0 means "use default". Consider using the same pattern as bandwidth (checking != 0 instead of > 0) for consistency, or add explicit documentation about this behavior difference.

Copilot uses AI. Check for mistakes.
Comment on lines 27 to 31
Examples:
# set sleep time to 2 seconds (2000ms) between retries for a peer
hbase> set_peer_sleep_for_retries '1', 2000
# unset sleep time for a peer to use the global default configured in server-side
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "unset sleep time for a peer to use the global default configured in server-side" when passing 0, but this behavior differs from the bandwidth configuration pattern shown in the same file. In the bandwidth test (lines 609-622), 0 is also used as the initial/default value, but the bandwidth implementation treats 0 as "use default bandwidth". For consistency and clarity, consider documenting this explicitly in the help text that 0 has special meaning as a sentinel value for "use global default", and ensure users understand that setting it to 0 is not the same as setting it to the actual configured global default value.

Suggested change
Examples:
# set sleep time to 2 seconds (2000ms) between retries for a peer
hbase> set_peer_sleep_for_retries '1', 2000
# unset sleep time for a peer to use the global default configured in server-side
A value of 0 is treated specially as a sentinel meaning "use the global default
sleep time configured on the server side" rather than an explicit sleep time.
Examples:
# set sleep time to 2 seconds (2000ms) between retries for a peer
hbase> set_peer_sleep_for_retries '1', 2000
# unset the peer-specific sleep time so that this peer uses the global default
# configured on the server side (0 is a sentinel and does not change that default)

Copilot uses AI. Check for mistakes.
Comment on lines 530 to 536
public long getSleepForRetries() {
long peerSleepForRetries = replicationPeer.getPeerConfig().getSleepForRetries();
if (peerSleepForRetries > 0) {
return peerSleepForRetries;
}
return this.conf.getLong("replication.source.sleepforretries", 1000);
}
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for the new getSleepForRetries() method and its fallback logic. While the existing tests mock getSleepForRetries() to return 0L, there's no test that verifies:

  1. When peer config has a positive value (e.g., 2000), it returns that value
  2. When peer config is 0, it falls back to the global configuration value
  3. The integration between peer-level and global configuration

Consider adding a dedicated unit test in TestReplicationSource.java that verifies this fallback behavior, similar to how bandwidth is tested in other parts of the codebase.

Copilot uses AI. Check for mistakes.
* Get the sleep time for retries. Check peer config first, if set use it, otherwise fall back to
* global configuration.
* @return sleep time in milliseconds
*/
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method overrides ReplicationSourceInterface.getSleepForRetries; it is advisable to add an Override annotation.

Suggested change
*/
*/
@Override

Copilot uses AI. Check for mistakes.
this.conf.getInt(WAIT_ON_ENDPOINT_SECONDS, DEFAULT_WAIT_ON_ENDPOINT_SECONDS);
decorateConf();
// 1 second
this.sleepForRetries = this.conf.getLong("replication.source.sleepforretries", 1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do you think this sleepForRetries would be changing dynamically after the initialization ? IIRC it would be only reloading via refreshSources (e.g. via updateReplicationPeerConfig ) such that this value would only created once per peer configuration or refresh.

so, can we keep this variable and just call getSleepForRetries() once within ReplicationSource.java?

@Apache-HBase

This comment has been minimized.

Copy link
Contributor

@taklwu taklwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one minor comment.

Copy link
Contributor

@taklwu taklwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more comment.

// Check if any replication.source.* keys have changed values
for (Map.Entry<String, String> entry : newReplicationConfigs.entrySet()) {
String key = entry.getKey();
if (key.startsWith("replication.source.")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to handle the following keys? if not, then your change looks good to me

hbase.replication.source.fs.conf.provider
hbase.replication.source.service
hbase.replication.source.maxthreads

Copy link
Contributor Author

@sidkhillon sidkhillon Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, the first two couldn't be used on a per-peer basis while the last is only used by the InterClusterReplicationEndpoint and not a ReplicationSource. Therefore, I think it is fine to not include those.

@sidkhillon
Copy link
Contributor Author

I have updated #7578 to reflect the changes suggested by PR comments as well. It only has minor differences from this branch.

@Apache-HBase

This comment has been minimized.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 28s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 buf 0m 0s buf was not available.
+0 🆗 buf 0m 0s buf was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
_ master Compile Tests _
+0 🆗 mvndep 0m 14s Maven dependency ordering for branch
+1 💚 mvninstall 3m 18s master passed
+1 💚 compile 4m 58s master passed
+1 💚 checkstyle 1m 31s master passed
+1 💚 spotbugs 4m 34s master passed
+1 💚 spotless 0m 49s branch has no errors when running spotless:check.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 12s Maven dependency ordering for patch
+1 💚 mvninstall 2m 53s the patch passed
+1 💚 compile 5m 0s the patch passed
+1 💚 cc 5m 0s the patch passed
+1 💚 javac 5m 0s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 1m 31s the patch passed
-0 ⚠️ rubocop 0m 10s /results-rubocop.txt The patch generated 3 new + 504 unchanged - 0 fixed = 507 total (was 504)
+1 💚 spotbugs 4m 50s the patch passed
+1 💚 hadoopcheck 11m 9s Patch does not cause any errors with Hadoop 3.3.6 3.4.1.
+1 💚 hbaseprotoc 1m 47s the patch passed
+1 💚 spotless 0m 42s patch has no errors when running spotless:check.
_ Other Tests _
+1 💚 asflicense 0m 37s The patch does not generate ASF License warnings.
52m 28s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7577/5/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #7577
Optional Tests dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless cc buflint bufcompat hbaseprotoc rubocop
uname Linux d11cb7dbc205 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 72e9655
Default Java Eclipse Adoptium-17.0.11+9
Max. process+thread count 85 (vs. ulimit of 30000)
modules C: hbase-protocol-shaded hbase-client hbase-server hbase-shell U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7577/5/console
versions git=2.34.1 maven=3.9.8 spotbugs=4.7.3 rubocop=1.37.1
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 44s Docker mode activated.
-0 ⚠️ yetus 0m 4s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+0 🆗 mvndep 0m 15s Maven dependency ordering for branch
+1 💚 mvninstall 4m 30s master passed
+1 💚 compile 2m 51s master passed
+1 💚 javadoc 1m 32s master passed
+1 💚 shadedjars 8m 42s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 14s Maven dependency ordering for patch
+1 💚 mvninstall 4m 20s the patch passed
+1 💚 compile 3m 14s the patch passed
+1 💚 javac 3m 14s the patch passed
+1 💚 javadoc 1m 39s the patch passed
+1 💚 shadedjars 8m 52s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
+1 💚 unit 0m 51s hbase-protocol-shaded in the patch passed.
+1 💚 unit 2m 10s hbase-client in the patch passed.
-1 ❌ unit 308m 10s /patch-unit-hbase-server.txt hbase-server in the patch failed.
+1 💚 unit 8m 49s hbase-shell in the patch passed.
361m 51s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7577/5/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR #7577
Optional Tests javac javadoc unit compile shadedjars
uname Linux 0686d93ef510 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 72e9655
Default Java Eclipse Adoptium-17.0.11+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7577/5/testReport/
Max. process+thread count 3428 (vs. ulimit of 30000)
modules C: hbase-protocol-shaded hbase-client hbase-server hbase-shell U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7577/5/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 10s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 buf 0m 0s buf was not available.
+0 🆗 buf 0m 0s buf was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
_ master Compile Tests _
+0 🆗 mvndep 0m 12s Maven dependency ordering for branch
+1 💚 mvninstall 2m 32s master passed
+1 💚 compile 3m 47s master passed
+1 💚 checkstyle 1m 8s master passed
+1 💚 spotbugs 3m 22s master passed
+1 💚 spotless 0m 39s branch has no errors when running spotless:check.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 10s Maven dependency ordering for patch
+1 💚 mvninstall 2m 12s the patch passed
+1 💚 compile 3m 45s the patch passed
+1 💚 cc 3m 45s the patch passed
+1 💚 javac 3m 45s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 1m 7s the patch passed
-0 ⚠️ rubocop 0m 7s /results-rubocop.txt The patch generated 3 new + 504 unchanged - 0 fixed = 507 total (was 504)
+1 💚 spotbugs 3m 35s the patch passed
+1 💚 hadoopcheck 8m 25s Patch does not cause any errors with Hadoop 3.3.6 3.4.1.
+1 💚 hbaseprotoc 1m 14s the patch passed
+1 💚 spotless 0m 32s patch has no errors when running spotless:check.
_ Other Tests _
+1 💚 asflicense 0m 27s The patch does not generate ASF License warnings.
38m 59s
Subsystem Report/Notes
Docker ClientAPI=1.48 ServerAPI=1.48 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7577/6/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #7577
Optional Tests dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless cc buflint bufcompat hbaseprotoc rubocop
uname Linux badf1eb8d21e 6.8.0-1024-aws #26~22.04.1-Ubuntu SMP Wed Feb 19 06:54:57 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 851c737
Default Java Eclipse Adoptium-17.0.11+9
Max. process+thread count 90 (vs. ulimit of 30000)
modules C: hbase-protocol-shaded hbase-client hbase-server hbase-shell U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7577/6/console
versions git=2.34.1 maven=3.9.8 spotbugs=4.7.3 rubocop=1.37.1
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants