Skip to content

Conversation

@mymeiyi
Copy link
Contributor

@mymeiyi mymeiyi commented Jan 20, 2026

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

Copilot AI review requested due to automatic review settings January 20, 2026 09:13
@Thearas
Copy link
Contributor

Thearas commented Jan 20, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes version caching issues in cloud mode for both table and partition version retrieval. The changes address how session variables are accessed and improve thread safety through volatile field declarations.

Changes:

  • Updated CloudPartition and OlapTable to use volatile fields for cache timestamps and versions
  • Modified version cache expiration checks to use ConnectContext-based session variable access
  • Enhanced batch version retrieval to properly update caches
  • Added missing waitForPendingTxn parameter to version requests

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
CloudPartitionTest.java Updated tests to properly initialize ConnectContext with SessionVariable for instance-based access patterns
CloudPartition.java Made cache fields volatile, changed method visibility for testing, updated cache expiration logic to use ConnectContext, added waitForPendingTxn parameter, optimized ArrayList capacity
OlapTable.java Made cache fields volatile, changed cache comparison from > to >=, added cache updates in batch version retrieval, optimized with Collections.emptyList()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (2)

fe/fe-core/src/test/java/org/apache/doris/cloud/catalog/CloudPartitionTest.java:63

  • The ConnectContext set via ctx.setThreadLocalInfo() at line 51 is not cleaned up after the test completes. This could cause the thread-local context to leak into other tests. Add ConnectContext.remove() in a finally block or use @After/@AfterEach to ensure cleanup.
    @Test
    public void testIsCachedVersionExpired() {
        // Create ConnectContext with SessionVariable
        ConnectContext ctx = new ConnectContext();
        ctx.setSessionVariable(new SessionVariable());
        ctx.setThreadLocalInfo();

        // test isCachedVersionExpired
        CloudPartition part = createPartition(1, 2, 3);
        ctx.getSessionVariable().cloudPartitionVersionCacheTtlMs = 0;
        Assertions.assertTrue(part.isCachedVersionExpired());
        ctx.getSessionVariable().cloudPartitionVersionCacheTtlMs = -10086;
        part.setCachedVisibleVersion(2, 10086L); // update version and last cache time
        ctx.getSessionVariable().cloudPartitionVersionCacheTtlMs = 10000;
        Assertions.assertFalse(part.isCachedVersionExpired()); // not expired due to long expiration duration
        Assertions.assertEquals(2, part.getCachedVisibleVersion());

    }

fe/fe-core/src/test/java/org/apache/doris/cloud/catalog/CloudPartitionTest.java:172

  • The ConnectContext set via ctx.setThreadLocalInfo() at line 70 is not cleaned up after the test completes. This could cause the thread-local context to leak into other tests. Add ConnectContext.remove() in a finally block or use @After/@AfterEach to ensure cleanup, similar to how it's done in OlapTableTest.java at line 405.
    @Test
    public void testCachedVersion() throws RpcException {
        // Create ConnectContext with SessionVariable
        ConnectContext ctx = new ConnectContext();
        ctx.setSessionVariable(new SessionVariable());
        ctx.setThreadLocalInfo();

        CloudPartition part = createPartition(1, 2, 3);
        List<CloudPartition> parts = new ArrayList<>();
        for (long i = 0; i < 3; ++i) {
            parts.add(createPartition(5 + i, 5 + i, 5 + i));
        }
        Assertions.assertEquals(-1, part.getCachedVisibleVersion()); // not initialized FE side version

        // CHECKSTYLE OFF
        final ArrayList<Long> singleVersions = new ArrayList<>(Arrays.asList(2L, 3L, 4L, 5L));
        final ArrayList<ArrayList<Long>> batchVersions = new ArrayList<>(Arrays.asList(
                new ArrayList<>(Arrays.asList(1L, 2L, -1L)),
                new ArrayList<>(Arrays.asList(2L, 3L, -1L)),
                new ArrayList<>(Arrays.asList(3L, 4L, -1L)),
                new ArrayList<>(Arrays.asList(5L, -1L))
        ));
        final Integer[] callCount = {0};

        new MockUp<VersionHelper>(VersionHelper.class) {
            @Mock
            public Cloud.GetVersionResponse getVersionFromMeta(Cloud.GetVersionRequest req) {
                Cloud.GetVersionResponse.Builder builder = Cloud.GetVersionResponse.newBuilder();
                builder.setVersion(singleVersions.get(callCount[0]));
                builder.addAllVersions(batchVersions.get(callCount[0]));
                ++callCount[0];
                return builder.build();
            }
        };
        // CHECKSTYLE ON

        ctx.getSessionVariable().cloudPartitionVersionCacheTtlMs = -1; // disable cache
            {
                // test single get version
                Assertions.assertEquals(2, part.getVisibleVersion()); // should not get from cache
                Assertions.assertEquals(1, callCount[0]); // issue a rpc call to meta-service

                // test snapshot versions
                List<Long> versions = CloudPartition.getSnapshotVisibleVersion(parts); // should not get from cache
                Assertions.assertEquals(2, callCount[0]); // issue a rpc call to meta-service
                for (int i = 0; i < batchVersions.get(1).size(); ++i) {
                    Long exp = batchVersions.get(1).get(i);
                    if (exp == -1) {
                        exp = CloudPartition.PARTITION_INIT_VERSION;
                    }
                    Assertions.assertEquals(exp, versions.get(i));
                }
            }

        // enable change expiration and make it cached in long duration
        ctx.getSessionVariable().cloudPartitionVersionCacheTtlMs = 100000;
            {
                // test single get version
                Assertions.assertEquals(2, part.getVisibleVersion()); // cached version
                Assertions.assertEquals(2, callCount[0]); // issue a rpc call to meta-service

                // test snapshot versions
                List<Long> versions = CloudPartition.getSnapshotVisibleVersion(parts); // should not get from cache
                Assertions.assertEquals(2, callCount[0]); // issue a rpc call to meta-service
                for (int i = 0; i < batchVersions.get(1).size(); ++i) {
                    Long exp = batchVersions.get(1).get(i);
                    if (exp == -1) {
                        exp = CloudPartition.PARTITION_INIT_VERSION;
                    }
                    Assertions.assertEquals(exp, versions.get(i));
                }
            }

        // enable change expiration and make it expired
        ctx.getSessionVariable().cloudPartitionVersionCacheTtlMs = 500;
        try {
            Thread.sleep(550);
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }
        // make some partition not expired, these partitions will not get version from meta-service
        CloudPartition hotPartition = parts.get(0);
        hotPartition.setCachedVisibleVersion(hotPartition.getCachedVisibleVersion(), 10086L);
        Assertions.assertEquals(2, hotPartition.getCachedVisibleVersion());
        Assertions.assertFalse(hotPartition.isCachedVersionExpired());
        Assertions.assertTrue(parts.get(1).isCachedVersionExpired());
        Assertions.assertTrue(parts.get(2).isCachedVersionExpired());

            {
                // test single get version
                Assertions.assertEquals(4, part.getVisibleVersion()); // should not get from cache
                Assertions.assertEquals(3, callCount[0]); // issue a rpc call to meta-service

                // test snapshot versions
                List<Long> versions = CloudPartition.getSnapshotVisibleVersion(parts); // should not get from cache
                Assertions.assertEquals(versions.size(), parts.size());
                Assertions.assertEquals(4, callCount[0]); // issue a rpc call to meta-service
                for (int i = 0; i < batchVersions.get(3).size(); ++i) {
                    Long exp = batchVersions.get(3).get(i);
                    if (exp == -1) {
                        exp = CloudPartition.PARTITION_INIT_VERSION;
                    }
                    Assertions.assertEquals(exp, versions.get(i + 1)); // exclude the first hot partition
                }
                // hot partition version not changed
                Assertions.assertEquals(2, hotPartition.getCachedVisibleVersion());
            }
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mymeiyi
Copy link
Contributor Author

mymeiyi commented Jan 20, 2026

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 31153 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 53d4bf628dd6d3dbe233c813890820176f37bfaa, data reload: false

------ Round 1 ----------------------------------
q1	17593	4263	4079	4079
q2	2054	366	239	239
q3	10163	1285	715	715
q4	10195	795	305	305
q5	7510	2090	1797	1797
q6	191	168	139	139
q7	926	793	659	659
q8	9273	1368	1123	1123
q9	4846	4661	4538	4538
q10	6796	1796	1398	1398
q11	536	305	277	277
q12	778	739	627	627
q13	17799	3822	3126	3126
q14	291	289	269	269
q15	586	511	516	511
q16	754	746	631	631
q17	683	832	442	442
q18	7072	6437	6266	6266
q19	1340	977	619	619
q20	408	357	242	242
q21	3001	2435	2199	2199
q22	1037	1022	952	952
Total cold run time: 103832 ms
Total hot run time: 31153 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4103	4024	4011	4011
q2	322	394	317	317
q3	2109	2595	2271	2271
q4	1304	1743	1353	1353
q5	4128	4059	4017	4017
q6	222	169	128	128
q7	1841	1792	1662	1662
q8	2785	2463	2397	2397
q9	7407	7312	7203	7203
q10	2520	2708	2323	2323
q11	579	496	509	496
q12	708	769	660	660
q13	3549	4133	3500	3500
q14	290	313	287	287
q15	554	512	500	500
q16	655	727	637	637
q17	1212	1339	1382	1339
q18	8311	7847	7853	7847
q19	868	890	895	890
q20	1946	2034	1987	1987
q21	4834	4357	4203	4203
q22	1053	1016	952	952
Total cold run time: 51300 ms
Total hot run time: 48980 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 173859 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 53d4bf628dd6d3dbe233c813890820176f37bfaa, data reload: false

query5	4483	663	493	493
query6	330	237	204	204
query7	4227	491	267	267
query8	359	241	240	240
query9	8757	2879	2898	2879
query10	531	370	359	359
query11	15084	15174	14831	14831
query12	179	120	112	112
query13	1263	496	398	398
query14	6217	3044	2777	2777
query14_1	2741	2651	2669	2651
query15	195	191	174	174
query16	987	495	458	458
query17	1086	672	551	551
query18	2464	441	337	337
query19	229	225	194	194
query20	118	118	114	114
query21	223	144	113	113
query22	3837	4147	3847	3847
query23	15921	15549	15293	15293
query23_1	15378	15530	15369	15369
query24	7181	1612	1169	1169
query24_1	1169	1166	1168	1166
query25	531	454	412	412
query26	1255	282	155	155
query27	2758	470	282	282
query28	4544	2190	2174	2174
query29	787	549	441	441
query30	311	233	199	199
query31	794	643	575	575
query32	90	79	80	79
query33	534	360	324	324
query34	960	919	523	523
query35	724	757	705	705
query36	837	880	806	806
query37	135	99	84	84
query38	2693	2773	2661	2661
query39	770	772	749	749
query39_1	707	741	722	722
query40	222	143	124	124
query41	73	70	66	66
query42	108	103	107	103
query43	482	491	407	407
query44	1372	760	757	757
query45	196	189	220	189
query46	888	993	573	573
query47	1386	1494	1330	1330
query48	311	333	242	242
query49	595	435	335	335
query50	644	273	205	205
query51	3814	3857	3799	3799
query52	103	107	96	96
query53	285	328	274	274
query54	288	264	279	264
query55	81	80	76	76
query56	304	295	298	295
query57	1007	1015	969	969
query58	271	250	255	250
query59	2029	2123	2000	2000
query60	327	326	316	316
query61	147	142	142	142
query62	377	337	300	300
query63	311	267	266	266
query64	4911	1270	917	917
query65	3797	3745	3759	3745
query66	1452	434	334	334
query67	15485	15628	15471	15471
query68	2381	1092	778	778
query69	451	353	328	328
query70	1011	965	871	871
query71	337	311	291	291
query72	5223	3286	3414	3286
query73	605	781	319	319
query74	8620	8654	8592	8592
query75	2842	2852	2454	2454
query76	2295	1140	652	652
query77	377	373	309	309
query78	9826	9995	9112	9112
query79	1074	953	604	604
query80	1315	564	479	479
query81	532	262	235	235
query82	1041	153	111	111
query83	373	264	240	240
query84	254	117	89	89
query85	890	486	414	414
query86	414	302	293	293
query87	2900	2871	2786	2786
query88	3474	2589	2563	2563
query89	395	358	322	322
query90	1954	172	155	155
query91	167	160	137	137
query92	83	73	75	73
query93	987	946	544	544
query94	637	322	310	310
query95	590	344	308	308
query96	647	528	230	230
query97	2366	2410	2342	2342
query98	227	207	196	196
query99	581	568	507	507
Total cold run time: 246270 ms
Total hot run time: 173859 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 26.79 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 53d4bf628dd6d3dbe233c813890820176f37bfaa, data reload: false

query1	0.06	0.05	0.05
query2	0.10	0.04	0.04
query3	0.26	0.09	0.09
query4	1.61	0.12	0.11
query5	0.29	0.26	0.25
query6	1.15	0.66	0.65
query7	0.04	0.02	0.02
query8	0.05	0.04	0.04
query9	0.56	0.50	0.49
query10	0.54	0.54	0.55
query11	0.15	0.09	0.10
query12	0.14	0.10	0.10
query13	0.60	0.58	0.59
query14	0.95	0.95	0.96
query15	0.79	0.78	0.77
query16	0.40	0.40	0.38
query17	1.00	0.97	1.06
query18	0.22	0.21	0.21
query19	1.87	1.90	1.85
query20	0.02	0.02	0.01
query21	15.44	0.28	0.15
query22	5.19	0.05	0.05
query23	15.90	0.28	0.10
query24	1.95	0.25	0.53
query25	0.11	0.06	0.06
query26	0.14	0.13	0.13
query27	0.07	0.06	0.05
query28	4.37	1.07	0.88
query29	12.56	3.90	3.15
query30	0.27	0.14	0.14
query31	2.82	0.63	0.40
query32	3.24	0.56	0.46
query33	3.04	3.06	3.11
query34	16.37	5.11	4.42
query35	4.45	4.45	4.46
query36	0.66	0.51	0.48
query37	0.13	0.07	0.07
query38	0.07	0.04	0.03
query39	0.05	0.03	0.03
query40	0.16	0.14	0.14
query41	0.09	0.04	0.03
query42	0.04	0.03	0.03
query43	0.05	0.04	0.04
Total cold run time: 97.97 s
Total hot run time: 26.79 s

@mymeiyi
Copy link
Contributor Author

mymeiyi commented Jan 20, 2026

run p0

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 76.92% (20/26) 🎉
Increment coverage report
Complete coverage report

1 similar comment
@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 76.92% (20/26) 🎉
Increment coverage report
Complete coverage report

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: fix fix table and partition get_version

Summary

This PR fixes several issues with version caching in cloud mode:

  1. Converts cloudPartitionVersionCacheTtlMs from static to instance variable
  2. Adds volatile for thread safety on cache fields
  3. Fixes cache update logic and adds missing RPC parameter
  4. Updates cache after batch version fetching

Analysis of Changes

1. [Good] Static to Instance Variable ✓

// Before
public static long cloudPartitionVersionCacheTtlMs = 0;

// After
public long cloudPartitionVersionCacheTtlMs = 0;

This enables per-session cache TTL configuration instead of a global setting.


2. [Good] Added volatile for Thread Safety ✓

// OlapTable
private volatile long lastTableVersionCachedTimeMs = 0;
private volatile long cachedTableVersion = -1;

// CloudPartition
private volatile long lastVersionCachedTimeMs = 0;

This ensures visibility of cache updates across threads.


3. [Bug Fix] Missing RPC Parameter ✓

// Before - waitForPendingTxns was ignored!
Cloud.GetVersionRequest request = Cloud.GetVersionRequest.newBuilder()
        .setRequestIp(...)
        .setDbId(...)
        // missing: .setWaitForPendingTxn(waitForPendingTxns)
        .build();

// After
Cloud.GetVersionRequest request = Cloud.GetVersionRequest.newBuilder()
        ...
        .setWaitForPendingTxn(waitForPendingTxns)  // Now included
        .build();

This is a real bug fix - the waitForPendingTxns parameter was being passed to the method but never sent to the meta-service!


4. [Good] Cache Update After Batch Fetch ✓

List<Long> versions = getVisibleVersionFromMeta(dbIds, tableIds);

// update cache - NEW
Preconditions.checkState(tables.size() == versions.size());
for (int i = 0; i < tables.size(); i++) {
    tables.get(i).setCachedTableVersion(versions.get(i));
}

This ensures the cache is populated after fetching from meta-service, reducing future RPC calls.


5. [Good] Null Check for ConnectContext ✓

// Before - potential NPE
long cacheExpirationMs = SessionVariable.cloudPartitionVersionCacheTtlMs;

// After - safe
ConnectContext ctx = ConnectContext.get();
if (ctx == null) {
    return true;  // expired if no context
}
long cacheExpirationMs = ctx.getSessionVariable().cloudPartitionVersionCacheTtlMs;

Issues Found

1. [Medium] Potential Race Condition in Cache Update

The volatile keyword provides visibility but doesn't protect the compound check-then-update operation:

protected void setCachedTableVersion(long version) {
    if (version >= cachedTableVersion) {  // read
        cachedTableVersion = version;      // write - race window here
        lastTableVersionCachedTimeMs = System.currentTimeMillis();
    }
}

Two threads could both pass the check and overwrite each other. However, since the version only increases monotonically, this is acceptable - both writes will set valid values. Worth documenting with a comment though.


2. [Low] Change from > to >= Needs Clarification

// Before
if (version > cachedTableVersion) {

// After  
if (version >= cachedTableVersion) {

This allows refreshing the cache timestamp even when the version hasn't changed. If intentional, consider adding a comment:

// Use >= to refresh cache timestamp even if version unchanged,
// extending the cache validity period on repeated reads
if (version >= cachedTableVersion) {

3. [Low] Inconsistent Cache Expiration Check Pattern

CloudPartition.isCachedVersionExpired() checks lastVersionCachedTimeMs == 0:

if (lastVersionCachedTimeMs == 0) {
    return true;
}

But OlapTable.isCachedTableVersionExpired() checks cachedTableVersion == -1:

if (cachedTableVersion == -1) {
    return true;
}

Consider making the pattern consistent between the two classes.


Positive Aspects

  1. Bug fix for setWaitForPendingTxn: This was a real bug where the parameter was ignored
  2. Pre-sized ArrayList allocation: new ArrayList<>(tables.size()) avoids resizing
  3. Collections.emptyList(): Better than new ArrayList<>() for empty results
  4. Test updates: Properly updated tests to use ConnectContext with session variables
  5. @VisibleForTesting annotations: Good practice for test helper methods

Verdict

Looks good to merge

Minor suggestions:

  1. Add comment explaining why >= instead of > for cache update
  2. Consider documenting the race condition acceptability in setCachedTableVersion
  3. Consider making cache expiration check pattern consistent

The setWaitForPendingTxn bug fix alone makes this PR valuable.

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 42.31% (11/26) 🎉
Increment coverage report
Complete coverage report

@mymeiyi
Copy link
Contributor Author

mymeiyi commented Jan 21, 2026

run nonConcurrent

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 42.31% (11/26) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants