-
Notifications
You must be signed in to change notification settings - Fork 283
Fix SMT2 output determinism by using ordered maps #8830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
This commit addresses the issue where SMT2 formula output varied even when semantically equivalent C code was provided with different source formatting (e.g., adding empty lines). Root Cause: ----------- Several key data structures in smt2_conv.h used std::unordered_map, which iterates in hash-dependent order. When source code formatting changes: 1. Line numbers change 2. Expression hashes change 3. Symbols are stored/iterated in different orders 4. SMT2 declarations appear in different orders The ensuing changes to the SMT formula content -- even though the formulae were semantically equivalent -- caused variability in SMT-solver performance. Producing stable output should ensure stability in solver performance. For the test case provided in diffblue#8820 this now yields: `SMT files: 0 lines differ` (where previously it was 8 lines differing). Fixes: diffblue#8820
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Improves determinism of generated SMT2 output (and thus solver performance stability) by making key generation steps order-stable and adding a regression test.
Changes:
- Replaced several
std::unordered_mapmembers in the SMT2 converter with ordered maps and added determinism-oriented comments. - Made
goto_symext::phi_functionprocess SSA delta merges in a deterministic order. - Added a regression test asserting stable ordering of SMT2 declarations.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/solvers/smt2/smt2_conv.h | Switches multiple unordered_map members to map to attempt stable ordering. |
| src/solvers/smt2/smt2_conv.cpp | Adds determinism-related commentary around footer emission and symbol discovery. |
| src/goto-symex/symex_goto.cpp | Sorts phi-merge processing by identifier to stabilize downstream equation / SMT2 output order. |
| regression/cbmc/deterministic-smt-output/test.desc | New regression asserting consistent declaration ordering (x before y). |
| regression/cbmc/deterministic-smt-output/main.c | Minimal program to trigger and validate deterministic SMT2 output ordering. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Using std::map instead of std::unordered_map to ensure deterministic | ||
| // iteration order for SMT2 output. This prevents non-deterministic output | ||
| // when source code formatting changes (e.g., adding empty lines). | ||
| typedef std::map<irep_idt, identifiert> identifier_mapt; |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change intends deterministic SMT2 output ordering, but std::map<irep_idt, ...> does not sort by identifier text: irep_idt (dstringt) compares by internal table index (no), which can vary with string-interning insertion order (see src/util/dstring.h:126-128). To make ordering truly stable by identifier name, use a comparator based on irep_idt::compare / id2string (or store std::string keys).
| /// smt2 formula. Using std::map for deterministic iteration order. | ||
| std::map<irep_idt, bool> set_values; |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set_values is only used for lookups (e.g., set_values.find(...) in smt2_dec.cpp), not for iteration-based output. Switching from std::unordered_map to std::map therefore doesn’t improve SMT2 output determinism, but does change expected complexity from average O(1) to O(log n) on a potentially large map.
If determinism is not required here, consider keeping std::unordered_map (or otherwise justify the ordering requirement).
| /// for all symbols (variables) found. Symbols are added to identifier_map, | ||
| /// which uses std::map to ensure deterministic iteration order. | ||
| /// | ||
| /// Determinism: Since identifier_map is ordered by identifier name, symbols | ||
| /// will always be declared in the same order in the SMT2 output, regardless | ||
| /// of the order in which they are discovered during traversal. This ensures | ||
| /// that semantically equivalent C code produces identical SMT2 output even | ||
| /// if source formatting (e.g., empty lines) changes line numbers and hash codes. |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new doc comment is misleading:
identifier_mapbeing astd::map<irep_idt, ...>does not imply ordering by identifier name (irep_idt/dstringtorders by internal table index; seesrc/util/dstring.h:126-128).- More importantly,
find_symbolsemits(declare-fun ...)immediately when a symbol is first inserted, so declaration order still depends on discovery / traversal order, not on iteratingidentifier_map.
Either adjust the documentation to match the actual behavior, or change the implementation to collect symbols first and then output declarations in a sorted order based on identifier text.
| /// for all symbols (variables) found. Symbols are added to identifier_map, | |
| /// which uses std::map to ensure deterministic iteration order. | |
| /// | |
| /// Determinism: Since identifier_map is ordered by identifier name, symbols | |
| /// will always be declared in the same order in the SMT2 output, regardless | |
| /// of the order in which they are discovered during traversal. This ensures | |
| /// that semantically equivalent C code produces identical SMT2 output even | |
| /// if source formatting (e.g., empty lines) changes line numbers and hash codes. | |
| /// for all symbols (variables) it encounters. When a symbol is seen for the | |
| /// first time during traversal, it is inserted into identifier_map and its | |
| /// SMT2 declaration is emitted immediately. | |
| /// | |
| /// Note: identifier_map is a std::map keyed by irep_idt and is used to track | |
| /// which symbols have already been declared. Its key ordering is defined by | |
| /// irep_idt and is not guaranteed to correspond to lexicographic ordering of | |
| /// identifier text, and declaration order in the SMT2 output follows the | |
| /// expression traversal order rather than iteration over identifier_map. |
| std::map<std::string, symex_renaming_levelt::delta_viewt::const_iterator> | ||
| ordered_names_to_merge; | ||
| for(auto it = delta_view.begin(); it != delta_view.end(); ++it) |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::map is used here, but this translation unit doesn’t include <map> (and none of the existing includes appear to). Please add #include <map> to avoid relying on transitive includes / potential build breaks.
| std::map<std::string, symex_renaming_levelt::delta_viewt::const_iterator> | ||
| ordered_names_to_merge; | ||
| for(auto it = delta_view.begin(); it != delta_view.end(); ++it) | ||
| { | ||
| const ssa_exprt &ssa = it->m.first; | ||
| bool inserted = | ||
| ordered_names_to_merge.insert({id2string(ssa.get_identifier()), it}) | ||
| .second; | ||
| CHECK_RETURN(inserted); | ||
| } |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change makes phi_function deterministic, but it also changes complexity from a single O(n) scan to building an ordered std::map (O(n log n)) plus per-entry std::string allocations (id2string(...) copied into the map key). Given this is likely a hot path, consider collecting iterators into a std::vector and sorting them with a comparator over identifier text to reduce allocations/overhead while preserving determinism.
| // Using std::map for deterministic iteration order of converters | ||
| using converterst = std::map<irep_idt, std::function<void(const exprt &)>>; |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
irep_idt ordering in std::map is not lexicographic: dstringt::operator< compares the internal table index (no), which depends on string-interning insertion order (see src/util/dstring.h:126-128). If determinism is meant to be by identifier text (as suggested by the regression test / comments), this map should use a comparator based on irep_idt::compare / id2string rather than the default std::less<irep_idt>.
This affects determinism claims for converter ordering here (and similarly for the other new std::map<irep_idt, ...> members in this header).
| // Using std::map for deterministic iteration order of converters | |
| using converterst = std::map<irep_idt, std::function<void(const exprt &)>>; | |
| // Comparator to ensure lexicographic ordering by identifier text | |
| struct irep_id_lexicographic_less | |
| { | |
| bool operator()(const irep_idt &lhs, const irep_idt &rhs) const | |
| { | |
| return lhs.compare(rhs) < 0; | |
| } | |
| }; | |
| // Using std::map with a lexicographic comparator for deterministic | |
| // iteration order of converters based on identifier text | |
| using converterst = | |
| std::map<irep_idt, std::function<void(const exprt &)>, irep_id_lexicographic_less>; |
| // Using std::map for deterministic order of bindings | ||
| std::map<irep_idt, irept> current_bindings; | ||
|
|
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switching current_bindings to std::map<irep_idt, ...> does not guarantee lexicographic determinism because irep_idt (dstringt) orders by internal table index (no), not by string content (see src/util/dstring.h:126-128). If you need stable ordering by identifier text, use a comparator based on irep_idt::compare / id2string.
| // Using std::map for deterministic order of bindings | |
| std::map<irep_idt, irept> current_bindings; | |
| struct irep_idt_lexicographic_less | |
| { | |
| bool operator()(const irep_idt &lhs, const irep_idt &rhs) const | |
| { | |
| return lhs.compare(rhs) < 0; | |
| } | |
| }; | |
| // Using std::map with lexicographic ordering of identifier text | |
| std::map<irep_idt, irept, irep_idt_lexicographic_less> current_bindings; |
|
Test results on mlkem-native, running on M1 macOS, with MLKEM_K=4 and -j8 Before (release 6.8.0): 139 proofs in 17m13s real, 49m 12s user |
|
Test results on mlkem-native, running on r7g.16xlarge/Ubuntu, with MLKEM_K=4 and -j64 Before (release 6.8.0): 139 proofs in 6m20s real, 47m37s user |
|
Test results on mldsa-native, running on r7g.16xlarge/Ubuntu, with MLD_CONFIG_PARAMETER_SET=87 and -j64 Before (release 6.8.0): 173 proofs in 4m22s real, 51m38s user so a little slower. For example, the proofs of sign_verify_internal slows from 143s to 307s. With stability gained, we can experiment with that to see which ordering of SMT terms is the best. I also hope that PR#8705 will result in a noticeable improvement in performance. |
|
Rod: can you tell what makes that run slower? Is it the time to generate the SMT-LIB instance, or the time to solve it? |
|
For polyvec_add() which slowed a little after this change, the time to generate the SMT is 5 seconds. Time to prove it with Z3 is 520s. Once we have stabilized SMT-generation in CBMC, we've got a rational basis to proceed - we can experiment with Z3 options and tactics, work out what the best ordering of terms is (for Z3 at least), and possible open bug reports on Z3. Onwards! |
|
That Z3 time of 520s is with the option With default options, proof time is 44s on my laptop. There might be a few other cases where we can adjust Z3 options to get the overall performance back to what it was. |
This commit addresses the issue where SMT2 formula output varied even when semantically equivalent C code was provided with different source formatting (e.g., adding empty lines).
Root Cause:
Several key data structures in smt2_conv.h used std::unordered_map, which iterates in hash-dependent order. When source code formatting changes:
The ensuing changes to the SMT formula content -- even though the formulae were semantically equivalent -- caused variability in SMT-solver performance. Producing stable output should ensure stability in solver performance. For the test case provided in #8820 this now yields:
SMT files: 0 lines differ(where previously it was 8 lines differing).Fixes: #8820