Skip to content

Conversation

@PSU3D0
Copy link

@PSU3D0 PSU3D0 commented Jul 7, 2025

Overview

This PR introduces optional formula reading capabilities to python-calamine, allowing users to access the underlying formulas in spreadsheet cells rather than just their calculated values.

Motivation

Previously, python-calamine only provided access to cell values (the calculated results of formulas). Users had no way to access the actual formula expressions, which is essential for:

  • Spreadsheet analysis and auditing
  • Formula debugging and validation
  • Converting between different spreadsheet formats while preserving formulas
  • Building tools that need to understand the calculation logic

Implementation

Core Changes

  1. Opt-in Design: Formula reading is disabled by default (read_formulas=False) to maintain backward compatibility and avoid performance overhead for users who don't need formulas.

  2. New API Parameter: Added read_formulas parameter to all workbook creation methods:

    • CalamineWorkbook.from_object(path_or_filelike, read_formulas=False)
    • CalamineWorkbook.from_path(path, read_formulas=False)
    • CalamineWorkbook.from_filelike(filelike, read_formulas=False)
    • load_workbook(path_or_filelike, read_formulas=False)
  3. Formula Iterator: Added CalamineSheet.iter_formulas() method that returns a CalamineFormulaIterator with consistent dimensions matching the data iterator.

  4. Performance Optimization: Implemented on-demand coordinate mapping instead of pre-allocating expanded ranges, ensuring minimal memory overhead.

Technical Details

  • Coordinate Consistency: Formula iterator returns the same dimensions as data iterator, with empty strings for cells without formulas
  • Graceful Error Handling: Clear error message when attempting to access formulas without enabling the feature
  • Memory Efficient: Uses get_value() lookups rather than expanding formula ranges upfront
  • Iterator Properties: Both CalamineCellIterator and CalamineFormulaIterator expose position, start, width, and height properties for coordinate recovery

File Format Support

Supports formula reading across all major spreadsheet formats:

  • Excel (.xlsx, .xls, .xlsb)
  • OpenDocument (.ods)

Formula syntax varies by format (e.g., ODS uses of:=SUM([.A1:.B1]) vs Excel's SUM(A1:B1)).

API Examples

# Default behavior - formulas disabled
wb = load_workbook("spreadsheet.xlsx")
sheet = wb.get_sheet_by_name("Sheet1")
# sheet.iter_formulas()  # Raises ValueError

# Enable formula reading
wb = load_workbook("spreadsheet.xlsx", read_formulas=True)
sheet = wb.get_sheet_by_name("Sheet1")

# Access both data and formulas
data = list(sheet.iter_rows())      # [[10.0, 15.0, 25.0]]
formulas = list(sheet.iter_formulas())  # [["", "", "SUM(A1:B1)"]]

# Iterator properties for coordinate mapping
formula_iter = sheet.iter_formulas()
print(f"Start: {formula_iter.start}, Size: {formula_iter.width}x{formula_iter.height}")

Backward Compatibility

  • Default behavior unchanged (read_formulas=False)
  • No breaking changes to existing APIs
  • Existing code continues to work without modification

Testing

Comprehensive test suite covering:

  • All supported file formats (.xlsx, .xls, .ods)
  • All workbook creation methods
  • Error handling for disabled formulas
  • Iterator property validation
  • Coordinate consistency between data and formula iterators

Performance Impact

  • Zero overhead when formulas are disabled (default)
  • Minimal memory usage when formulas are enabled due to on-demand lookup strategy
  • No upfront range expansion or pre-allocation

Summary by CodeRabbit

Release Notes

  • New Features
    • Added read_formulas parameter to workbook creation methods (load_workbook, from_path, from_object, from_filelike) to enable formula extraction from spreadsheets
    • Added iter_formulas() method to iterate through formulas in sheets
    • Formula iteration requires enabling read_formulas at workbook creation; attempting to read formulas when disabled raises an error

✏️ Tip: You can customize this high-level summary in your review settings.

@PSU3D0
Copy link
Author

PSU3D0 commented Jul 7, 2025

A human addendum to the AI's diff summary ^:

This lib is much faster than openpyxl but remained unusable for my use-case due to:

  • Lack of formula support
  • Lack of coordinate recovery

This PR should be a big step forward for broader usability and more use-cases. The next thing I'd hit would be a unified CellData iterator that exposes value, coordinate, optional formula, and potentially style information (still need to check Calamine source on this).

Cheers!

@PSU3D0
Copy link
Author

PSU3D0 commented Jul 12, 2025

@dimastbk bumping this

m.add_function(wrap_pyfunction!(load_workbook, m)?)?;
m.add_class::<CalamineWorkbook>()?;
m.add_class::<CalamineSheet>()?;
m.add_class::<CalamineFormulaIterator>()?;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add CalamineCellIterator here?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove, please

@benemanu
Copy link

benemanu commented Aug 21, 2025

This would be awesome!
Is there any way i can help?

@simx11
Copy link

simx11 commented Aug 21, 2025

This is something we urgently need, let me know if we can help!

Copilot AI review requested due to automatic review settings October 21, 2025 09:22

This comment was marked as outdated.

@coderabbitai
Copy link

coderabbitai bot commented Jan 27, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • ✅ Full review completed - (🔄 Check again to review again)

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@python/python_calamine/_python_calamine.pyi`:
- Around line 133-137: The type stub for class CalamineWorkbook is missing the
read_formulas property; add a read-only boolean attribute declaration to the
CalamineWorkbook class (e.g., add "read_formulas: bool") so the stub matches the
Rust #[pyo3(get)]-exposed property and satisfies the test that checks
test_formulas_read_flag_exposed.
🧹 Nitpick comments (1)
src/types/sheet.rs (1)

330-356: Consider explicit dimension checks for edge cases.

The width/height calculation uses tuple comparison (data_start <= data_end), which performs lexicographic comparison. While this should work correctly for typical calamine ranges where both row and column of start are ≤ end, consider making the intent explicit:

Suggested change for clarity
     fn from_range_with_data_bounds(
         range: Arc<Range<String>>,
         data_start: (u32, u32),
         data_end: (u32, u32),
     ) -> CalamineFormulaIterator {
-        let width = if data_start <= data_end {
+        let width = if data_start.1 <= data_end.1 {
             (data_end.1 - data_start.1 + 1) as usize
         } else {
             0
         };

-        let height = if data_start <= data_end {
+        let height = if data_start.0 <= data_end.0 {
             (data_end.0 - data_start.0 + 1) as usize
         } else {
             0
         };

Comment on lines 133 to 137
@typing.final
class CalamineWorkbook(contextlib.AbstractContextManager):
path: str | None
sheet_names: list[str]
sheets_metadata: list[SheetMetadata]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing read_formulas property in type stub.

The CalamineWorkbook class exposes a read_formulas property (via #[pyo3(get)] in Rust), which is tested in test_formulas_read_flag_exposed, but it's not declared in the type stub.

Proposed fix
 `@typing.final`
 class CalamineWorkbook(contextlib.AbstractContextManager):
     path: str | None
     sheet_names: list[str]
     sheets_metadata: list[SheetMetadata]
+    read_formulas: bool
     `@classmethod`
     def from_object(
🤖 Prompt for AI Agents
In `@python/python_calamine/_python_calamine.pyi` around lines 133 - 137, The type
stub for class CalamineWorkbook is missing the read_formulas property; add a
read-only boolean attribute declaration to the CalamineWorkbook class (e.g., add
"read_formulas: bool") so the stub matches the Rust #[pyo3(get)]-exposed
property and satisfies the test that checks test_formulas_read_flag_exposed.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/types/sheet.rs`:
- Around line 230-245: iter_formulas currently uses unwrap_or((0,0)) for range
bounds which causes a 1×1 empty row for empty ranges and drops leading empty
rows when start.0 > 0; fix by computing bounds like iter_rows: let data_start =
self.range.start().unwrap_or((0,0)); if self.range.is_empty() set data_end to a
value that yields zero height (e.g. make data_end row/col < data_start so
height/width compute to 0), otherwise use self.range.end().unwrap(); keep
data_start unchanged so leading empty rows (when start.0 > 0) are preserved,
then call CalamineFormulaIterator::from_range_with_data_bounds with those
adjusted data_start/data_end values. Ensure you use self.range.is_empty(),
self.range.start(), self.range.end(), iter_formulas, and
CalamineFormulaIterator::from_range_with_data_bounds to locate and implement the
change.

Comment on lines +230 to +245
fn iter_formulas(&self) -> PyResult<CalamineFormulaIterator> {
match &self.formula_range {
Some(formula_range) => {
let data_start = self.range.start().unwrap_or((0, 0));
let data_end = self.range.end().unwrap_or((0, 0));
Ok(CalamineFormulaIterator::from_range_with_data_bounds(
Arc::clone(formula_range),
data_start,
data_end
))
},
None => Err(pyo3::exceptions::PyValueError::new_err(
"Formula iteration is disabled. Use read_formulas=True when creating the workbook to enable formula access."
)),
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Formula iterator row count can diverge from iter_rows for non‑zero starts or empty sheets.

iter_rows() emits leading empty rows when start.0 > 0, but CalamineFormulaIterator’s height is computed from data_end - data_start + 1, so it skips those rows. For empty ranges, unwrap_or((0,0)) produces a 1×1 empty row. This violates the “same dimensions” contract in those cases.

🐛 Suggested fix to align formula dimensions with iter_rows and empty ranges
-        let data_start = self.range.start().unwrap_or((0, 0));
-        let data_end = self.range.end().unwrap_or((0, 0));
+        let data_start = self.range.start();
+        let data_end = self.range.end();
         Ok(CalamineFormulaIterator::from_range_with_data_bounds(
             Arc::clone(formula_range),
             data_start,
             data_end
         ))
-    fn from_range_with_data_bounds(
-        range: Arc<Range<String>>,
-        data_start: (u32, u32),
-        data_end: (u32, u32),
-    ) -> CalamineFormulaIterator {
-        let width = if data_start <= data_end {
-            (data_end.1 - data_start.1 + 1) as usize
-        } else {
-            0
-        };
-
-        let height = if data_start <= data_end {
-            (data_end.0 - data_start.0 + 1) as usize
-        } else {
-            0
-        };
-
-        CalamineFormulaIterator {
-            position: 0,
-            start: data_start,
-            range,
-            width,
-            height,
-        }
-    }
+    fn from_range_with_data_bounds(
+        range: Arc<Range<String>>,
+        data_start: Option<(u32, u32)>,
+        data_end: Option<(u32, u32)>,
+    ) -> CalamineFormulaIterator {
+        let (start, width, height) = match (data_start, data_end) {
+            (Some(start), Some(end)) if start <= end => {
+                let width = (end.1 - start.1 + 1) as usize;
+                // include leading empty rows to match iter_rows
+                let height = (end.0 + 1) as usize;
+                (start, width, height)
+            }
+            _ => ((0, 0), 0, 0),
+        };
+
+        CalamineFormulaIterator {
+            position: 0,
+            start,
+            range,
+            width,
+            height,
+        }
+    }
-        let current_row = slf.start.0 + slf.position;
-        slf.position += 1;
+        let current_row = slf.position;
+        slf.position += 1;
 
         // Create the result row with proper width, filled with empty strings
         let mut result_row = vec!["".to_string(); slf.width];
 
-        // Fill in formulas for this row by checking each column position
-        for (col_idx, result_cell) in result_row.iter_mut().enumerate() {
-            let current_col = slf.start.1 + col_idx as u32;
-            if let Some(formula) = slf.range.get_value((current_row, current_col)) {
-                if !formula.is_empty() {
-                    *result_cell = formula.clone();
-                }
-            }
-        }
+        // Fill in formulas only once we are within the data start row
+        if current_row >= slf.start.0 {
+            for (col_idx, result_cell) in result_row.iter_mut().enumerate() {
+                let current_col = slf.start.1 + col_idx as u32;
+                if let Some(formula) = slf.range.get_value((current_row, current_col)) {
+                    if !formula.is_empty() {
+                        *result_cell = formula.clone();
+                    }
+                }
+            }
+        }

Also applies to: 331-387

🤖 Prompt for AI Agents
In `@src/types/sheet.rs` around lines 230 - 245, iter_formulas currently uses
unwrap_or((0,0)) for range bounds which causes a 1×1 empty row for empty ranges
and drops leading empty rows when start.0 > 0; fix by computing bounds like
iter_rows: let data_start = self.range.start().unwrap_or((0,0)); if
self.range.is_empty() set data_end to a value that yields zero height (e.g. make
data_end row/col < data_start so height/width compute to 0), otherwise use
self.range.end().unwrap(); keep data_start unchanged so leading empty rows (when
start.0 > 0) are preserved, then call
CalamineFormulaIterator::from_range_with_data_bounds with those adjusted
data_start/data_end values. Ensure you use self.range.is_empty(),
self.range.start(), self.range.end(), iter_formulas, and
CalamineFormulaIterator::from_range_with_data_bounds to locate and implement the
change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants