Skip to content

Conversation

@Yicong-Huang
Copy link

@Yicong-Huang Yicong-Huang commented Jan 17, 2026

What's Changed

Fix ListVector/LargeListVector IPC serialization when valueCount is 0.

Problem

When valueCount == 0, setReaderAndWriterIndex() was setting offsetBuffer.writerIndex(0), which means readableBytes() == 0. IPC serializer uses readableBytes() to determine buffer size, so 0 bytes were written to the IPC stream. This crashes IPC readers in other libraries because Arrow spec requires offset buffer to have at least one entry [0].

@viirya:

The offset buffers are allocated properly. But during IPC serialization, they are ignored.

  public long readableBytes() {
      return writerIndex - readerIndex;
  }

So when ListVector.setReaderAndWriterIndex() sets writerIndex(0) and readerIndex(0), readableBytes() returns 0 - 0 = 0.

Then when MessageSerializer.writeBatchBuffers() calls WriteChannel.write(buffer), it writes 0 bytes.

So the flow is:

valueCount=0 → ListVector.setReaderAndWriterIndex() sets offsetBuffer.writerIndex(0)
VectorUnloader.getFieldBuffers() returns the buffer with writerIndex=0
MessageSerializer.writeBatchBuffers() writes the buffer
WriteChannel.write(buffer) checks buffer.readableBytes() which is 0
0 bytes are written to the IPC stream
PyArrow read the batch with the missing buffer → crash when other libraries to read

Fix

Simplify setReaderAndWriterIndex() to always use (valueCount + 1) * OFFSET_WIDTH for offset buffer's writerIndex. When valueCount == 0, this correctly sets writerIndex to OFFSET_WIDTH, ensuring offset[0] is included in serialization.

Testing

Added tests for nested empty lists verifying offset buffer has correct readableBytes().

Closes #343.

@Yicong-Huang Yicong-Huang changed the title GH-343 Fix ListVector offset buffer not allocated for nested empty arrays GH-343: Fix ListVector offset buffer not allocated for nested empty arrays Jan 17, 2026
@github-actions

This comment has been minimized.

@lidavidm lidavidm added the bug-fix PRs that fix a big. label Jan 18, 2026
@github-actions github-actions bot added this to the 19.0.0 milestone Jan 18, 2026
@jbonofre
Copy link
Member

@Yicong-Huang can you please rebase the PR ? Thanks !

@Yicong-Huang Yicong-Huang force-pushed the fix/343-empty-nested-list-offset-buffer branch from 8b09237 to 7dbdcc4 Compare January 20, 2026 18:21
Comment on lines 279 to 287
// Ensure offset buffer has at least one entry for offset[0].
// According to Arrow specification, offset buffer must have N+1 entries,
// even when N=0, it should contain [0].
if (offsetBuffer.capacity() == 0) {
// Save and restore offsetAllocationSizeInBytes to avoid affecting subsequent allocateNew()
long savedOffsetAllocationSize = offsetAllocationSizeInBytes;
offsetBuffer = allocateOffsetBuffer(OFFSET_WIDTH);
offsetAllocationSizeInBytes = savedOffsetAllocationSize;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks unnecessary because once allocateNew is called on the vector, its offset buffer should be allocated, and we shouldn't modify offsetBuffer in a getter method like that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense. reverted changes here.

Comment on lines 323 to 324
// Even when valueCount is 0, offset buffer should have offset[0] per Arrow spec
offsetBuffer.writerIndex(Math.min(OFFSET_WIDTH, offsetBuffer.capacity()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we only need offsetBuffer.writerIndex((valueCount + 1) * OFFSET_WIDTH); like below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the if clause to use offsetBuffer.writerIndex((valueCount + 1) * OFFSET_WIDTH) for all cases.

if (valueCount == 0) {
validityBuffer.writerIndex(0);
offsetBuffer.writerIndex(0);
// Even when valueCount is 0, offset buffer should have offset[0] per Arrow spec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to explain it clearly:

Suggested change
// Even when valueCount is 0, offset buffer should have offset[0] per Arrow spec
// IPC serializer will determine readable bytes based on `readerIndex` and `writerIndex`.
// Both are set to 0 means 0 bytes are written to the IPC stream which will crash IPC readers
// in other libraries. According to Arrow spec, we should still output the offset buffer which
// is [0].

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! changed to this comment.

@viirya
Copy link
Member

viirya commented Jan 21, 2026

When outer array is empty, nested writers are never invoked, so child list's offset buffer remains unallocated (capacity = 0). This violates Arrow spec which requires offset[0] = 0.

I think you are referring the writers in Spark. It is out of context here and not related to the root cause. We should update the description to explain the issue clearly.

The offset buffers are actually allocated properly. But during IPC serialization, they are ignored.

  public long readableBytes() {
      return writerIndex - readerIndex;
  }

So when ListVector.setReaderAndWriterIndex() sets writerIndex(0) and readerIndex(0), readableBytes() returns 0 - 0 = 0.

Then when MessageSerializer.writeBatchBuffers() calls WriteChannel.write(buffer), it writes 0 bytes.

So the flow is:

  1. valueCount=0 → ListVector.setReaderAndWriterIndex() sets offsetBuffer.writerIndex(0)
  2. VectorUnloader.getFieldBuffers() returns the buffer with writerIndex=0
  3. MessageSerializer.writeBatchBuffers() writes the buffer
  4. WriteChannel.write(buffer) checks buffer.readableBytes() which is 0
  5. 0 bytes are written to the IPC stream
  6. PyArrow read the batch with the missing buffer → crash when other libraries to read

@viirya
Copy link
Member

viirya commented Jan 21, 2026

Hi @lidavidm @jbonofre, do you think this can catch up the Arrow Java 19.0.0 release?

@Yicong-Huang Yicong-Huang changed the title GH-343: Fix ListVector offset buffer not allocated for nested empty arrays GH-343: Fix ListVector offset buffer not properly serialized for nested empty arrays Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PRs that fix a big.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++/Java] Error when reading inner lists within a struct in empty outer lists from C++/Python in Java

4 participants