Skip to content

PERF: Speed up DataFrame.apply(axis=1) for ExtensionDtype columns#65097

Draft
jbrockmendel wants to merge 1 commit intopandas-dev:mainfrom
jbrockmendel:perf-61747
Draft

PERF: Speed up DataFrame.apply(axis=1) for ExtensionDtype columns#65097
jbrockmendel wants to merge 1 commit intopandas-dev:mainfrom
jbrockmendel:perf-61747

Conversation

@jbrockmendel
Copy link
Copy Markdown
Member

Summary

  • For multi-block EA DataFrames (e.g. multiple Arrow-typed columns), series_generator now extracts scalars directly from column arrays into a reusable object-dtype Series, avoiding the expensive per-row fast_xs_from_sequence round-trip.
  • Single-block EA DataFrames keep the existing fast_xs path which already preserves EA dtype.
  • Replace is_object_dtype(self.dtype) with self.dtype == object in _can_hold_identifiers_and_holds_name to avoid unnecessary indirection.

closes #61747

Performance

500K rows, 2 columns (int64[pyarrow] + double[pyarrow]), df.apply(lambda r: ..., axis=1):

Before After
NumPy 1.7s 1.7s
Arrow 8.3s 3.0s

Test plan

  • pytest pandas/tests/apply/ -x -q — 922 passed
  • pytest pandas/tests/extension/test_arrow.py -k apply — 240 passed
  • pytest pandas/tests/frame/ -k apply — 7 passed

🤖 Generated with Claude Code

@jbrockmendel jbrockmendel added the Performance Memory or execution speed performance label Apr 6, 2026
For multi-block EA DataFrames (e.g. multiple Arrow-typed columns),
series_generator now extracts scalars directly from column arrays into
a reusable object-dtype Series, avoiding the expensive per-row
fast_xs -> _from_sequence round-trip. Single-block EA DataFrames
keep the existing fast_xs path which already preserves EA dtype.

Also replace is_object_dtype(self.dtype) with self.dtype == object in
_can_hold_identifiers_and_holds_name to avoid unnecessary indirection.

closes pandas-dev#61747

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Memory or execution speed performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PERF: Arrow dtypes are much slower than Numpy for DataFrame.apply

1 participant