⚡️ Speed up function create_position_ids_from_input_ids by 26%
#382
+8
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 26% (0.26x) speedup for
create_position_ids_from_input_idsinsrc/transformers/models/ibert/modeling_ibert.py⏱️ Runtime :
2.11 milliseconds→1.67 milliseconds(best of63runs)📝 Explanation and details
The optimization achieves a 26% speedup by eliminating unnecessary type conversions and reducing tensor operations in PyTorch.
Key optimizations:
Direct boolean comparison: Changed
input_ids.ne(padding_idx).int()toinput_ids != padding_idx, eliminating the.int()conversion. PyTorch'storch.cumsumcan work directly with boolean tensors and automatically returns int64 dtype.Removed redundant type conversion: Eliminated
.type_as(mask)sincetorch.cumsumon boolean tensors already produces the correct int64 dtype, avoiding an unnecessary tensor copy/conversion.Conditional addition: Added
if past_key_values_length != 0:check to only perform the addition when needed, reducing operations in the common case wherepast_key_values_length=0(which happens in 40 out of 47 test cases based on profiler hits).Simplified final operations: Restructured the computation to use separate mask application and padding_idx addition steps, which is more efficient than the original complex expression.
Why it's faster:
past_key_values_length=0Impact on workloads:
This function is called during model embedding initialization in the forward pass (as shown in function_references), making it part of the critical path for every inference. The optimization particularly benefits:
past_key_values_length(27.4% average speedup across most test cases)The optimization maintains identical numerical behavior while being consistently 18-59% faster across all test scenarios.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-create_position_ids_from_input_ids-mia5lom5and push.