You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I changed the NumPy buffering setup a bit to simplify the code and make it faster in general.
This probably has little or no effect on numexpr (the core loop size may be smaller occasionally, but you don't use GROWINNER which is a change einsum noticed: a huge core reduced the summation precision).
I noticed this "fixed-size" optimization that assumes the inner-loop has a fixed size until the end of the iteration:
* First do all the blocks with a compile-time fixed size.
* This makes a big difference (30-50% on some tests).
*/
block_size = *size_ptr;
while (block_size == BLOCK_SIZE1) {
#defineREDUCTION_INNER_LOOP
#defineBLOCK_SIZE BLOCK_SIZE1
#include"interp_body.cpp"
#undef BLOCK_SIZE
#undef REDUCTION_INNER_LOOP
iternext(iter);
block_size = *size_ptr;
}
/* Then finish off the rest */
if (block_size > 0) do {
#defineREDUCTION_INNER_LOOP
#defineBLOCK_SIZE block_size
#include"interp_body.cpp"
#undef BLOCK_SIZE
#undef REDUCTION_INNER_LOOP
} while (iternext(iter));
Only for non-contiguous, non-reduction (except reduce to scalar) use-cases this fast path may not be hit anymore, because:
NumPy may now shrink the buffersize a bit to align better with the iteration shape.
NumPy will more often have intermittently smaller buffers/chunks that are then grown to full size again. (Previously common in reduction operations only)
For contiguous ops without a reduction (or a reduction along all axes). You still always get the requested buffersize (until the end).
So in the end, my hope is that the fast-path still kicks in for the most relevant use-cases. But if you/someone notices a performance regression, I can take a closer look.
The text was updated successfully, but these errors were encountered:
I changed the NumPy buffering setup a bit to simplify the code and make it faster in general.
This probably has little or no effect on
numexpr
(the core loop size may be smaller occasionally, but you don't useGROWINNER
which is a changeeinsum
noticed: a huge core reduced the summation precision).I noticed this "fixed-size" optimization that assumes the inner-loop has a fixed size until the end of the iteration:
numexpr/numexpr/interpreter.cpp
Lines 638 to 660 in 2378606
Only for non-contiguous, non-reduction (except reduce to scalar) use-cases this fast path may not be hit anymore, because:
For contiguous ops without a reduction (or a reduction along all axes). You still always get the requested buffersize (until the end).
So in the end, my hope is that the fast-path still kicks in for the most relevant use-cases. But if you/someone notices a performance regression, I can take a closer look.
The text was updated successfully, but these errors were encountered: