Discussion:
[PATCH][AArch64] Adjust writeback in non-zero memset
Wilco Dijkstra
2018-11-06 14:42:10 UTC
Permalink
This fixes an ineffiency in the non-zero memset. Delaying the writeback
until the end of the loop is slightly faster on some cores - this shows
~5% performance gain on Cortex-A53 when doing large non-zero memsets.

Tested against the GLIBC testsuite.

---

diff --git a/newlib/libc/machine/aarch64/memset.S b/newlib/libc/machine/aarch64/memset.S
index 799e7b7874a397138c5c85cfa2adb85f63c94cef..7c8fe583bf88722d73b90ec470c72b509e5be137 100644
--- a/newlib/libc/machine/aarch64/memset.S
+++ b/newlib/libc/machine/aarch64/memset.S
@@ -142,10 +142,10 @@ L(set_long):
b.eq L(try_zva)
L(no_zva):
sub count, dstend, dst /* Count is 16 too large. */
- add dst, dst, 16
+ sub dst, dst, 16 /* Dst is biased by -32. */
sub count, count, 64 + 16 /* Adjust count and bias for loop. */
-1: stp q0, q0, [dst], 64
- stp q0, q0, [dst, -32]
+1: stp q0, q0, [dst, 32]
+ stp q0, q0, [dst, 64]!
L(tail64):
subs count, count, 64
b.hi 1b
Richard Earnshaw (lists)
2018-11-06 15:01:35 UTC
Permalink
Post by Wilco Dijkstra
This fixes an ineffiency in the non-zero memset. Delaying the writeback
until the end of the loop is slightly faster on some cores - this shows
~5% performance gain on Cortex-A53 when doing large non-zero memsets.
Tested against the GLIBC testsuite.
Thanks, pushed.

R.
Post by Wilco Dijkstra
---
diff --git a/newlib/libc/machine/aarch64/memset.S b/newlib/libc/machine/aarch64/memset.S
index 799e7b7874a397138c5c85cfa2adb85f63c94cef..7c8fe583bf88722d73b90ec470c72b509e5be137 100644
--- a/newlib/libc/machine/aarch64/memset.S
+++ b/newlib/libc/machine/aarch64/memset.S
b.eq L(try_zva)
sub count, dstend, dst /* Count is 16 too large. */
- add dst, dst, 16
+ sub dst, dst, 16 /* Dst is biased by -32. */
sub count, count, 64 + 16 /* Adjust count and bias for loop. */
-1: stp q0, q0, [dst], 64
- stp q0, q0, [dst, -32]
+1: stp q0, q0, [dst, 32]
+ stp q0, q0, [dst, 64]!
subs count, count, 64
b.hi 1b
Loading...