Some vvsin and vvsqrt tests (on doubles) using randomized datasets working on a dataset of 100k elements. I broke up the work into varying sized chunks, from 10 to 50,000 element arrays, and then averaged results over multiple runs.
On my 3ghz xeon system, vvsqrt consistently was about twice as fast as a sqrt loop for any sized array.
Green lines represent a standard C loop for the math function. Blue lines are several runs of the vForce library in question. Array chunk sizes are along the x-axis, time in milliseconds along the y-axis.
Lin-log plot of vvsqrt
On the next graph, I plotted the timing ratio of vForce to Libm. The green line represents a 1:1 ratio. The blue line represents a (two 4-core) 3GHz Xeon MacPro and the reddish line is from a (one 4-core) 2.8 GHz i7 iMac. There is a bump right at 1024 of bad performance, and then greater gains from then on out, as the algorithm for vvsin turns on GCD for multicore machines. I missed this bump when I first started testing with data sets that were powers of 10.