build dps with ARM neon 64 fix for release mode build add in threading protection and seperate out rng added callback function and updates to README update default handle to lock, and add finished handle call cleanup after veiwing diff of changes
4.5 KiB
Building wolfSSL with DSP Use
Intro
This directory is to help with building wolfSSL for use with DSP. It assumes that the Hexagon SDK has been setup on the machine and that the enviorment variables have been set by calling (source ~/Qualcomm/Hexagon_SDK/3.4.3/setup_sdk_env.source). Currently offloading ECC 256 verify operations to the DSP is supported. When WOLFSSL_DSP is defined ECC verify operations are offloaded to the aDSP by default. When not in SINGLE_THREADED mode a call back function must be set for getting the handle or a handle must be set in the ecc_key structure for the operation to make use of multiple threads when offloading to the DSP. This is because creating new handles for new threads must be done.
Building
The directory is divided up into a build for the CPU portion in IDE/HEXAGON and a build for use on the DSP located in IDE/HEXAGON/DSP. Each section has their own Makefile. The Makefile default to an Ubuntu + hexagon v65 release build but can be changed by using V=. An example of building both would be:
cd IDE/HEXAGON
make V=UbuntuARM_Release_aarch64
cd DSP
make V=hexagon_Release_dynamic_toolv83_v65
The results from each build will be placed into the ship directories of each, for example ./UbuntuARM_Release_aarch64/ship/* and ./DSP/hexagon_Release_dynamic_toolv83_v65/ship/*. The Makefile creates a DSP library libwolfssl_dsp_skel.so, library libwolfssl.so, executable benchmark, example ecc-verify, example ecc-verify-benchmark and executable testwolfcrypt. These then need pushed to the device in order to run. An example of pushing the results to the device would be:
cd IDE/HEXAGON
adb push DSP/hexagon_Release_dynamic_toolv83_v65/ship/libwolfssl_dsp_skel.so /data/rfsa/adsp/
adb push UbuntuARM_Release_aarch64/ship/libwolfssl.so /data/
adb push UbuntuARM_Release_aarch64/ship/benchmark /data/
adb push UbuntuARM_Release_aarch64/ship/eccverify /data/
adb push UbuntuARM_Release_aarch64/ship/eccbenchmark /data/
To change the settings wolfSSL is built with macros can be set in IDE/HEXAGON/user_settings.h. It contains a default setting at this point that was used for collecting benchmark values. The macro necessary to turn on use of the DSP is WOLFSSL_DSP.
The script IDE/HEXAGON/build.sh was added to help speed up building and testing. An example of using the script would be:
cd IDE/HEXAGON
./build.sh Release
This will delete the previous build and rebuild for Release mode. Then it will try to push the resulting library and some of the executables to the device.
For increased performance uncomment the -O3 flag in IDE/HEXAGON/Makefile and IDE/HEXAGON/DSP/Makefile.
Use
A default handle is created with the call to wolfCrypt_Init() and is set to use the aDSP. A default mutex is locked for each use of the handle to make the library stable when multiple threads are calling to DSP supported operations. To use wolfSSL with a user created handle it can be done by calling wc_ecc_set_handle or by setting a callback function using wolfSSL_SetHandleCb(). This should be set in the case of multithreaded applications to account for having a handle for each thread being used.
wolfSSL_SetHandleCb
The API wolfSSL_SetHandleCb takes a function pointer of type "int (*wolfSSL_DSP_Handle_cb)(remote_handle64 *handle, int finished void *ctx);". This callback is executed right before the operation is handed off to the DSP (finished set to 0) and right after done with the handle (finished set to 1). With ECC this would be after the ECC verify function has been called but before the information is passed on to the DSP and once again with the finished flag set after the result is returned.
The callback set should return 0 on sucessfully setting the input handle. The ctx argument is for future custom context to be passed in and is currently not used.
Expected Performance
This is the expected results from running ./eccbenchmark using the -O3 flag
benchmarking using default (locks on handle for aDSP) 5000 verifies on 1 threads took 17.481616 seconds 10000 verifies on 2 threads took 35.324308 seconds
benchmarking using software (+NEON if built in) 5000 verifies on 1 threads took 1.398336 seconds 10000 verifies on 2 threads took 1.383992 seconds
benchmarking using threads on aDSP 5000 verifies on 1 threads took 17.616811 seconds 10000 verifies on 2 threads took 19.215413 seconds 15000 verifies on 3 threads took 20.410200 seconds 20000 verifies on 4 threads took 23.261446 seconds
benchmarking 1 thread on cDSP 5000 verifies on 1 threads took 18.560995 seconds