it's clearly that float16 can save bandwidth, but is float16 can save compute cycles while computing transcendental functions, like exp()?
Can float16 data type save compute cycles while computing transcendental functions?
399 views Asked by Leonardo Physh At
1
There are 1 answers
Related Questions in CPU-ARCHITECTURE
- What is causing the store latency in this program?
- what's the difference between "nn layout" and "nt layout"
- Will a processor with such a defect work?
- How do i find number of Cycles of a processor?
- Why does LLVM-MCA measure an execution stall?
- Can out-of-order execution of CPU affect the order of new operator in C++?
- running SPEC in gem5 using the SimPoint methodology
- Why don't x86-64 (or other architectures) implement division by 10?
- warn: MOVNTDQ: Ignoring non-temporal hint, modeling as cacheable!, While simulating x86 with spec2006 benchamrks I am getting stuck in warn message
- arithmetic intensity of zgemv versus dgemv/sgemv?
- What is the microcode scoreboard?
- Why don't x86/ARM CPU just stop speculation for indirect branches when hardware prediction is not available?
- Question about the behaviour of registers
- How to increase throughput of random memory reads/writes on multi-GB buffers?
- RISVC Single Cycle Processor Data Path and Testbench
Related Questions in HPC
- Python virtual environment get deleted on HPC automatically
- Does the original HPCCG by Mantevo perform a preconditioned symmetric gauss Seidel smoother
- Is there an enroot equivalent to docker run?
- Snakemake remote rules re-read config file?
- Post processing queue for Slurm
- Intel OneApi Vtune profiler not supporting my microarchitecture
- How to install gromacs on gcp HPC
- arithmetic intensity of zgemv versus dgemv/sgemv?
- Slurmd daemon start error: Couldn't find the specified plugin name for cgroup/v2 looking at all files
- mpiexec error on HPC: execvp error on file srun (No such file or directory)
- Intel Vtune hotspot can not see source code (only assembly code )
- Embed mcmapply in clusterApply?
- Datapoints over the rooflines in Intel-Advisor run on Intel-processors
- Use srun to execute code once, but with multiple tasks
- Optuna in-memory paralellization
Related Questions in HALF-PRECISION-FLOAT
- Failing to optimize unet model with dtype=torch.float16
- std::floating_point concept in CUDA for all IEE754 types
- Why doesn't /proc/cpuinfo contain fp16 if FEAT_FP16 is supported?
- How to call _mm256_mul_ph from rust?
- M2 Mac YOLOv8 Training: RuntimeError: "upsample_nearest2d_channels_last" not implemented for 'Half'
- Convert generic type to Half value allocation-free
- How to use float16 neon intrinsics on Android?
- How do I print the half-precision / bfloat16 values from in a (binary) file?
- How can I do arithmetic on CUDA's __half type in host side code?
- Different methods to unpack CUDA half2 datatypes
- Clarification on IEEE 754 rounding to nearest, ties to even
- Precision loss reading from `r16Snorm` texture to `half` variable in Metal
- Using bfloat16 with C++23 on x86 CPUs using g++13
- How to convert a float to a half type and the other way around in C
- Using half precision with CuPy
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
If your hardware has full support for it, not just conversion to float32, then yes, definitely. e.g. on a GPU, or on Intel Alder Lake with AVX-512 enabled, or Sapphire Rapids. Half-precision floating-point arithmetic on Intel chips. Or apparently on Apple M2 CPUs.
If you can do two 64-byte SIMD vectors of FMAs per clock on a core, you go twice as fast if that's 32 half-precision FMAs per vector instead of 16 single-precision FMAs.
Speed vs. precision tradeoff: only enough for FP16 is needed
Without hardware ALU support for FP16, only by not requiring as much precision because you know you're eventually going to round to fp16. So you'd use polynomial approximations of lower degree, thus fewer FMA operations, even though you're computing with float32.
BTW,
expandlogare interesting for floating point because the format itself is build around an exponential representation. So you can do an exponential by converting fp->int and stuffing that integer into the exponent field of an FP bit pattern. Then with the the fractional part of your FP number, you use a polynomial approximation to get the mantissa of the exponent. Alogimplementation is the reverse: extract the exponent field and use a polynomial approximation of log of the mantissa, over a range like 1.0 to 2.0.See
Efficient implementation of log2(__m256d) in AVX2
Fastest Implementation of Exponential Function Using AVX
Very fast approximate Logarithm (natural log) function in C++?
vgetmantps vs andpd instructions for getting the mantissa of float
Normally you do want some FP operations, so I don't think it would be worth trying to use only 16-bit integer operations to avoid unpacking to float32 even for exp or log, which are somewhat special and intimately connected with floating point's
significand * 2^exponentformat, unlike sin/cos/tan or other transcendental functions.So I think your best bet would normally still be to start by converting fp16 to fp32, if you don't have instructions like AVX-512 FP16 can do actual FP math on it. But you can gain performance from not needing as much precision, since implementing these functions normally involves a speed vs. precision tradeoff.