How to calculate cycles /issue for floating point additions for 1 core? - cycle

(Not Homework!!)
This is my processor :
Intel(R) Core(TM) i5-3210M CPU # 2.50GHz
Number of cores : 2
I was reading FLOPS I came across a questions that asks calculate cycles /issue for floating point additions for 1 core.(I know the number of cores I have)
Should I study the architecture of my machine and calculate the cycles???)
I am not sure what is an issue(an instruction??), and how would you calculate the cycles?
I would appreciate it if someone could hint me in on it.
Thank you.

According to the Intel Optimization Manual without SIMD instructions it should be 1 cycle/issue
Intel Core i5 is based on Intel microarchitecture code name Westmere (06_25H, 06_2CH and 06_2FH), which supports SSE. With SSE instructions you should get double throughput, and if it supports VXA you would get 4x.

Related

Hardware for Deep Learning

I have a couple questions on hardware for a Deep Learning project I'm starting, I intend to use pyTorch for Neural Networks.
I am thinking about going for an 8th Gen CPU on a z390 (I'll wait month to see if prices drop after 9th gen CPU's are available) so I still get a cheaper CPU that can be upgraded later.
Question 1) Are CPU cores going to be beneficial would getting the latest Intel chips be worth the extra cores, and if cores on CPU will be helpful, should I just go AMD?
I am also thinking about getting a 1080ti and then later on, once I'm more proficient adding two more 2080ti's, I would go for more but it's difficult to find a board to fit 4.
Question 2) Does mixing GPU's effect parallel processing, Should I just get a 2080ti now and then buy another 2 later. And a part b to this question do the lane speeds matter, should I spend more on a board that doesn't slow down the PCIe slots if you utilise more than one.
Question 3) More RAM? 32GB seems plenty. So 2x16gb sticks with a board that can has 4 slots up to 64gb.
The matter when running multi GPU is also the number of available PCIe lanes. If you may go for up to 4 GPUs, I'd go for AMD Threadrippers for the 64 PCIe lanes.
For machine learning in a general manner, core & thread count is quite important, so TR is still a good option, depending on the budget of course.
Few poeple mention that running an instance on each GPU may be more interesting, if you do so, mising GPUs is not a problem.
32GB of ram seems good, no need to go for 4 sticks if your CPU does not support quad channel indeed.

Laptop requirements with kinect xbox1

I am using kinect xbox1 for window camera, for computing skeleton data and rgb data.I am retrieving 30 frames per second. Also calculating joint values of human body and then calculate angle between joints. I want that my laptop/system compute faster values of joints and angle. And store into directory.But recently i am using my laptop which compute the joint values and angle very slowly.
Specification of my laptop are:
500GB Hard
600GB RAM
1.7GHZ processor
Kindly tell me which system i am used to calculate faster calculation. I want really fast system/laptop to calculate very fast calculations. Anyone have idea please tell me.
And also tell me the complete specifications of system. I want to use latest fastest technology or any machine which resolve my issue.
Your computer must have the following minimum capabilities:
32-bit (x86) or 64-bit (x64) processors
Dual-core, 2.66-GHz or faster processor
USB 2.0 bus dedicated to the Kinect
2 GB of RAM
Graphics card that supports DirectX 9.0c
Source: MSDN
Anyway I suggest you:
A Desktop PC
with a Processor with 3Ghz (More are usually better) multi-core processor
with a GPU compatible with DirectX 11 and C++ AMP

Crystal core MPU Clock rate differences

I have a embedded system which on boot up shows as below:
Clocking rate (Crystal/Core/MPU): 12.0/400/1000 MHz
Can anybody explain me on differences between these three clock rate.
Processor is ARMv7, OMAP3xxx
As Clement mentioned, the 12.0 is the frequency in MHz of the external oscillator. Core and MPU are the frequencies of the internal PLL's.
The MPU is the Microprocessor Unit Subsystem. This is the actual Cortex-A8 core as well as some closely related peripherals. So your MPU is running at 1000 MHz or 1GHz. This is similar to the CPU frequency in your computer.
In the AM335x, the Core PLL is responsible for the following subsystems: SGX, EMAC, L3S, L3F, L4F, L4_PER, L4_WKUP, PRUSS IEP, Debugss. The subsystems may differ slightly based on the particular chip you are working with. Yours is running at 400MHz. This can be thought of as similar to the Front Side Bus (FSB) frequency in your computer though the analogy isn't exact.
12 Mhz is the frequency of the crystal oscillator present on the board to give a time reference.
A TI OMAP contains 2 cores : an ARM and a DSP. The terminology used here is not clear but it may be the frequencies of these cores. Check you datasheet to be sure.

SSE program takes a lot longer on AMD than on Intel

I am working in the optimization of an algorithm using SSE2 instructions. But I have run into this problem when I was testing the performance:
I) Intel e6750
Doing 4 times the non-SSE2 algorithm takes 14.85 seconds
Doing 1 time the SSE2 algorithm(processes the same data) takes 6.89 seconds
II) Phenom II x4 2.8Ghz
Doing 4 times the non-SSE2 algorithm takes 11.43 seconds
Doing 1 time the SSE2 algorithm(processes the same data) takes 12.15 seconds
Anyone can help me why this is happening? I'm really confused about the results.
In both cases I'm compiling with g++ using -O3 as flag.
PS: The algorithm doesn't use floating point math, it uses the SSE's integer instructions.
Intel has made big improvements to their SSE implementation over the last 5 years or so, which AMD has not really kept up with. Originally both were really just 64 bit execution units, and 128 bit operations were broken down into 2 micro-ops. Ever since Core and Core 2 were introduced though, Intel CPUs have had a full 128 bit SSE implementation, which means that 128 bit operations effectively got a 2x throughput boost (1 micro op versus 2). More recent Intel CPUs also have multiple SSE execution units which means you can get > 1 instruction per clock throughput for 128 bit SIMD instructions.

Optimizing CUDA kernels regarding registers

I'm using the CUDA Occupancy calculator to try to optimize my CUDA kernel. Currently I'm using 34 registers and zero shared memory...Thus the maximum occupancy is 63% for 310 Threads per block. When I could somehow change the registers (e.g. by passing kernel parameters via shared memory) to 20 or below I could get an occupancy of 100%. Is this a good way to do it or would you advise me to use another path of optimizing?
Further I'm also wondering if there's a newer version of the occupancy calculator for Compute Capability 2.1!?
Some points to consider:
320 threads per block will give the same occupancy as 310, because occupancy is defined as active warps/maximum warps per SM, and the warp size is always 32 threads. You should never use a block size which is not a round multiple of 32. That just wastes cores and cycles.
Kernel parameters are passed in constant memory on your compute 2.1 device, and they have no effect on occupancy or register usage.
The GPU design has a pipeline latency of about 21 cycles. So for a Fermi GPU, you need about 43% occupancy to cover all of the internal scheduling latency. Once that is done, you may find that there is relatively little benefit in trying to achieve higher occupancy.
Striving for 100% occupancy is usually never a constructive optimization goal. If you have not done so, I highly recommend looking over Vasily Volkov's presentation from GTC 2010 "Better performance at lower occupancy", where he shows all sorts of surprising results, like code hitting 85% of peak memory bandwidth at 8% occupancy.
The newest occupancy calculator doesn't cover compute 2.1, but the effective occupancy rules for compute 2.0 apply to 2.1 devices too. The extra cores in the compute 2.1 multiprocessor come into play via instruction level parallelism and what is almost out of order execution. That really doesn't change the occupancy characteristics of the device at all compared to compute 2.0 devices.
talonmies is correct, occupancy is overrated.
Vasily Volkov had a great presentation at GTC2010 on this topic: "Better Performance at Lower Occupancy."
http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf