How to pick target hardware for opencl

How to pick target hardware for opencl - gpu

How do you tell OpenCL to target build for a gpu instead of a cpu? will it automatically pick one over the other?

OpenCL will not automatically pick a device for you. You have to explicitly choose a platform (Intel/AMD/Nvidia) and a device (CPU/GPU) on that platform. Platform #0 and device #0 by default will not always give you the GPU. This is quite cumbersome when running code on different computers, as on each you have to manually select the device.
However there is a smart solution for this, a lightweight OpenCL-Wrapper that automatically picks the fastest available GPU (or CPU if no GPU is available) for you. This works by reading out the number of compute units and clock frequency and adding missing information (number of cores per CU) via vendor and device name with a small database.
Find the source code with an example here.
Here is just the code for automatically selecting the fastest device:
vector<cl::Device> cl_devices; // get all devices of all platforms
{
vector<cl::Platform> cl_platforms; // get all platforms (drivers)
cl::Platform::get(&cl_platforms);
for(uint i=0u; i<(uint)cl_platforms.size(); i++) {
vector<cl::Device> cl_devices_available;
cl_platforms[i].getDevices(CL_DEVICE_TYPE_ALL, &cl_devices_available); // to query only GPUs, use CL_DEVICE_TYPE_GPU here
for(uint j=0u; j<(uint)cl_devices_available.size(); j++) {
cl_devices.push_back(cl_devices_available[j]);
}
}
}
cl::Device cl_device; // select fastest available device
{
float best_value = 0.0f;
uint best_i = 0u; // index of fastest device
for(uint i=0u; i<(uint)cl_devices.size(); i++) { // find device with highest (estimated) floating point performance
const string name = trim(cl_devices[i].getInfo<CL_DEVICE_NAME>()); // device name
const string vendor = trim(cl_devices[i].getInfo<CL_DEVICE_VENDOR>()); // device vendor
const uint compute_units = (uint)cl_devices[i].getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>(); // compute units (CUs) can contain multiple cores depending on the microarchitecture
const uint clock_frequency = (uint)cl_devices[i].getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>(); // in MHz
const bool is_gpu = cl_devices[i].getInfo<CL_DEVICE_TYPE>()==CL_DEVICE_TYPE_GPU;
const uint ipc = is_gpu?2u:32u; // IPC (instructions per cycle) is 2 for GPUs and 32 for most modern CPUs
const bool nvidia_192_cores_per_cu = contains_any(to_lower(name), {" 6", " 7", "ro k", "la k"}) || (clock_frequency<1000u&&contains(to_lower(name), "titan")); // identify Kepler GPUs
const bool nvidia_64_cores_per_cu = contains_any(to_lower(name), {"p100", "v100", "a100", "a30", " 16", " 20", "titan v", "titan rtx", "ro t", "la t", "ro rtx"}) && !contains(to_lower(name), "rtx a"); // identify P100, Volta, Turing, A100, A30
const bool amd_128_cores_per_dualcu = contains(to_lower(name), "gfx10"); // identify RDNA/RDNA2 GPUs where dual CUs are reported
const float nvidia = (float)(contains(to_lower(vendor), "nvidia"))*(nvidia_192_cores_per_cu?192.0f:(nvidia_64_cores_per_cu?64.0f:128.0f)); // Nvidia GPUs have 192 cores/CU (Kepler), 128 cores/CU (Maxwell, Pascal, Ampere) or 64 cores/CU (P100, Volta, Turing, A100)
const float amd = (float)(contains_any(to_lower(vendor), {"amd", "advanced"}))*(is_gpu?(amd_128_cores_per_dualcu?128.0f:64.0f):0.5f); // AMD GPUs have 64 cores/CU (GCN, CDNA) or 128 cores/dualCU (RDNA, RDNA2), AMD CPUs (with SMT) have 1/2 core/CU
const float intel = (float)(contains(to_lower(vendor), "intel"))*(is_gpu?8.0f:0.5f); // Intel integrated GPUs usually have 8 cores/CU, Intel CPUs (with HT) have 1/2 core/CU
const float arm = (float)(contains(to_lower(vendor), "arm"))*(is_gpu?8.0f:1.0f); // ARM GPUs usually have 8 cores/CU, ARM CPUs have 1 core/CU
const uint cores = to_uint((float)compute_units*(nvidia+amd+intel+arm)); // for CPUs, compute_units is the number of threads (twice the number of cores with hyperthreading)
const float tflops = 1E-6f*(float)cores*(float)ipc*(float)clock_frequency; // estimated device FP32 floating point performance in TeraFLOPs/s
if(tflops>best_value) {
best_value = tflops;
best_i = i;
}
}
const string name = trim(cl_devices[best_i].getInfo<CL_DEVICE_NAME>()); // device name
cl_device = cl_devices[best_i];
print_info(name); // print device name
}
Alternatively, you can also make it automatically choose the device with most memory rather than FLOPs, or a device with specified ID from the list of all devices from all platforms. There is many more benefits to using this wrapper, for example significantly simpler code for using arrays and automatic tracking of total device memory allocation, all while not impacting performance in any way.

Related

Measure RAM consumption in runtime for Nano BLE 33 sense

I am now working with Nano BLE 33 Sense. I want to check how much of the SRAM the code is using, during each runtime. In my code, I simply define an array that is going to store the analog signal coming from a sensor. I used the ways that are discussed in the following links to track memory usage during runtime.
https://forum.arduino.cc/t/solved-measure-free-sram-of-nano-33-ble-sense-at-runtime/697716
Checking memory footprint in Arduino
https://learn.adafruit.com/memories-of-an-arduino/measuring-free-memory
However, when I change the array size from 100 units to 1000 the SRAM and flash usage do not change in any of the solutions provided above. Also, here is my code in Arduino IDE:
namespace
{
constexpr int ECGBufferSize = 10;
int counter = 0;
int16_t ECG_buffer[ECGBufferSize];
}
void setup() {
// put your setup code here, to run once:
Serial.begin(9600);
}
void loop()
{
// put your main code here, to run repeatedly:
int hello[10] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
while(counter < ECGBufferSize)
{
ECG_buffer[counter] = analogRead(A7);
counter++;
Serial.print(counter);
Serial.print("\n");
delay(5);
}
counter = 0;
delay(10000);
}
Please help me to find the correct way to track the memory usage of my Nano BLE 33 Sense.

Reading 32-bit SPI data using 16-bit processor (dsPIC)

I am interfacing a M90E32AS energy meter IC with dsPIC33F series processor. I successfully read voltage and current values. I try to read power values also, as per the datasheet the power registers are 32-bits wide. I tried the following code to read the 32 bit value but I am unsuccessful. Help me to rectify the error.
int PmB_read()
{
CS=0;
SPI2BUF=SBUF=0x80B2;
while(SPI2STATbits.SPITBF==1){}
SPI2BUF=0x0;
while(SPI2STATbits.SPITBF==1){}
delay();
CS=1;
HiByte = SPI2BUF;
return HiByte;
}
int PmBLSB_read()
{
CS=0;
SPI2BUF=SBUF=0x80C2;
while(SPI2STATbits.SPITBF==1){}
SPI2BUF=0x0;
while(SPI2STATbits.SPITBF==1){}
delay();
CS=1;
LoByte = SPI2BUF;
TPmB = (HiByte << 16)| LoByte;
Total = TPmB * 0.00032;
return Total;
}
Here is the data sheet screen shot for power register

Distortion in ESP32 I2S audio playback with external DAC for sample frequency higher than 20kSps

Hardware: ESP32 DevKitV1, PCM5102 breakout board, SD-card adapter.
Software: Arduino framework.
For some time I am struggling with audio playback using a I2S DAC external to ESP32.
The problem is I can only play without distortion for low sample frequencies, i.e. below 20kSps.
I have been studying the documentation, https://docs.espressif.com/projects/esp-idf/en/latest/api-reference/peripherals/i2s.html, and numerous other sources but sill haven't managed to fix this.
I2S configuration function:
esp_err_t I2Smixer::i2sConfig(int bclkPin, int lrckPin, int dinPin, int sample_rate)
{
// i2s configuration: Tx to ext DAC, 2's complement 16-bit PCM, mono,
const i2s_config_t i2s_config = {
.mode = (i2s_mode_t)(I2S_MODE_MASTER | I2S_MODE_TX | I2S_CHANNEL_MONO), // only tx, external DAC
.sample_rate = sample_rate,
.bits_per_sample = I2S_BITS_PER_SAMPLE_16BIT,
.channel_format = I2S_CHANNEL_FMT_ONLY_RIGHT, // single channel
// .channel_format = I2S_CHANNEL_FMT_RIGHT_LEFT, //2-channels
.communication_format = (i2s_comm_format_t)(I2S_COMM_FORMAT_I2S | I2S_COMM_FORMAT_I2S_MSB),
.intr_alloc_flags = ESP_INTR_FLAG_LEVEL3, // highest interrupt priority that can be handeled in c
.dma_buf_count = 128, //16,
.dma_buf_len = 128, // 64
.use_apll = false,
.tx_desc_auto_clear = true};
const i2s_pin_config_t pin_config = {
.bck_io_num = bclkPin, //this is BCK pin
.ws_io_num = lrckPin, // this is LRCK pin
.data_out_num = dinPin, // this is DATA output pin
.data_in_num = I2S_PIN_NO_CHANGE // Not used
};
esp_err_t ret1 = i2s_driver_install((i2s_port_t)i2s_num, &i2s_config, 0, NULL);
esp_err_t ret2 = i2s_set_pin((i2s_port_t)i2s_num, &pin_config);
esp_err_t ret3 = i2s_set_sample_rates((i2s_port_t)i2s_num, sample_rate);
// i2s_adc_disable((i2s_port_t)i2s_num);
// esp_err_t ret3 = rtc_clk_apll_enable(1, 15, 8, 5, 6);
return ret1 + ret2 + ret3;
}
A wave file, which was created in a 16 bit mono PCM, 44.1kHz format, is opened:
File sample_file = SD.open("/test.wav")
In the main loop, the samples are fed to the I2S driver.
esp_err_t I2Smixer::loop()
{
esp_err_t ret1 = ESP_OK, ret2 = ESP_OK;
int32_t output = 0;
if (sample_file.available())
{
if (sample_file.size() - sample_file.position() > 2) // bytes left
{
int16_t tmp; // 16 bits signed PCM assumed
sample_file.read((uint8_t *)&tmp, 2);
output =(int32_t)tmp;
}
else
{
sample_file.close();
}
}
size_t i2s_bytes_write;
int16_t int16_t_output = (int16_t)output;
ret1 = i2s_write((i2s_port_t)i2s_num, &int16_t_output, 2, &i2s_bytes_write, portMAX_DELAY);
if (i2s_bytes_write != 2)
ret2 = ESP_FAIL;
return ret1 + ret2;
}
This works fine for sample rates up to 20 kSps.
For a sample rate of 32k or 44.1k heavy distortion occurs. I suspect that this is caused by the I2S DMA Tx buffer.
If the number of DMA buffers (dma_buf_count) and the buffer length (dma_buf_len) is increased, then the sound is played fine at first. Subsequently, after a short time, the distortion kicks in again. I cannot measure this short time span, maybe around a second, but I did notice it depends on the dma_buf_count and dma_buf_len.
Next to this, I tried increasing the CPU frequency to 240MHz, no improvement.
Further I tried to play a file from SPIFSS, no improvement.
I am out of ideas right now, has anyone encountered this issue also?

Reading one sample at a time and pushing it to the I2S driver will not be the most efficient usage of the driver. You are using just 2 bytes in every 128 byte DMA buffer. That leaves just a single sample period to push the next sample before the DMA buffer is "starved".
Read the file in 128 byte (64 sample) chunks and write the whole chunk to the I2S in order to use the DMA effectively.
Depending on the file-system implementation it may be a little more efficient too to use larger chunks that are sympathetic to the file-system's media, sector size and DMA buffering.

How to limit FPS for UVC gadget?

I'm developing an application based on g_webcam kind of template code available at git://git.ideasonboard.org/uvc-gadget.git. I've noticed the FPS setting supplied in the USB device config structures is not respected. In fact, gadget attempts the fastest possible frame rate. Moreover, the host tends to loose the pipe to UVC device due to probable low-level USB interface flooding due to opportunistic FPS selection.
So, how can we set a hard-limit on FPS for a UVC gadget?
Thanks!
Kernel module source:
/* Uncompressed Payload - 3.1.2. Uncompressed Video Frame Descriptor */
static const struct UVC_FRAME_UNCOMPRESSED(1)
uvc_frame_uncompressed_360p = {
.bLength = UVC_DT_FRAME_UNCOMPRESSED_SIZE(1),
.bDescriptorType = USB_DT_CS_INTERFACE,
.bDescriptorSubType = UVC_VS_FRAME_UNCOMPRESSED,
.bFrameIndex = 1,
.bmCapabilities = 0,
.wWidth = cpu_to_le16(FRAME_WIDTH),
.wHeight = cpu_to_le16(FRAME_HEIGHT),
.dwMinBitRate = cpu_to_le32(FRAME_WIDTH * FRAME_HEIGHT * 8 * FRAME_RATE),
.dwMaxBitRate = cpu_to_le32(FRAME_WIDTH * FRAME_HEIGHT * 8 * FRAME_RATE),
.dwMaxVideoFrameBufferSize = cpu_to_le32(FRAME_WIDTH * FRAME_HEIGHT),
.dwDefaultFrameInterval = cpu_to_le32(FRAME_RATE_USEC),
.bFrameIntervalType = 1,
.dwFrameInterval[0] = cpu_to_le32(FRAME_RATE_USEC),
};
UVC gadget source:
static const struct uvc_frame_info uvc_frames_grey[] = {
{ FRAME_WIDTH, FRAME_HEIGHT, FRAME_RATE_USEC, },
{ 0, 0, 0, },
};
Common header:
#define STREAMING_MAXPACKET 1024
#define FRAME_WIDTH 160
#define FRAME_HEIGHT 90
#define FRAME_RATE 330 /* 330 FPS */
#define FRAME_RATE_USEC 30303 /* 330 FPS */

I believe Paul added support for setting a limit on the FPS in usb-gadget, which got upstreamed recently.
Please consider looking at the latest version of the repository.
Let us know if you hit any further issues on this.

How do I configure the u-boot video driver for a 320x240 LVDS display on an iMX6 board?

I have a custom hardware device that uses a Variscite i.MX6Q (quad-core) to drive a 320x240 display. Once the linux kernel starts booting, the LCD display works great - no issues at all. However, prior to that the boot loader (u-boot) shows a white screen (sometimes with faint vertical lines) for about 0.25s, then goes black for about 8s until the kernel takes over (reinitializing the display and correctly showing the kernel's own splash screen).
Since the linux kernel can drive the display just fine, I'm sure I've just mis-configured something in my u-boot setup...but I'm tearing my hair out trying to figure out what and where! Resources / things I've tried include:
Porting LVDS LCD With Low Resolution to i.MX6 - This seems highly relevant, but refers to tweaking linux kernel drivers instead of uboot and I'm not experienced enough to port the knowledge to uboot.
U-Boot splash screen - LVDS - This seems soooo close to the problem I'm having, but doesn't list a clear solution. One response in the forum linked to a suggestion to invert the polarity of one of the clocks, which I tried but did not notice any difference.
How to display splash screen on parallel LCD in u-boot - In the same theme as the prior posts, this again hints at an issue with specifying clocks for low-res displays.
i.mx6 33.26MHz LVDS panel cannot display in u-boot - Following these instructions, I modified ...../uboot/drivers/video/ipu_common.c and set the g_ldb_clk struct .rate members to 6400000, but that seemed to have no effect.
Adding Displays to iMX Developer's Kits [Warning - PDF!] - Instructions on how to add support for new displays to iMX boards; section 6.1.4 talks about iMX6Q. However, I've added the proper display timings to the displays[] var (see code below) and I'm still having problems.
From my custom board schematics, I know that I need to configure a PWM backlight display on PWM2 and backlight enable/disable on GPIO 5-13, and I need to provide custom display timings. So, the relevant sections in ..../uboot/board/variscite/mx6var_som.c:
struct display_info_t const displays[] = {{
.bus = -1,
.addr = 0,
.pixfmt = IPU_PIX_FMT_RGB24,
.detect = detect_MyCustomBoard,
.enable = lvds_enable_disable,
.mode = {
.name = "VAR-QVGA MX6CB-R",
.refresh = 60, /* optional */
.xres = 320,
.yres = 240,
.pixclock = MHZ2PS(6.4),
.left_margin = 64,
.right_margin = 20,
.upper_margin = 8,
.lower_margin = 4,
.hsync_len = 4,
.vsync_len = 10,
.sync = FB_SYNC_EXT,
.vmode = FB_VMODE_NONINTERLACED
} },
...
};
static void setup_display(void)
{
...
/* Turn off backlight until display is ready */
SETUP_IOMUX_PAD(PAD_DISP0_DAT19__GPIO5_IO13 | MUX_PAD_CTRL(NO_PAD_CTRL));
gpio_direction_output(IMX_GPIO_NR(5, 13), 0);
/* Setup the backlight dimmer (via PWM) */
SETUP_IOMUX_PAD(PAD_DISP0_DAT9__PWM2_OUT | MUX_PAD_CTRL(BACKLIGHT_PWM_CTRL));
pwm_init(VAR_SOM_BACKLIGHT_PWM_ID, VAR_SOM_BACKLIGHT_PERIOD, 0);
pwm_config(VAR_SOM_BACKLIGHT_PWM_ID, 0, VAR_SOM_BACKLIGHT_PERIOD);
...
/* Turn on LDB0, LDB1, IPU,IPU DI0 clocks */
reg = readl(&mxc_ccm->CCGR3);
reg |= MXC_CCM_CCGR3_LDB_DI0_MASK | MXC_CCM_CCGR3_LDB_DI1_MASK;
writel(reg, &mxc_ccm->CCGR3);
/* set LDB0, LDB1 clk select to 011/011 */
reg = readl(&mxc_ccm->cs2cdr);
reg &= ~(MXC_CCM_CS2CDR_LDB_DI0_CLK_SEL_MASK
| MXC_CCM_CS2CDR_LDB_DI1_CLK_SEL_MASK);
reg |= (1 << MXC_CCM_CS2CDR_LDB_DI0_CLK_SEL_OFFSET)
| (1 << MXC_CCM_CS2CDR_LDB_DI1_CLK_SEL_OFFSET);
writel(reg, &mxc_ccm->cs2cdr);
...
}
int splash_screen_prepare(void)
{
...
/* Turn on backlight */
gpio_set_value(IMX_GPIO_NR(5, 13), 1);
pwm_config(VAR_SOM_BACKLIGHT_PWM_ID, VAR_SOM_BACKLIGHT_PERIOD*127/256, VAR_SOM_BACKLIGHT_PERIOD);
...
}
For comparison, here are the relevant sections of my linux device tree:
&pwm2 {
pinctrl-names = "default";
pinctrl-0 = <&pinctrl_pwm2_1>;
status = "okay";
};
backlight {
compatible = "pwm-backlight";
pwms = <&pwm2 0 50000>;
brightness-levels = <0 4 8 16 32 64 128 248>;
default-brightness-level = <7>;
status = "okay";
};
&ldb {
status = "okay";
lvds-channel#0 {
fsl,data-mapping = "spwg";
fsl,data-width = <24>;
status = "okay";
primary;
display-timings {
native-mode = <&timing0r>;
timing0r: hsd100pxn1 {
clock-frequency = <6400000>;
hactive = <320>;
vactive = <240>;
hback-porch = <64>;
hfront-porch = <20>;
vback-porch = <8>;
vfront-porch = <4>;
hsync-len = <4>;
vsync-len = <10>;
};
};
};
...
};
&iomuxc {
pinctrl-names = "default";
pinctrl-0 = <&pinctrl_hog>;
imx6qdl-var-som-mx6 {
pinctrl_hog: hoggrp {
fsl,pins = <
...
/* LCD Enable on GPIO 5-13 */
MX6QDL_PAD_DISP0_DAT19__GPIO5_IO13 0xc0000000
...
>;
};
In terms of hardware, the LVDS signal from the iMX6 is converted to parallel RGB by a TI SN65LVDS822 FlatlinkTM LVDS receiver, which drives a 320x240 QVGA Okaya RH320240T-3x5AP-A display.
The framework I'm using is Yocto (Krogoth release), which includes:
U-Boot 2015.04-mx6+g535519b: git://github.com/varigit/uboot-imx.git, branch imx_v2015.04_4.1.15_1.1.0_ga_var03, commit 535519
Linux kernel 4.1.15: git://github.com/varigit/linux-2.6-imx.git, branch imx-rel_imx_4.1.15_2.0.0_ga-var01, commit 5a4b34
I do have a Variscite DevKit, and when I boot the SOM in the DevKit (with an appropriate device tree and associated drivers) everything works great and I see both the uboot splash image as well as the linux kernel splash image. This implies that the image I'm using for the uboot splash is valid, can be read by uboot, etc.
There is one other kicker: I do not have serial console access on my production board set :(.
So, the big question here is what am I doing wrong in my uboot display driver initialization? At this point, I'd even welcome strategies on how to go about debugging this (although I don't have access to an oscilloscope).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to pick target hardware for opencl - gpu

How do you tell OpenCL to target build for a gpu instead of a cpu? will it automatically pick one over the other?

Related

Measure RAM consumption in runtime for Nano BLE 33 sense

Reading 32-bit SPI data using 16-bit processor (dsPIC)

Distortion in ESP32 I2S audio playback with external DAC for sample frequency higher than 20kSps

How to limit FPS for UVC gadget?

How do I configure the u-boot video driver for a 320x240 LVDS display on an iMX6 board?

Categories

Resources