I'm trying to replicate the GNU coding standard using uncrustify. My program has the following function declarations,
static void connect_to_server_cb1 (GObject *source_object,
GAsyncResult *result,
gpointer user_data);
static gboolean connect_to_server_cb2 (GObject *source_object,
GAsyncResult *result,
gpointer user_data);
static void connect_to_server_cb3 (GObject *source_object,
GAsyncResult *result,
gpointer user_data);
I'm expecting output as follows,
static void connect_to_server_cb1 (GObject *source_object,
GAsyncResult *result,
gpointer user_data);
static gboolean connect_to_server_cb2 (GObject *source_object,
GAsyncResult *result,
gpointer user_data);
static void connect_to_server_cb3 (GObject *source_object,
GAsyncResult *result,
gpointer user_data);
Which config option I should try to achive this?
I'm not sure that uncrustify can do exactly what you want. You could try to post-process the output with the gcu-lineup-parameters program from GNOME C Utils, which Nautilus already uses for example. The Nautilus source tree contains a run-uncrustify.sh script which does this for each file:
# Aligning prototypes is not working yet, so avoid headers
"$UNCRUSTIFY" -c "$DATA/uncrustify.cfg" --no-backup "$FILE"
"$DATA/lineup-parameters" "$FILE" > "$FILE.temp" && mv "$FILE.temp" "$FILE"
This should work if you only want to process the .c files, but the comment suggests that aligning prototypes in headers is more difficult.
# Whether to align variable definitions in prototypes and functions.
align_func_params = true # true/false
# How to consider (or treat) the '*' in the alignment of variable definitions.
#
# 0: Part of the type 'void * foo;' (default)
# 1: Part of the variable 'void *foo;'
# 2: Dangling 'void *foo;'
# Dangling: the '*' will not be taken into account when aligning.
align_var_def_star_style = 2 # unsigned number
# How to consider (or treat) the '&' in the alignment of variable definitions.
#
# 0: Part of the type 'long & foo;' (default)
# 1: Part of the variable 'long &foo;'
# 2: Dangling 'long &foo;'
# Dangling: the '&' will not be taken into account when aligning.
align_var_def_amp_style = 2 # unsigned number
# The span for aligning function prototypes.
#
# 0: Don't align (default).
align_func_proto_span = 4 # unsigned number
Related
This tutorial demonstrates how to make a C++/CUDA-based Python extension for PyTorch. But for ... reasons ... my use-case is more complicated than this and doesn't fit neatly within the Python setuptools framework described by the tutorial.
Is there a way to use cmake to compile a Python library that extends PyTorch?
Yes.
The trick is to use cmake to combine together all the C++ and CUDA files we'll need and to use PyBind11 to build the interface we want; fortunately, PyBind11 is included with PyTorch.
The code below is collected and kept up-to-date in this Github repo.
Our project consists of several files:
CMakeLists.txt
cmake_minimum_required (VERSION 3.9)
project(pytorch_cmake_example LANGUAGES CXX CUDA)
find_package(Python REQUIRED COMPONENTS Development)
find_package(Torch REQUIRED)
# Modify if you need a different default value
if(NOT DEFINED CMAKE_CUDA_ARCHITECTURES)
set(CMAKE_CUDA_ARCHITECTURES 61)
endif()
# List all your code files here
add_library(pytorch_cmake_example SHARED
main.cu
)
target_compile_features(pytorch_cmake_example PRIVATE cxx_std_11)
target_link_libraries(pytorch_cmake_example PRIVATE ${TORCH_LIBRARIES} Python::Python)
# Use if the default GCC version gives issues.
# Similar syntax is used if we need better compilation flags.
target_compile_options(pytorch_cmake_example PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-ccbin g++-9>)
# Use a variant of this if you're on an earlier cmake than 3.18
# target_compile_options(pytorch_cmake_example PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-gencode arch=compute_61,code=sm_61>)
main.cu
#include <c10/cuda/CUDAException.h>
#include <torch/extension.h>
#include <torch/library.h>
using namespace at;
int64_t integer_round(int64_t num, int64_t denom){
return (num + denom - 1) / denom;
}
template<class T>
__global__ void add_one_kernel(const T *const input, T *const output, const int64_t N){
// Grid-strided loop
for(int i=blockDim.x*blockIdx.x+threadIdx.x;i<N;i+=blockDim.x*gridDim.x){
output[i] = input[i] + 1;
}
}
///Adds one to each element of a tensor
Tensor add_one(const Tensor &input){
auto output = torch::zeros_like(input);
// Common values:
// AT_DISPATCH_INDEX_TYPES
// AT_DISPATCH_FLOATING_TYPES
// AT_DISPATCH_INTEGRAL_TYPES
AT_DISPATCH_ALL_TYPES(
input.scalar_type(), "add_one_cuda", [&](){
const auto block_size = 128;
const auto num_blocks = std::min(65535L, integer_round(input.numel(), block_size));
add_one_kernel<<<num_blocks, block_size>>>(
input.data_ptr<scalar_t>(),
output.data_ptr<scalar_t>(),
input.numel()
);
// Always test your kernel launches
C10_CUDA_KERNEL_LAUNCH_CHECK();
}
);
return output;
}
///Note that we can have multiple implementations spread across multiple files, though there should only be one `def`
TORCH_LIBRARY(pytorch_cmake_example, m) {
m.def("add_one(Tensor input) -> Tensor");
m.impl("add_one", c10::DispatchKey::CUDA, TORCH_FN(add_one));
//c10::DispatchKey::CPU is also an option
}
Compilation
Compile it all using this command:
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_PREFIX_PATH=`python -c 'import torch;print(torch.utils.cmake_prefix_path)'` -GNinja ..
test.py
You can then run the following test script.
import torch
torch.ops.load_library("build/libpytorch_cmake_example.so")
shape = (3,3,3)
a = torch.randint(0, 10, shape, dtype=torch.float).cuda()
a_plus_one = torch.ops.pytorch_cmake_example.add_one(a)
My cuda version is 10.1, and GPU is T4. My code is like this:
#include <iostream>
#include <algorithm>
#include <random>
#include <vector>
#include <numeric>
#include <algorithm>
#include <chrono>
#include <cuda_runtime.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/execution_policy.h>
#include <cooperative_groups.h>
using std::cout;
using std::endl;
void sort_2d_by_row();
thrust::device_vector<float> thrust_2d_by_row_even_odd(
thrust::device_vector<float>&, int, int);
__global__ void even_odd_kernel(float *ptr, int M, int N);
int main() {
cudaError_t err = cudaDeviceSetLimit(cudaLimitMallocHeapSize, 1UL << 32);
if (err) cout << "errors occur\n";
sort_2d_by_row();
return 0;
}
void sort_2d_by_row() {
std::random_device rd;
std::mt19937 engine;
engine.seed(rd());
std::uniform_real_distribution<float> u(0, 90.);
int M = 19;
int N = 8 * 768 * 768;
/* int N = 10; */
std::vector<float> v(M * N);
std::generate(v.begin(), v.end(), [&](){return u(engine);});
thrust::host_vector<float> hv(v.begin(), v.end());
thrust::device_vector<float> dv = hv;
thrust::device_vector<float> res_even_odd = thrust_2d_by_row_even_odd(dv, M, N);
}
thrust::device_vector<float> thrust_2d_by_row_even_odd(
thrust::device_vector<float>& v, int M, int N) {
thrust::device_vector<float> res(v.begin(), v.end());
thrust::device_vector<int> index(M);
thrust::sequence(thrust::device, index.begin(), index.end(), 0, 1);
int blocky = 1;
while (blocky < M) blocky *= 2;
blocky /= 2;
int blockx = 1;
while (blockx < (N / 2) && blockx < 1024) blockx *= 2;
blockx /= 2;
int gridx = std::min(4096, N / blockx / 2);
dim3 block(blockx, blocky);
dim3 grid(gridx);
even_odd_kernel<<<grid, block, 0>>>(
thrust::raw_pointer_cast(&res[0]), M, N);
cudaDeviceSynchronize();
return res;
}
// descending
__global__ void even_odd_kernel(float *ptr, int M, int N) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int m = threadIdx.y;
int tstride = blockDim.x * gridDim.x * 2;
cooperative_groups::grid_group g = cooperative_groups::this_grid();
g.sync();
}
And CMakeLists.txt is like this:
CMAKE_MINIMUM_REQUIRED(VERSION 2.8)
PROJECT(cuda)
if (NOT CMAKE_BUILD_TYPE)
set(CMAKE_BUILD_TYPE Release)
endif ()
set(CMAKE_CXX_FLAGS "-std=c++14 -Wall -Wextra")
set(CMAKE_CXX_FLAGS_DEBUG "-g3 -O0")
set(CMAKE_CXX_FLAGS_RELEASE "-O2")
set(CUDA_NVCC_FLAGS "-std=c++14 -arch=sm_60 -Xptxas=-v -rdc=true")
set(CUDA_NVCC_FLAGS_DEBUG "-G -O0")
set(CUDA_NVCC_FLAGS_RELEASE "-O2")
set(CUDA_CUDA_FLAGS "-gencode arch=compute_70,code=sm_70 -rdc=true")
message (${CMAKE_BUILD_TYPE})
find_package(CUDA REQUIRED)
cuda_add_executable(sort sort.cu)
target_include_directories(
sort PUBLIC ${CUDA_INCLUDE_DIRS} ${CUDNN_INCLUDE_DIRS})
target_link_libraries(
sort ${CUDA_LIBRARIES})
The error message is:
CMakeFiles/sort.dir/sort_generated_sort.cu.o: In function
`__sti____cudaRegisterAll()':
tmpxft_0004cd04_00000000-5_sort.cudafe1.cpp:(.text.startup+0x15):
undefined reference to
`__cudaRegisterLinkedBinary_39_tmpxft_0004cd04_00000000_6_sort_cpp1_ii_main'
collect2: error: ld returned 1 exit status
CMakeFiles/sort.dir/build.make:963: recipe for target 'sort' failed
How could I make it work please? Besides, Does g.sync() have big harms to the program performance, or is the impact travial?
The cooperative groups are not an issue, IMHO. That's just something requiring a recent version of CUDA. As for your linking trouble - I think it must be some sort of flag mess. I'll suggest an alternative CMakeLists.txt, which itself is not perfect, but is more appropriate for CMake versions of recent years. It also has a bunch of suggestions for you in comments:
cmake_minimum_required(VERSION 3.8.2)
# If you want to properly search for Thrust, you'll need a FindThrust.cmake
# script, which constitutes a "CMake module". You place it under cmake/Modules
# in your source directory and make it available by uncommenting the following
# line:
#list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/cmake/Modules")
project(sort-with-cuda
DESCRIPTION "My project description here"
LANGUAGES CXX CUDA)
# Don't do this. Set your build type explicitly, once; and then it's
# cached and you don't have to worry about it when you run make.
#
#if (NOT CMAKE_BUILD_TYPE)
# set(CMAKE_BUILD_TYPE Release)
#endif ()
# In the future, this should not be necessary, but we need it for
# cuda_select_nvcc_arch_flags
include(FindCUDA)
# This will set the appropriate gencode parameters for the hardware
# on your system (although you could always force it manually)
cuda_select_nvcc_arch_flags(CUDA_ARCH_FLAGS_TMP Auto)
set(CUDA_ARCH_FLAGS ${CUDA_ARCH_FLAGS_TMP} CACHE STRING "CUDA -gencode parameters")
string(REPLACE ";" " " CUDA_ARCH_FLAGS_STR "${CUDA_ARCH_FLAGS}")
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} ${CUDA_ARCH_FLAGS_STR}")
# The above may produce something like:
#
# -gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_70,code=compute_70;-gencode;arch=compute_75,code=compute_75
#
# But it may include older micro-architectures which have been
# deprecated/removed, in which case you'll need to edit that
# with ccmake and only keep what you need.
add_executable(sort-with-cuda sort.cu)
set_target_properties(
sort-with-cuda
PROPERTIES
CXX_STANDARD 14
CXX_STANDARD_REQUIRED YES
CXX_EXTENSIONS NO
)
# Note: I haven't added flags for compiling with warnings
# Thrust is very finickey: It provies a configuration script, but
# only for CMake >= 3.15 . And - it doesn't provide a FindThrust.cmake
# script itself with targets appropriate for CMake >= 3.
#
# See https://github.com/NVIDIA/thrust/blob/main/thrust/cmake/README.md
#
# With CMake 3.15 or later you can enable the following two lines:
#
#find_package(Thrust REQUIRED CONFIG)
#thrust_create_target(Thrust)
#target_link_libraries(sort-with-cuda Thrust)
#
# With earlier CMake versions, get yourself a proper FindThrust.cmake
# script (which creates a Thrust::Thrust target I suppose) and
# then uncomment the following two lines:
#
#find_package(Thrust REQUIRED)
#target_link_libraries(sort-with-cuda Thrust::Thrust)
# The following sets -rdc=true , but you don't actually need that for your example
set_target_properties(sort-with-cuda PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
Since two days I am trying to make printf\sprintf working in my project...
MCU: STM32F722RETx
I tried to use newLib, heap3, heap4, etc, etc. nothing works. HardFault_Handler is run evry time.
Now I am trying to use simple implementation from this link and still the same problem. I suppose my device has some problem with double numbers, becouse program run HardFault_Handler from this line if (value != value) in _ftoa function.( what is strange because this stm32 support FPU)
Do you guys have any idea? (Now I am using heap_4.c)
My compiller options:
target_compile_options(${PROJ_NAME} PUBLIC
$<$<COMPILE_LANGUAGE:CXX>:
-std=c++14
>
-mcpu=cortex-m7
-mthumb
-mfpu=fpv5-d16
-mfloat-abi=hard
-Wall
-ffunction-sections
-fdata-sections
-O1 -g
-DLV_CONF_INCLUDE_SIMPLE
)
Linker options:
target_link_options(${PROJ_NAME} PUBLIC
${LINKER_OPTION} ${LINKER_SCRIPT}
-mcpu=cortex-m7
-mthumb
-mfloat-abi=hard
-mfpu=fpv5-sp-d16
-specs=nosys.specs
-specs=nano.specs
# -Wl,--wrap,malloc
# -Wl,--wrap,_malloc_r
-u_printf_float
-u_sprintf_float
)
Linker script:
/* Highest address of the user mode stack */
_estack = 0x20040000; /* end of RAM */
/* Generate a link error if heap and stack don't fit into RAM */
_Min_Heap_Size = 0x200; /* required amount of heap */
_Min_Stack_Size = 0x400; /* required amount of stack */
/* Specify the memory areas */
MEMORY
{
RAM (xrw) : ORIGIN = 0x20000000, LENGTH = 256K
FLASH (rx) : ORIGIN = 0x08000000, LENGTH = 512K
}
UPDATE:
I don't think so it is stack problem, I have set configCHECK_FOR_STACK_OVERFLOW to 2, but hook function is never called. I found strange think: This soulution works:
float d = 23.5f;
char buffer[20];
sprintf(buffer, "temp %f", 23.5f);
but this solution not:
float d = 23.5f;
char buffer[20];
sprintf(buffer, "temp %f",d);
No idea why passing variable by copy, generate a HardFault_Handler...
You can implement a hard fault handler that at least will provide you with the SP location to where the issue is occurring. This should provide more insight.
https://www.freertos.org/Debugging-Hard-Faults-On-Cortex-M-Microcontrollers.html
It should let you know if your issue is due to a floating point error within the MCU or if it is due to a branching error possibly caused by some linking problem
I also had error with printf when using FreeRTOS for my SiFive HiFive Rev B.
To solve it, I rewrite _fstat and _write functions to change output function of printf
/*
* Retarget functions for printf()
*/
#include <errno.h>
#include <sys/stat.h>
int _fstat (int file, struct stat * st) {
errno = -ENOSYS;
return -1;
}
int _write (int file, char * ptr, int len) {
extern int uart_putc(int c);
int i;
/* Turn character to capital letter and output to UART port */
for (i = 0; i < len; i++) uart_putc((int)*ptr++);
return 0;
}
And create another uart_putc function for UART0 of SiFive HiFive Rev B hardware:
void uart_putc(int c)
{
#define uart0_txdata (*(volatile uint32_t*)(0x10013000)) // uart0 txdata register
#define UART_TXFULL (1 << 31) // uart0 txdata flag
while ((uart0_txdata & UART_TXFULL) != 0) { }
uart0_txdata = c;
}
The newlib C-runtime library (used in many embedded tool chains) internally uses it's own malloc-family routines. newlib maintains some internal buffers and requires some support for thread-safety:
http://www.nadler.com/embedded/newlibAndFreeRTOS.html
hard fault can caused by unaligned Memory Access:
https://www.keil.com/support/docs/3777.htm
I have a CGAL::Point_set_3 point set with point normal and color. I would like to save all properties to a PLY file, using write_ply_with_properties() function.
My goal is to make the full version work (see code below), but even the simple version doesn't compile, with the same error as the full version.
I work on Linux with CGAL release 4.14 and gcc 7.4.0.
Here is the code:
#include <CGAL/Exact_predicates_inexact_constructions_kernel.h>
#include <CGAL/Point_set_3.h>
#include <CGAL/Point_set_3/IO.h>
#include <tuple> // for std::tie
#include <fstream>
typedef CGAL::Exact_predicates_inexact_constructions_kernel Kernel;
typedef Kernel::Point_3 Point;
typedef Kernel::Vector_3 Vector;
typedef CGAL::Point_set_3<Point> Point_set;
int main(int argc, char*argv[])
{
Point_set points;
points.insert(Point(1., 2., 3.));
points.insert(Point(4., 5., 6.));
// add normal map
points.add_normal_map();
auto normal_map = points.normal_map();
// add color map
typedef Point_set::Property_map< Vector > ColorMap;
bool success = false;
ColorMap color_map;
std::tie(color_map, success) =
points.add_property_map< Vector >("color");
assert(success);
// populate normal and color map
for(auto it = points.begin(); it != points.end(); ++it)
{
normal_map[*it] = Vector(10., 11., 12.);
color_map[*it] = Vector(20., 21., 22.);
}
std::ofstream out("out.ply");
#if 1
// simple version
if(!out || !CGAL::write_ply_points_with_properties(
out,
points.points(), // const PointRange
CGAL::make_ply_point_writer(points.point_map())))
#else
// full version
if(!out || !CGAL::write_ply_points_with_properties(
out,
points.points(), // const PointRange
CGAL::make_ply_point_writer(points.point_map()),
CGAL::make_ply_normal_writer(points.normal_map()),
std::make_tuple(color_map,
CGAL::PLY_property< double >("red"),
CGAL::PLY_property< double >("green"),
CGAL::PLY_property< double >("blue"))))
#endif
{
return EXIT_FAILURE;
}
return EXIT_SUCCESS;
}
The compilation error is:
...
/usr/include/boost/property_map/property_map.hpp:303:54: error: no match for ‘operator[]’ (operand types are ‘const CGAL::Point_set_3<CGAL::Point_3<CGAL::Epick> >::Property_map<CGAL::Point_3<CGAL::Epick> >’ and ‘const CGAL::Point_3<CGAL::Epick>’)
Reference v = static_cast<const PropertyMap&>(pa)[k];
CGAL-4.14/include/CGAL/Surface_mesh/Properties.h:567:15: note: candidate: CGAL::Properties::Property_map_base<I, T, CRTP_derived_class>::reference CGAL::Properties::Property_map_base<I, T, CRTP_derived_class>::operator[](const I&) [with I = CGAL::Point_set_3<CGAL::Point_3<CGAL::Epick> >::Index; T = CGAL::Point_3<CGAL::Epick>; CRTP_derived_class = CGAL::Point_set_3<CGAL::Point_3<CGAL::Epick> >::Property_map<CGAL::Point_3<CGAL::Epick> >; CGAL::Properties::Property_map_base<I, T, CRTP_derived_class>::reference = CGAL::Point_3<CGAL::Epick>&]
reference operator[](const I& i)
^~~~~~~~
CGAL-4.14/include/CGAL/Surface_mesh/Properties.h:567:15: note: no known conversion for argument 1 from ‘const CGAL::Point_3<CGAL::Epick>’ to ‘const CGAL::Point_set_3<CGAL::Point_3<CGAL::Epick> >::Index&’
How can I fix it?
The problem in your code is that you are using the method points() of CGAL::Point_set_3 which returns a range of points of type CGAL::Point_set_3::Point_range, whereas the property maps that you use (points.point_map(), etc.) are directly applied to a type CGAL::Point_set_3.
So you should simply call the write_ply_points_with_properties() on points, not on points.points().
Note also that if you store your colors on simple types (for example, using three Point_set_3 properties typed unsigned char), you can take advantage of the function CGAL::write_ply_point_set() that will automatically write all the simply-typed properties it finds, which makes it quite straightforward to use (just do CGAL::write_ply_point_set(out, points) and you're done).
One last thing that is really a detail not related to your problem, but you should avoid using the CGAL::Vector_3 for storing anything else than an actual geometric 3D vector (like colors in your case). That makes your code harder to read and is also quite an ineffective way to store colors if they are encoded as integer values between 0 and 255 (which is what unsigned char is for).
Following the same steps in CUDA samples to launch a kernel and sync across the grid using cooperative_groups::this_grid().sync() causes any CUDA API call to fails. While using
cooperative_groups::this_thread_block().sync() works fine and gives correct results.
I used the following code and CMakeLists.txt (cmake version 3.11.1) to test it using CUDA 10 on TITAN V GPU (Driver Version 410.73) with Ubuntu 16.04.5 LTS. The code is also available on github in order to make it easy to reproduce the error.
The code reads an array and then reverses it (from [0 1 2 ... 9] to [9 8 7 ... 0]). In order to do this, each thread reads a single element from the array, sync, and then writes its element to the right destination. The code can be easily modified to ensure that this_thread_block().sync() works fine. Simply change arr_size to be less 1024 and use cg::thread_block barrier = cg::this_thread_block(); instead.
test_cg.cu
#include <cuda_runtime_api.h>
#include <stdio.h>
#include <stdint.h>
#include <cstdint>
#include <numeric>
#include <cuda.h>
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
//********************** CUDA_ERROR
inline void HandleError(cudaError_t err, const char *file, int line) {
//Error handling micro, wrap it around function whenever possible
if (err != cudaSuccess) {
printf("\n%s in %s at line %d\n", cudaGetErrorString(err), file, line);
#ifdef _WIN32
system("pause");
#else
exit(EXIT_FAILURE);
#endif
}
}
#define CUDA_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))
//******************************************************************************
//********************** cg kernel
__global__ void testing_cg_grid_sync(const uint32_t num_elements,
uint32_t *d_arr){
uint32_t tid = threadIdx.x + blockDim.x*blockIdx.x;
if (tid < num_elements){
uint32_t my_element = d_arr[tid];
//to sync across the whole grid
cg::grid_group barrier = cg::this_grid();
//to sync within a single block
//cg::thread_block barrier = cg::this_thread_block();
//wait for all reads
barrier.sync();
uint32_t tar_id = num_elements - tid - 1;
d_arr[tar_id] = my_element;
}
}
//******************************************************************************
//********************** execute
void execute_test(const int sm_count){
//host array
const uint32_t arr_size = 1 << 20; //1M
uint32_t* h_arr = (uint32_t*)malloc(arr_size * sizeof(uint32_t));
//fill with sequential numbers
std::iota(h_arr, h_arr + arr_size, 0);
//device array
uint32_t* d_arr;
CUDA_ERROR(cudaMalloc((void**)&d_arr, arr_size*sizeof(uint32_t)));
CUDA_ERROR(cudaMemcpy(d_arr, h_arr, arr_size*sizeof(uint32_t),
cudaMemcpyHostToDevice));
//launch config
const int threads = 512;
//following the same steps done in conjugateGradientMultiBlockCG.cu
//cuda sample to launch kernel that sync across grid
//https://github.com/NVIDIA/cuda-samples/blob/master/Samples/conjugateGradientMultiBlockCG/conjugateGradientMultiBlockCG.cu#L436
int num_blocks_per_sm = 0;
CUDA_ERROR(cudaOccupancyMaxActiveBlocksPerMultiprocessor(&num_blocks_per_sm,
(void*)testing_cg_grid_sync, threads, 0));
dim3 grid_dim(sm_count * num_blocks_per_sm, 1, 1), block_dim(threads, 1, 1);
if(arr_size > grid_dim.x*block_dim.x){
printf("\n The grid size (numBlocks*numThreads) is less than array size.\n");
exit(EXIT_FAILURE);
}
printf("\n Launching %d blocks, each containing %d threads", grid_dim.x,
block_dim.x);
//argument passed to the kernel
void *kernel_args[] = {
(void *)&arr_size,
(void *)&d_arr, };
//finally launch the kernel
cudaLaunchCooperativeKernel((void*)testing_cg_grid_sync,
grid_dim, block_dim, kernel_args);
//make sure everything went okay
CUDA_ERROR(cudaGetLastError());
CUDA_ERROR(cudaDeviceSynchronize());
//get results on the host
CUDA_ERROR(cudaMemcpy(h_arr, d_arr, arr_size*sizeof(uint32_t),
cudaMemcpyDeviceToHost));
//validate
for (uint32_t i = 0; i < arr_size; i++){
if (h_arr[i] != arr_size - i - 1){
printf("\n Result mismatch in h_arr[%u] = %u\n", i, h_arr[i]);
exit(EXIT_FAILURE);
}
}
}
//******************************************************************************
int main(int argc, char**argv) {
//set to Titan V
uint32_t device_id = 0;
cudaSetDevice(device_id);
//get sm count
cudaDeviceProp devProp;
CUDA_ERROR(cudaGetDeviceProperties(&devProp, device_id));
int sm_count = devProp.multiProcessorCount;
//execute
execute_test(sm_count);
printf("\n Mission accomplished \n");
return 0;
}
CMakeLists.txt
cmake_minimum_required(VERSION 3.8 FATAL_ERROR)
set(PROJECT_NAME "test_cg")
project(${PROJECT_NAME} LANGUAGES CXX CUDA)
#default build type is Release
if (CMAKE_BUILD_TYPE STREQUAL "")
set(CMAKE_BUILD_TYPE Release)
endif ()
SET(CUDA_SEPARABLE_COMPILATION ON)
########## Libraries/flags Starts Here ######################
find_package(CUDA REQUIRED)
include_directories("${CUDA_INCLUDE_DIRS}")
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS}; -lineinfo; -std=c++11; -expt-extended-lambda; -O3; -use_fast_math; -rdc=true;)
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode=arch=compute_70,code=sm_70) #for TITAN V
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS}")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -m64 -Wall -std=c++11")
########## Libraries/flags Ends Here ######################
########## inc/libs/exe/features Starts Here ######################
set(CMAKE_INCLUDE_CURRENT_DIR ON)
CUDA_ADD_EXECUTABLE(${PROJECT_NAME} test_cg.cu)
target_compile_features(${PROJECT_NAME} PUBLIC cxx_std_11)
set_target_properties(${PROJECT_NAME} PROPERTIES POSITION_INDEPENDENT_CODE ON)
set_target_properties(${PROJECT_NAME} PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
target_link_libraries(${PROJECT_NAME} ${CUDA_LIBRARIES} ${CUDA_cudadevrt_LIBRARY})
########## inc/libs/exe/features Ends Here ######################
Running this code gives:
unknown error in /home/ahdhn/test_cg/test_cg.cu at line 67
This is the first line that uses cudaMalloc. I made sure that the code is compiled for the correct architecture by querying __CUDA_ARCH__ from the device and the results is 700. Kindly let me know if you spot me doing something wrong in the code or the CMakeLists.txt file.
With external help, the solution that got the code working is to add string(APPEND CMAKE_CUDA_FLAGS " -gencode arch=compute_70,code=sm_70 --cudart shared") after the second set(CUDA_NVCC_FLAGS...... The reason is that I only have libcudadevrt.a under my /usr/local/cuda-10.0/lib64/ and so I have to signal CUDA to link shared/dynamic run-time library since the default is to link to static. string(APPEND CMAKE_CUDA_FLAGS " -gencode arch=compute_70,code=sm_70") after the second set(CUDA_NVCC_FLAGS...... The reason is that the sm_70 flag was not passed to the linker properly.
Additionally, using only CUDA_NVCC_FLAGS will only pass the sm_70 info to the compiler not the linker. While only using CMAKE_NVCC_FLAGS will report error: namespace "cooperative_groups" has no member "grid_group" error.