I am new to OpenACC. I have a query related to structured data transfer using the #pragma acc data directive. According to the site https://docs.computecanada.ca/wiki/OpenACC_Tutorial_-_Data_movement
The data directive defines a region of code in which GPU arrays remain on the GPU and are shared among all kernels in that region.
I do understand the use of copy clause. I was wondering whether this directive can be used without any clause?
I read the OpenACC 2.7 specification. This part is not clear if clause is mandatory or not. My understanding is that if a data region is defined without specifying any data explicitly, then all data used within that region will implicitly remain on the GPU throughout the data region.
#pragma acc data
{
#pragma acc kernels
// Kernel 1
#pragma acc kernels
// Kernel 2
}
That means, for the above code, all data used in Kernel 1 and Kernel 2 will remain on the GPU for the entire duration of the data region.
Please correct me if I'm wrong.
Thank You in advance.
There are implicit data regions as part of a parallel construct (i.e. the data region that's part of a "paralllel" or "kernels" region), where the compiler will attempt to implicitly copy the data to the device assuming the size and shape of the data is known. Otherwise, you do need to use data clauses to define the shape and size.
For the other data region constructs, structured, unstructured, and declare, you do need to have your variables that you want on the device included in a data clause, where the data clause may be copy, copyin, copyout, create, present, or deviceptr (or delete for exit data directives). The compiler can't assume what data you want on the device, so in general won't implicitly copy it for you.
Related
I have a uniform buffer which should be updated every frame. In order to avoid big stalls I want to create 3 buffers (by the number of my frame buffers) which should be interleaved every frame (0-1-2-0-1-2-...). But I can't understand how to create descriptors and bind them. This is how I'm doing it now:
I created a VkDescriptorSetLayout where I specified that I want to use a uniform buffer at binding position 0 in some shader stage.
I created a VkDescriptorPool with a size for 3 descriptors for uniform buffers.
Next I need to allocate descriptor sets but how many descriptor sets do I need here? I have only one VkDescriptorSetLayout and I'm expecting to get one VkDescriptorSet.
Next I need to update created descriptor sets. Since I have only one binding (0) in my shader I can use only one one buffer in VkDescriptorBufferInfo which will be passed to VkWriteDescriptorSet which will be passed to vkUpdateDescriptorSets. But what about other two buffers? Where to specify them?
When I need to record a command I need to bind my descriptor set. But till now I have a descriptor set which is updated only for one buffer. What about the others?
Do I need to create 3 VkDescriptorSetLayout - one for every frame? Next do I need to allocate and update corresponding descriptor set with a corresponding buffer? And after this do I need to create 3 different command buffers where I should specify corresponding descriptor set.
It seems it's a lot of work - the data is almost the same - all bindings, states stays the same, only the buffer changes.
All it sounds very confusing so please don't hesitate to clarify.
Descriptor Set Layouts defines the contents of a descriptor set - what types of resources (descriptors) given set contains. When You need several descriptor sets with a single uniform buffer, You can create all of these descriptor sets using the same layout (layout is only a description, a specification). This way You just tell the driver: "Hey, driver! Give me 3 descriptor sets. But all of them should be exactly the same".
But because they are created from the same layout it doesn't mean they must contain the same resource handles. All them (in Your case) must contain a uniform buffer. But what resource will be used for this uniform buffer depends on You. So each descriptor set can be updated with separate buffer.
Now when You want to use 3 buffers one after another in three consecutive frames, You can do it in several different ways:
You can have a single descriptor set. Then in every frame, before You start preparing command buffers, You update the descriptor set with the next buffer. But when You update a descriptor set, it cannot be used by any submitted (and not yet finished) command buffers. So this would require additional synchronization and wouldn't be that much different than using a single buffer. This way You also cannot "pre-record" command buffers.
You can have a single descriptor set. To change its contents (use a different buffer in it) You can update it through functions added in the VK_KHR_descriptor_update_template extension. It allows descriptor updates to be recorded in command buffers so synchronization should be a bit easier. It should allow You to pre-record command buffers. But it needs an extension to be supported so it's not an option on platforms that do not support it.
Method You probably thought of - You can have 3 separate descriptor sets. All of them allocated using the same layout with a uniform buffer. Then You update each descriptor set with a different buffer (You can use 1st buffer with 1st descriptor set, 2nd buffer with 2nd descriptor set and 3rd buffer with 3rd descriptor set). Now during recording a command buffer, when You want to use a first buffer then You just bind the first descriptor set. In the next frame, You just bind a second descriptor set and so on.
The method 3 is probably the easiest to implement as it requires no synchronization for descriptors (only per frame-level synchronization is required if You have such). It also allows You to pre-record command buffers and it doesn't require any additional extensions to be enabled. But as You noted, it requires more resources to be created and managed.
Don't forget that You need to create a descriptor pool that is big enough to contain 3 uniform buffers but at the same You must also specify that You want to allocate 3 descriptor sets from it (one uniform buffer per descriptor set).
You can read more about descriptor sets in Intel's API without Secrets: Introduction to Vulkan - Part 6 tutorial.
As for Your questions:
Do I need to create 3 VkDescriptorSetLayout - one for every frame?
No - a single layout is enough (as soon as all descriptor sets contain the same types of resources in the same bindings.
Next do I need to allocate and update corresponding descriptor set
with a corresponding buffer?
As per option 3 - yes.
And after this do I need to create 3 different command buffers where I
should specify corresponding descriptor set.
It depends whether You re-record command buffers every frame or if You pre-record them up front. Usually command buffers are re-recorded each frame. But as having a single command buffer requires waiting until its submission is finished, You probably may need a set(s) of command buffers for each frame, that correspond to Your framebuffer images (and descriptor sets). So in frame 0 You use command buffer #0 (or multiple command buffers reserved for frame 0). In frame 1 You use command buffer #1 etc.
Now You can record a command buffer for a given frame and during recording You provide a descriptor set that is associated with a given frame.
I am trying to find best practice(efficient) way of storing set of List objects against ReportingDate key.
List could be serailised as Xml/DataContract or ProtoBuf....
And given some of the data could be big (for that slice of key):
I was wondering if there is any of getting data from redis cache in IEnum/streamed fashion? Atm we using ProtoBuf.NET to have file based cache. And we retrieve data into mem in streamed fashion (we also have an option of selecting what props/fields we want in that T object as ProtoBuf allows us to do it)
Is there any way can force (after some inactivity) certain part of the data to be offloaded from mem and back into file if it is not being used. But load it up again if it is called
Tnx
It sounds like you want a sorted set - see https://redis.io/topics/data-types#sorted-sets. You would use the date as the value, perhaps in epoch time (since it needs to be a number). SE.Redis supports all the operations you would expect to get ranges of values (either positional ranges - the first 20 records, etc; or absolute ranges bases on the value - all items between two dates expressed in the same unit). Look at the methods starting " SortedSet...".
The value can be binary, so protobuf-net is fine (you would serialize the value for each date separately). Just pass a byte[] as the value. You need to handle serialization separately to the redis library.
As for swapping data out: no. Redis has date-based expiration, but doesn't have hot and cold storage. It is either there, or it isn't. You could use scheduled tasks to purge or move data based on date ranges, again using any of the Z* (redis) or SortedSet* (SE.Redis) methods.
For the complete list of Z* operations, see: https://redis.io/commands#sorted_set. They should all be available in SE.Redis.
When a resilient distributed dataset (RDD) is created from a text file or collection (or from another RDD), do we need to call "cache" or "persist" explicitly to store the RDD data into memory? Or is the RDD data stored in a distributed way in the memory by default?
val textFile = sc.textFile("/user/emp.txt")
As per my understanding, after the above step, textFile is a RDD and is available in all/some of the node's memory.
If so, why do we need to call "cache" or "persist" on textFile RDD then?
Most RDD operations are lazy. Think of an RDD as a description of a series of operations. An RDD is not data. So this line:
val textFile = sc.textFile("/user/emp.txt")
It does nothing. It creates an RDD that says "we will need to load this file". The file is not loaded at this point.
RDD operations that require observing the contents of the data cannot be lazy. (These are called actions.) An example is RDD.count — to tell you the number of lines in the file, the file needs to be read. So if you write textFile.count, at this point the file will be read, the lines will be counted, and the count will be returned.
What if you call textFile.count again? The same thing: the file will be read and counted again. Nothing is stored. An RDD is not data.
So what does RDD.cache do? If you add textFile.cache to the above code:
val textFile = sc.textFile("/user/emp.txt")
textFile.cache
It does nothing. RDD.cache is also a lazy operation. The file is still not read. But now the RDD says "read this file and then cache the contents". If you then run textFile.count the first time, the file will be loaded, cached, and counted. If you call textFile.count a second time, the operation will use the cache. It will just take the data from the cache and count the lines.
The cache behavior depends on the available memory. If the file does not fit in the memory, for example, then textFile.count will fall back to the usual behavior and re-read the file.
I think the question would be better formulated as:
When do we need to call cache or persist on a RDD?
Spark processes are lazy, that is, nothing will happen until it's required.
To quick answer the question, after val textFile = sc.textFile("/user/emp.txt") is issued, nothing happens to the data, only a HadoopRDD is constructed, using the file as source.
Let's say we transform that data a bit:
val wordsRDD = textFile.flatMap(line => line.split("\\W"))
Again, nothing happens to the data. Now there's a new RDD wordsRDD that contains a reference to testFile and a function to be applied when needed.
Only when an action is called upon an RDD, like wordsRDD.count, the RDD chain, called lineage will be executed. That is, the data, broken down in partitions, will be loaded by the Spark cluster's executors, the flatMap function will be applied and the result will be calculated.
On a linear lineage, like the one in this example, cache() is not needed. The data will be loaded to the executors, all the transformations will be applied and finally the count will be computed, all in memory - if the data fits in memory.
cache is useful when the lineage of the RDD branches out. Let's say you want to filter the words of the previous example into a count for positive and negative words. You could do this like that:
val positiveWordsCount = wordsRDD.filter(word => isPositive(word)).count()
val negativeWordsCount = wordsRDD.filter(word => isNegative(word)).count()
Here, each branch issues a reload of the data. Adding an explicit cache statement will ensure that processing done previously is preserved and reused. The job will look like this:
val textFile = sc.textFile("/user/emp.txt")
val wordsRDD = textFile.flatMap(line => line.split("\\W"))
wordsRDD.cache()
val positiveWordsCount = wordsRDD.filter(word => isPositive(word)).count()
val negativeWordsCount = wordsRDD.filter(word => isNegative(word)).count()
For that reason, cache is said to 'break the lineage' as it creates a checkpoint that can be reused for further processing.
Rule of thumb: Use cache when the lineage of your RDD branches out or when an RDD is used multiple times like in a loop.
Do we need to call "cache" or "persist" explicitly to store the RDD data into memory?
Yes, only if needed.
The RDD data stored in a distributed way in the memory by default?
No!
And these are the reasons why :
Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
For more details please check the Spark programming guide.
Below are the three situations you should cache your RDDs:
using an RDD many times
performing multiple actions on the same RDD
for long chains of (or very expensive) transformations
Adding another reason to add (or temporarily add) cache method call.
for debug memory issues
with cache method, spark will give debugging informations regarding the size of the RDD. so in the spark integrated UI, you will get RDD memory consumption info. and this proved very helpful diagnosing memory issues.
I am accessing a PostGreSQL 8.4 database with JDBC called by MATLAB.
The tables I am interested in basically consist of various columns of different datatypes. They are selected through their time-stamps.
Since I want to retrieve big amounts of data I am looking for a way of making the request faster than it is right now.
What I am doing at the moment is the following:
First I establish a connection to the database and call it DBConn. Next step would be to prepare a Select-Statement and execute it:
QUERYSTRING = ['SELECT * FROM ' TABLENAME '...
WHERE ts BETWEEN ''' TIMESTART ''' AND ''' TIMEEND ''''];
QUERY = DBConn.prepareStatement(QUERYSTRING);
RESULTSET = QUERY.executeQuery();
Then I store the columntypes in variable COLTYPE (1 for FLOAT, -1 for BOOLEAN and 0 for the rest - nearly all columns contain FLOAT). Next step is to process every row, column by column, and retrieve the data by the corresponding methods. FNAMES contains the fieldnames of the table.
m=0; % Variable containing rownumber
while RESULTSET.next()
m = m+1;
for n = 1:length(FNAMES)
if COLTYPE(n)==1 % Columntype is a FLOAT
DATA{1}.(FNAMES{n})(m,1) = RESULTSET.getDouble(n);
elseif COLTYPE(n)==-1 % Columntype is a BOOLEAN
DATA{1}.(FNAMES{n})(m,1) = RESULTSET.getBoolean(n);
else
DATA{1}.(FNAMES{n}){m,1} = char(RESULTSET.getString(n));
end
end
end
When I am done with my request I close the statement and the connection.
I don´t have the MATLAB database toolbox so I am looking for solutions without it.
I understand that it is very ineffective to request the data of every single field. Still, I failed on finding a way to get more data at once - for example multiple rows of the same column. Is there any way to do so? Do you have other suggestions of speeding the request up?
Summary
To speed this up, push the loops, and then your column datatype conversion, down in to the Java layer, using the Database Toolbox or custom Java code. The Matlab-to-Java method call overhead is probably what's killing you, and there's no way of doing block fetches (multiple rows in one call) with plain JDBC. Make sure the knobs on the JDBC driver you're using are set appropriately. And then optimize the transfer of expensive column data types like strings and dates.
(NB: I haven't done this with Postgres, but have with other DBMSes, and this will apply to Postgres too because most of it is about the JDBC and Matlab layers above it.)
Details
Push loops down to Java to get block fetching
The most straightforward way to get this faster is to push the loops over the rows and columns down in to the Java layer, and have it return blocks of data (e.g. 100 or 1000 rows at a time) to the Matlab layer. There is substantial per-call overhead in invoking a Java method from Matlab, and looping over JDBC calls in M-code is going to incur (see Is MATLAB OOP slow or am I doing something wrong? - full disclosure: that's my answer). If you're calling JDBC from M-code like that, you're incurring that overhead on every single column of every row, and that's probably the majority of your execution time right now.
The JDBC API itself does not support "block cursors" like ODBC does, so you need to get that loop down in to the Java layer. Using the Database Toolbox like Oleg suggests is one way to do it, since they implement their lower-level cursor stuff in Java. (Probably for precisely this reason.) But if you can't have a database toolbox dependency, you can just write your own thin Java layer to do so, and call that from your M-code. (Probably through a Matlab class that is coupled to your custom Java code and knows how to interact with it.) Make the Java code and Matlab code share a block size, buffer up the whole block on the Java side, using primitive arrays instead of object arrays for column buffers wherever possible, and have your M-code fetch the result set in batches, buffering those blocks in cell arrays of primitive column arrays, and then concatenate them together.
Pseudocode for the Matlab layer:
colBufs = repmat( {{}}, [1 nCols] );
while (cursor.hasMore())
cursor.fetchBlock();
for iCol = 1:nCols
colBufs{iCol}{end+1} = cursor.getBlock(iCol); % should come back as primitive
end
end
for iCol = 1:nCols
colResults{iCol} = cat(2, colBufs{iCol}{:});
end
Twiddle JDBC DBMS driver knobs
Make sure your code exposes the DBMS-specific JDBC connection parameters to your M-code layer, and use them. Read the doco for your specific DBMS and fiddle with them appropriately. For example, Oracle's JDBC driver defaults to setting the default fetch buffer size (the one inside their JDBC driver, not the one you're building) to about 10 rows, which is way too small for typical data analysis set sizes. (It incurs a network round trip to the db every time the buffer fills.) Simply setting it to 1,000 or 10,000 rows is like turning on the "Go Fast" switch that had shipped set to "off". Benchmark your speed with sample data sets and graph the results to pick appropriate settings.
Optimize column datatype transfer
In addition to giving you block fetch functionality, writing custom Java code opens up the possibility of doing optimized type conversion for particular column types. After you've got the per-row and per-cell Java call overhead handled, your bottlenecks are probably going to be in date parsing and passing strings back from Java to Matlab. Push the date parsing down in to Java by having it convert SQL date types to Matlab datenums (as Java doubles, with a column type indicator) as they're being buffered, maybe using a cache to avoid recalculation of repeated dates in the same set. (Watch out for TimeZone issues. Consider Joda-Time.) Convert any BigDecimals to double on the Java side. And cellstrs are a big bottleneck - a single char column could swamp the cost of several float columns. Return narrow CHAR columns as 2-d chars instead of cellstrs if you can (by returning a big Java char[] and then using reshape()), converting to cellstr on the Matlab side if necessary. (Returning a Java String[]converts to cellstr less efficiently.) And you can optimize the retrieval of low-cardinality character columns by passing them back as "symbols" - on the Java side, build up a list of the unique string values and map them to numeric codes, and return the strings as an primitive array of numeric codes along with that map of number -> string; convert the distinct strings to cellstr on the Matlab side and then use indexing to expand it to the full array. This will be faster and save you a lot of memory, too, since the copy-on-write optimization will reuse the same primitive char data for repeated string values. Or convert them to categorical or ordinal objects instead of cellstrs, if appropriate. This symbol optimization could be a big win if you use a lot of character data and have large result sets, because then your string columns transfer at about primitive numeric speed, which is substantially faster, and it reduces cellstr's typical memory fragmentation. (Database Toolbox may support some of this stuff now, too. I haven't actually used it in a couple years.)
After that, depending on your DBMS, you could squeeze out a bit more speed by including mappings for all the numeric column type variants your DBMS supports to appropriate numeric types in Matlab, and experimenting with using them in your schema or doing conversions inside your SQL query. For example, Oracle's BINARY_DOUBLE can be a bit faster than their normal NUMERIC on a full trip through a db/Matlab stack like this. YMMV.
You could consider optimizing your schema for this use case by replacing string and date columns with cheaper numeric identifiers, possibly as foreign keys to separate lookup tables to resolve them to the original strings and dates. Lookups could be cached client-side with enough schema knowledge.
If you want to go crazy, you can use multithreading at the Java level to have it asynchronously prefetch and parse the next block of results on separate Java worker thread(s), possibly parallelizing per-column date and string processing if you have a large cursor block size, while you're doing the M-code level processing for the previous block. This really bumps up the difficulty though, and ideally is a small performance win because you've already pushed the expensive data processing down in to the Java layer. Save this for last. And check the JDBC driver doco; it may already effectively be doing this for you.
Miscellaneous
If you're not willing to write custom Java code, you can still get some speedup by changing the syntax of the Java method calls from obj.method(...) to method(obj, ...). E.g. getDouble(RESULTSET, n). It's just a weird Matlab OOP quirk. But this won't be much of a win because you're still paying for the Java/Matlab data conversion on each call.
Also, consider changing your code so you can use ? placeholders and bound parameters in your SQL queries, instead of interpolating strings as SQL literals. If you're doing a custom Java layer, defining your own #connection and #preparedstatement M-code classes is a decent way to do this. So it looks like this.
QUERYSTRING = ['SELECT * FROM ' TABLENAME ' WHERE ts BETWEEN ? AND ?'];
query = conn.prepare(QUERYSTRING);
rslt = query.exec(startTime, endTime);
This will give you better type safety and more readable code, and may also cut down on the server-side overhead of query parsing. This won't give you much speed-up in a scenario with just a few clients, but it'll make coding easier.
Profile and test your code regularly (at both the M-code and Java level) to make sure your bottlenecks are where you think they are, and to see if there are parameters that need to be adjusted based on your data set size, both in terms of row counts and column counts and types. I also like to build in some instrumentation and logging at both the Matlab and Java layer so you can easily get performance measurements (e.g. have it summarize how much time it spent parsing different column types, how much in the Java layer and how much in the Matlab layer, and how much waiting on the server's responses (probably not much due to pipelining, but you never know)). If your DBMS exposes its internal instrumentation, maybe pull that in too, so you can see where you're spending your server-side time.
It occurs to me that to speed up the query to the table, you have to remove if, for that in the JDBC ResulSetMetadata there that give you the information about the data type of the column and the same name, so you save time of if, else if
ResultSetMetaData RsmD = (ResultSetMetaData) rs.getMetaData ();
int cols=rsMd.getColumnCount();
while (rs.next)
for i=1 to cols
row[i]=rs.getString(i);
My example is pseudocode becouse i´m not matlab programmer.
I hope you find it useful JDBC, anything let me know!
I'm receiving variable sized data in each simulation step in simulink. However I need to wait a certain amount of simulation steps before I received the whole data package and therefore I need some kind of variable sized buffer. I have no information about the total amount of data, which I'm going to receive. The only information I got, is the amount of simulation step, that I have to wait until I received the whole data.
I've tried to implement it via a matlab function block and several delay blocks that delay the output data of the matlab function block for one simulation step. but always failing at the variable size constraints (because the delay blocks doesn't support it) and I neither found any buffer block that supports the functionality, that I need here.
Hope, you can help me out!
Given that you know your input and output sample rates, I'd suggest writing a c-mex S-function.
It wouldn't be trivial, but you can
set the input and output ports to have different sample rates
set the input and output ports to have variable signal length
store a pointer to a std::vector<...> class in the P work vector
the std::vector<...> gives you the ability to increase its size as new input data arrives, and be emptied when the data is posted to the output.
Update based on comments:
For code generation you need to specify an upper bound for the size of the buffer, which makes a MATLAB Function block suitable.
Specify the maximum size of the buffer, and keep track of how much f it has been filled using an internal persistent variable.
But the only way to have a block with a different sample rate at its input and its output is to write an S-Function. For the MATLAB Function approach I can think of two approaches,
a) write the code so that it has an internal buffer that fills and only updates the output when the buffer becomes full.
Of course the output sample rate will be the same as the input sample rate, but the data will only change when you specify that it should.
b) have two outputs, one being the buffer, and one being an "I've just become full" logical signal. Then follow the block by a Triggered Subsystem that feeds the buffer straight through it, and is rising edge triggered by the logical signal. The output of the Triggered Subsystem will then only update at the steps when the buffer becomes full.