I was reading about the downsides of using java serialization and the necessity to go for serialization frameworks. There are so many frameworks like avro, parquet, thrift, protobuff.
Question is what framework addresses what and what are all the parameters that are to be considered while choosing a serialization framework.
I would like to get hands on with a practical use case and compare/choose the serialization frameworks based on the requirements.
Can somebody please assist on this topic?
There are a lot of factors to consider. I'll go through some of the important ones.
0) Schema first or Code First
If you have a project that'll involve different languages, code first approaches are likely to be problematic. It's all very well have a JAVA class that can be serialised, but it might be a nuisance if that has to be deserialised in C.
Generally I favour schema first approaches, just in case.
1) Inter-object Demarkation
Some serialisations produce a byte stream that makes it possible to see where one object stops and another begins. Other do not.
So, if you have a message transport / data store that will separate out batches of bytes for you, e.g. ZeroMQ, or a data base field, then you can use a serialisation that doesn't demarkate messages. Examples include Google Protocol Buffers. With demarkation done by the transport / store, the reader can get a batch of bytes knowing for sure that it encompasses one object, and only one object.
If your message transport / data store doesn't demarkate between batches of bytes, e.g. a network stream or a file, then either you invent your own demarkation markers, or use a serialisation that demarkates for you. Examples include ASN.1 BER, XML.
2) Cannonical
This is a property of a serialisation which means that the serialised data desribes its own structure. In principal the reader of a cannonical message doesn't have to know up front what the message structure is, it can simply work that out as it reads the bytes (even if it doesn't know the field names). This can be useful in circumstances where you're not entirely sure where the data is coming from. If the data is not cannonical, the reader has to know in advance what the object structure was otherwise the deserialisation is ambiguous.
Examples of cannonical serialisations include ASN.1 BER, ASN.1 cannonical PER, XML. Ones that aren't include ASN.1 uPER, possibly Google Protocol Buffers (I may have that wrong).
AVRO does something different - the data schema is itself part of the serialised data, so it is always possible to reconstruct the object from arbitrary data. As you can imagine the libraries for this are somewhat clunky in languages like C, but rather better in dynamic languages.
3) Size and Value Constrained.
Some serialisation technologies allow the developer to set constraints on the values of fields and the sizes of arrays. The intention is that code generated from a schema file incorporating such constraints will automatically validate objects on serialisation and on deserialistion.
This can be extremely useful - free, schema driven content inspection done automatically. It's very easy to spot out-of-specification data.
This is extremely useful in large, hetrogenous projects (lots of different languages in use), as all sources of truth about what's valid and what's not comes from the schema, and only the schema, and is enforced automatically by the auto-generated code. Developers can't ignore / get round the constraints, and when the constraints change everyone can't help but notice.
Examples include ASN.1 (usually done pretty well by tool sets), XML (not often done properly by free / cheap toolsets; MS's xsd.exe purposefully ignores any such constraints) and JSON (down to object validators). Of these three, ASN.1 has by far the most elaborate of constraints syntaxes; it's really very powerful.
Examples that don't - Google Protocol Buffers. In this regard GPB is extremely irritating, because it doesn't have constraints at all. The only way of having value and size constraints is to either write them as comments in the .proto file and hope developers read them and pay attention, or some other sort of non-sourcecode approach. With GPB being aimed very heavily at hetrogrenous systems (literally every language under the sun is supported), I consider this to be a very serious omission, because value / size validation code has to be written by hand for each language used in a project. That's a waste of time. Google could add syntactical elements to .proto and code generators to support this without changing wire foramts at all (it's all in the auto-generated code).
4) Binary / Text
Binary serialisations will be smaller, and probably a bit quicker to serialise / deserialise. Text serialisations are more debuggable. But it's amazing what can be done with binary serialisations. For example, one can easily add ASN.1 decoders to Wireshark (you compile them up from your .asn schema file using your ASN.1 tools), et voila - on the wire decoding of programme data. The same is possible with GPB I should think.
ASN.1 uPER is extremely useful in bandwidth constrained situations; it automatically uses the size / value constraints to economise in bits on the wire. For example, a field valid between 0 and 15 needs only 4 bits, and that's what uPER will use. It's no coincidence that uPER features heavily in protocols like 3G, 4G, and 5G too I should think. This "minimum bits" approach is a whole lot more elegant than compressing a text wireformat (which is what's done a lot with JSON and XML to make them less bloaty).
5) Values
This is a bit of an oddity. In ASN.1 a schema file can define both the structure of objects, and also values of objects. With the better tools you end up with (in your C++, JAVA, etc source code) classes, and pre-define objects of that class already filled in with values.
Why is that useful? Well, I use it a lot for defining project constants, and to give access to the limits on constraints. For example, suppose you'd got a an array field with a valid length of 15 in a message. You could have a literal 15 in the field constraint or you could cite the value of an integer value object in the constraint, with the integer also being available to developers.
--ASN.1 value that, in good tools, is built into the
--generated source code
arraySize INTEGER ::= 16
--A SET that has an array of integers that size
MyMessage ::= SET
{
field [0] SEQUENCE (SIZE(arraySize)) OF INTEGER
}
This is really handy in circumstances where you want to loop over that constraint, because the loop can be
for (int i = 0; i < arraySize; i++) {do things with MyMessage.field[i];} // ArraySize is an integer in the auto generated code built from the .asn schema file
Clearly this is fantastic if the constraint ever needs to be changed, because the only place it has to be changed is in the schema, followed by a project recompile (where every place it's used will use the new value). Better still, if it's renamed in the schema file a recompile identifies everywhere in the poject it was used (because the developer written source code that uses it is still using the old name, which is now an undefined symbol --> compiler errors.
ASN.1 constraints can get very elaborate. Here's a tiny taste of what can be done. This is fantastic for system developers, but is pretty complicated for the tool developers to implement.
arraySize INTEGER ::= 16
minSize INTEGER ::= 4
maxVal INTEGER ::= 31
minVal INTEGER ::= 16
oddVal INTEGER ::= 63
MyMessage2 ::= SET
{
field_1 [0] SEQUENCE (SIZE(arraySize)) OF INTEGER, -- 16 elements
field_2 [1] SEQUENCE (SIZE(0..arraySize)) OF INTEGER, -- 0 to 16 elements
field_3 [2] SEQUENCE (SIZE(minSize..arraySize)) OF INTEGER, -- 4 to 16 elements
field_4 [3] SEQUENCE (SIZE(minSize<..<arraySize)) OF INTEGER, -- 5 to 15 elements
field_5 [4] SEQUENCE (SIZE(minSize<..<arraySize)) OF INTEGER(0..maxVal), -- 5 to 15 elements valued 0..31
field_6 [5] SEQUENCE (SIZE(minSize<..<arraySize)) OF INTEGER(minVal..maxVal), -- 5 to 15 elements valued 16..31
field_7 [6] SEQUENCE (SIZE(minSize<..<arraySize)) OF INTEGER(minVal<..maxVal), -- 5 to 15 elements valued 17..31
field_8 [7] SEQUENCE (SIZE(arraySize)) OF INTEGER(minVal<..<maxVal), -- 16 elements valued 17..30
field_9 [8] INTEGER (minVal..maxVal AND oddVal) -- valued 16 to 31, and also 63
f8_indx [10] INTEGER (0..<arraySize) -- index into field 8, constrained to be within the bounds of field 8
}
So far as I know, only ASN.1 does this. And then it's only the more expensive tools that actually pick up these elements out of a schema file. With it, this makes it tremendously useful in a large project because literally everything related to data and its constraints and how to handle it is defined in only the .asn schema, and nowhere else.
As I said, I use this a lot, for the right type of project. Once one has got it pervading an entire project, the amount of time and risk saved is fantastic. It changes the dynamics of a project too; one can make late changes to a schema knowing that the entire project picks those up with nothing more than a recompile. So, protocol changes late in a project go from being high risk to something you might be content to do every day.
6) Wireformat Object Type
Some serialisation wireformats will identify the type of an object in the wireformat bytestrean. This helps the reader in situations where objects of many different types may arrive from one or more sources. Other serialisations won't.
ASN.1 varies from wireformat to wireformat (it has several, including a few binary ones as well as XML and JSON). ASN.1 BER uses type, value and length fields in its wireformat, so it is possible for the reader to peek at an object's tag up front decode the byte stream accordingly. This is very useful.
Google Protocol Buffers doesn't quite do the same thing, but if all message types in a .proto are bundled up into one final oneof, and it's that that's only every serialised, then you can achieve the same thing
7) Tools cost.
ASN.1 tools range from very, very expensive (and really good), to free (and less good). A lot of others are free, though I've found that the best XML tools (paying proper attention to value / size constraints) are quite expensive too.
8) Language Coverage
If you've heard of it, it's likely covered by tools for lots of different languages. If not, less so.
The good commercial ASN.1 tools cover C/C++/Java/C#. There are some free C/C++ ones of varying completeness.
9) Quality
It's no good picking up a serialisation technology if the quality of the tools is poor.
In my experience, GPB is good (it generally does what it says it will). The commercial ASN1 tools are very good, eclipsing GPB's toolset comprehensively. AVRO works. I've heard of some occassional problems with Capt'n Proto, but having not used it myself you'd have to check that out. XML works with good tools.
10) Summary
In case you can't tell, I'm quite a fan of ASN.1.
GPB is incredibly useful too for its widespread support and familiarity, but I do wish Google would add value / size constraints to fields and arrays, and also incorporate a value notation. If they did this it'd be possible to have the same project workflow as can be achieved with ASN.1. If Google added just these two features, I'd consider GPB to be pretty well nigh on "complete", needing only an equivalent of ASN.1's uPER to finish it off for those people with little storage space or bandwidth.
Note that quite a lot of it is focused on what a project's circumstances are, as well as how good / fast / mature the technology actually is.
Related
Why does the Java API use int, when short or even byte would be sufficient?
Example: The DAY_OF_WEEK field in class Calendar uses int.
If the difference is too minimal, then why do those datatypes (short, int) exist at all?
Some of the reasons have already been pointed out. For example, the fact that "...(Almost) All operations on byte, short will promote these primitives to int". However, the obvious next question would be: WHY are these types promoted to int?
So to go one level deeper: The answer may simply be related to the Java Virtual Machine Instruction Set. As summarized in the Table in the Java Virtual Machine Specification, all integral arithmetic operations, like adding, dividing and others, are only available for the type int and the type long, and not for the smaller types.
(An aside: The smaller types (byte and short) are basically only intended for arrays. An array like new byte[1000] will take 1000 bytes, and an array like new int[1000] will take 4000 bytes)
Now, of course, one could say that "...the obvious next question would be: WHY are these instructions only offered for int (and long)?".
One reason is mentioned in the JVM Spec mentioned above:
If each typed instruction supported all of the Java Virtual Machine's run-time data types, there would be more instructions than could be represented in a byte
Additionally, the Java Virtual Machine can be considered as an abstraction of a real processor. And introducing dedicated Arithmetic Logic Unit for smaller types would not be worth the effort: It would need additional transistors, but it still could only execute one addition in one clock cycle. The dominant architecture when the JVM was designed was 32bits, just right for a 32bit int. (The operations that involve a 64bit long value are implemented as a special case).
(Note: The last paragraph is a bit oversimplified, considering possible vectorization etc., but should give the basic idea without diving too deep into processor design topics)
EDIT: A short addendum, focussing on the example from the question, but in an more general sense: One could also ask whether it would not be beneficial to store fields using the smaller types. For example, one might think that memory could be saved by storing Calendar.DAY_OF_WEEK as a byte. But here, the Java Class File Format comes into play: All the Fields in a Class File occupy at least one "slot", which has the size of one int (32 bits). (The "wide" fields, double and long, occupy two slots). So explicitly declaring a field as short or byte would not save any memory either.
(Almost) All operations on byte, short will promote them to int, for example, you cannot write:
short x = 1;
short y = 2;
short z = x + y; //error
Arithmetics are easier and straightforward when using int, no need to cast.
In terms of space, it makes a very little difference. byte and short would complicate things, I don't think this micro optimization worth it since we are talking about a fixed amount of variables.
byte is relevant and useful when you program for embedded devices or dealing with files/networks. Also these primitives are limited, what if the calculations might exceed their limits in the future? Try to think about an extension for Calendar class that might evolve bigger numbers.
Also note that in a 64-bit processors, locals will be saved in registers and won't use any resources, so using int, short and other primitives won't make any difference at all. Moreover, many Java implementations align variables* (and objects).
* byte and short occupy the same space as int if they are local variables, class variables or even instance variables. Why? Because in (most) computer systems, variables addresses are aligned, so for example if you use a single byte, you'll actually end up with two bytes - one for the variable itself and another for the padding.
On the other hand, in arrays, byte take 1 byte, short take 2 bytes and int take four bytes, because in arrays only the start and maybe the end of it has to be aligned. This will make a difference in case you want to use, for example, System.arraycopy(), then you'll really note a performance difference.
Because arithmetic operations are easier when using integers compared to shorts. Assume that the constants were indeed modeled by short values. Then you would have to use the API in this manner:
short month = Calendar.JUNE;
month = month + (short) 1; // is july
Notice the explicit casting. Short values are implicitly promoted to int values when they are used in arithmetic operations. (On the operand stack, shorts are even expressed as ints.) This would be quite cumbersome to use which is why int values are often preferred for constants.
Compared to that, the gain in storage efficiency is minimal because there only exists a fixed number of such constants. We are talking about 40 constants. Changing their storage from int to short would safe you 40 * 16 bit = 80 byte. See this answer for further reference.
The design complexity of a virtual machine is a function of how many kinds of operations it can perform. It's easier to having four implementations of an instruction like "multiply"--one each for 32-bit integer, 64-bit integer, 32-bit floating-point, and 64-bit floating-point--than to have, in addition to the above, versions for the smaller numerical types as well. A more interesting design question is why there should be four types, rather than fewer (performing all integer computations with 64-bit integers and/or doing all floating-point computations with 64-bit floating-point values). The reason for using 32-bit integers is that Java was expected to run on many platforms where 32-bit types could be acted upon just as quickly as 16-bit or 8-bit types, but operations on 64-bit types would be noticeably slower. Even on platforms where 16-bit types would be faster to work with, the extra cost of working with 32-bit quantities would be offset by the simplicity afforded by only having 32-bit types.
As for performing floating-point computations on 32-bit values, the advantages are a bit less clear. There are some platforms where a computation like float a=b+c+d; could be performed most quickly by converting all operands to a higher-precision type, adding them, and then converting the result back to a 32-bit floating-point number for storage. There are other platforms where it would be more efficient to perform all computations using 32-bit floating-point values. The creators of Java decided that all platforms should be required to do things the same way, and that they should favor the hardware platforms for which 32-bit floating-point computations are faster than longer ones, even though this severely degraded PC both the speed and precision of floating-point math on a typical PC, as well as on many machines without floating-point units. Note, btw, that depending upon the values of b, c, and d, using higher-precision intermediate computations when computing expressions like the aforementioned float a=b+c+d; will sometimes yield results which are significantly more accurate than would be achieved of all intermediate operands were computed at float precision, but will sometimes yield a value which is a tiny bit less accurate. In any case, Sun decided everything should be done the same way, and they opted for using minimal-precision float values.
Note that the primary advantages of smaller data types become apparent when large numbers of them are stored together in an array; even if there were no advantage to having individual variables of types smaller than 64-bits, it's worthwhile to have arrays which can store smaller values more compactly; having a local variable be a byte rather than an long saves seven bytes; having an array of 1,000,000 numbers hold each number as a byte rather than a long waves 7,000,000 bytes. Since each array type only needs to support a few operations (most notably read one item, store one item, copy a range of items within an array, or copy a range of items from one array to another), the added complexity of having more array types is not as severe as the complexity of having more types of directly-usable discrete numerical values.
If you used the philosophy where integral constants are stored in the smallest type that they fit in, then Java would have a serious problem: whenever programmers write code using integral constants, they have to pay careful attention to their code to check if the type of the constants matter, and if so look up the type in the documentation and/or do whatever type conversions are needed.
So now that we've outlined a serious problem, what benefits could you hope to achieve with that philosophy? I would be unsurprised if the only runtime-observable effect of that change would be what type you get when you look the constant up via reflection. (and, of course, whatever errors are introduced by lazy/unwitting programmers not correctly accounting for the types of the constants)
Weighing the pros and the cons is very easy: it's a bad philosophy.
Actually, there'd be a small advantage. If you have a
class MyTimeAndDayOfWeek {
byte dayOfWeek;
byte hour;
byte minute;
byte second;
}
then on a typical JVM it needs as much space as a class containing a single int. The memory consumption gets rounded to a next multiple of 8 or 16 bytes (IIRC, that's configurable), so the cases when there are real saving are rather rare.
This class would be slightly easier to use if the corresponding Calendar methods returned a byte. But there are no such Calendar methods, only get(int) which must returns an int because of other fields. Each operation on smaller types promotes to int, so you need a lot of casting.
Most probably, you'll either give up and switch to an int or write setters like
void setDayOfWeek(int dayOfWeek) {
this.dayOfWeek = checkedCastToByte(dayOfWeek);
}
Then the type of DAY_OF_WEEK doesn't matter, anyway.
Using variables smaller than the bus size of the CPU means more cycles are necessary. For example when updating a single byte in memory, a 64-bit CPU needs to read a whole 64-bit word, modify only the changed part, then write back the result.
Also, using a smaller data type requires overhead when the variable is stored in a register, since the behavior of the smaller data type to be accounted for explicitly. Since the whole register is used anyways, there is nothing to be gained by using a smaller data type for method parameters and local variables.
Nevertheless, these data types might be useful for representing data structures that require specific widths, such as network packets, or for saving space in large arrays, sacrificing speed.
I am accessing a PostGreSQL 8.4 database with JDBC called by MATLAB.
The tables I am interested in basically consist of various columns of different datatypes. They are selected through their time-stamps.
Since I want to retrieve big amounts of data I am looking for a way of making the request faster than it is right now.
What I am doing at the moment is the following:
First I establish a connection to the database and call it DBConn. Next step would be to prepare a Select-Statement and execute it:
QUERYSTRING = ['SELECT * FROM ' TABLENAME '...
WHERE ts BETWEEN ''' TIMESTART ''' AND ''' TIMEEND ''''];
QUERY = DBConn.prepareStatement(QUERYSTRING);
RESULTSET = QUERY.executeQuery();
Then I store the columntypes in variable COLTYPE (1 for FLOAT, -1 for BOOLEAN and 0 for the rest - nearly all columns contain FLOAT). Next step is to process every row, column by column, and retrieve the data by the corresponding methods. FNAMES contains the fieldnames of the table.
m=0; % Variable containing rownumber
while RESULTSET.next()
m = m+1;
for n = 1:length(FNAMES)
if COLTYPE(n)==1 % Columntype is a FLOAT
DATA{1}.(FNAMES{n})(m,1) = RESULTSET.getDouble(n);
elseif COLTYPE(n)==-1 % Columntype is a BOOLEAN
DATA{1}.(FNAMES{n})(m,1) = RESULTSET.getBoolean(n);
else
DATA{1}.(FNAMES{n}){m,1} = char(RESULTSET.getString(n));
end
end
end
When I am done with my request I close the statement and the connection.
I don´t have the MATLAB database toolbox so I am looking for solutions without it.
I understand that it is very ineffective to request the data of every single field. Still, I failed on finding a way to get more data at once - for example multiple rows of the same column. Is there any way to do so? Do you have other suggestions of speeding the request up?
Summary
To speed this up, push the loops, and then your column datatype conversion, down in to the Java layer, using the Database Toolbox or custom Java code. The Matlab-to-Java method call overhead is probably what's killing you, and there's no way of doing block fetches (multiple rows in one call) with plain JDBC. Make sure the knobs on the JDBC driver you're using are set appropriately. And then optimize the transfer of expensive column data types like strings and dates.
(NB: I haven't done this with Postgres, but have with other DBMSes, and this will apply to Postgres too because most of it is about the JDBC and Matlab layers above it.)
Details
Push loops down to Java to get block fetching
The most straightforward way to get this faster is to push the loops over the rows and columns down in to the Java layer, and have it return blocks of data (e.g. 100 or 1000 rows at a time) to the Matlab layer. There is substantial per-call overhead in invoking a Java method from Matlab, and looping over JDBC calls in M-code is going to incur (see Is MATLAB OOP slow or am I doing something wrong? - full disclosure: that's my answer). If you're calling JDBC from M-code like that, you're incurring that overhead on every single column of every row, and that's probably the majority of your execution time right now.
The JDBC API itself does not support "block cursors" like ODBC does, so you need to get that loop down in to the Java layer. Using the Database Toolbox like Oleg suggests is one way to do it, since they implement their lower-level cursor stuff in Java. (Probably for precisely this reason.) But if you can't have a database toolbox dependency, you can just write your own thin Java layer to do so, and call that from your M-code. (Probably through a Matlab class that is coupled to your custom Java code and knows how to interact with it.) Make the Java code and Matlab code share a block size, buffer up the whole block on the Java side, using primitive arrays instead of object arrays for column buffers wherever possible, and have your M-code fetch the result set in batches, buffering those blocks in cell arrays of primitive column arrays, and then concatenate them together.
Pseudocode for the Matlab layer:
colBufs = repmat( {{}}, [1 nCols] );
while (cursor.hasMore())
cursor.fetchBlock();
for iCol = 1:nCols
colBufs{iCol}{end+1} = cursor.getBlock(iCol); % should come back as primitive
end
end
for iCol = 1:nCols
colResults{iCol} = cat(2, colBufs{iCol}{:});
end
Twiddle JDBC DBMS driver knobs
Make sure your code exposes the DBMS-specific JDBC connection parameters to your M-code layer, and use them. Read the doco for your specific DBMS and fiddle with them appropriately. For example, Oracle's JDBC driver defaults to setting the default fetch buffer size (the one inside their JDBC driver, not the one you're building) to about 10 rows, which is way too small for typical data analysis set sizes. (It incurs a network round trip to the db every time the buffer fills.) Simply setting it to 1,000 or 10,000 rows is like turning on the "Go Fast" switch that had shipped set to "off". Benchmark your speed with sample data sets and graph the results to pick appropriate settings.
Optimize column datatype transfer
In addition to giving you block fetch functionality, writing custom Java code opens up the possibility of doing optimized type conversion for particular column types. After you've got the per-row and per-cell Java call overhead handled, your bottlenecks are probably going to be in date parsing and passing strings back from Java to Matlab. Push the date parsing down in to Java by having it convert SQL date types to Matlab datenums (as Java doubles, with a column type indicator) as they're being buffered, maybe using a cache to avoid recalculation of repeated dates in the same set. (Watch out for TimeZone issues. Consider Joda-Time.) Convert any BigDecimals to double on the Java side. And cellstrs are a big bottleneck - a single char column could swamp the cost of several float columns. Return narrow CHAR columns as 2-d chars instead of cellstrs if you can (by returning a big Java char[] and then using reshape()), converting to cellstr on the Matlab side if necessary. (Returning a Java String[]converts to cellstr less efficiently.) And you can optimize the retrieval of low-cardinality character columns by passing them back as "symbols" - on the Java side, build up a list of the unique string values and map them to numeric codes, and return the strings as an primitive array of numeric codes along with that map of number -> string; convert the distinct strings to cellstr on the Matlab side and then use indexing to expand it to the full array. This will be faster and save you a lot of memory, too, since the copy-on-write optimization will reuse the same primitive char data for repeated string values. Or convert them to categorical or ordinal objects instead of cellstrs, if appropriate. This symbol optimization could be a big win if you use a lot of character data and have large result sets, because then your string columns transfer at about primitive numeric speed, which is substantially faster, and it reduces cellstr's typical memory fragmentation. (Database Toolbox may support some of this stuff now, too. I haven't actually used it in a couple years.)
After that, depending on your DBMS, you could squeeze out a bit more speed by including mappings for all the numeric column type variants your DBMS supports to appropriate numeric types in Matlab, and experimenting with using them in your schema or doing conversions inside your SQL query. For example, Oracle's BINARY_DOUBLE can be a bit faster than their normal NUMERIC on a full trip through a db/Matlab stack like this. YMMV.
You could consider optimizing your schema for this use case by replacing string and date columns with cheaper numeric identifiers, possibly as foreign keys to separate lookup tables to resolve them to the original strings and dates. Lookups could be cached client-side with enough schema knowledge.
If you want to go crazy, you can use multithreading at the Java level to have it asynchronously prefetch and parse the next block of results on separate Java worker thread(s), possibly parallelizing per-column date and string processing if you have a large cursor block size, while you're doing the M-code level processing for the previous block. This really bumps up the difficulty though, and ideally is a small performance win because you've already pushed the expensive data processing down in to the Java layer. Save this for last. And check the JDBC driver doco; it may already effectively be doing this for you.
Miscellaneous
If you're not willing to write custom Java code, you can still get some speedup by changing the syntax of the Java method calls from obj.method(...) to method(obj, ...). E.g. getDouble(RESULTSET, n). It's just a weird Matlab OOP quirk. But this won't be much of a win because you're still paying for the Java/Matlab data conversion on each call.
Also, consider changing your code so you can use ? placeholders and bound parameters in your SQL queries, instead of interpolating strings as SQL literals. If you're doing a custom Java layer, defining your own #connection and #preparedstatement M-code classes is a decent way to do this. So it looks like this.
QUERYSTRING = ['SELECT * FROM ' TABLENAME ' WHERE ts BETWEEN ? AND ?'];
query = conn.prepare(QUERYSTRING);
rslt = query.exec(startTime, endTime);
This will give you better type safety and more readable code, and may also cut down on the server-side overhead of query parsing. This won't give you much speed-up in a scenario with just a few clients, but it'll make coding easier.
Profile and test your code regularly (at both the M-code and Java level) to make sure your bottlenecks are where you think they are, and to see if there are parameters that need to be adjusted based on your data set size, both in terms of row counts and column counts and types. I also like to build in some instrumentation and logging at both the Matlab and Java layer so you can easily get performance measurements (e.g. have it summarize how much time it spent parsing different column types, how much in the Java layer and how much in the Matlab layer, and how much waiting on the server's responses (probably not much due to pipelining, but you never know)). If your DBMS exposes its internal instrumentation, maybe pull that in too, so you can see where you're spending your server-side time.
It occurs to me that to speed up the query to the table, you have to remove if, for that in the JDBC ResulSetMetadata there that give you the information about the data type of the column and the same name, so you save time of if, else if
ResultSetMetaData RsmD = (ResultSetMetaData) rs.getMetaData ();
int cols=rsMd.getColumnCount();
while (rs.next)
for i=1 to cols
row[i]=rs.getString(i);
My example is pseudocode becouse i´m not matlab programmer.
I hope you find it useful JDBC, anything let me know!
To add a field to a XML object it takes the length of the fieldname +
3 characters (or 7 when nested) and for JSON 4 (or 6 when nested)
<xml>xml</xml> xml="xml"
{"json":json,} "json": json,
Assume the average is 4 and fieldname average is 11 - to justify the use of XML/JSON over a table in use of storage, each field must in average only appear in less than 1/15 of objects, in other words there must be ~15 times more different fields within the whole related group of objects, than one object has in average.
(Yet a table may very well allows faster computation still when this ration is higher and its bigger in storage) I have not yet seen a use of XML/JSON with a very high ratio.
Aren't most real of XML/JSON forced and inefficient?
Shouldn't related data be stored and queried in relations (tables)?
What am i missing?
Example conversion XML to table
Object1
<aaaaahlongfieldname>1</aaaaahlongfieldname>
<b>B
<c>C</c>
</b>
Object2
<aaaaahlongfieldname>2</aaaaahlongfieldname>
<b><c><d>D</d></c></b>
<ba>BA</ba>
<ba "xyz~">BA</ba>
<c>C</c>
Both converted to a csv like table (delimiter declaration,head,line1,line2)
delimiter=,
aaaaahlongfieldname,b,b/c,b/c/d,ba,ba-xyz~,c
,B,C,,,,
,,,D,BA,BA,C
/ and - symbols in values will need to be escaped only in the head
but ,,,, could also be \4 escaped number of delimiters in a row (when an escape symbol or string is declared as well - worth it at large numbers of empty fields ) and since escape character and delimiter will need to be escaped when they appear in values, they could automatically be declared rare symbols that usually hardly appear
escape=~
delimiter=°
aaaaahlongfieldname°b°b/c°b/c/d°ba°ba-xyz~~°c
°B°C~4
°°°D°BA°BA°C
Validation/additional info: XML/json misses all empty fields so missing "fields in "rows can not be noticed. A line of a table is only valid when the number of fields is correct and (faulty) lines must be noticed. but through columns having different datatypes missing delimiters could usually easily be repaired.
Edit:
On readablity/editablity: Good thing of course, the first time one read xml and json it maybe was selfexplanatory having read html and js already but that's all? - most of the time it is machines reading it and sometimes developers, both of which may not be entertained by the verbosity
The CSV in your example is quite inefficient use of 8 bit encoding. You're hardly even using 5 bits of entropy, clearly wasting 3 bits. Why not compress it?
The answer to all of these is people make mistakes, and stronger typing trades efficiency for safety. It is impossible for machine or human to identify a transposed column in a CSV stream, however both JSON & XML would automatically handle it, and (assuming no hierarchy boundaries got crossed) everything would still work. 30 years ago when storage space was scarce & instructions per second were sometimes measured 100s per second, using minimal amounts of decoration in protocols made sense. These days even embedded systems have relatively vast amounts of power & storage, thus the tradeoff for a little extra safety is much easier to make.
For tightly controlled data transfer, say between modules that my development team is working on, JSON works great. But when data needs to go between different groups, I strongly prefer XML, simply because it helps both sides understand what is happening. If the data needs to go across a "slow" pipe, compression will remove 98% of the XML "overhead".
The designers of XML were well aware that there was a high level of redundancy in the representation, and they considered this a good thing (I'm not saying they were right). Essentially (a) redundancy costs nothing if you use data compression, (b) redundancy (within limits) helps human readability, and (c ) redundancy makes it easier to detect and diagnose errors, especially important when XML is being hand-authored.
Let me ask whether antlr3 accepts the following example grammar.
for an input , x + y * z ,
it is parsed as x+(y*z) if each in {x,y,z} is a number;
it is parsed as (x+y)*z if each in {x,y,z} is an object of a particular type T;
And let me ask whether such grammars are used sometimes or very rarely for computer languages.
Thank you very much.
In general, parsers (produced by parser generators) only check syntax.
A parser (produced by any means) that can explore multiple parses (I believe ANTLR does this by backtracking; other parsing engines [GLR, Earley] do it by parallel exploration of possible parses), if augmented with semantic checking information, could reject parses that didn't meet semantic constraints.
People tend not to build such parsers in my experience, partly because it is hard to explain to users. If they don't get it, your parser isn't successful; your example is especially bad IMHO in terms of explainability. They also tend not to do this because they need that type information, and that's not always convenient to collect as you parse. The GCC parsers famously do just this this to parse statements such as
X*T;
and the parser is a bit of a mess because of the need to parse and collect this type information as it goes.
I suspect ANTLR can check semantic predicates. How easy it is to get type information of the kind you discuss to those semantic checks is another question; I have no experience here.
The GLR parsing engine used by our DMS Software Reengineering Toolkit does have "semantic" predicates. It isn't particularly easy to get real semantic type information to those predicates by architectural design; we wanted such predicates to be driven off of "syntax". But then, everything (including type inference) is driven off syntax. So we stick information purely local to the reduction being proposed. This is particulary handy in (not) recognizing as separate types of parses, the following peculiar FORTRAN construct for nested-do-termination vs. shared-do-termination:
DO 10 I=1,10,1
DO 10 J=1,10,1
A(I,J)=0
10 CONTINUE
20 CONTINUE
vs.
DO 20 I=1,10,1
DO 10 J=1,10,1
A(I,J)=0
10 CONTINUE
20 CONTINUE
To the parser, at the pure syntax level, both of these look like:
DO <INT> <VAR>=...
DO <INT> <VAR>=...
<STMTS>
<INT> CONTINUE
<INT> CONTINUE
How can one determine which CONTINUE statement belongs to which DO consrtuct with only this information? You can't.
The DMS FORTRAN parser does exactly this by having two sets of rules for DO loops, one for unshared continues, an one for shared continues. They differentiate using semantic predicates that check that the CONTINUE statement label matches the DO loop designated label. And thus the DMS FORTRAN parser gets the loop nesting right as it parses. AFAIK, all the other FORTRAN compilers parse the statements individually, and then stitch the DO loop nests together in a post pass.
And yes, while FORTRAN has this (confusing) construct, no other modern language that I know copied it.
Very specific issue here…and no this isn’t homework (left that far…far behind). Basically I need to compute a checksum for code being written to an EPROM and I’d like to write this function in an Ada program to practice my bit manipulation in the language.
A section of a firmware data file for an EPROM is being changed by me and that change requires a new valid checksum at the end so the resulting system will accept the changed code. This checksum starts out by doing a modulo 256 binary sum of all data it covers and then other higher-level operations are done to get the checksum which I won’t go into here.
So now how do I do binary addition on a mod type?
I assumed if I use the “+” operator on a mod type it would be summed like an integer value operation…a result I don’t want. I’m really stumped on this one. I don’t want to really do a packed array and perform the bit carry if I don’t have to, especially if that’s considered “old hat”. References I’m reading claim you need to use mod types to ensure more portable code when dealing with binary operations. I’d like to try that if it’s possible. I'm trying to target multiple platforms with this program so portability is what I'm looking for.
Can anyone suggest how I might perform binary addition on a mod type?
Any starting places in the language would be of great help.
Just use a modular type, for which the operators do unsigned arithmetic.
type Word is mod 2 ** 16; for Word'Size use 16;
Addendum: For modular types, the predefined logical operators operate on a bit-by-bit basis. Moreover, "the binary adding operators + and – on modular types include a final reduction modulo the modulus if the result is outside the base range of the type." The function Update_Crc is an example.
Addendum: §3.5.4 Integer Types, ¶19 notes that for modular types, the results of the predefined operators are reduced modulo the modulus, including the binary adding operators + and –. Also, the shift functions in §B.2 The Package Interfaces are available for modular types. Taken together, the arithmetical, logical and shift capabilities are sufficient for most bitwise operations.