Ada records : detect lost bytes, propose optimisation - record

I'm working on an legacy Ada project with significant RAM constraints.
In order to save memory for additional features, I would like to analyze all records definitions, in order to:
detect holes (i.e. wasted bytes)
propose a record declaration order (or a representation) that minimizes the memory footprint (with some algorithm that should be similar to Knapsack problem)
Please note that I'm not (yet) in the process of saving any lost bits (no pragma pack needed here, nor rep. clause for strictly contiguous records like in this question). Only bytes for now.
Simplified example (real world record are way more complex and may have discriminants, tagged types) :
type My_Record is record
field1 : Foo; -- enum with 3 values
field2 : Bar; -- some other record
field3 : Float; -- 32 bits
field4 : Flex;-- enum with 12 values
end record;
Its -gnatR2s output would look like (32 bits world):
for My_Record'Alignment use 4;
for My_Record use record
field1 at 0 use 0.. 7;
field2 at 4 use 0..47; -- 3 bytes lost from field 1
field3 at 12 use 0..31; -- 2 bytes lost from field 2
field4 at 16 use 0.. 7; -- another 3 bytes lost
end record;
What i would like to do (memory usage optimized record):
-- rewrite record, not the preferred way since record writing order may have some human readable context purpose
type My_Record is record
field2 : Bar; -- 2 unused bytes
field1 : Foo; -- 1 byte left
field4 : Flex; -- 0 byte left
field3 : Float; -- 4 bytes used
-- 0 wasted bytes
end record;
or:
-- preferred way : keep record declaration, but force rep. clause
type My_Record is record
field1 : Foo;
field2 : Bar;
field3 : Float;
field4 : Flex;
end record;
-- optimization is here
for My_Record'alignment use 4;
for My_Record use record
field2 at 0 use 0..47;
field1 at 6 use 0.. 7; -- exploit the Bar unused bytes
field4 at 7 use 0.. 7; -- exploit the Bar unused bytes
field3 at 8 use 0..31;
end record;
(apologies for any mistake in the example, I hope you get the point)
How can I do this ?
ASIS (but I have 0% skill, I'm not even sure it can do what I want)
libadalang (how does it obtain rep. clauses without compiling the units ?)
just use -gnatR2s on all compilation units and write a .rep parser in python
there is a hidden compilation option, a pragma or an existing GNAT tool that can be of help (like pragma component_alignment or pragma optimize_alignment, but I can't say if they address the matter, since it affects alignment, but not necessarily alignment + ordering)
For context on repl clauses, Ada Reference Manual and GNAT small differences, one may read this link

I think you already already state all options: using the implementation specific pragma's you mention, optimize manually and/or write a custom optimizer that analyzes the representation clauses and optimizes them using some (tunable) cost function that trades space, access time, etc.. Writing a custom optimizer will most certainly be most costly in terms of time and effort. I'm not aware of any other "hidden" options (if they are hidden, then it might be for a reason).
I would start by formulating a space budget (if that's possible), then analyze how much padding bytes each record type has in order to have a better understanding of how they are distributed and to estimate the potential maximum amount of bytes that can be recovered by removing all padding bytes (note that instances count: a small type being instantiated a lot might have a larger impact on the memory footprint than large type which is instantiated only once). And only then I would determine the strategy to reduce the padding bytes: use a pragma for all types, or only some types, optimize some records by hand or conclude that you really need a custom optimizer. Of course, time + cost matters here. I wouldn't recommend an extensive analysis if the job must be finished by tomorrow.
For the analysis I would just parse the output of -gnatR2. From what I know (but you might want to check this), libadalang focuses on source code (which may not state an explicit representation clause). Regarding ASIS, I think it might work, but I doubt if it's worth the effort. Parsing the output of -gnatR2 using some scripting language will probably be more time efficient.

Related

Should I define a column type from actual length or nth power of 2(Sql Server )?

Should I define a column type from actual length to nth power of 2?
The first case, I have a table column store no more than 7 charactors,
will I use NVARCHAR(8)? since there maybe implicit convert inside Sql
server, allocate 8 space and truncate automatic(heard some where).
If not, NCHAR(7)/NCHAR(8), which should be(assume the fixed length is 7)
Any performance differ on about this 2 cases?
You should use the actual length of the string. Now, if you know that the value will always be exactly 7 characters, then use CHAR(7) rather than VARCHAR(7).
The reason you see powers-of-2 is for columns that have an indeterminate length -- a name or description that may not be fixed. In most databases, you need to put in some maximum length for the varchar(). For historical reasons, powers-of-2 get used for such things, because of the binary nature of the underlying CPUs.
Although I almost always use powers-of-2 in these situations, I can think of no real performance differences. There is one. . . in some databases the actual length of a varchar(255) is stored using 1 byte whereas a varchar(256) uses 2 bytes. That is a pretty minor difference -- even when multiplied over millions of rows.

Storing trillions of document similarities

I wrote a program to compute similarities among a set of 2 million documents. The program works, but I'm having trouble storing the results. I won't need to access the results often, but will occasionally need to query them and pull out subsets for analysis. The output basically looks like this:
1,2,0.35
1,3,0.42
1,4,0.99
1,5,0.04
1,6,0.45
1,7,0.38
1,8,0.22
1,9,0.76
.
.
.
Columns 1 and 2 are document ids, and column 3 is the similarity score. Since the similarity scores are symmetric I don't need to compute them all, but that still leaves me with 2000000*(2000000-1)/2 ≈ 2,000,000,000,000 lines of records.
A text file with 1 million lines of records is already 9MB. Extrapolating, that means I'd need 17 TB to store the results like this (in flat text files).
Are there more efficient ways to store these sorts of data? I could have one row for each document and get rid of the repeated document ids in the first column. But that'd only go so far. What about file formats, or special database systems? This must be a common problem in "big data"; I've seen papers/blogs reporting similar analyses, but none discuss practical dimensions like storage.
DISCLAIMER: I don't have any practical experience with this, but it's a fun exercise and after some thinking this is what I came up with:
Since you have 2.000.000 documents you're kind of stuck with an integer for the document id's; that makes 4 bytes + 4 bytes; the comparison seems to be between 0.00 and 1.00, I guess a byte would do by encoding the 0.00-1.00 as 0..100.
So your table would be : id1, id2, relationship_value
That brings it to exactly 9 bytes per record. Thus (without any overhead) ((2 * 10^6)^2)*9/2bytes are needed, that's about 17Tb.
Off course that's if you have just a basic table. Since you don't plan on querying it very often I guess performance isn't that much of an issue. So you could go 'creative' by storing the values 'horizontally'.
Simplifying things, you would store the values in a 2 million by 2 million square and each 'intersection' would be a byte representing the relationship between their coordinates. This would "only" require about 3.6Tb, but it would be a pain to maintain, and it also doesn't make use of the fact that the relations are symmetrical.
So I'd suggest to use a hybrid approach, a table with 2 columns. First column would hold the 'left' document-id (4 bytes), 2nd column would hold a string of all values of documents starting with an id above the id in the first column using a varbinary. Since a varbinary only takes the space that it needs, this helps us win back some space offered by the symmetry of the relationship.
In other words,
record 1 would have a string of (2.000.000-1) bytes as value for the 2nd column
record 2 would have a string of (2.000.000-2) bytes as value for the 2nd column
record 3 would have a string of (2.000.000-3) bytes as value for the 2nd column
etc
That way you should be able to get away with something like 2Tb (inc overhead) to store the information. Add compression to it and I'm pretty sure you can store it on a modern disk.
Off course the system is far from optimal. In fact, querying the information will require some patience as you can't approach things set-based and you'll pretty much have to scan things byte by byte. A nice 'benefit' of this approach would be that you can easily add new documents by adding a new byte to the string of EACH record + 1 extra record in the end. Operations like that will be costly though as it will result in page-splits; but at least it will be possible without having to completely rewrite the table. But it will cause quite bit of fragmentation over time and you might want to rebuild the table once in a while to make it more 'aligned' again. Ah.. technicalities.
Selecting and Updating will require some creative use of SubString() operations, but nothing too complex..
PS: Strictly speaking, for 0..100 you only need 7 bytes, so if you really want to squeeze the last bit out of it you could actually store 8 values in 7 bytes and save another ca 300Mb, but it would make things quite a bit more complex... then again, it's not like the data is going to be human-readable anyway =)
PS: this line of thinking is completely geared towards reducing the amount of space needed while remaining practical in terms of updating the data. I'm not saying it's going to be fast; in fact, if you'd go searching for all documents that have a relation-value of 0.89 or above the system will have to scan the entire table and even with modern disks that IS going to take a while.
Mind you that all of this is the result of half an hour brainstorming; I'm actually hoping that someone might chime in with a neater approach =)

[My]SQL VARCHAR Size and Null-Termination

Disclaimer: I'm very new to SQL and databases in general.
I need to create a field that will store a maximum of 32 characters of text data. Does "VARCHAR(32)" mean that I have exactly 32 characters for my data? Do I need to reserve an extra character for null-termination?
I conducted a simple test and it seems that this is a WYSIWYG buffer. However, I wanted to get a concrete answer from people who actually know what they're doing.
I have a C[++] background, so this question is raising alarm bells in my head.
Yes, you have 32 characters at your disposal. SQL does not concern itself with nul terminated strings like some programming languages do.
Your VARCHAR specification size is the max size of your data, so in this case, 32 characters. However, VARCHARS are a dynamic field, so the actual physical storage used is only the size of your data, plus one or two bytes.
If you put a 10-character string into a VARCHAR(32), the physical storage will be 11 or 12 bytes (the manual will tell you the exact formula).
However, when MySQL is dealing with result sets (ie. after a SELECT), 32 bytes will be allocated in memory for that field for every record.

Is there any reason for numeric rather than int in T-SQL?

Why would someone use numeric(12, 0) datatype for a simple integer ID column? If you have a reason why this is better than int or bigint I would like to hear it.
We are not doing any math on this column, it is simply an ID used for foreign key linking.
I am compiling a list of programming errors and performance issues about a product, and I want to be sure they didn't do this for some logical reason. If you follow this link:
http://msdn.microsoft.com/en-us/library/ms187746.aspx
... you can see that the numeric(12, 0) uses 9 bytes of storage and being limited to 12 digits, theres a total of 2 trillion numbers if you include negatives. WHY would a person use this when they could use a bigint and get 10 million times as many numbers with one byte less storage. Furthermore, since this is being used as a product ID, the 4 billion numbers of a standard int would have been more than enough.
So before I grab the torches and pitch forks - tell me what they are going to say in their defense?
And no, I'm not making a huge deal out of nothing, there are hundreds of issues like this in the software, and it's all causing a huge performance problem and using too much space in the database. And we paid over a million bucks for this crap... so I take it kinda seriously.
Perhaps they're used to working with Oracle?
All numeric types including ints are normalized to a standard single representation among all platforms.
There are many reasons to use numeric - for example - financial data and other stuffs which need to be accurate to certain decimal places. However for the example you cited above, a simple int would have done.
Perhaps sloppy programmers working who didn't know how to to design a database ?
Before you take things too seriously, what is the data storage requirement for each row or set of rows for this item?
Your observation is correct, but you probably don't want to present it too strongly if you're reducing storage from 5000 bytes to 4090 bytes, for example.
You don't want to blow your credibility by bringing this up and having them point out that any measurable savings are negligible. ("Of course, many of our lesser-experienced staff also make the same mistake.")
Can you fill in these blanks?
with the data type change, we use
____ bytes of disk space instead of ____
____ ms per query instead of ____
____ network bandwidth instead of ____
____ network latency instead of ____
That's the kind of thing which will give you credibility.
How old is this application that you are looking into?
Previous to SQL Server 2000 there was no bigint. Maybe its just something that has made it from release to release for many years without being changed or the database schema was copied from an application that was this old?!?
In your example I can't think of any logical reason why you wouldn't use INT. I know there are probably reasons for other uses of numeric, but not in this instance.
According to: http://doc.ddart.net/mssql/sql70/da-db_1.htm
decimal
Fixed precision and scale numeric data from -10^38 -1 through 10^38 -1.
numeric
A synonym for decimal.
int
Integer (whole number) data from -2^31 (-2,147,483,648) through 2^31 - 1 (2,147,483,647).
It is impossible to know if there is a reason for them using decimal, since we have no code to look at though.
In some databases, using a decimal(10,0) creates a packed field which takes up less space. I know there are many tables around my work that use that. They probably had the same kind of thought here, but you have gone to the documentation and proven that to be incorrect. More than likely, I would say it will boil down to a case of "that's the way we have always done it, because someone one time said it was better".
It is possible they spend a LOT of time in MS Access and see 'Number' often and just figured, its a number, why not use numeric?
Based on your findings, it doesn't sound like they are the optimization experts, and just didn't know. I'm wondering if they used schema generation tools and just relied on them too much.
I wonder how efficient an index on a decimal value (even if 0 scale is set) for a primary key compares to a pure integer value.
Like Mark H. said, other than the indexing factor, this particular scenario likely isn't growing the database THAT much, but if you're looking for ammo, I think you did find some to belittle them with.
In your citation, the decimal shows precision of 1-9 as using 5 bytes. Your column apparently has 12,0 - using 4 bytes of storage - same as integer.
Moreover, INT, datatype can go to a power of 31:
-2^31 (-2,147,483,648) to 2^31-1 (2,147,483,647)
While decimal is much larger to 38:
- 10^38 +1 through 10^38 - 1
So the software creator was actually providing more while using the same amount of storage space.
Now, with the basics out of the way, the software creator actually limited themselves to just 12 numbers or 123,456,789,012 (just an example for place holders not a maximum number). If they used INT they could not scale this column - it would go up to the full 31 digits. Perhaps there is a business reason to limit this column and associated columns to 12 digits.
An INT is an INT, while a DECIMAL is scalar.
Hope this helps.
PS:
The whole number argument is:
A) Whole numbers are 0..infinity
B) Counting (Natural) numbers are 1..infinity
C) Integers are infinity (negative) .. infinity (positive)
D) I would not cite WikiANYTHING for anything. Come on, use a real source! May as well be http://MyPersonalMathCite.com

How do you handle BLOB and numerical data efficiently in database communication?

SQL databases seem to be the cornerstone of most software. However, it seems optimized for textual data. In fact when doing any queries involving numerical data, integers specifically, it seems inefficient that the numbers are getting converted to text and then back to native formats both ways between the application and the database. This same inefficiency seems to apply to BLOB data as well. My understanding is that even with something like Linq to SQL, this two way conversion is occuring in the background.
Are there general ways to bypass this overhead with SQL? Are there certain database management systems that handle this more efficiently than others (ie, with non-standard extensions/API's)?
Clarification. In the following select statement, the list of numbers after IN could be more easily passed as a raw array of int, but there seems to be no way of achieving that optimization level.
SELECT foo FROM bar WHERE baz IN (23, 34, 45, 9854004, ...)
Don't suppose. Measure.
Format conversion is not likely to be a measurable cost for database work, unless you are misusing the database as an arithmetic engine.
The IO cost for LOBs, especially for CLOBS with character conversion, can become significant; the remedy here, once you know that the simplest thing that might work actually has a noticeable performance impact, is to minimize the number of times you copy the LOB data. Use whatever SQL parameter binding style allows you to transfer the data directly between its point of creation or use, and the database -- often this is binding the LOB to a stream or I/O channel.
But don't do this until you have a way to measure the impact, and have measurements showing that this is your bottleneck.
Numerical data in a database is not stored as text. I guess it depends on the database, but it certainly doesn't have to be and isn't.
BLOBs are stored exactly how you set them -- by definition, the DB has no way to interpret the information -- I guess it could compress if it found that to be useful. BLOBs are not translated into text.
Here's how Oracle stores numbers:
http://download.oracle.com/docs/cd/B28359_01/server.111/b28318/datatype.htm#i16209
Internal Numeric Format
Oracle Database stores numeric data in variable-length format. Each value is stored in scientific notation, with 1 byte used to store the exponent and up to 20 bytes to store the mantissa. The resulting value is limited to 38 digits of precision. Oracle Database does not store leading and trailing zeros. For example, the number 412 is stored in a format similar to 4.12 x 102, with 1 byte used to store the exponent(2) and 2 bytes used to store the three significant digits of the mantissa(4,1,2). Negative numbers include the sign in their length.
MySQL info here:
http://dev.mysql.com/doc/refman/5.0/en/numeric-types.html
Look at the table -- a TINYINT is represented in 1 byte (range -128 - 127), not possible if stored as text.
EDIT: With the clarification -- I would say use the API in your language that looks something like this (pseudocode)
stmt = conn.Prepare("SELECT * FROM TABLE where x in (?, ?, ?)");
stmt.SetInt(0, x);
stmt.SetInt(1, y);
stmt.SetInt(2, z);
I don't believe that the underlying protocols use text for the transport of parameters.