Price for literal vs int in Amazon CloudSearch - amazon-cloudsearch

I'm designing search domain and here is my problem: I have a few columns, that don't need to be searchable, returnable or faceted. They have only a few possible values and will be used in fq parameter as filters. The question is: which datatype to go with? Int or literal?
I'll get more maintainable solution with literal, because int would introduce an external dependency to other database/enum in my code. But will using literals increase the price? And how much? Couldn't find the answer in CS documentation.

Amazon Cloudsearch's pricing model has the following components:
Instance hourly rate
Indexing cost based on the size of your data
Transfer rates
Your instance rates will depend on the amount of data you're storing. Essentially, every one of those models is directly or indirectly dependent on the size of the information you're storing.
Based on that, it would make sense to go with an integer field instead of a literal. A number is cheaper to store than it's literal equivalent. The difference might be negligible in the initial stages, but when you have a HUGE amount of information being indexed, it could be the difference between a small and medium instance and therefore save you some money.

Related

Redis bitmap split key division strategy

I'm grabbing and archiving A LOT of data from the Federal Elections Commission public data source API which has a unique record identifier called "sub_id" that is a 19 digit integer.
I'd like to think of a memory efficient way to catalog which line items I've already archived and immediately redis bitmaps come to mind.
Reading the documentation on redis bitmaps indicates a maximum storage length of 2^32 (4294967296).
A 19 digit integer could theoretically range anywhere from 0000000000000000001 - 9999999999999999999. Now I know that the datasource in question does not actually have 99 quintillion records, so they are clearly sparsely populated and not sequential. Of the data I currently have on file the maximum ID is 4123120171499720404 and a minimum value of 1010320180036112531. (I can tell the ids a date based because the 2017 and 2018 in the keys correspond to the dates of the records they refer to, but I can't sus out the rest of the pattern.)
If I wanted to store which line items I've already downloaded would I need 2328306436 different redis bitmaps? (9999999999999999999 / 4294967296 = 2328306436.54). I could probably work up a tiny algorithm determine given an 19 digit idea to divide by some constant to determine which split bitmap index to check.
There is no way this strategy seems tenable so I'm thinking I must be fundamentally misunderstanding some aspect of this. Am I?
A Bloom Filter such as RedisBloom will be an optimal solution (RedisBloom can even grow if you miscalculated your desired capacity).
After you BF.CREATE your filter, you pass to BF.ADD an 'item' to be inserted. This item can be as long as you want. The filter uses hash functions and modulus to fit it to the filter size. When you want to check if the item was already checked, call BF.EXISTS with the 'item'.
In short, what you describe here is a classic example for when a Bloom Filter is a great fit.
How many "items" are there? What is "A LOT"?
Anyway. A linear approach that uses a single bit to track each of the 10^19 potential items requires 1250 petabytes at least. This makes it impractical (atm) to store it in memory.
I would recommend that you teach yourself about probabilistic data structures in general, and after having grokked the tradeoffs look into using something from the RedisBloom toolbox.
If the ids ids are not sequential and very spread, keep tracking of which one you processed using a bitmap is not the best option since it would waste lot of memory.
However, it is hard to point the best solution without knowing the how many distinct sub_ids your data set has. If you are talking about a few 10s of millions, a simple set in Redis may be enough.

How to find out how much space a SQL Server table uses?

Is it possible to get the amount of space on disk that a particular table uses? Let's say I have a million users stored in my table and I want to know how much space it's required to store all users and/or one of them.
Update:
I'm planning to use redis to cache some fields from one particular table in memory to quickly retrieve the needed data after. So I need to calculate how much space approximately will it take and thus will it fit in the memory or not. Definitely it depends on the data types that I use inside my table but if a table consists of several dozens of fields it would take too much time to count this one by one.
There is exactly such answer for the MySQL though it's not suitable for SQL Server: How can you determine how much disk space a particular MySQL table is taking up? You can check it to see what I mean.
If you have SSMS, you can right-click on the table in the Object Explorer, go to Properties, and then look at the Storage page. The field, Data space, is the size of the data in that table, but it probably does not include some of the overhead costs of the table.
This is really an extended comment, because it does not directly answer the question.
For most purposes, you just use the size of the columns, add them together, and multiply by the number of rows. This lowballs the estimate, but it is reasonable. And (depending on how you handle the types) might be a reasonable estimate of the size of exporting the data.
That said, the storage of tables is a difficult matter. Here are some of the factors you need to take into account:
The size of individuals fields. This is made slightly more difficult because some types have varying sizes, so those are entirely data dependent.
The number of pages occupied by a table (or equivalently how full each data page is). Note that this can vary, depending on how full each table is.
The number of pages occupied by "overflow" data types, such as varchar(max).
Whether or not the data pages are compressed or encrypted.
The indexes for the table.
How full each index page is.
And, no doubt, I've left out a bunch of other relevant internal details (here is a place to start on page layouts).
In other words, there isn't a simple answer. Equivalent tables on two different systems could occupy very different amounts of space. This is true of the "same" table on the same system at different times.
The general answer when working with databases is that you need a lot more space than number of rows * row size -- I seem to recall using a factor of 3 at one point in time. In general, storage is pretty cheap, so this is not the limiting factor using a database.
We would need to see your full database schema, with tables and columns and all fields' data types. Without those pieces of information it's just a lucky guess. Here is a helpful cheat sheet of the sizes of each data type: https://www.connectionstrings.com/sql-server-2012-data-types-reference/
Then you just have to do the Math and calculate the space needed for X, which is your number of records

Different Types saved values to SQL Database

I am currently writing an application which will have a lot of transaction.
Each transaction will have a value although the value can be an int, bit, short string, large string etc...
I want to try to keep processing and storage to a minimum as I would like to run this in the cloud. Should I have lot of different fields on the transaction eg.
TransactionLine.valueint
TransactionLine.valuestring
TransactionLine.valuedecimal
TransactionLine.valuebool
or should I have separate tables for each value transaction value type.
TransactionLine - Table
---------------
TransactionLine.ValueId
ValueInt -Table
-------
ValueInt.ValueId
ValueInt.Value
ValueString - Table
-------
ValueString.ValueId
ValueString.Value
You could store key-value pairs in the database. The only data type that can store any other data type is a VARCHAR(MAX) or a BLOB. That means that all data must be converted to a string before it can be stored. That conversion will take processing time.
In the opposite direction, when you want to do a SUM or a MAX or an AVG , ... of numeric data you will first have to convert the string back to its real data type. That conversion too will take processing time.
Databases are read a lot more than written to. The conversion nightmare will get your system on its knees. There has been a lot of debate on this topic. The high cost of conversions is the killer.
There are systems that store the whole database in one single table. But in those cases the whole system is build with one clear goal: to support that system in an efficient way in a fast compiled programming language, like C(++, #), not in a relational database language like SQL.
I don't have the idea I fully understand what you really want. If you only want to store the transactions, this may be a worth trying. But why do you want to store them one field at a time? Data is stored in groups in records. And the data type of each and every column in a record is known at the creation time of the table.
You should really look into cassandra. When you say a lot of transactions, do you mean millions of records? For cassandra, handling millions of records is a norm. You will have a column family (in rdbms, table is similiar to column family) store many rows, and for each row, you do not need to predefined a column. It can be define on demand, thus reducing the storage dramatically especially if you are dealing with a lot of records.
You do not need to worry if the data is of data type int, string, decimal or bool because default datatype for column value is in BytesType. There are other data types which you can predefined too in the the column family column metadata if you want to. Since you are starting to write an application, I will suggest you spend sometime to read into cassandra and how it would help you in your situation.

Is it sensible to store long, unique text strings in OLAP cubes for drillthrough retrieval (especially in SSAS)?

I'm motivated to store some long text strings in an OLAP cube, long on the order of 1,000s or 10,000s of characters -- but I'm wondering if this will lead me astray. (I'm also curious to learn a little more about how OLAP engines handle strings.) The particular use case I have in mind is that I have a unique, pre-existing "record description" for each of my OLAP facts, and I want to put those descriptions in the cube so that I have the option to get them back when I do a DRILLTHROUGH operation. In contrast, I don't need the record descriptions to appear when doing normal pivot table / aggregate type operations. (The descriptions are too long to display sensibly in a pivot table, plus each fact has a unique description, meaning it doesn't make sense to aggregate over descriptions.) My current dataset has around 700,000 facts, though I'm also curious if the answer would change for larger datasets.
My hope was that an OLAP server could do something sensible if I put these long strings in a cube. In the Sql Server / SSAS case in particular, I thought perhaps I'd put them in a dimension marked as ROLAP, to save memory usage, and use a degenerate dimension (aka a "fact dimension", in SSAS terminology), to avoid needless ETL complexities. But I'm curious if this would be regarded as a horrible practice for some reason, or if there are any hidden gotchas.
Update: My example use case is where you have a string associated with each OLAP fact. But it might also be instructive to consider the case where the strings are instead associated with each particular value of a particular dimension. (e.g. Suppose you had a Company dimension and each company had a somewhat lengthy Company Description string.)
Here's what I've been able to uncover about the implications of storing such strings in SSAS, especially SSAS 2008. Where I consider data structures, it's exclusively focused on MOLAP storage, which is what I've been experimenting with.
First, standard MS ETL (extract/transform/load, i.e. data import) tools like Business Intelligence Development Studio may try to prevent you from importing large textfields, especially varchar(max) fields, but there is a workaround, and it's proven effective for me. (For BIDS it involves manually setting the DataSize element in an XML file, potentially to the magic size of 163315555 bytes. Props to Matija Lah for figuring this out.)
Second, as far as I can tell, storing lots of long, unique strings shouldn't wreak havoc on the on-disk data structures used by SSAS. Also, the size of the string data on disk should be of the same order of magnitude as the string data in your data source. Here's some rough info on SSAS handles strings:
The core OLAP data structures (e.g. for the attributes of a dimension, or for the facts of a measure groups) don't directly contain strings; instead contain offsets into "string store" files (extensions .ksstore, .asstore, .bsstore, or .string.data), which contain the actual string data.
Within a given string store, each string is represented only once. If several rows in your source data tables contain duplicate strings, then at the SSAS/MOLAP level, that will translate into duplicated file-offsets, rather than duplicated string values
If you're source string has length n, then the corresponding data structure in the string store has 8-ish bytes of overhead, plus 2*n bytes per character. (Strings are inherently stored in 2-byte Unicode format in SSAS.)
For some fantastic detail about this stuff, I suggest the book Microsoft SQL Server 2008 Analysis Services Unleashed, in particular chapter 20, "The Physical Data Model".
At least in my experiments, string store files do not seem to be compressed -- at least they're not notably smaller than an uncompressed string store would be.
I've verified experimentally that text data takes the same order of magnitude of bytes whether stored in SSAS MOLAP or in a sql table. In particular, I did a "select sum(len(myfield)) from mytable" from one of my dimension tables, and then compared to the size of the corresponding attribute's files in my SSAS data directory. Size was 172MB in SQL and 304MB in SQL server. (Sql size was 147MB if I summed all unique strings, rather than all strings.) In my case the size difference was mostly explained by character encoding; my source sql data is stored with one byte per character, whereas SSAS stores all strings with two bytes per character. I found that the .kssstore file totally dominated all the other files associated with this attribute in size, regardless of whether or not I optimized the attribute via AttributeHierarchyOptimizedState=FullyOptimized.
Third, there is a 4GB cap on the size of string store files, which limits the amount of unique text that can be associated, say, with a particular dimension/attribute. In my case I'm less than 10% of the way to the limit, but this might affect some people. (Quick order-of-magnitude calculation for the original post: 1M facts * 10,000 bytes/per fact = 10GB-ish worth of text.) If you do hit this limit, you'll apparently hit it at cube "processing" time. Apparently it applies even to ROLAP dimensions. There may be some hacks to work around this. See here. Note that Sql Server 2012 may remove this 4GB limitation.
Forth, it seems that if long unique strings create a problem in SSAS, they do so at the level of in-memory representation. One potential problem (that I haven't looked into in detail) is that having these extra strings cached in memory will keep SSAS from keeping other important data structures in memory, and thus degrade performance. Another problem, suggested by the book The Microsoft Data Warehouse Toolkit (though I haven't yet found this claim elsewhere), is that SSAS does some expansive string padding on its in-memory data structures:
"The relational database stores variable length string columns ... However, other parts of the SQL Server toolset will fill these columns out to their full width. Notable, Integration Services and Analysis Services pad string columns with spaces as they are loaded into memory. Both Integration Services and Analysis Services love physical memory, so there's a cost to declaring string columns that are far wider than they need to be."
To conclude, so far storing my long string data in the cube seems convenient, and I haven't uncovered any reasons to expect disaster, so I'm giving it a try. I'll try to provide an update if things don't work out.
You could store the values in a table relationaly and then create an integer surrogate key.
add the integer surrogate to your UDM and create a SSRS Drillthrough action
http://msdn.microsoft.com/en-US/library/ms174526(v=SQL.90).aspx
that looks up the text field by the key value.
I would use a degenerate dimension, but hide it via SSAS until requested via a Drillthrough Action.
I can't guide you on the internal storage of strings for the AS engine, but as for storing them in SQL, I would make sure your varchar(MAX) column was at the end of your columns to speed up SQL engines scanning of those rows.
At 700,000 rows, with enough memory and disk I/O, you aren't taxing SQL much.
Haven't worked through all the possibilities described and link to from it yet, but this thread from 2007 is on the same topic and seems pretty relevant:
http://www.sqldev.org/sql-server-analysis-services/discussion-about-how-to-create-a-fact-drillthrough-dimension-the-best-way-34857.shtml
One new possibility raised here is that, rather than treating text stored in the fact table as a degenerate dimension, you could potentially treat it as a text-valued (vs numeric-valued) measure. Initial googling suggests that SSAS might support this but there are some tricks to getting this right, e.g. you probably want to disable aggregation for that measure, you might need to do something non-standard to get the field to appear in a drillthrough, and it might require SSAS enterprise edition.

Representing Sparse Data in PostgreSQL

What's the best way to represent a sparse data matrix in PostgreSQL? The two obvious methods I see are:
Store data in a single a table with a separate column for every conceivable feature (potentially millions), but with a default value of NULL for unused features. This is conceptually very simple, but I know that with most RDMS implementations, that this is typically very inefficient, since the NULL values ususually takes up some space. However, I read an article (can't find its link unfortunately) that claimed PG doesn't take up data for NULL values, making it better suited for storing sparse data.
Create separate "row" and "column" tables, as well as an intermediate table to link them and store the value for the column at that row. I believe this is the more traditional RDMS solution, but there's more complexity and overhead associated with it.
I also found PostgreDynamic, which claims to better support sparse data, but I don't want to switch my entire database server to a PG fork just for this feature.
Are there any other solutions? Which one should I use?
I'm assuming you're thinking of sparse matrices from mathematical context:
http://en.wikipedia.org/wiki/Sparse_matrix (The storing techniques described there are for memory storage (fast arithmetic operation), not persistent storage (low disk usage).)
Since one usually do operate on this matrices on client side rather than on server side a SQL-ARRAY[] is the best choice!
The question is how to take advantage of the sparsity of the matrix? Here the results from some investigations.
Setup:
Postgres 8.4
Matrices w/ 400*400 elements in double precision (8 Bytes) --> 1.28MiB raw size per matrix
33% non-zero elements --> 427kiB effective size per matrix
averaged using ~1000 different random populated matrices
Competing methods:
Rely on the automatic server side compression of columns with SET STORAGE MAIN or EXTENDED.
Only store the non-zero elements plus a bitmap (bit varying(xx)) describing where to locate the non-zero elements in the matrix. (One double precision is 64 times bigger than one bit. In theory (ignoring overheads) this method should be an improvement if <=98% are non-zero ;-).) Server side compression is activated.
Replace the zeros in the matrix with NULL. (The RDBMSs are very effective in storing NULLs.) Server side compression is activated.
(Indexing of non-zero elements using a 2nd index-ARRAY[] is not very promising and therefor not tested.)
Results:
Automatic compression
no extra implementation efforts
no reduced network traffic
minimal compression overhead
persistent storage = 39% of the raw size
Bitmap
acceptable implementation effort
network traffic slightly decreased; dependent on sparsity
persistent storage = 33.9% of the raw size
Replace zeros with NULLs
some implementation effort (API needs to know where and how to set the NULLs in the ARRAY[] while constructing the INSERT query)
no change in network traffic
persistent storage = 35% of the raw size
Conclusion:
Start with the EXTENDED/MAIN storage parameter. If you have some free time investigate your data and use my test setup with your sparsity level. But the effect may be lower than you expect.
I suggest always to use the matrix serialization (e.g. Row-major order) plus two integer columns for the matrix dimensions NxM. Since most APIs use textual SQL you are saving a lot of network traffic and client memory for nested "ARRAY[ARRAY[..], ARRAY[..], ARRAY[..], ARRAY[..], ..]" !!!
Tebas
CREATE TABLE _testschema.matrix_dense
(
matdata double precision[]
);
ALTER TABLE _testschema.matrix_dense ALTER COLUMN matdata SET STORAGE EXTERN;
CREATE TABLE _testschema.matrix_sparse_autocompressed
(
matdata double precision[]
);
CREATE TABLE _testschema.matrix_sparse_bitmap
(
matdata double precision[]
bitmap bit varying(8000000)
);
Insert the same matrices into all tables. The concrete data depends on the certain table.
Do not change the data on server side due to unused but allocated pages. Or do a VACUUM.
SELECT
pg_total_relation_size('_testschema.matrix_dense') AS dense,
pg_total_relation_size('_testschema.matrix_sparse_autocompressed') AS autocompressed,
pg_total_relation_size('_testschema.matrix_sparse_bitmap') AS bitmap;
A few solutions spring to mind,
1) Separate your features into groups that are usually set together, create a table for each group with a one-to-one foreign key relationship to the main data, only join on tables you need when querying
2) Use the EAV anti-pattern, create a 'feature' table with a foreign key field from your primary table as well as a fieldname and a value column, and store the features as rows in that table instead of as attributes in your primary table
3) Similarly to how PostgreDynamic does it, create a table for each 'column' in your primary table (they use a separate namespace for those tables), and create functions to simplify (as well as efficiently index) accessing and updating the data in those tables
4) create a column in your primary data using XML, or VARCHAR, and store some structured text format within it representing your data, create indexes over the data with functional indexes, write functions to update the data (or use the XML functions if you are using that format)
5) use the contrib/hstore module to create a column of type hstore that can hold key-value pairs, and can be indexed and updated
6) live with lots of empty fields
A NULL value will take up no space when it's NULL. It'll take up one bit in a bitmap in the tuple header, but that will be there regardless.
However, the system can't deal with millions of columns, period. There is a theoretical max of a bit over a thousand, IIRC, but you really don't want to go that far.
If you really need that many, in a single table, you need to go the EAV method, which is basically what you're saying in (2).
If each entry has only a relatively few keys, I suggest you look at the "hstore" contrib modules which lets you store this type of data very efficiently, as a third option. It's been enhanced further in the upcoming 9.0 version, so if you are a bit away from production deployment, you might want to look directly at that one. However, it's well worth it in 8.4 as well. And it does support some pretty efficient index based lookups. Definitely worth looking into.
I know this is an old thread, but MadLib provides a sparse vector type for Postgres, along with several machine learning and statistical methods.