How to do "GROUP BY" mathematically?

How to do "GROUP BY" mathematically? - sql

I have a data structure of key value pairs and I want to implement "GROUP BY" value.
Both keys and values are strings.
So what I did was I gave every value(string) a unique "prime number". Then for every key I stored the multiplication of all the prime numbers associated with different values that a particular key has.
So if key "Anirudh" has values "x","y","z", then I store the number M(Key) = 2*3*5 = 30 as well.
Later if I want to do group by a particular value "x"(say) then I just iterate over all the keys, and divide the M(key) by the prime number associated with "x". I then check if the remainder is 0 and if it is zero, then that particular "key" is a part of group by for value "x".
I know that this is the most weird way to do it. Some people sort the key value pairs(sorted by values). I could have also created another table(hash table) already grouped by "values". So I want to know a better method than mine (there must be many). In my method as the number of unique values for a particular key increases the product of prime number also increases (that too exponentially).

Your method will always perform O(n) to find group members because you have to iterate through all elements of the collection to find elements belonging to the target group. Your method also risks overflowing common integer bounds (32, 64 bit) if you have many elements since you are multiplying potentially a large number of prime numbers together to form your key.
You will find it more efficient and certainly more predictable to use a bit mask to track group membership following this approach. If you have 16 groups, you can represent that with a 16-bit short using a bit mask. Using primes as you suggest, you would need an integer with enough bits to hold the number 32589158477190044730 (first 16 primes multiplied together), which would require 65 bits.
Other approaches to grouping also are O(n) for the first iteration (after all, each element must be tested at least once for group membership). However, if you tend to repeat the same group checks, the other methods you refer to (e.g. keeping a list or hash table per target group) is much more efficient because subsequent group membership tests are O(1).
So to directly answer your question:
If there are multiple queries for group membership (repeating some groups), any solution that stores the groups (including the ones you suggest in your question) will perform better than your method.
If there are no repeat queries for group membership, there is no advantage to storing group membership
Given that repeat queries seem likely based on your question:
Use a structure such as a list keyed off of a group ID to store group membership if you want to trade memory to get more speed.
Use a suitably wide bit array to store group membership if you want to trade speed to use less memory.

If have no real idea what is being asked here, but this sounds similar (but much more computationally expensive) than a bit vector or a sum of powers of 2. First value is "1", second is "2", third is "4" and so on. If you got "7", you know it is "first" + "second" + "third".

Related

Best way to save a sorting order for rows

I have a table where the rows have a particular order. Users can insert new rows at any place in the ordering.
Our current strategy is:
Use column called "order_index" with Integer type
Have rows with their "order_index" separated by 10000
When new row is inserted, assign integer halfway in between its neighbors
If rows become too tightly packed (have separation of one), then lock and re-assign "order_index" to all rows, incrementing by 10000
This is obviously somewhat complex and the re-assigning is not optimal since it takes longer than we'd like. Any better approach to this?

If you use a floating point index, there is always a number halfway in between.

Redis bitmap split key division strategy

I'm grabbing and archiving A LOT of data from the Federal Elections Commission public data source API which has a unique record identifier called "sub_id" that is a 19 digit integer.
I'd like to think of a memory efficient way to catalog which line items I've already archived and immediately redis bitmaps come to mind.
Reading the documentation on redis bitmaps indicates a maximum storage length of 2^32 (4294967296).
A 19 digit integer could theoretically range anywhere from 0000000000000000001 - 9999999999999999999. Now I know that the datasource in question does not actually have 99 quintillion records, so they are clearly sparsely populated and not sequential. Of the data I currently have on file the maximum ID is 4123120171499720404 and a minimum value of 1010320180036112531. (I can tell the ids a date based because the 2017 and 2018 in the keys correspond to the dates of the records they refer to, but I can't sus out the rest of the pattern.)
If I wanted to store which line items I've already downloaded would I need 2328306436 different redis bitmaps? (9999999999999999999 / 4294967296 = 2328306436.54). I could probably work up a tiny algorithm determine given an 19 digit idea to divide by some constant to determine which split bitmap index to check.
There is no way this strategy seems tenable so I'm thinking I must be fundamentally misunderstanding some aspect of this. Am I?

A Bloom Filter such as RedisBloom will be an optimal solution (RedisBloom can even grow if you miscalculated your desired capacity).
After you BF.CREATE your filter, you pass to BF.ADD an 'item' to be inserted. This item can be as long as you want. The filter uses hash functions and modulus to fit it to the filter size. When you want to check if the item was already checked, call BF.EXISTS with the 'item'.
In short, what you describe here is a classic example for when a Bloom Filter is a great fit.

How many "items" are there? What is "A LOT"?
Anyway. A linear approach that uses a single bit to track each of the 10^19 potential items requires 1250 petabytes at least. This makes it impractical (atm) to store it in memory.
I would recommend that you teach yourself about probabilistic data structures in general, and after having grokked the tradeoffs look into using something from the RedisBloom toolbox.

If the ids ids are not sequential and very spread, keep tracking of which one you processed using a bitmap is not the best option since it would waste lot of memory.
However, it is hard to point the best solution without knowing the how many distinct sub_ids your data set has. If you are talking about a few 10s of millions, a simple set in Redis may be enough.

When should I use CYCLE in a sequence?

I'm using sequences in a PostgreSQL database to insert rows into tables.
When creating the sequences I have never used the CYCLE option on them. I mean they can generate pretty big numbers (in the order of 2^63 as far as I remeber) and I don't really see why I would like a sequence to go back to zero. So my question is:
When should I use CYCLE while creating a sequence?
Do you have an example where it makes sense?

It seems a sequence can use CYCLE for other purposes rather than for primary key generation.
This is, in scenarios where the uniqueness of its value is not required; actually is quite the opposite, when the values are expected to cycle back and repeat themselves after some time.
For example:
When generating numbers that must return to the initial value and repeat themselves at some point, for any reason (e.g. implementing a "Bingo" game).
When the sequence is a temporary identifier that will last for a short period of time and will be unique during its life.
When the field is small -- or can accept a limited number of values -- and it doesn't matter if they repeat themselves.
When there is another field in the entity that will identify it, and the sequence value is used for something else.
When an entity has a composite unique key and the sequence value is only a part of it.
When using the sequence value to generate uniform distribution of values on a big set, though this is hardly a random assignation of values.
Any other cyclic number generation.

What's the database performance improvement from storing as numbers rather than text?

Suppose I have text such as "Win", "Lose", "Incomplete", "Forfeit" etc. I can directly store the text in the database. Instead if use numbers such as 0 = Win, 1 = Lose etc would I get a material improvement in database performance? Specifically on queries where the field is part of my WHERE clause

At the CPU level, comparing two fixed-size integers takes just one instruction, whereas comparing variable-length strings usually involves looping through each character. So for a very large dataset there should be a significant performance gain with using integers.
Moreover, a fixed-size integer will generally take less space and can allow the database engine to perform faster algorithms based on random seeking.
Most database systems however have an enum type which is meant for cases like yours - in the query you can compare the field value against a fixed set of literals while it is internally stored as an integer.

There might be significant performance gains if the column is used in an index.

It could range anywhere from negligible to extremely beneficial depending on the table size, the number of possible values being enumerated and the database engine / configuration.
That said, it almost certainly will never perform worse to use a number to represent an enumerated type.

Don't guess. Measure.
Performance depends on how selective the index is (how many distinct values are in it), whether critical information is available in the natural key, how long the natural key is, and so on. You really need to test with representative data.
When I was designing the database for my employer's operational data store, I built a testbed with tables designed around natural keys and with tables designed around id numbers. Both those schemas have more than 13 million rows of computer-generated sample data. In a few cases, queries on the id number schema outperformed the natural key schema by 50%. (So a complex query that took 20 seconds with id numbers took 30 seconds with natural keys.) But 80% of the test queries had faster SELECT performance against the natural key schema. And sometimes it was staggeringly faster--a difference of 30 to 1.
The reason, of course, is that lots of the queries on the natural key schema need no joins at all--the most commonly needed information is naturally carried in the natural key. (I know that sounds odd, but it happens surprisingly often. How often is probably application-dependent.) But zero joins is often going to be faster than three joins, even if you join on integers.

Clearly if your data structures are shorter, they are faster to compare AND faster to store and retrieve.
How much faster 1, 2, 1000. It all depends on the size of the table and so on.
For example: say you have a table with a productId and a varchar text column.
Each row will roughly take 4 bytes for the int and then another 3-> 24 bytes for the text in your example (depending on if the column is nullable or is unicode)
Compare that to 5 bytes per row for the same data with a byte status column.
This huge space saving means more rows fit in a page, more data fits in the cache, less writes happen when you load store data, and so on.
Also, comparing strings at the best case is as fast as comparing bytes and worst case much slower.
There is a second huge issue with storing text where you intended to have a enum. What happens when people start storing Incompete as opposed to Incomplete?

having a skinner column means that you can fit more rows per page.
it is a HUGE difference between a varchar(20) and an integer.

How can I improve performance of average method in SQL?

I'm having some performance problems where a SQL query calculating the average of a column is progressively getting slower as the number of records grows. Is there an index type that I can add to the column that will allow for faster average calculations?
The DB in question is PostgreSQL and I'm aware that particular index type might not be available, but I'm also interested in the theoretical answer, weather this is even possible without some sort of caching solution.
To be more specific, the data in question is essentially a log with this sort of definition:
table log {
int duration
date time
string event
}
I'm doing queries like
SELECT average(duration) FROM log WHERE event = 'finished'; # gets average time to completion
SELECT average(duration) FROM log WHERE event = 'finished' and date > $yesterday; # average today
The second one is always fairly fast since it has a more restrictive WHERE clause, but the total average duration one is the type of query that is causing the problem. I understand that I could cache the values, using OLAP or something, my question is weather there is a way I can do this entirely by DB side optimisations such as indices.

The performance of calculating an average will always get slower the more records you have, at it always has to use values from every record in the result.
An index can still help, if the index contains less data than the table itself. Creating an index for the field that you want the average for generally isn't helpful as you don't want to do a lookup, you just want to get to all the data as efficiently as possible. Typically you would add the field as an output field in an index that is already used by the query.

Depends what you are doing? If you aren't filtering the data then beyond having the clustered index in order, how else is the database to calculate an average of the column?
There are systems which perform online analytical processing (OLAP) which will do things like keeping running sums and averages down the information you wish to examine. It all depends one what you are doing and your definition of "slow".
If you have a web based program for instance, perhaps you can generate an average once a minute and then cache it, serving the cached value out to users over and over again.

Speeding up aggregates is usually done by keeping additional tables.
Assuming sizeable table detail(id, dimA, dimB, dimC, value) if you would like to make the performance of AVG (or other aggregate functions) be nearly constant time regardless of number of records you could introduce a new table
dimAavg(dimA, avgValue)
The size of this table will depend only on the number of distinct values of dimA (furthermore this table could make sense in your design as it can hold the domain of the values available for dimA in detail (and other attributes related to the domain values; you might/should already have such table)
This table is only helpful if you will anlayze by dimA only, once you'll need AVG(value) according to dimA and dimB it becomes useless. So, you need to know by which attributes you will want to do fast analysis on. The number of rows required for keeping aggregates on multiple attributes is n(dimA) x n(dimB) x n(dimC) x ... which may or may not grow pretty quickly.
Maintaining this table increases the costs of updates (incl. inserts and deletes), but there are further optimizations that you can employ...
For example let us assume that system predominantly does inserts and only occasionally updates and deletes.
Lets further assume that you want to analyze by dimA only and that ids are increasing. Then having structure such as
dimA_agg(dimA, Total, Count, LastID)
can help without a big impact on the system.
This is because you could have triggers that would not fire on every insert, but lets say on ever 100 inserts.
This way you can still get accurate aggregates from this table and the details table with
SELECT a.dimA, (SUM(d.value)+MAX(a.Total))/(COUNT(d.id)+MAX(a.Count)) as avgDimA
FROM details d INNER JOIN
dimA_agg a ON a.dimA = d.dimA AND d.id > a.LastID
GROUP BY a.dimA
The above query with proper indexes would get one row from dimA_agg and only less then 100 rows from detail - this would perform in near constant time (~logfanoutn) and would not require update to dimA_agg for every insert (reducing update penalties).
The value of 100 was just given as an example, you should find optimal value yourself (or even keep it variable, though triggers only will not be enough in that case).
Maintaining deletes and updates must fire on each operation but you can still inspect if the id of the record to be deleted or updated is in the stats already or not to avoid the unnecessary updates (will save some I/O).
Note: The analysis is done for the domain with discreet attributes; when dealing with time series the situation gets more complicated - you have to decide the granularity of the domain in which you want to keep the summary.
EDIT
There are also materialized views, 2, 3

Just a guess, but indexes won't help much since average must read all the record (in any order), indexes are usefull the find subsets of rows, ubt if you have to iterate on all rows with no special ordering indexes are not helping...

This might not be what you're looking for, but if your table has some way to order the data (e.g. by date), then you can just do incremental computations and store the results.
For example, if your data has a date column, you could compute the average for records 1 - Date1 then store the average for that batch along with Date1 and the #records you averaged. The next time you compute, you restrict your query to results Date1..Date2, and add the # of records, and update the last date queried. You have all the information you need to compute the new average.
When doing this, it would obviously be helpful to have an index on the date, or whatever column(s) you are using for the ordering.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas