My question is more theoretical and it is about why do RDBMS/drivers return data the way they all do it, not how they find a correct set, nor how to find it. I'm pretty familiar with SQL, but there is one thing that always annoyed my sense of economy.
Consider following "class" graph:
A {
field1, ..., field9
b_items = [ b1, ..., bN ]
}
B {
field1, ..., field6
c_items = [ c1, ..., cM ]
}
C {
field1, field2
}
We have few A objects, each A object has many B objects, and each B objects has lots of C objects. count(A) < count(B) << count(C).
Now I would like to use a RDBMS to store it, because relations are cool and optimizers are smart, so I can get virtually anything in milliseconds, provided there is a good plan and index set.
I'll skip table creation code, which should be obvious, and go straight to the select:
SELECT *
FROM A
LEFT JOIN B ON B.a_id = A.id
LEFT JOIN C ON C.b_id = B.id
WHERE whatever
Database server returns the result set combined of all columns from all tables, properly joined into the sort-of tree:
A.f1 .... A.f9 B.f1 .... B.f6 C.f1 C.f2
---------------------------------------------------
1 1 1 1 1 1 1 1
1 1 1 1 1 1 2 2
1 1 1 1 1 1 3 3
... more rows...
1 1 1 1 1 1 999 999
↓
1 1 1 2 2 2 1 1
1 1 1 2 2 2 2 2
... more rows...
1 1 1 2 2 2 999 999
... lots of rows ...
1 1 1 99 99 99 999 999
↓
2 2 2 -- oh there it is, A[2]
...
5 5 5 NULL NULL NULL NULL NULL -- A[5] has no b_items
...
9 9 9 ...
The problem is, if A has many columns, especially with text, json, other heavy data, it is duplicated thousands of times to match each product of +B+C join. Why don't SQL servers at least simply not send me the same {A,B}-rows after the first one in join group? Ideally, I would like to see something like that as a result:
[
{
<A-fields>,
B = [
{
<B-fields>,
C = [
{
<C-fields>
},
... more C rows
]
},
... more B rows
]
},
... more A rows
]
which pretty much resembles what I actually need to get in memory on the client-side. I know I can make more queries to fetch less data, e.g. via A.id IN (ids...) or stored proc returning nulls on parasite rows, but isn't relational model intended for one-shot access? Roundtrips are heavy, and so are planner guesses. And real data graphs are rarely of only 3 steps height (consider 5-10). Then why not make it all via single pass, but without excessive traffic?
I'm fine with duplicate cells in A and B columns, because usually there is not too much, but maybe I'm missing something mainstream, SQL and non-hacky that google hides from me for so many years.
Thanks!
The only way to avoid duplicated data transfer is to use aggregate functions like string_agg () or array_agg (). You can also aggregate the data using jsonb functions. You can even get a single json object instead of tabular data, example:
select jsonb_agg(taba)
from (
select to_jsonb(taba) || jsonb_build_object('tabb', jsonb_agg(tabb)) taba
from taba
left join (
select to_jsonb(tabb) || jsonb_build_object('tabc', jsonb_agg(to_jsonb(tabc))) tabb
from tabb
join tabc on tabc.bid = tabb.id
group by tabb.id
) tabb
on (tabb->>'aid')::int = taba.id
group by taba.id
) taba
Complete working example.
json_agg() may not be the fastest thing.Also, I wonder if your ORM will digest it properly and instantiate the right objects.
The usual way is to simply do:
SELECT ... FROM a WHERE ...
Then you recover the ids, and do:
SELECT ... FROM b WHERE a_id IN (the list you just got)
SELECT ... FROM c WHERE a_id IN (the list you just got)
These are utually autogenerated by an ORM. If the ORM is smart, you get one query per table. If it is dumb you get one query per object... However, this forces three queries, with network roundrips, plus some processing. Fortunately, postgres will let you have your cake and eat it, although that takes a little bit of extra work.
Thus, you can create a function in plpgsql which returns "SETOF refcursor". Since a refcursor is a cursor, a function can return several result sets.
Example.
Back in the day when I was doing sql for websites, I used that a few times. Mostly when you just want to fetch one object and a few dependencies, so the actual query parsing and planning takes longer than the queries themselves which return one line or a few. There it uses a function, so everything is already compiled. It's very efficient.
Related
I have a table that looks as follows:
Policy Number Benefit Code Transaction Code
1 A 2
1 B 1
2 A 3
3 A 2
1 C 2
For analysis purposes, it would be much more convenient to have the table in the following form:
PN BC 1 TC 1 BC 2 TC 2 BC 3 TC 3
1 A 2 B 1 C 2
2 A 3 NULL NULL NULL NULL
3 A 2 NULL NULL NULL NULL
I believe this can be done, for example, in R using the tidyverse package, where the concept is basically pivoting the table from long-form to wide-form. Now, I know that I could possibly use the LEAD function in SQL, but the problem/issue is that I do not know how many benefit codes and transaction codes each policy has (i.e. they are not fixed).
Thus, my query is:
How can I "pivot wider" my table to achieve something like the above?
Other than "pivoting wider", is there a more dynamic form of the LEAD function in SQL, where it takes all subsequent rows of a group (in my case, each policy number) and puts them in new columns?
Any intuitive explanations or suggestions will be greatly appreciated :)
Suppose I have a table with the following structure:
id measure_1_actual measure_1_predicted measure_2_actual measure_2_predicted
1 1 0 0 0
2 1 1 1 1
3 . . 0 0
I want to create the following table, for each ID (shown is an example for id = 1):
measure actual predicted
1 1 0
2 0 0
Here's one way I could solve this problem (I haven't tested this, but you get the general idea, I hope):
SELECT 1 AS measure,
measure_1_actual AS actual,
measure_1_predicted AS predicted
FROM tb
WHERE id = 1
UNION
SELECT 2 AS measure,
measure_2_actual AS actual,
measure_2_predicted AS predicted
FROM tb WHERE id = 1
In reality, I have five of these "measures" and tens of millions of people - subsetting such a large table five times for each member does not seem the most efficient way of doing this. This is a real-time API, receiving tens of requests a minute, so I think I'll need a better way of doing this. My other thought was to perhaps create a temp table/view for each member once the request is received, and then UNION based off of that subsetted table.
Does anyone have a more efficient way of doing this?
You can use a lateral join:
select t.id, v.*
from t cross join lateral
(values (1, measure_1_actual, measure_1_predicted),
(2, measure_2_actual, measure_2_predicted)
) v(measure, actual, predicted);
Lateral joins were introduced in Postgres 9.4. You can read about them in the documentation.
I have one large table with ~10,000 rows of data and 100 columns that I want to continuously update. The problem is that the files I will use to update (.csv) often are in different orders or contain extra/missing columns. If there are extra columns in the update I am fine discarding them, but I want the remaining columns to match up exactly, even if some are missing or out of order.
I know that there is a solution in creating a select and simply listing all columns, but I am looking for something more elegant/foolproof. Many of the examples I have seen work well enough using MERGE, UNION, or JOIN but I can't get them to work for this much larger dataset, which is why it has been giving me so much trouble. I am not very experienced with SQL so I would appreciate some additional padding to the explanation.
Where ABCD are columns and 1 is data: Here is the master table
a b c d
1 1 1 1
Here is the update table:
b c d e
1 _ 1 1
Only imagine that there are 100 columns and 100 rows to append to the 10,000 stored.
Desired:
a b c d e
1 1 1 1
_ 1 _ 1 1
Or even
a b c d
1 1 1 1
_ 1 _ 1
e:
This answer is exactly what I want, but it doesn't seem possible in TSQL
https://stackoverflow.com/a/52524364/11777090
do union all
select a,b,c,d,0 from table
union all
select 0,b,c,d,e from table
I have 2 procedures (say A and B). They both return data with similar columns set (Id, Name, Count). To be more concrete, procedures results examples are listed below:
A:
Id Name Count
1 A 10
2 B 11
B:
Id Name Count
1 E 14
2 F 15
3 G 16
4 H 17
The IDs are generated as ROW_NUMBER() as I don't have own identifiers for these records because they are aggregated values.
In code I query over the both results using the same class NameAndCountView.
And finally my problem. When I look into results after executing both procedures sequentially I get the following:
A:
Id Name Count
1 A 10 ->|
2 B 11 ->|
|
B: |
Id Name Count |
1 A 10 <-|
2 B 11 <-|
3 G 16
4 H 17
As you can see results in the second set are replaced with results with the same IDs from the first. Of course the problem take place because I use the same class for retrieving data, right?
The question is how to make this work without creating additional NameAndCountView2-like class?
If possible, and if you don't really mind about the original Id values, maybe you can try having the first query return even Ids :
ROW_NUMBER() over (order by .... )*2
while the second returns odd Ids :
ROW_NUMBER() over (order by .... )*2+1
This would also allow you to know where the Ids come from.
I guess this would be repeatable with N queries by having the query number i selecting
ROW_NUMBER() over (order by .... )*n+i
Hope this will help
I am working on a tag recommendation system that takes metadata strings (e.g. text descriptions) of an object, and splits it into 1-, 2- and 3-grams.
The data for this system is kept in 3 tables:
The "object" table (e.g. what is being described),
The "token" table, filled with all 1-, 2- and 3-grams found (examples below), and
The "mapping" table, which maintains associations between (1) and (2), as well as a frequency count for these occurrences.
I am therefore able to construct a table via a LEFT JOIN, that looks somewhat like this:
SELECT mapping.object_id, mapping.token_id, mapping.freq, token.token_size, token.token
FROM mapping LEFT JOIN
token
ON (mapping.token_id = token.id)
WHERE mapping.object_id = 1;
object_id token_id freq token_size token
+-----------+----------+------+------------+--------------
1 1 1 2 'a big'
1 2 1 1 'a'
1 3 1 1 'big'
1 4 2 3 'a big slice'
1 5 1 1 'slice'
1 6 3 2 'big slice'
Now I'd like to be able to get the relative probability of each term within the context of a single object ID, so that I can sort them by probability, and see which terms are most probably (e.g. ORDER BY rel_prob DESC LIMIT 25)
For each row, I'm envisioning the addition of a column which gives the result of freq/sum of all freqs for that given token_size. In the case of 'a big', for instance, that would be 1/(1+3) = 0.25. For 'a', that's 1/3 = 0.333, etc.
I can't, for the life of me, figure out how to do this. Any help is greatly appreciated!
If I understood your problem, here's the query you need
select
m.object_id, m.token_id, m.freq,
t.token_size, t.token,
cast(m.freq as decimal(29, 10)) / sum(m.freq) over (partition by t.token_size, m.object_id)
from mapping as m
left outer join token on m.token_id = t.id
where m.object_id = 1;
sql fiddle example
hope that helps