Subset large table for use in multiple UNIONs - sql

Suppose I have a table with the following structure:
id measure_1_actual measure_1_predicted measure_2_actual measure_2_predicted
1 1 0 0 0
2 1 1 1 1
3 . . 0 0
I want to create the following table, for each ID (shown is an example for id = 1):
measure actual predicted
1 1 0
2 0 0
Here's one way I could solve this problem (I haven't tested this, but you get the general idea, I hope):
SELECT 1 AS measure,
measure_1_actual AS actual,
measure_1_predicted AS predicted
FROM tb
WHERE id = 1
UNION
SELECT 2 AS measure,
measure_2_actual AS actual,
measure_2_predicted AS predicted
FROM tb WHERE id = 1
In reality, I have five of these "measures" and tens of millions of people - subsetting such a large table five times for each member does not seem the most efficient way of doing this. This is a real-time API, receiving tens of requests a minute, so I think I'll need a better way of doing this. My other thought was to perhaps create a temp table/view for each member once the request is received, and then UNION based off of that subsetted table.
Does anyone have a more efficient way of doing this?

You can use a lateral join:
select t.id, v.*
from t cross join lateral
(values (1, measure_1_actual, measure_1_predicted),
(2, measure_2_actual, measure_2_predicted)
) v(measure, actual, predicted);
Lateral joins were introduced in Postgres 9.4. You can read about them in the documentation.

Related

How to delete a STRUCT from an ARRAY in the NESTED field

Is there a simple way to delete a STRUCT from the nested and repeated field in the BigQuery (BQ table column Type: RECORD, Mode: REPEATED).
Let's say I have the following tables:
wishlist
name toy.id toy.priority
Alice 1 high
2 medium
3 low
Kazik 3 high
1 medium
toys
id name available
1 car 0
2 doll 1
3 bike 1
I'd like to DELETE from wishlist toys that are not available (toys.available==0). In this case, it's toy.id==1.
As a result, the wishlist would look like this:
name toy.id toy.priority
Alice 2 medium
3 low
Kazik 3 high
I know how to select it:
WITH `project.dataset.wishlist` AS
(
SELECT 'Alice' name, [STRUCT<id INT64, priority STRING>(1, 'high'), (2, 'medium'), (3, 'low')] toy UNION ALL
SELECT 'Kazik' name, [STRUCT<id INT64, priority STRING>(3, 'high'), (1, 'medium')]
), toys AS (
SELECT 1 id, 'car' name, 0 available UNION ALL
SELECT 2 id, 'doll' name, 1 available UNION ALL
SELECT 3 id, 'bike' name, 1 available
)
SELECT wl.name, ARRAY_AGG(STRUCT(unnested_toy.id, unnested_toy.priority)) as toy
FROM `project.dataset.wishlist` wl, UNNEST (toy) as unnested_toy
LEFT JOIN toys t ON unnested_toy.id=t.id
WHERE t.available != 0
GROUP BY name
But I don't know how to remove structs <toy.id, toy.priority> from wishlist when toys.available==0.
There are very similar questions like How to delete/update nested data in bigquery or How to Delete rows from Structure in bigquery but the answers are either unclear to me in terms of deletion or suggest copying the whole wishlist to the new table using the selection statement. My 'wishlist' is huge and 'toys.availabililty' changes often. Copying it seems to me very inefficient.
Could you please suggest a solution aligned with BQ best practices?
Thank you!
... since row Deletion was implemented in BQ, I thought that STRUCT deletion inside a row is also possible.
You can use UPDATE DML for this (not DELETE as it is used for deletion of whole row(s), while UPDATE can be used to modify the row)
update `project.dataset.wishlist` wl
set toy = ((
select array_agg(struct(unnested_toy.id, unnested_toy.priority))
from unnest(toy) as unnested_toy
left join `project.dataset.toys` t on unnested_toy.id=t.id
where t.available != 0
))
where true;
You can UNNEST() and reaggregate:
SELECT wl.name,
(SELECT ARRAY_AGG(t)
FROM UNNEST(wl.toy) t JOIN
toys
ON toys.id = t.id
WHERE toys.available <> 0
) as available_toys
FROM `project.dataset.wishlist` wl;

How to write a SQL query to calculate percentages based on values across different tables?

Suppose I have a database containing two tables, similar to below:
Table 1:
tweet_id tweet
1 Scrap the election results
2 The election was great!
3 Great stuff
Table 2:
politician tweet_id
TRUE 1
FALSE 2
FALSE 3
I'm trying to write a SQL query which returns the percentage of tweets that contain the word 'election' broken down by whether they were a politician or not.
So for instance here, the first 2 tweets in Table 1 contain the word election. By looking at Table 2, you can see that tweet_id 1 was written by a politician, whereas tweet_id 2 was written by a non-politician.
Hence, the result of the SQL query should return 50% for politicians and 50% for non-politicians (i.e. two tweets contained the word 'election', one by a politician and one by a non-politician).
Any ideas how to write this in SQL?
You could do this by creating one subquery to return all election tweets, and one subquery to return all election tweets by politicians, then join.
Here is a sample. Note that you may need to cast the totals to decimals before dividing (depending on which SQL provider you are working in).
select
politician_tweets.total / election_tweets.total
from
(
select
count(tweet) as total
from
table_1
join table_2 on table_1.tweet_id = table_2.tweet_id
where
tweet like '%election%'
) election_tweets
join
(
select
count(tweet) as total
from
table_1
join table_2 on table_1.tweet_id = table_2.tweet_id
where
tweet like '%election%' and
politician = 1
) politician_tweets
on 1 = 1
You can use aggregation like this:
select t2.politician, avg( case when t.tweet like '%election%' then 1.0 else 0 end) as election_ratio
from tweets t join
table2 t2
on t.tweet_id = t2.tweet_id
group by t2.politician;
Here is a db<>fiddle.

SQL Case with calculation on 2 columns

I have a value table and I need to write a case statement that touches 2 columns: Below is the example
Type State Min Max Value
A TX 2 15 100
A TX 16 30 200
A TX 31+ 500
Let say I have another table that has the following
Type State Weight Value
A TX 14 ?
So when I join the table , I need a case statement that looks at weight from table 2 , type and state - compare it to the table 1 , know that the weight falls between 2 and 15 from row 1 and update Value in table 2 with 100
Is this doable ?
Thanks
It returns 0 if there aren't rows in this range of values.
select Type, State, Weight,
(select coalesce(Value, 0)
from table_b
where table_b.Type = table_a.Type
and table_b.State = table_a.State
and table_a.Value between table_b.Min and table_b.Max) as Value
from table_a
For an Alteryx solution: (1) run both tables into a Join tool, joining on Type and State; (2) Send the output to a Filter tool where you force Weight to be between Min and Max; (3) Send that output to a Select tool, where you grab only the specific columns you want; (since the Join will give you all columns from all tables). Done.
Caveats: the data running from Join to Filter could be large, since you are joining every Type/State combination in the Lookup table to the other table. Depending on the size of your datasets, that might be cumbersome. Alteryx is very fast though, and at least we're limiting on State and Type, so if your datasets aren't too large, this simple solution will work fine.
With larger data, try to do it as part of your original select, utilizing one of the other solutions given here for your SQL query.
Considering that Min and Max columns in first table are of Integer type
You need to use INNER JOIN on ranges
SELECT *
FROM another_table a
JOIN first_table b
ON a.type = b.type
AND a.State = b.State
AND a.Weight BETWEEN b.min AND b.max

Oracle SQL statement to update column values based on specific condition

I have a table which is having 3 columns-PID,LOCID,ISMGR. Now in existing scenario, for some person, based on the location ID, he is set as ISMGR=true.
But as per the new requirement, we have to make all the ISMGR=true for any person who is having at least one ISMGR=true(means if he is mangager for any one location, he should be manager for all the locations).
Table Data before running the script:
PID|LOCID|ISMGR
1 1 1
1 2 0
1 3 0
2 1 0
2 2 1
Table Data after running the script:
PID|LOCID|ISMGR
1 1 1
1 2 1
1 3 1
2 1 1
2 2 1
Any help will be highly appreciated..
Thanks in advance.
I would be inclined to write this using exists:
update t
set ismgr = 1
where ismgr = 0 and
exists (select 1 from t t2 where t2.pid = t.pid and t2.ismgr = 1);
exists should be more efficient than doing a subquery with an aggregation.
This will work best with indexes on t(pid, ismgr) and t(ismgr).
This is not an answer but a test of the two solutions offered so far - I will call them the "EXISTS" and the "AGGREGATE" solutions or approaches.
Details of the tests are below, but here are two overall conclusions:
Both approaches have comparable execution times; on average the AGGREGATE approach worked a little faster than the EXISTS approach, but by a very small margin (smaller than the differences between running times from one trial to the next). Without indexes on any columns, the run times were: (first number is for the EXISTS approach and the second for AGGREGATE). Trial 1: 8.19s 8.08s Trial 2: 8.98s 8.22s Trial 3: 9.46s 9.55s Note - Estimated optimizer costs should be used only to compare different execution plans for the same statement, not for different solutions using different approaches. Even so, someone will inevitably ask; so - for the EXISTS approach the lowest cost the Optimizer found was 4766; for AGGREGATE, 2665. Again, though, this is completely meaningless.
If a lot of rows need to be updated, indexes will hurt performance much more than they help it. Indeed, when rows are updated, the indexes must be updated as well. If only a small number of rows must be updated, then the indexes will help, because most of the time is spent finding the rows that must be updated and only little time is spent in the updates themselves. In my example almost 25% of rows had to be updated... so the AGGREGATE solution took 51.2 seconds and the EXISTS solution took 59.3 seconds! RECOMMENDATION: If you expect that a large number of rows may need to be updated, and you already have indexes on the table, you may be better off DROPPING them and re-creating them after the updates! Or, perhaps there are other solutions to this problem; I am not an expert (keep that in mind!)
To test properly, after I created the test table and committed, I ran each solution by itself, then I rolled back and, logged in as SYS (in a different session), I ran alter system flush buffer_cache to make sure performance is not randomly helped by cache hits or hurt by misses. In all cases everything is done from disk storage.
I created a table with id's from 1 to 1.2 million and a random integer between 1 and 3, with probabilities 40%, 40% and 20% respectively (see the use of dbms_random below). Then from this prep data I created the test table: each pid was included one, two or three times based on this random integer; and a random 0 or 1 was added as ismgr (with 50-50 probability) in each row. I also added a random integer between 1 and 4 as locid just to simulate the actual data; I didn't worry about duplicate locid since that column plays no role in the problem.
Of the 1.2 million pids, approximately 480,000 (40%) appear just once in the test table, another ~480,000 appear twice and ~240,000 three times. Total rows should be about 2,160,000. That's the cardinality of the base table (in reality it ended up being 2,160,546). Then: none of the ~480,000 rows with unique pid need to be changed; half of the 480,000 pids with a count of 2 will have the same ismgr (so no change) and the other half will be split, so we will need to change 240,000 rows from these; and a simple combinatorial argument shows that 3/8, or 270,000 rows, of the 720,000 rows for pids that appear three times in the table must be changed. So we should expect that 510,000 rows should be changed. In fact the update statements resulted in 510,132 rows updated (same for both solutions). These sanity checks show that the test was probably set up correctly. Below I show also a small sample from the base table, also as a sanity check.
CREATE TABLE statement:
create table tbl as
with prep ( pid, dup ) as (
select level,
round( dbms_random.value(0.5, 3) ) as dup
from dual
connect by level <= 1200000
)
select pid,
round( dbms_random.value(0.5, 4.5) ) as locid,
round( dbms_random.value(0, 1) ) as ismgr
from prep
connect by level <= dup
and prior pid = pid
and prior sys_guid() is not null
;
commit;
Sanity checks:
select count(*) from tbl;
COUNT(*)
----------
2160546
select * from tbl where pid between 324720 and 324730;
PID LOCID ISMGR
---------- ---------- ----------
324720 4 1
324721 1 0
324721 4 1
324722 3 0
324723 1 0
324723 3 0
324723 3 1
324724 3 1
324724 2 0
324725 4 1
324725 2 0
324726 2 0
324726 1 0
324727 3 0
324728 4 1
324729 1 0
324730 3 1
324730 3 1
324730 2 0
19 rows selected
UPDATE statements:
update tbl t
set ismgr = 1
where ismgr = 0 and
exists (select 1 from tbl t2 where t2.pid = t.pid and t2.ismgr = 1);
rollback;
update tbl
set ismgr = 1
where ismgr = 0
and pid in ( select pid
from tbl
group by pid
having max(ismgr) = 1);
rollback;
-- statements to create indexes, used in separate testing:
create index pid_ismgr_idx on tbl(pid, ismgr);
create index ismgr_ids on tbl(ismgr);
Why PL/SQL? All you need is a plain SQL statement. For example:
update your_table t -- enter your actual table name here
set ismgr = 1
where ismgr = 0
and pid in ( select pid
from your_table
group by pid
having max(ismgr) = 1)
;
The existing solutions are perfectly fine, but I prefer to use merge any time I'm updating rows from a correlated sub-query. I find it to be more readable and the performance is typically commensurate with the exists method.
MERGE INTO t
USING (SELECT DISTINCT pid
FROM t
WHERE ismgr = 1) src
ON (t.pid = src.pid)
WHEN MATCHED THEN
UPDATE SET ismgr = 1
WHERE ismgr = 0;
As #mathguy pointed out, in this case using group by and having is more efficient than distinct. To use that with merge is just a matter of changing the sub-query:
MERGE INTO t
USING (SELECT pid
FROM t
GROUP BY pid
HAVING MAX(ismgr) = 1) src
ON (t.pid = src.pid)
WHEN MATCHED THEN
UPDATE SET ismgr = 1
WHERE ismgr = 0;

Calculating relative frequencies in SQL

I am working on a tag recommendation system that takes metadata strings (e.g. text descriptions) of an object, and splits it into 1-, 2- and 3-grams.
The data for this system is kept in 3 tables:
The "object" table (e.g. what is being described),
The "token" table, filled with all 1-, 2- and 3-grams found (examples below), and
The "mapping" table, which maintains associations between (1) and (2), as well as a frequency count for these occurrences.
I am therefore able to construct a table via a LEFT JOIN, that looks somewhat like this:
SELECT mapping.object_id, mapping.token_id, mapping.freq, token.token_size, token.token
FROM mapping LEFT JOIN
token
ON (mapping.token_id = token.id)
WHERE mapping.object_id = 1;
object_id token_id freq token_size token
+-----------+----------+------+------------+--------------
1 1 1 2 'a big'
1 2 1 1 'a'
1 3 1 1 'big'
1 4 2 3 'a big slice'
1 5 1 1 'slice'
1 6 3 2 'big slice'
Now I'd like to be able to get the relative probability of each term within the context of a single object ID, so that I can sort them by probability, and see which terms are most probably (e.g. ORDER BY rel_prob DESC LIMIT 25)
For each row, I'm envisioning the addition of a column which gives the result of freq/sum of all freqs for that given token_size. In the case of 'a big', for instance, that would be 1/(1+3) = 0.25. For 'a', that's 1/3 = 0.333, etc.
I can't, for the life of me, figure out how to do this. Any help is greatly appreciated!
If I understood your problem, here's the query you need
select
m.object_id, m.token_id, m.freq,
t.token_size, t.token,
cast(m.freq as decimal(29, 10)) / sum(m.freq) over (partition by t.token_size, m.object_id)
from mapping as m
left outer join token on m.token_id = t.id
where m.object_id = 1;
sql fiddle example
hope that helps