In Google BigQuery, I have a query that has the same field name appearing multiple times in various join subqueries. I would like to abstract out this field name into a temporary function such that it will amend it in all places if I change it within the function only.
This is the query I have:
SELECT *
FROM
(SELECT field1, COUNT(*) sq1_total
FROM table
WHERE condition = 1
GROUP BY field 1) sq1
LEFT JOIN
(SELECT field1, COUNT(*) sq2_total
FROM table
WHERE condition = 0
GROUP BY field 1) sq2
USING(field1)
This is what I would like to have:
CREATE TEMP FUNCTION replace_field_name() AS (...);
SELECT *
FROM
(SELECT replace_field_name(), COUNT(*) sq1_total
FROM table
WHERE condition = 1
GROUP BY replace_field_name()) sq1
LEFT JOIN
(SELECT replace_field_name(), COUNT(*) sq2_total
FROM table
WHERE condition = 0
GROUP BY replace_field_name()) sq2
USING(replace_field_name())
So that when I want to compare many different fields like this, I only need to change the field name in one place as opposed to five places.
Is this possible?
Below thoughts/proposals relevant in terms of BigQuery Standard SQL
I would like to abstract out this field name into a temporary function ...
As Tim mentioned in his comment - it is quite not possible to do in a way you mock it
I want to compare many different fields like this, I only need to change the field name in one place as opposed to five places.
You can try to re-write your query in such a way that you will need to change field name in less places, like in below examples
#standardSQL
SELECT * FROM (SELECT field1, COUNT(*) sq1_total FROM `project.dataset.table` WHERE condition = 1 GROUP BY 1) sq1
LEFT JOIN (SELECT field1, COUNT(*) sq2_total FROM `project.dataset.table` WHERE condition = 0 GROUP BY 1) sq2
USING (field1)
OR
#standardSQL
SELECT DISTINCT field1,
COUNTIF(condition = 1) OVER(PARTITION BY field1) sq1_total,
COUNTIF(condition = 0) OVER(PARTITION BY field1) sq2_total
FROM `project.dataset.table`
In bothe above queries - there are "just" three place to replace field name in (as opposed to 5 in original query)
Obviously - this does not address the problem in qualitative way - just quantitatively
Is this possible?
Good news - there is always work around - but usually it requires to slightly change something in your requirements, expectations
For example in below solution you need to set field name only once!!! in UNNEST(['field1']) field line
#standardSQL
SELECT DISTINCT field, value,
COUNTIF(condition = 1) OVER(PARTITION BY field, value) sq1_total,
COUNTIF(condition = 0) OVER(PARTITION BY field, value) sq2_total
FROM (
SELECT field, REGEXP_EXTRACT(x, CONCAT(r'"', field, '":"?([^",])"?')) value, condition
FROM `project.dataset.table` t,
UNNEST([TO_JSON_STRING(t)]) x,
UNNEST(['field1']) field
)
the "price" is - you will have output in form of (with dummy data)
Row field value sq1_total sq2_total
1 field1 1 1 3
2 field1 2 1 0
instead of output from original query
Row field1 sq1_total sq2_total
1 1 1 3
2 2 1 null
I want to compare many different fields like this ...
The extra value in above approach is that you can run your comparison (for as many fields as you want) in one shot - by adding needed fields' names into UNNEST(['field1']) field list as in below example
#standardSQL
SELECT DISTINCT field, value,
COUNTIF(condition = 1) OVER(PARTITION BY field, value) sq1_total,
COUNTIF(condition = 0) OVER(PARTITION BY field, value) sq2_total
FROM (
SELECT field, REGEXP_EXTRACT(x, CONCAT(r'"', field, '":"?([^",])"?')) value, condition
FROM `project.dataset.table` t,
UNNEST([TO_JSON_STRING(t)]) x,
UNNEST(['field1', 'field2']) field
)
-- ORDER BY field, value
so result could look like
Row field value sq1_total sq2_total
1 field1 1 1 3
2 field1 2 1 0
3 field2 1 1 1
4 field2 2 0 2
5 field2 3 1 0
Related
I'm looking for a similar SQL Statement to the any statement in R. What I have is a time-series data set that begins in 2014 and ends in 2020. I have a column that identifies if, in 2016, individuals voluntarily or involuntarily changed a drug. What I want to do is completely remove any individuals that involuntarily changed a drug. In R what I would do is group by the individual's ID and delete all IDs from the data set if the DrugChange column is 'Involuntarily'. My R code would look like this:
df<-df%>%group_by(ID)%>%filter(!any(DrugChange=='Involuntarily'))
In SQL I've been searching around for a somewhat simple solution, and (stupidly) thought just using a WHERE statement would work, but all it does is remove one row not all rows. Is there a way I can use a WHERE statement or is there a better method?
I think you want something like this:
select id
from t
group by id
having sum(case when DrugChange = 'Involuntarily' then 1 else 0 end) = 0;
My understanding is that you are looking to take a subset of rows such that if any row for an ID has Involuntarily in the DrugChange column then all rows for that ID should be excluded so in the example in the Note at the end all rows for ID 1 would be excluded and all rows for ID 2 would be kept.
1) windowing function Using the test data in the Note at the end and an SQL windowing function create a column ok which is 1 for every row of an ID not having any Involutarily in the DrugChange column and then pick only those rows . We have removed the ok column but if you want it omit the [-1].
library(sqldf)
sqldf("select * from (
select not max(DrugChange = 'Involuntarily') over (partition by ID) ok, *
from df
) where ok")[-1]
giving:
DrugChange ID
1 X 2
2 X 2
1a) This could be written in terms of a CTE like this:
sqldf("with inner as (
select not max(DrugChange = 'Involuntarily') over (partition by ID) ok, *
from df
)
select * from inner where ok")[-1]
2) join An alternate approach is to generate one row per ID with the ok value and then join it to df if ok is 1.
sqldf("select a.*
from df a join (select ID, not max(DrugChange = 'Involuntarily') ok
from df
group by ID) b on a.ID = b.ID and b.ok")
giving:
DrugChange ID
1 X 2
2 X 2
2a) We could also write this in terms of a CTE like this:
sqldf("with right as (
select ID, not max(DrugChange = 'Involuntarily') ok
from df
group by ID
)
select a.* from df a join right b on a.ID = b.ID and b.ok")
3) in A different approach is to use in as shown here:
sqldf("select *
from df
where id not in (select distinct id from df where DrugChange = 'Involuntarily')")
giving:
DrugChange ID
1 X 2
2 X 2
It will also work without the distinct keyword.
3a) We could also write it with a CTE like this:
sqldf("with ids as (
select distinct id from df where DrugChange = 'Involuntarily'
)
select * from df where id not in ids")
Note
Test data used.
df <- data.frame(DrugChange = c("Involuntarily", "X", "X", "X"), ID = c(1,1,2,2))
Imagine a table with only one column.
+------+
| v |
+------+
|0.1234|
|0.8923|
|0.5221|
+------+
I want to do the following for row K:
Take row K=1 value: 0.1234
Count how many values in the rest of the table are less than or equal to value in row 1.
Iterate through all rows
Output should be:
+------+-------+
| v |output |
+------+-------+
|0.1234| 0 |
|0.8923| 2 |
|0.5221| 1 |
+------+-------+
Quick Update I was using this approach to compute a statistic at every value of v in the above table. The cross join approach was way too slow for the size of data I was dealing with. So, instead I computed my stat for a grid of v values and then matched them to the vs in the original data. v_table is the data table from before and stat_comp is the statistics table.
AS SELECT t1.*
,CASE WHEN v<=1.000000 THEN pr_1
WHEN v<=2.000000 AND v>1.000000 THEN pr_2
FROM v_table AS t1
LEFT OUTER JOIN stat_comp AS t2
Windows functions were added to ANSI/ISO SQL in 1999 and to to Hive in version 0.11, which was released on 15 May, 2013.
What you are looking for is a variation on rank with ties high which in ANSI/ISO SQL:2011 would look like this-
rank () over (order by v with ties high) - 1
Hive currently does not support with ties ... but the logic can be implemented using count(*) over (...)
select v
,count(*) over (order by v) - 1 as rank_with_ties_high_implicit
from mytable
;
or
select v
,count(*) over
(
order by v
range between unbounded preceding and current row
) - 1 as rank_with_ties_high_explicit
from mytable
;
Generate sample data
select 0.1234 as v into #t
union all
select 0.8923
union all
select 0.5221
This is the query
;with ct as (
select ROW_NUMBER() over (order by v) rn
, v
from #t ot
)
select distinct v, a.cnt
from ct ot
outer apply (select count(*) cnt from ct where ct.rn <> ot.rn and v <= ot.v) a
After seeing your edits, it really does look look like you could use a Cartesian product, i.e. CROSS JOIN here. I called your table foo, and crossed joined it to itself as bar:
SELECT foo.v, COUNT(foo.v) - 1 AS output
FROM foo
CROSS JOIN foo bar
WHERE foo.v >= bar.v
GROUP BY foo.v;
Here's a fiddle.
This query cross joins the column such that every permutation of the column's elements is returned (you can see this yourself by removing the SUM and GROUP BY clauses, and adding bar.v to the SELECT). It then adds one count when foo.v >= bar.v, yielding the final result.
You can take the full Cartesian product of the table with itself and sum a case statement:
select a.x
, sum(case when b.x < a.x then 1 else 0 end) as count_less_than_x
from (select distinct x from T) a
, T b
group by a.x
This will give you one row per unique value in the table with the count of non-unique rows whose value is less than this value.
Notice that there is neither a join nor a where clause. In this case, we actually want that. For each row of a we get a full copy aliased as b. We can then check each one to see whether or not it's less than a.x. If it is, we add 1 to the count. If not, we just add 0.
STEP 1
Select data1,name,phone,address from dummyTable limit 4;
From above query, I will get the following result for example:
data1 | name | phone | address
fgh | hjk | 567...| CA
ghjkk | jkjii| 555...| NY
Now, after having the above result I am suppose to match data1 records that I got from above query to existing another table in a database called existingTable which has a same column called data1 in it. If the result above gives data1 value as 'fgh' so I take that 'fgh' and compare with that existingtable column called data1.
STEP 2
Next, after I am finished comparing, I need to apply some condition as follows:
if((results.data1.value).equals(existingTable.data1.value))
then count --
else
count++
So by above condition I am trying to explain, that if the value I got from the result is matched then I do count decrement by 1 and if not then count is incremented by 1.
Summary
I basically wanted to achieve this in one single query, is it possible using PostgreSQL?
I think you can translate that to a simple query:
SELECT d.data1, d.name, d.phone, d.address
, count(*) - 2 * count(e.data1)
FROM (
SELECT data1, name, phone, address
FROM dummytable
-- ORDER BY ???
LIMIT 4
) d
LEFT JOIN existingtable e USING (data1)
GROUP BY d.data1, d.name, d.phone, d.address;
The major ingredient is the LEFT [OUTER] JOIN. Follow the link to the manual.
count(*) counts all rows from dummytable.
count(e.data1) only counts rows from existingtable where a matching data1 exists (count() does not count NULL values). I subtract that twice to match your formula.
About ORDER BY: There is no natural order in a database table. You need to order by something to get predictable results.
If there can be duplicates in existingtable but you want to count every distinct data1 only once, eliminate dupes before you join or use an EXISTS semi-join:
SELECT data1, name, phone, address
, count(*) - 2 * count(EXISTS (
SELECT 1 FROM existingtable e
WHERE e.data1 = d.data1) OR NULL)
FROM (
SELECT data1, name, phone, address
FROM dummytable
-- ORDER BY ???
LIMIT 4
) d
GROUP BY data1, name, phone, address;
The last count works because (TRUE OR NULL) IS TRUE, but (FALSE OR NULL) IS NULL.
I have result of two queries like:
Result of query 1
ID Value
1 4
2 0
3 6
4 9
Result of query 2
ID Value
1 6
2 4
3 0
4 1
I want to add values column "Value" and show final result:
Result of Both queries
ID Value
1 10
2 4
3 6
4 10
plz guide me...
select id, sum(value) as value
from (
select id, value from query1
uninon all
select id, value from query2
) x
group by id
Try using a JOIN:
SELECT
T1.ID,
T1.Value + T2.Value AS Value
FROM (...query1...) AS T1
JOIN (...query2...) AS T2
ON T1.Id = T2.Id
You may also need to consider what should happen if there is an Id present in one result but not in the other. The current query will omit it from the results. You may want to investigate OUTER JOIN as an alternative.
A not particularly nice but fairly easy to comprehend way would be:
SELECT ID,SUM(Value) FROM
(
(SELECT IDColumn AS ID,ValueColumn AS Value FROM TableA) t1
OUTER JOIN
(SELECT IDColumn AS ID,ValueColumn AS Value FROM TableB) t2
) a GROUP BY a.ID
It has the benefits of
a) I don't know your actual table structure so you should be able to work out how to get the two 'SELECT's working from your original queries
b) If ID doesn't appear in either table, that's fine
I have a table in my database:
Name | Element
1 2
1 3
4 2
4 3
4 5
I need to make a query that for a number of arguments will select the value of Name that has on the right side these and only these values.
E.g.:
arguments are 2 and 3, the query should return only 1 and not 4 (because 4 also has 5). For arguments 2,3,5 it should return 4.
My query looks like this:
SELECT name FROM aggregations WHERE (element=2 and name in (select name from aggregations where element=3))
What do i have to add to this query to make it not return 4?
A simple way to do it:
SELECT name
FROM aggregations
WHERE element IN (2,3)
GROUP BY name
HAVING COUNT(element) = 2
If you want to add more, you'll need to change both the IN (2,3) part and the HAVING part:
SELECT name
FROM aggregations
WHERE element IN (2,3,5)
GROUP BY name
HAVING COUNT(element) = 3
A more robust way would be to check for everything that isn't not in your set:
SELECT name
FROM aggregations
WHERE NOT EXISTS (
SELECT DISTINCT a.element
FROM aggregations a
WHERE a.element NOT IN (2,3,5)
AND a.name = aggregations.name
)
GROUP BY name
HAVING COUNT(element) = 3
It's not very efficient, though.
Create a temporary table, fill it with your values and query like this:
SELECT name
FROM (
SELECT DISTINCT name
FROM aggregations
) n
WHERE NOT EXISTS
(
SELECT 1
FROM (
SELECT element
FROM aggregations aii
WHERE aii.name = n.name
) ai
FULL OUTER JOIN
temptable tt
ON tt.element = ai.element
WHERE ai.element IS NULL OR tt.element IS NULL
)
This is more efficient than using COUNT(*), since it will stop checking a name as soon as it finds the first row that doesn't have a match (either in aggregations or in temptable)
This isn't tested, but usually I would do this with a query in my where clause for a small amount of data. Note that this is not efficient for large record counts.
SELECT ag1.Name FROM aggregations ag1
WHERE ag1.Element IN (2,3)
AND 0 = (select COUNT(ag2.Name)
FROM aggregatsions ag2
WHERE ag1.Name = ag2.Name
AND ag2.Element NOT IN (2,3)
)
GROUP BY ag1.name;
This says "Give me all of the names that have the elements I want, but have no records with elements I don't want"