In SQL, how do I create new column of values for each distinct values of another column? - sql

Something like this: SQL How to create a value for a new column based on the count of an existing column by groups?
But I have more than two distinct values. I have a variable n number of distinct values, so I don't always know have many different counts I have.
And then in the original table, I want each row '3', '4', etc. to have the count i.e. all the rows with the '3' would have the same count, all the rows with '4' would have the same count, etc.
edit: Also how would I split the count via different dates i.e. '2017-07-19' for each distinct values?
edit2: Here is how I did it, but now I need to split it via different dates.
edit3: This is how I split via dates.
#standardSQL
SELECT * FROM
(SELECT * FROM table1) main
LEFT JOIN (SELECT event_date, value, COUNT(value) AS count
FROM table1
GROUP BY event_date, value) sub ON main.value=sub.value
AND sub.event_date=SAFE_CAST(main.event_time AS DATE)
edit4: I wish PARTITION BY was documented somewhere better. Nothing seems to be widely written on BigQuery or anything with detailed documentation
#standardSQL
SELECT
*,
COUNT(*) OVER (PARTITION BY event_date, value) AS cnt
FROM table1;

The query that you give would better be written using window functions:
SELECT t1.*, COUNT(*) OVER (PARTITION BY value) as cnt
FROM table1 t1;
I am not sure if this answers your question.
If you have another column that you want to count as well, you can use conditional aggregation:
SELECT t1.*,
COUNT(*) OVER (PARTITION BY value) as cnt,
SUM(CASE WHEN datecol = '2017-07-19' THEN 1 ELSE 0 END) OVER (PARTITION BY value) as cnt_20170719
FROM table1 t1;

Related

How to count the rows of a column grouping by a column but omitting the others columns of the table in Oracle?

I want to do a count grouping by the first column but omitting the others columns in the group by. Let me explain:
I have a table with those columns
So, what I want to get is a new column with the work orders total by Instrument, something like this:
How can I do that? Because if I do a count like this:
SELECT INSTRUMENT, WORKORDER, DATE, COUNT(*)
FROM TABLE1
GROUP BY INSTRUMENT, WORKORDER, DATE;
I get this:
Just use a window function:
select t.*,
count(*) over (partition by instrument) as instrument_count
from table1 t;
Although answer given by Gordon is perfect but there is also another option by using group by and subquery. You can add date column to this query as well
SELECT * FROM
(
SELECT A.INSTRUMENT, B.TOTAL_COUNT_BY_INSTRUMENT
FROM work_order A,
(SELECT COUNT(1) AS TOTAL_COUNT_BY_INSTRUMENT,
INSTRUMENT
FROM WORK_ORDER
GROUP BY INSTRUMENT
) B
WHERE A.INSTRUMENT = B.instrument);

SQL Server Count field values without merge

How do I create a COUNT column to count the repetitive values?
And I want to keep the table EXACTLY as below but add the last column (count_id).
The values at the left come from a JOIN so they are "equal".
Thanks! (I tried a lot)
You just want count(*) as a window function:
select t.*,
count(*) over (partition by id, name, department) as count_id
from t;

Joining data from two sources using bigquery

Can anyone please check whether below code is correct? In cte_1, I’m taking all dimensions and metrics from t1 excpet value1, value2, value3. In cte_2, I’m finding the unique row number for t2. In cte_3, I’m taking all distinct dimensions and metrics using join on two keys such as Date, and Ad. In cte_4, I’m taking the values for only row number 1. I’m getting sum(value1),sum(value2),sum(value3) correct ,but sum(value4) is incorrect
WITH cte_1 AS
(SELECT *except(value1, value2, value3) FROM t1 where Date >"2020-02-16" and Publisher ="fb")
-- Find unique row number from t2--
,cte_2 as(
SELECT ROW_NUMBER() OVER(ORDER BY Date) distinct_row_number, * FROM t2
,cte_3 as
(SELECT cte_2.*,cte_1.*except(Date) FROM cte_2 join cte_1
on cte_2.Date = cte_1. Date
and cte_2.Ad= cte_1.Ad))
,cte_4 AS (
(SELECT *
FROM
(
SELECT *,
row_number() OVER (PARTITION BY distinct_row_number ORDER BY Date) as rn
FROM cte_3 ) T
where rn = 1 ))
select sum(value1),sum(value2),sum(value3),sum(value4) from cte_4
Please see the sample table below:
Whilst your data does not seem compliant with the query you shared, since it is lacking the field named Ad and other fields have different names, such as Date and ReportDate, I was able to identify some issues and propose improvements.
First, within your temp table cte_1, you are only using a filter in the WHERE clause, you could use it within your from statement in your last step, such as :
SELECT * FROM (SELECT field1,field2,field3 FROM t1 WHERE Date > DATE(2020,02,16) )
Second, in cte_2, you need to select all the columns you will need from the table t2. Otherwise, your table will have only the row number and it won't be possible to join it with other tables, once it does not provide any other information. Thus, if you need the row number, you select it together with the other columns, which it has to include your primary key if you will perform any join in the future. The syntax would be as follows:
SELECT field1, field2, ROW_NUMBER() OVER(ORDER BY Date) FROM t2
Third, in cte_3, I assume you want to perform an INNER JOIN. Thus, you need to make sure that the primary keys are present in both tables, in your case Date and Ad, which I could not find within your data. Furthermore, you can not have duplicated names when joining two tables and selecting all the columns. For example, in your case you have Brand, value 1, value 2 and value 3 in both tables, it will cause an error. Thus, you need to specify where these fields should come from by selecting one by one or the using a EXCEPT clause.
Finally, in cte_4 and your final select could be together in one step. Basically, you are selecting only one row of data ordered by Date. Then summing the fields value 1, value 2 and value 3 individually based on the partition by date. Moreover, you are not selecting any identifier for the sum, which means that your table will have only the final sums. In general, when peforming a aggregation, such as SUM(), the primary key(s) is selected as well. Lastly, this step could have been performed in one step such as follows, using only the data from t2:
SELECT ReportDate, Brand, sum(value1) as sum_1,sum(value2) as sum_1,sum(value3) as sum_1, sum(value4) as sum_1 FROM (SELECT t2.*, ROW_NUMBER() OVER(PARTITION BY Date ORDER BY Date) as rn t2)
WHERE rn=1
GROUP BY ReportDate, Brand
UPDATE:
With your explanation in the comment section. I was able to created a more specific query. The fields ReportDate,Brand,Portfolio,Campaign and value1,value2,value3 are from t2. Whilst value4 is from t1. The sum is made based on the row number equals to 1. For this reason, the tables t1 and t2 are joined before being using ROW_NUMBER(). Finally, in the last Select statement rn is not selected and the data is aggregated based on ReportDate, Brand, Portfolio and t2.Campaign.
WITH cte_1 AS (
SELECT t2.ReportDate, t2.Brand, t2.Portfolio, t2.Campaign,
t2.value1, t2.value2, t2.value3, t1.value4
FROM t2 LEFT JOIN t1 on t2.ReportDate = t1.ReportDate and t1.placement=t2.Ad
),
cte_2 AS(
SELECT *, ROW_NUMBER() OVER(PARTITION BY Date ORDER BY ReportDate) as rn FROM cte_1
)
SELECT ReportDate, Brand, Portfolio, Campaign, SUM(value1) as sum1, SUM(value2) as sum2, SUM(value3) as sum3,
SUM(value4) as sum4
FROM cte_2
WHERE rn=1
GROUP BY 1,2,3,4

Get minimum without using row number/window function in Bigquery

I have a table like as shown below
What I would like to do is get the minimum of each subject. Though I am able to do this with row_number function, I would like to do this with groupby and min() approach. But it doesn't work.
row_number approach - works fine
SELECT * FROM (select subject_id,value,id,min_time,max_time,time_1,
row_number() OVER (PARTITION BY subject_id ORDER BY value) AS rank
from table A) WHERE RANK = 1
min() approach - doesn't work
select subject_id,id,min_time,max_time,time_1,min(value) from table A
GROUP BY SUBJECT_ID,id
As you can see just the two columns (subject_id and id) is enough to group the items together. They will help differentiate the group. But why am I not able to use the other columns in select clause. If I use the other columns, I may not get the expected output because time_1 has different values.
I expect my output to be like as shown below
In BigQuery you can use aggregation for this:
SELECT ARRAY_AGG(a ORDER BY value LIMIT 1)[SAFE_OFFSET(1)].*
FROM table A
GROUP BY SUBJECT_ID;
This uses ARRAY_AGG() to aggregate each record (the a in the argument list). ARRAY_AGG() allows you to order the result (by value) and to limit the size of the array. The latter is important for performance.
After you concatenate the arrays, you want the first element. The .* transforms the record referred to by a to the component columns.
I'm not sure why you don't want to use ROW_NUMBER(). If the problem is the lingering rank column, you an easily remove it:
SELECT a.* EXCEPT (rank)
FROM (SELECT a.*,
ROW_NUMBER() OVER (PARTITION BY subject_id ORDER BY value) AS rank
FROM A
) a
WHERE RANK = 1;
Are you looking for something like below-
SELECT
A.subject_id,
A.id,
A.min_time,
A.max_time,
A.time_1,
A.value
FROM table A
INNER JOIN(
SELECT subject_id, MIN(value) Value
FROM table
GROUP BY subject_id
) B ON A.subject_id = B.subject_id
AND A.Value = B.Value
If you do not required to select Time_1 column's value, this following query will work (As I can see values in column min_time and max_time is same for the same group)-
SELECT
A.subject_id,A.id,A.min_time,A.max_time,
--A.time_1,
MIN(A.value)
FROM table A
GROUP BY
A.subject_id,A.id,A.min_time,A.max_time
Finally, the best approach is if you can apply something like CAST(Time_1 AS DATE) on your time column. This will consider only the date part regardless of the time part. The query will be
SELECT
A.subject_id,A.id,A.min_time,A.max_time,
CAST(A.time_1 AS DATE) Time_1,
MIN(A.value)
FROM table A
GROUP BY
A.subject_id,A.id,A.min_time,A.max_time,
CAST(A.time_1 AS DATE)
-- Make sure the syntax of CAST AS DATE
-- in BigQuery is as I written here or bit different.
Below is for BigQuery Standard SQL and is most efficient way for such cases like in your question
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY value LIMIT 1)[OFFSET(0)]
FROM `project.dataset.table` t
GROUP BY subject_id
Using ROW_NUMBER is not efficient and in many cases lead to Resources exceeded error.
Note: self join is also very ineffective way of achieving your objective
A bit late to the party, but here is a cte-based approach which made sense to me:
with mins as (
select subject_id, id, min(value) as min_value
from table
group by subject_id, id
)
select distinct t.subject_id, t.id, t.time_1, t.min_time, t.max_time, m.min_value
from table t
join mins m on m.subject_id = t.subject_id and m.id = t.id

SELECT *, COUNT(*) in SQLite

If i perform a standard query in SQLite:
SELECT * FROM my_table
I get all records in my table as expected. If i perform following query:
SELECT *, 1 FROM my_table
I get all records as expected with rightmost column holding '1' in all records. But if i perform the query:
SELECT *, COUNT(*) FROM my_table
I get only ONE row (with rightmost column is a correct count).
Why is such results? I'm not very good in SQL, maybe such behavior is expected? It seems very strange and unlogical to me :(.
SELECT *, COUNT(*) FROM my_table is not what you want, and it's not really valid SQL, you have to group by all the columns that's not an aggregate.
You'd want something like
SELECT somecolumn,someothercolumn, COUNT(*)
FROM my_table
GROUP BY somecolumn,someothercolumn
If you want to count the number of records in your table, simply run:
SELECT COUNT(*) FROM your_table;
count(*) is an aggregate function. Aggregate functions need to be grouped for a meaningful results. You can read: count columns group by
If what you want is the total number of records in the table appended to each row you can do something like
SELECT *
FROM my_table
CROSS JOIN (SELECT COUNT(*) AS COUNT_OF_RECS_IN_MY_TABLE
FROM MY_TABLE)