Not getting unique values inspite of using the distinct function

Not getting unique values inspite of using the distinct function - sql

I am using the below code to return a set of distinct UUIDs and a corresponding date when the first action was taken on those UUIDs. The raw data will have non-distinct UUIDs and a corresponding date when an action was taken. I am trying to extract unique UUIDs and the first date when the action was taken as represented by date1. Can someone help where I am going wrong.
The output that I get is the same raw data and the UUIDs are unfortunately non-unique and has many duplicates
with raw_data as (
select UUID, cast(datestring as timestamp) as date1
from raw)
select
distinct UUID,
date_trunc('week', date1)
from raw_date

Use the min() aggregation function:
select UUID,
min(date_trunc('week', cast(datestring as timestamp)))
from raw
group by UUID;
This should do everything your query is doing. There is no need for a subquery or CTE.

Related

Select columns from second subquery if first returns NULL

I have two queries that I'm running separately from a PHP script. The first is checking if an identifier (group) has a timestamp in a table.
SELECT
group, MAX(timestamp) AS timestamp, value
FROM table_schema.sample_table
GROUP BY group, value
If there is no timestamp, then it runs this second query that retrieves the minimum timestamp from a separate table:
SELECT
group, MIN(timestamp) as timestamp, value AS value
FROM table_schema.src_table
GROUP BY group, value
And goes on from there.
What I would like to do, for the sake of conciseness, is to have a single query that runs the first statement, but that defaults to the second if NULL. I've tried with coalesce() and CASE statements, but they require subqueries to return single columns (which I hadn't run into being an issue yet). I then decided I should try a JOIN on the table with the aggregate timestamp to get the whole row, but then quickly realized I can't variate the table being joined (not to my knowledge). I opted to try joining both results and getting the max, something like this:
Edit: I am so tired, this should be a UNION, not a JOIN
sorry for any possible confusion :(
SELECT smpl.group, smpl.value, MAX(smpl.timestamp) AS timestamp
FROM table_schema.sample_table as smpl
INNER JOIN
(SELECT src.group, src.value, MIN(src.timestamp) AS timestamp
FROM source_table src
GROUP BY src.group, src.value) AS history
ON
smpl.group = history.group
GROUP BY smpl.group, smpl.value
I don't have a SELECT MAX() on this because it's really slow as is, most likely because my SQL is a bit rusty.
If anyone knows a better approach, I'd appreciate it!

Please try this:
select mx.group,(case when mx.timestamp is null then mn.timestamp else mx.timestamp end)timestamp,
(case when mx.timestamp is null then mn.value else mx.value end)value
(
SELECT
group, MAX(timestamp) AS timestamp, value
FROM table_schema.sample_table
GROUP BY group, value
)mx
left join
(
SELECT
group, MIN(timestamp) as timestamp, value AS value
FROM table_schema.src_table
GROUP BY group, value
)mn
on mx.group = mn.group

Creating a partitioned table from query in Big Query does not yield same as without partitioning

When creating a table let's say "orders" with partitioning in the following way my result gets truncated in comparison to if I create it without partitioning. (Commenting and uncommenting rows five and 6).
I suspect that it might have something to do with the BQ limits (found here) but I can't figure out what. The ts is a timestamp field and order_id is a UUID string.
i.e. The count distinct on the last row will yield very different results. When partitioned it will return far less order_ids than without partitioning.
DROP TABLE IF EXISTS
`project.dataset.orders`;
CREATE OR REPLACE TABLE
`project.dataset.orders`
-- PARTITION BY
-- DATE(ts)
AS
SELECT
ts,
order_id,
SUM(order_value) AS order_value
FROM
`project.dataset.raw_orders`
GROUP BY
1, 2;
SELECT COUNT(DISTINCT order_id) FROM `project.dataset.orders`;

(This is not a valid 'answer', I just need a better place to write SQL than the comment box, I don't mind if moderator convert this answer into a comment AFTER it serves its purpose)
What is the number you'd get if you do query below, and which one does it align with (partitioned or non-partitioned)?
SELECT COUNT(DISTINCT order_id) FROM (
SELECT
ts,
order_id,
SUM(order_value) AS order_value
FROM
`project.dataset.raw_orders`
GROUP BY
1, 2
) t;

It turns out that there's a 60 day partition expiration!
https://cloud.google.com/bigquery/docs/managing-partitioned-tables#partition-expiration
So by updating the partition expiration I could get the full range.

Calculating Datediff from nested json sql query

I am trying to use datediff() to find the age of a person in a postgres db (from their date of birth(. I know I can run datediff() like this
SELECT DATEDIFF(current_date, '2021-01-24');
The query i use to get date of birth is (its in json)
select date_of_birth from (select attrs::json->'info'->>'date_of_birth' from users) as date_of_birth;
This gives me output like
date_of_birth
--------------
(2000-11-03)
(2000-06-11)
(2000-05-31)
(2008-11-26)
(2007-11-09)
(2020-03-26)
(2018-06-30)
I tried using
SELECT DATEDIFF(current_date, (select date_of_birth from (select attrs::json->'info'->>'date_of_birth' as date_of_birth from users));
It doesn't work. I tried several permutations but i can't get it to work.
How should I edit my query to calculate the user age?

This query:
select date_of_birth
from (
select attrs::json->'info'->>'date_of_birth'
from users
) as date_of_birth;
Returns the row rather than a column, (the column expression for the extracted date value has no defined alias). It's like using select users from users. You need to make `date_of_birth_ a column alias (not a table alias) and use that in the outer query.
To get the difference between two dates, just subtract them but you need to cast the valued to a date to be able to do that.
select current_date - u.date_of_birth
from (
select (attrs::json->'info'->>'date_of_birth')::date as date_of_birth
from users
) as u;
Or without a derived table:
select current_date - (u.attrs::json->'info'->>'date_of_birth')::date
from users as u
Apparently your dates are stored in a non-standard format. In that case you can't use as cast, but you need to use the to_date() function:
to_date(u.attrs::json->'info'->>'date_of_birth', 'mm:dd:yyyy')
If you are storing JSON in the attrs column you should convert it from a text (or varchar column to a proper json (or better jsonb) so you don't need to cast it all the time.

Selecting distinct values from database

I have a table as follows:
ParentActivityID | ActivityID | Timestamp
1 A1 T1
2 A2 T2
1 A1 T1
1 A1 T5
I want to select unique ParentActivityID's along with Timestamp. The time stamp can be the most recent one or the first one as is occurring in the table.
I tried to use DISTINCT but i came to realise that it dosen't work on individual columns. I am new to SQL. Any help in this regard will be highly appreciated.

DISTINCT is a shorthand that works for a single column. When you have multiple columns, use GROUP BY:
SELECT ParentActivityID, Timestamp
FROM MyTable
GROUP BY ParentActivityID, Timestamp
Actually i want only one one ParentActivityID. Your solution will give each pair of ParentActivityID and Timestamp. For e.g , if i have [1, T1], [2,T2], [1,T3], then i wanted the value as [1,T3] and [2,T2].
You need to decide what of the many timestamps to pick. If you want the earliest one, use MIN:
SELECT ParentActivityID, MIN(Timestamp)
FROM MyTable
GROUP BY ParentActivityID

Try this:
SELECT [ParentActivityId],
MIN([Timestamp]) AS [FirstTimestamp],
MAX([Timestamp]) AS [RecentTimestamp]
FROM [Table]
GROUP BY [ParentActivityId]
This will provide you the first timestamp and the most recent timestamp for each ParentActivityId that is present in your table. You can choose the ones you need as per your need.

"Group by" is what you need here. Just do "group by ParentActivityID" and tell that most recent timestamp along all rows with same ParentActivityID is needed for you:
SELECT ParentActivityID, MAX(Timestamp) FROM Table GROUP BY ParentActivityID
"Group by" operator is like taking rows from a table and putting them in a map with a key defined in group by clause (ParentActivityID in this example). You have to define how grouping by will handle rows with duplicate keys. For this you have various aggregate functions which you specify on columns you want to select but which are not part of the key (not listed in group by clause, think of them as a values in a map).
Some databases (like mysql) also allow you to select columns which are not part of the group by clause (not in a key) without applying aggregate function on them. In such case you will get some random value for this column (this is like blindly overwriting value in a map with new value every time). Still, SQL standard together with most databases out there will not allow you to do it. In such case you can use min(), max(), first() or last() aggregate function to work around it.

Use CTE for getting the latest row from your table based on parent id and you can choose the columns from the entire row of the output .
;With cte_parent
As
(SELECT ParentActivityId,ActivityId,TimeStamp
, ROW_NUMBER() OVER(PARTITION BY ParentActivityId ORDER BY TimeStamp desc) RNO
FROM YourTable )
SELECT *
FROM cte_parent
WHERE RNO =1

Getting unique column amongst duplicate columns but returning the complete row

I need help on creating a select statement in sql to get the unique rows.
I need the unique Reference ID but since Call Time is also unique, I only need to get the first row out of the similar rows.
I have this table[Calls]:
The result should be:
When I used:
Select Distinct * FROM Calls
It will return the same table and not the result I want.

It may helps you...
min(date) is the first datetime for each individual
Select referenceid,min(date),number from calls
group by referenceid,number

Perhaps a simple GROUP BY:
SELECT ReferenceID,
MIN(CallTime) AS CallTime,
MIN(Number) AS Number
FROM dbo.TableName t
GROUP BY ReferenceID

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Not getting unique values inspite of using the distinct function - sql

Use the min() aggregation function: select UUID, min(date_trunc('week', cast(datestring as timestamp))) from raw group by UUID; This should do everything your query is doing. There is no need for a subquery or CTE.

Related

Select columns from second subquery if first returns NULL

Creating a partitioned table from query in Big Query does not yield same as without partitioning

Calculating Datediff from nested json sql query

Selecting distinct values from database

Getting unique column amongst duplicate columns but returning the complete row

Categories

Resources