INSERT INTO two columns from a SELECT query - sql

I have a table called VIEWS with Id, Day, Month, name of video, name of browser... but I'm interested only in Id, Day and Month.
The ID can be duplicate because the user (ID) can watch a video multiple days in multiple months.
This is the query for the minimum date and the maximum date.
SELECT ID, CONCAT(MIN(DAY), '/', MIN(MONTH)) AS MIN_DATE,
CONCAT(MAX(DAY), '/', MAX(MONTH)) AS MAX_DATE,
FROM Views
GROUP BY ID
I want to insert this select with two columns(MIN_DATE and MAX_DATE) to two new columns with insert into.
How can be the insert into query?

To do what you are trying to do (there are some issues with your solution, please read my comments below), first you need to add the new columns to the table.
ALTER TABLE Views ADD MIN_DATE VARCHAR(10)
ALTER TABLE Views ADD MAX_DATE VARCHAR(10)
Then you need to UPDATE your new columns (not INSERT, because you don't want new rows). Determine the min/max for each ID, then join the result back to the table to be able to update each row. You can't update directly from a GROUP BY as rows are grouped and lose their original row.
;WITH MinMax
(
SELECT
ID,
CONCAT(MIN(V.DAY), '/', MIN(V.MONTH)) AS MIN_DATE,
CONCAT(MAX(V.DAY), '/', MAX(V.MONTH)) AS MAX_DATE
FROM
Views AS V
GROUP BY
ID
)
UPDATE V SET
MIN_DATE = M.MIN_DATE,
MAX_DATE = M.MAX_DATE
FROM
MinMax AS M
INNER JOIN Views AS V ON M.ID = V.ID
The problems that I see with this design are:
Storing aggregated columns: you usually want to do this only for performance issues (which I believe is not the case here), as querying the aggregated (grouped) rows is faster due to being less rows to read. The problem is that you will have to update the grouped values each time one of the original rows is updated, which as extra processing time. Another option would be periodically updating the aggregated values, but you will have to accept that for a period of time the grouped values are not really representing the tracking table.
Keeping aggregated columns on the same table as the data they are aggregating: this is normalization problem. Updating or inserting a row will trigger updating all rows with the same ID as the min/max values might have changed. Also the min/max values will always be repeated on all rows that belong to the same ID, which is extra space that you are wasting. If you had to save aggregated data, you need to save it on a different table, which causes the problems I listed on the previous point.
Using text data type to store dates: you always want to work dates with a proper DATETIME data type. This will not only enable to use date functions like DATEADD or DATEDIFF, but also save space (varchars that store dates need more bytes that DATETIME). I don't see the year part on your query, it should be considered to compute a min/max (this might depend what you are storing on this table).
Computing the min/max incorrectly: If you have the following rows:
ID DAY MONTH
1 5 1
1 3 2
The current result of your query would be 3/1 as MIN_DATE and 5/2 as MAX_DATE, which I believe is not what you are trying to find. The lowest here should be the 5th of January and the highest the 3rd of February. This is a consequence of storing date parts as independent values and not the whole date as a DATETIME.
What you usually want to do for this scenario is to group directly on the query that needs the data grouped, so you will do the GROUP BY on the SELECT that needs the min/max. Having an index by ID would make the grouping very fast. Thus, you save the storage space you would use to keep the aggregated values and also the result is always the real grouped result at the time that you are querying.
Would be something like the following:
;WITH MinMax
(
SELECT
ID,
CONCAT(MIN(V.DAY), '/', MIN(V.MONTH)) AS MIN_DATE, -- Date problem (varchar + min/max computed seperately)
CONCAT(MAX(V.DAY), '/', MAX(V.MONTH)) AS MAX_DATE -- Date problem (varchar + min/max computed seperately)
FROM
Views AS V
GROUP BY
ID
)
SELECT
V.*,
M.MIN_DATE,
M.MAX_DATE
FROM
MinMax AS M
INNER JOIN Views AS V ON M.ID = V.ID

Related

SQL - Presto: Create a new table that counts values without duplicates having issues (AWS Athena)

I have a response_data_v0 that contains several duplicates of the UUID values that I want to ignore, keeping just the value that contains the oldest date (In other words, I only want the row with the first appearance of that specific UUID)
I built a temporary table using "with" and I filter it by getting "min(uuid)" as a column just to get unique values, then I created a second one that counts those values.
Given that the values of the UUID must be unique (following my logic), I created a validation column "excess_data" to test my hypotheses. All the values of "excess_data" should be = 0 if I am not getting duplicates in the first table given that
count(uuid) = count(distinct uuid)
In this specific case.
BUT, that is not happening, "excess_data" > 0 in all my results.
What am I doing wrong??
with unique_values as (SELECT
min(uuid) as uuid,
url,
day,
month,
year
--response_data
FROM "data_lake"."response_data_v0"
group by url, day, month, year
--order by uuid
)
SELECT
count(uuid) as count_uuids,
count(distinct uuid) as count_unique_uuids,
count(uuid) - count(distinct uuid) as excess_data,
month,
year
FROM unique_values
group by year, month
order by year, month
I would argue that the problem here is the different group clauses used to filter out duplicates and to verify that filtering. Unless business logic of the app creating the data prevents the same uuid from appearing on different days the observed behavior pretty much is bound to happen - unique_values grouping clause includes day, while the final select - does not. Either add day to the result query group by clause or remove it from unique_values's.
PrestoDB/Trino have a min_by aggregate function that returns the value of a column associated with the minimum value of another column (of the group). min_by(uuid, my_date) would return the value of the uuid column of the row with the minimum value of my_date in the group. Just assemble a string from your year, month, day columns and use that in place of my_date.

Bigquery - how to aggregate data based on conditions

I have a simple table like the following, which has product, price, cost and category. price and cost can be null.
And this table is being updated from time to time. Now I want to have a daily summary of the table content grouped by category, to see in each category, how many products that has no price, and how many has a price, and how many products has a price that is higher than the cost, so the result table would look like the following:
I think I can get a query running everyday by setting up query re-run schedule in bigQuery, so I can have three rows of data appended to the result table everyday.
But the problem is, how can I get those three rows? I know I can group by, but how do I get the count with those conditions like not null, larger than, etc.
You seem to want window functions:
select t.*
countif(price is nuill) over (partition by date) as products_no_price,
countif(price <= cost) over (partition by date) as products_price_lower_than_cost
from t;
You can run this code on the table that has date column. In fact, you don't need to store the last two columns.
If you want to insert the first table into the second, then there is no date and you can simply use:
select t.*
countif(price is nuill) over () as products_no_price,
countif(price <= cost) over () as products_price_lower_than_cost
from t;

How to get last value from a table category wise?

I have a problem with retrieving the last value of every category from my table which should not be sorted. For example i want the daily inventory value of nov-1 last appearance in the table without sorting the column daily inventory i.e "471". Is there a way to achieve this?
similarly i need to get the value of the next week's last daily inventory value and i should be able to do this for multiple items in the table too.
p.s: nov-1 represents nov-1 st week
Question from comments of initial post: will I be able to achieve what I need if I introduce a column id? If so, how can I do it?
Here's a way to do it (no guarantee that it's the most efficient way to do it)...
;WITH SetID AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY Week ORDER BY Week) AS rowid, * FROM <TableName>
),
MaxRow AS
(
SELECT LastRecord = MAX(rowid), Week
FROM SetID
GROUP BY Week
)
SELECT a.*
FROM SetID a
INNER JOIN MaxRow b
ON a.rowid = b.LastRecord
AND b.Week = a.Week
ORDER BY a.Week
I feel like there's more to the table though, and this is also untested on large amounts of data. I'd be afraid that a different RowID could be potentially assigned upon each run. (I haven't used ROW_NUMBER() enough to know if this would throw unexpected data.)
I suppose this example is to enforce the idea that, if you had a dedicated rowID on the table, it's possible. Also, I believe #Larnu's comment to you on your original post - introducing an ID column that retains current order, but reinserting all your data - is a concern too.
Here's a SQLFiddle example here.

Select latest and earliest times within a time group and a pivot statement

I have attandance data that contains a username, time, and status (IN or OUT). I want to show attendance data that contains a name, and the check in/out times. I expect a person to check in and out no more than twice a day. The data looks like this:
As you can see, my problem is that one person can have multiple data entries in different seconds for the same login attempt. This is because I get data from a fingerprint attendace scanner, and the machine in some cases makes multiple entries, sometimes just within 5-10 seconds. I want to select the data to be like this:
How can I identify the proper time for the login attempt, and then select the data with a pivot?
First, you need to normalize your data by removing the duplicate entries. In your situation, that's a challenge because the duplicated data isn't easily identified as a duplicate. You can make some assumptions though. Below, I assume that no one will make multiple login attempts in a two minute window. You can do this by first using a Common Table Expression (CTE, using the WITH clause).
Within the CTE, you can use the LAG function. Essentially what this code is saying is "for each partition of user and entry type, if the previous value was within 2 minutes of this value, then put a number, otherwise put null." I chose null as the flag that will keep the value because LAG of the first entry is going to be null. So, your CTE will just return a table of entry events (ID) that were distinct attempts.
Now, you prepare another CTE that a PIVOT will pull from that has everything from your table, but only for the entry IDs you cared about. The PIVOT is going to look over the MIN/MAX of your IN/OUT times.
WITH UNIQUE_LOGINS AS (
SELECT ID FROM LOGIN_TABLE
WHERE CASE WHEN LAG(TIME, 1, 0) OVER (PARTITION BY USERNAME, STATUS ORDER BY TIME)
+ (2/60/24) < TIME THEN NULL ELSE 1 END IS NULL ), -- Times within 2 minutes
TEMP_FOR_PIVOT AS (
SELECT USERNAME, TIME, STATUS FROM LOGIN_TABLE WHERE ID IN (SELECT ID FROM UNIQUE_LOGINS)
)
SELECT * FROM TEMP_FOR_PIVOT
PIVOT (
MIN(TIME), MAX(TIME) FOR STATUS IN ('IN', 'OUT')
)
From there, if you need to rearrange or rename your columns, then you can just put that last SELECT into yet another CTE and then select your values from it. There is some more about PIVOT here: Rotate/pivot table with aggregation in Oracle

SQL Server: I have multiple records per day and I want to return only the first of the day

I have some records track inquires by DATETIME. There is an glitch in the system and sometimes a record will enter multiple times on the same day. I have a query with a bunch of correlated subqueries attached to these but the numbers are off because when there were those glitches in the system then these leads show up multiple times. I need the first entry of the day, I tried fooling around with MIN but I couldn't quite get it to work.
I currently have this, I am not sure if I am on the right track though.
SELECT SL.UserID, MIN(SL.Added) OVER (PARTITION BY SL.UserID)
FROM SourceLog AS SL
Here's one approach using row_number():
select *
from (
select *,
row_number() over (partition by userid, cast(added as date) order by added) rn
from sourcelog
) t
where rn = 1
You could use group by along with min to accomplish this.
Depending on how your data is structured if you are assigning a unique sequential number to each record created you could just return the lowest number created per day. Otherwise you would need to return the ID of the record with the earliest DATETIME value per day.
--Assumes sequential IDs
select
min(Id)
from
[YourTable]
group by
--the conversion is used to stip the time value out of the date/time
convert(date, [YourDateTime]