Trying to get the greatest value from a customer on a given day - sql

What I need to do: if a customer makes more than one transaction in a day, I need to display the greatest value (and ignore any other values).
The query is pretty big, but the code I inserted below is the focus of the issue. I’m not getting the results I need. The subselect ideally should be reducing the number of rows the query generates since I don’t need all the transactions, just the greatest one, however my code isn’t cutting it. I’m getting the exact same number of rows with or without the subselect.
Note: I don’t actually have a t. in the actual query, there’s just a dozen or so other fields being pulled in. I added the t.* just to simplify the code example.*
SELECT
t.*,
(SELECT TOP (1)
t1.CustomerGUID
t1.Value
t1.Date
FROM #temp t1
WHERE t1.CustomerGUID = t.CustomerGUID
AND t1.Date = t.Date
ORDER BY t1.Value DESC) AS “Value”
FROM #temp t
Is there an obvious flaw in my code or is there a better way to achieve the result of getting the greatest value transaction per day per customer?
Thanks

you may want to do as follows:
SELECT
t1.CustomerGUID,
t1.Date,
MAX(t1.Value) AS Value
FROM #temp t1
GROUP BY
t1.CustomerGUID,
t1.Date

You can use row_number() as shown below.
SELECT
*
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY CustomerGUID ORDER BY Date Desc) AS SrNo FROM <YourTable>
)
<YourTable>
WHERE
SrNo = 1
Sample data will be more helpful.

Try this window function:
MAX(value) OVER(PARTITION BY date,customer ORDER BY value DESC)
Its faster and more efficient.

Probably many other ways to do it, but this one is simple and works
select t.*
from (
select
convert(varchar(8), r.date,112) one_day
,max(r.Value) max_sale
from #temp r
group by convert(varchar(8), r.date,112)
) e
inner join #temp t on t.value = e.max_sale and convert(varchar(8), t.date,112) = e.one_day
if you have 2 people who spend the exact same amount that's also max, you'll get 2 records for that day.
the convert(varchar(8), r.date,112) will perform as desired on date, datetime and datetime2 data types. If you're date is a varchar,char,nchar or nvarchar you'll want to examine the data to find out if you left(t.date,10) or left(t.date,8) it.

If i've understood your requirement correctly you have stated"greatest value transaction per day per customer". That suggests to me you don't want 1 row per customer in the output but a row per day per customer.
To achieve this you can group on the day like this
Select t.customerid, datepart(day,t.date) as Daydate,
max(t.value) as value from #temp t group by
t.customerid, datepart(day,t.date);

Related

SQL Server: loop once a month value to the whole month

I have a table that gets one value of only one day in each month. I want to duplicate that value to the whole month until a new value shows up. the result will be a table with data for each day of the month based on the last known value.
Can someone help me writing this query?
This is untested, due to a lack of consumable sample data, but this looks like a gaps and island problem. Here you can count the number of non-NULL values for Yield to assign the group "number" and then get the windowed MAX in the outer SELECT:
WITH CTE AS(
SELECT Yield,
[Date],
COUNT(yield) OVER (ORDER BY [Date]) AS Grp
FROM dbo.YourTable)
SELECT MAX(yield) OVER (PARTITION BY grp) AS yield
[Date],
DATENAME(WEEKDAY,[Date]) AS [Day]
FROM CTE;
You seem to have data on the first of the month. That suggests an alternative approach:
select t.*, t2.yield as imputed_yield
from t cross apply
(select t2.*
from t t2
where t2.date = datefromparts(year(t.date), month(t.date), 1)
) t2;
This should be able to take advantage of an index on (date, yield). And it does assume that the value you want is on the first date of the month.

SQL: How to select a max record each day?

I found a lot of similar questions but no one fits perfectly for my case and I am struggling for hours for a solution. My table is composed by the fields DAY, HOUR, EVENT1, EVENT2, EVENT3. Therefore I have 24 rows each day. EVENT1, EVENT2, EVENT3 have some values and I'd like to select each day only the row (I mean the record) for which EVENT3 has the maximum value in the day (among the 24 hours). The final outcome will be one row per day
One method uses a correlated subquery:
select t.*
from t
where t.event3 = (select max(t2.event3)
from t t2
where t2.date = t.date
);
In most databases, this has very good performance with an index on (date, event3).
A more canonical solution uses row_number():
select t.*
from (select t.*,
row_number() over (partition by date order by event3 desc) as seqnum
from t
) t
where seqnum = 1;
Another option aside from using correlated subqueries is to write this is a left self-join, something like this:
SELECT t.*
FROM t
LEFT JOIN t AS t2 ON t.day = t2.day AND t2.event3 > t.event3
WHERE t2.id IS NULL
If you want to select an arbitrary matching row each day in the event of multiple rows with the same maximum event3, tack GROUP BY t.day on the end of that.
I'm not sure how performance of this is going to compare to Gordon Linoff's solutions, but they might get assembled into quite similar query plans by the RDBMS anyway.

Get a new column with updated values, where each row change in time depending on the actual column?

I have some data that includes as columns an ID, Date and Place denoted by a number. I need to simulate a real time update where I create a new column that says how many different places are at the moment, so each time a new place appear in the column, the new column change it's value and shows it.
This is just a little piece of the original table with hundreds of millions of rows.
Here is an example, the left table is the original one and the right table is what I need.
I tried to do it with this piece of code but I cannot use the function DISTINCT with the OVER clause.
SELECT ID, Dates, Place,
count (distinct(Place)) OVER (PARTITION BY Place ORDER BY Dates) AS
DiffPlaces
FROM #informacion_prendaria_muestra
order by ID;
I think it will be possible by using DENSE_RANK() in SQL server
you can try this
SELECT ID, Dates, Place,
DENSE_RANK() OVER(ORDER BY Place) AS
DiffPlaces
FROM #informacion_prendaria_muestra
I think you can use a self join query like this - without using windows functions -:
select
t.ID, t.[Date], t.Place,
count(distinct tt.Place) diffPlace
from
yourTable t left join
yourTable tt on t.ID = tt.ID and t.[Date] >= tt.[Date]
group by
t.ID, t.[Date], t.Place
order by
Id, [Date];
SQL Fiddle Demo

Get minimum without using row number/window function in Bigquery

I have a table like as shown below
What I would like to do is get the minimum of each subject. Though I am able to do this with row_number function, I would like to do this with groupby and min() approach. But it doesn't work.
row_number approach - works fine
SELECT * FROM (select subject_id,value,id,min_time,max_time,time_1,
row_number() OVER (PARTITION BY subject_id ORDER BY value) AS rank
from table A) WHERE RANK = 1
min() approach - doesn't work
select subject_id,id,min_time,max_time,time_1,min(value) from table A
GROUP BY SUBJECT_ID,id
As you can see just the two columns (subject_id and id) is enough to group the items together. They will help differentiate the group. But why am I not able to use the other columns in select clause. If I use the other columns, I may not get the expected output because time_1 has different values.
I expect my output to be like as shown below
In BigQuery you can use aggregation for this:
SELECT ARRAY_AGG(a ORDER BY value LIMIT 1)[SAFE_OFFSET(1)].*
FROM table A
GROUP BY SUBJECT_ID;
This uses ARRAY_AGG() to aggregate each record (the a in the argument list). ARRAY_AGG() allows you to order the result (by value) and to limit the size of the array. The latter is important for performance.
After you concatenate the arrays, you want the first element. The .* transforms the record referred to by a to the component columns.
I'm not sure why you don't want to use ROW_NUMBER(). If the problem is the lingering rank column, you an easily remove it:
SELECT a.* EXCEPT (rank)
FROM (SELECT a.*,
ROW_NUMBER() OVER (PARTITION BY subject_id ORDER BY value) AS rank
FROM A
) a
WHERE RANK = 1;
Are you looking for something like below-
SELECT
A.subject_id,
A.id,
A.min_time,
A.max_time,
A.time_1,
A.value
FROM table A
INNER JOIN(
SELECT subject_id, MIN(value) Value
FROM table
GROUP BY subject_id
) B ON A.subject_id = B.subject_id
AND A.Value = B.Value
If you do not required to select Time_1 column's value, this following query will work (As I can see values in column min_time and max_time is same for the same group)-
SELECT
A.subject_id,A.id,A.min_time,A.max_time,
--A.time_1,
MIN(A.value)
FROM table A
GROUP BY
A.subject_id,A.id,A.min_time,A.max_time
Finally, the best approach is if you can apply something like CAST(Time_1 AS DATE) on your time column. This will consider only the date part regardless of the time part. The query will be
SELECT
A.subject_id,A.id,A.min_time,A.max_time,
CAST(A.time_1 AS DATE) Time_1,
MIN(A.value)
FROM table A
GROUP BY
A.subject_id,A.id,A.min_time,A.max_time,
CAST(A.time_1 AS DATE)
-- Make sure the syntax of CAST AS DATE
-- in BigQuery is as I written here or bit different.
Below is for BigQuery Standard SQL and is most efficient way for such cases like in your question
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY value LIMIT 1)[OFFSET(0)]
FROM `project.dataset.table` t
GROUP BY subject_id
Using ROW_NUMBER is not efficient and in many cases lead to Resources exceeded error.
Note: self join is also very ineffective way of achieving your objective
A bit late to the party, but here is a cte-based approach which made sense to me:
with mins as (
select subject_id, id, min(value) as min_value
from table
group by subject_id, id
)
select distinct t.subject_id, t.id, t.time_1, t.min_time, t.max_time, m.min_value
from table t
join mins m on m.subject_id = t.subject_id and m.id = t.id

How to set updating row's field with value of closest to it by date another field?

I have a huge table with 2m+ rows.
The structure is like that:
ThingName (STRING),
Date (DATE),
Value (INT64)
Sometimes Value is null and I need to fix it by setting it with NOT NULL Value of closest to it by Date row corresponding to ThingName...
And I am totally not SQL guy.
I tried to describe my task with this query (and simplified it a lot by using only previous dates (but actually I need to check future dates too)):
update my_tbl as SDP
set SDP.Value = (select SDPI.Value
from my_tbl as SDPI
where SDPI.Date < SDP.Date
and SDP.ThingName = SDPI.ThingName
and SDPI.Value is not null
order by SDPI.Date desc limit 1)
where SDP.Value is null;
There I try to set updating row Value with one that I select from same table for same ThingName and with limit 1 I leave only single result.
But query editor tell me this:
Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.
Actually, I am not sure at all that my task can be solved just with query.
So, can anyone help me? If this is impossible, then tell me this, if it possible, tell me what SQL constructions may help me.
Below is for BigQuery Standard SQL
In many (if not most) cases you don't want to update your table (as it incur extra cost and limitations associated with DML statements) but rather can adjust 'missing' values in-query - like in below example:
#standardSQL
SELECT
ThingName,
date,
IFNULL(value,
LAST_VALUE(value IGNORE NULLS)
OVER(PARTITION BY thingname ORDER BY date)
) AS value
FROM `project.dataset.my_tbl`
If for some reason you actually need to update the table - above statement will not help as DML's UPDATE does not allow use of analytic functions, so you need to use another approach. For example as below one
#standardSQL
SELECT
t1.ThingName, t1.date,
ARRAY_AGG(t2.Value IGNORE NULLS ORDER BY t2.date DESC LIMIT 1)[OFFSET(0)] AS value
FROM `project.dataset.my_tbl` AS t1
LEFT JOIN `project.dataset.my_tbl` AS t2
ON t2.ThingName = t1.ThingName
AND t2.date <= t1.date
GROUP BY t1.ThingName, t1.date, t1.value
and now you can use it to update your table as in example below
#standardSQL
UPDATE `project.dataset.my_tbl` t
SET value = new_value
FROM (
SELECT TO_JSON_STRING(t1) AS id,
ARRAY_AGG(t2.Value IGNORE NULLS ORDER BY t2.date DESC LIMIT 1)[OFFSET(0)] new_value
FROM `project.dataset.my_tbl` AS t1
LEFT JOIN `project.dataset.my_tbl` AS t2
ON t2.ThingName = t1.ThingName
AND t2.date <= t1.date
GROUP BY id
)
WHERE TO_JSON_STRING(t) = id
In BigQuery, updates are rather rare. The logic you seem to want is:
select t.*,
coalesce(value,
lag(value ignore nulls) over (partition by thingname order by date)
) as value
from my_tbl;
I don't really see a reason to save this back in the table.