SQL query that I have set up the algorithm but cannot write the code - sql

I could not find keywords to describe in the title.
I have a problem and I just can explain with example, I have a table like this
user_id | transaction_id | bonus_id | created_at
1. 1 4 2021-05-01
1 3 65 2021-05-01
1 4 4 2021-05-02
1 1 5 2021-05-02
1. 3 76. 2021-05-03
1 2 5 2021-05-03
Due to a mistake I made in php here, transaction id 3 and bonus id 65 but the bonus id 4 that should be
I need to replace all transactions from transaction type 1 to the next transaction type 1 with the bonus id of the first transaction_type_1.
but of course I have to do this for every user. How can I do that?

Related

Pandas: to get mean for each data category daily [duplicate]

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

SQL Compare 2 tables and show only non-matching data

I know this has been ask quite a few times but I didn't find the answer my problem. I am trying to compare two tables and see which messages has been read by customers (cust_id).
The message_details is split into 4 categories (all, none, single and trade)
It seems to work fine apart from when the condition in message_details is set to all. This will always display even if the customer has read the message.
I hope someone can help.
I have 2 tables
message_details
id
date
subject
message
condition
1
2022-01-18
Testing
This is a test to all people
all
2
2022-01-19
To all single
This is a single test
single
3
2022-01-20
To all none
This is a None test
none
4
2022-01-21
To all trade
This is a Trade test
trade
5
2022-01-19
To all single
This is a single test 2
single
6
2022-01-19
To all single
This is a single test 3
single
message_read
id
date
cust_id
message_id
condition
1
2022-01-18
283
1
read
2
2022-01-21
283
2
read
3
2022-01-18
283
5
read
4
2022-01-21
283
6
read
5
2022-01-18
211
1
read
6
2022-01-21
211
2
read
7
2022-01-18
211
5
read
8
2022-01-21
211
6
read
9
2022-01-18
213
1
read
10
2022-01-21
213
2
read
11
2022-01-18
213
5
read
12
2022-01-21
213
6
read
I am using the following code to check which cust_id has read the message_details.
SELECT
id, date, subject, message, condition
FROM
message_details
WHERE
NOT EXISTS
(
SELECT message_id, cust_id FROM message_read
WHERE message_read.cust_id = '283' AND
message_read.message_id = message_details.id
)
AND
message_details.condition = 'single'
OR
message_details.condition = 'all'
This seems to work (not sure it's the correct way of doing it) but doesn't hide id 1 from message_details even though cust_id = 283 has read it.

How to write the query to make report by month in sql

I have the receiving and sending data for whole year. so i want to built the monthly report base on that data with the rule is Fisrt in first out. It means is the first receiving will be sent out first ...
DECLARE #ReceivingTbl AS TABLE(Id INT,ProId int, RecQty INT,ReceivingDate DateTime)
INSERT INTO #ReceivingTbl
VALUES (1,1001,210,'2019-03-12'),
(2,1001,315,'2019-06-15'),
(3,2001,500,'2019-04-01'),
(4,2001,10,'2019-06-15'),
(5,1001,105,'2019-07-10')
DECLARE #SendTbl AS TABLE(Id INT,ProId int, SentQty INT,SendMonth int)
INSERT INTO #SendTbl
VALUES (1,1001,50,3),
(2,1001,100,4),
(3,1001,80,5),
(4,1001,80,6),
(5,2001,200,6)
SELECT * FROM #ReceivingTbl ORDER BY ProId,ReceivingDate
SELECT * FROM #SendTbl ORDER BY ProId,SendMonth
Id ProId RecQty ReceivingDate
1 1001 210 2019-03-12
2 1001 315 2019-06-15
5 1001 105 2019-07-10
3 2001 500 2019-04-01
4 2001 10 2019-06-15
Id ProId SentQty SendMonth
1 1001 50 3
2 1001 100 4
3 1001 80 5
4 1001 80 6
5 2001 200 6
--- And the below is what i want:
Id ProId RecQty ReceivingDate ... Mar Apr May Jun
1 1001 210 2019-03-12 ... 50 100 60 0
2 1001 315 2019-06-15 ... 0 0 20 80
5 1001 105 2019-07-10 ... 0 0 0 0
3 2001 500 2019-04-01 ... 0 0 0 200
4 2001 10 2019-06-15 ... 0 0 0 0
Thanks!
Your question is not clear to me.
If you want to purely use the FIFO approach, therefore ignore any data the table contains, you necessarely need to order by ID, which in your example you are providing, and looks like it is in order of insert.
The first line inserted should be also the first line appearing in the select (FIFO), in order to do so you have to use:
ORDER BY Id ASC
Which will place the lower value of the ID first (1, 2, 3, ...)
To me though, this doesn't make much sense, so pay attention to the meaning o the data you actually have and leverage dates like ReceivingDate, and order by that, maybe even filtering by month of the date, below an example for January data:
WHERE MONTH(ReceivingDate) = 1

Insert multiple rows from result of Average by date and id

I have a table with 1 result per day like this :
id | item_id | date | amount
-------------------------------------
1 1 2019-01-01 1
2 1 2019-01-02 2
3 1 2019-01-03 3
4 1 2019-01-04 4
5 1 2019-01-05 5
6 2 2019-01-01 1
7 2 2019-01-01 2
8 2 2019-01-01 3
9 2 2019-01-01 4
10 2 2019-01-01 5
11 3 2019-01-01 1
12 3 2019-01-01 2
13 3 2019-01-01 3
14 3 2019-01-01 4
15 3 2019-01-01 5
First I was trying to average the column amount for each day.
SELECT
x.item_id AS id,avg(x.amount) AS result
FROM
(SELECT
il.item_id, il.amount,
ROW_NUMBER() OVER (PARTITION BY il.item_id ORDER BY il.date DESC) rn
FROM
item_prices il) x
WHERE
x.rn BETWEEN 1 AND 50
GROUP BY
x.item_id
The result is going to be the following if calculated on 2019-01-05
item_id | average
1 3
2 3
3 3
or, if calculated 2019-01-04
item_id | average
1 2.5
2 2.5
3 2.5
My goal is to run the Average query , every day that would update the average automatically and insert it in 5th column "average" :
id | item_id | date | amount | average
5 1 2019-01-05 5 3
10 2 2019-01-05 5 3
15 3 2019-01-05 5 3
Issue is that every example i can find with Insert the Select they only update one row and they are over another table there is also the most recent date issue...
Can someone point me in the right direction?
Perhaps you want to see running average every day. Storing the value as a separate column is bound to cause problems especially when the rows are updated/deleted, the column also needs to be updated and hence will require complex triggers.
Simply create a View and run whenever you want to check the average directly from that View.
CREATE OR REPLACE VIEW v_item_prices AS
SELECT t.*,avg(t.amount) OVER ( PARTITION BY item_id order by date)
AS average FROM item_prices t
order by item_id,date
DEMO

SQL terminology to combine a NOT EXIST query with latest value

I am a beginner with basic knowledge.
I have a single table that I am trying to pull all UID's that have not had a particular code in the table within the past year.
My table looks like this: (but much larger of course)
FACID DPID EID DID UID DT Code Units Charge ET Ord
1 1 6 2 1002 15-Mar-07 99204 1 180 09:36.7 1
1 1 7 5 10004 15-Mar-07 99213 1 68 02:36.9 1
1 1 24 55 25887 15-Mar-07 99213 1 68 43:55.3 1
1 1 25 2 355688 15-Mar-07 99213 1 68 53:20.2 1
1 1 26 5 555654 15-Mar-07 99213 1 68 42:22.6 1
1 1 27 44 135514 15-Mar-07 99213 1 68 00:36.8 1
1 1 28 2 3244522 15-Mar-07 99214 1 98 34:59.4 1
1 1 29 5 235445 15-Mar-07 99213 1 68 56:42.1 1
1 1 30 3 3214444 15-Mar-07 99213 1 68 54:56.5 1
1 1 33 1 221444 15-Mar-07 99204 1 180 37:44.5 1
I am attempting to use the following, but this is not working for my time frame limits.
select distinct UID from PtProcTbl
where DT<'20120101'
and NOT EXISTS (Select Distinct UID
where Code in ('99203','99204','99205','99213',
'99214','99215','99244','99245'))
I need to know how to make sure the UID's that I am pulling are the ones don't have a DT after the 1/1/2012 cut off date that contains one of the NOT Exists codes.
The above query returned UID's that actually dates after 1/1/2012 that does contain one of the above codes...
Not sure what I am doing wrong or if I am totally off base on this..
Thanks in advance.
Are you sure you need the NOT EXISTS? How about instead:
AND Code NOT IN ('99203','99204','99205','99213','99214','99215','99244','99245')