SQL Having/Where clause to compare MAX from current/another table - sql

I have a table that has date information and is being copied to another table and trying to perform an incremental load.
date = date format
hour = int
person
date
hour
bob
2023-01-01
1
bill
2023-01-02
2
select * into test.person_copy from
(select * from original.person)
My thought process of performing the incremental load is to check on the max(date) & max(hour) from the original table against the copied table to identify what is the gap between the max values from both tables. However, I'm not entirely sure how to implement the logic as it doesn't seem straight forward with the where clause. Having clause might make more sense, but also doesn't seem correct?
select * into test.person_copy from
(select * from original.person org
Having max(org.date, org.hour) > (select max(copy.date,copy.hour) from test.person_copy copy)
)
The other variation I had in mind was to use HAVING NOT IN
Having max(org.date, org.hour) NOT IN (select max(copy.date,copy.hour) from test.person_copy copy)
Wasn't sure if logic is correct. Hour field will be of importance's, but can live with just the date fields.
Expected output would be that the logic would check for existing max(date) and only insert if it doesn't exist. Example below, 2023-01-03
| person | date | hour |
|--------|------------|------|
| bob | 2023-01-01 | 1 |
| bill | 2023-01-02 | 2 |
| test | 2023-01-03 | 2 |

Don't have access to a RedShift environment but the following query should work:
select *
into test.person_copy
from original.person org
where dateadd(hrs, org.hour, org.date) >
(select max(dateadd(hrs, cpy.hour, cpy.date))
from test.person_copy cpy
)
This assumes that when the previous hour's copy was made entire set of source rows for that date&hour was copied (the new incremental load would have all rows for the dates&hours not already copied). This means that you need additional criteria in the select to make sure that you include only completed date-hours (i.e. make sure that you don't include the rows with hour=10 while the time is still 10:30).

Related

MAX Function Fails in SQL

I'm trying to get the MOST recent date that comes before tom_temp.Begin_Time out of tbl_Trim_history.Comp. The SQL I'm using is:
SELECT
Tom_Temp2.feeder,
Tom_Temp.CauseType,
Tom_Temp.RootCause,
Tom_Temp.Storm_Name_Thunder,
Tom_Temp.DeviceGroup,
tbl_Trim_History.[COMP],
Tom_Temp.[Begin_Time]
FROM Tom_Temp2
LEFT JOIN (Tom_Temp
LEFT JOIN tbl_Trim_History
ON Tom_Temp.feeder = tbl_Trim_History.CIRCUIT_ID)
ON Tom_Temp2.feeder = Tom_Temp.feeder
WHERE (((tbl_Trim_History.[COMP]) < [Tom_Temp].[Begin_Time]));
I'm having a hard time figuring out where I need to put my max() function in this statement in order to make sure I don't get back every single tbl_Trim_history.[COMP] that occurs prior to the tom_temp.Begin_Time date. I only want the most recent date from tbl_Trim_history.[COMP] that occurs BEFORE the tom_temp.begin_Time .... NOT every historical date record.
Any help you guys could give me would be awesome because I keep getting back sets that I can tell are not what I'm looking for / expecting.
Thanks everyone. I appreciate the feedback.
Edit in regard to the responses below:
Due to the character limits, I just edited the master post for you guys.
I can't really post the data as it is somewhat confidential, so the best I can do is give you an example. Also, this is access, but my background is MySQL. Sorry for the tags, I wasn't sure what was similar since the access tag just didn't seem to fit the question.
The Data being received are about 168 records. Someone pointed out that there is an inner join occurring here, but I wanted to indicate I'm actually using 3 different tables.
1 table contains my feeders,
Another contains a list of all outages that I am joining to using all my feeders contained in the first table
Then I have another table that contains all the trim history for each feeder. The outage table is joined to the trim table.
When I run the query above, I get data like this
feeder | comp | Begin_time
___________________________________________
123456 | 10/4/2012 | 3/3/2016 11:26:00AM
123456 | 10/17/2015 | 3/3/2016 11:26:00AM
456789 | 6/28/2008 | 9/20/2013 10:05AM
456789 | 12/1/2012 | 9/20/2013 10:05AM
456789 | 7/3/2013 | 9/20/2013 10:05AM
what I want is data like this:
feeder | comp | Begin_time
___________________________________________
123456 | 10/17/2015 | 3/3/2016 11:26:00AM
456789 | 7/3/2013 | 9/20/2013 10:05AM
where the comp date is the closest to date / time occuring BEFORE Begin_time date.
I tried this query:
SELECT Tom_Temp2.feeder, Tom_Temp.CauseType, Tom_Temp.RootCause, Tom_Temp.Storm_Name_Thunder, Tom_Temp.DeviceGroup, Max(tbl_Trim_History.COMP) AS MaxOfCOMP, Tom_Temp.Begin_Time
FROM Tom_Temp2
LEFT JOIN (Tom_Temp LEFT JOIN tbl_Trim_History ON Tom_Temp.feeder = tbl_Trim_History.CIRCUIT_ID) ON Tom_Temp2.feeder = Tom_Temp.feeder
GROUP BY Tom_Temp2.feeder, Tom_Temp.CauseType, Tom_Temp.RootCause, Tom_Temp.Storm_Name_Thunder, Tom_Temp.DeviceGroup, Tom_Temp.Begin_Time
HAVING (((Max(tbl_Trim_History.COMP))<[Tom_Temp].[Begin_Time]));
But of the 168 records I get back in my first query, I'm only getting back 20 records with this query.
The reason I know this is wrong is because some records are missing between the set of 168 and the set of 20. For example, I'd be missing any records for feeder 456789. However, I know this record should be returned because it's in my table of feeders that should be returned (Tom_Temp2).
After manually deleting unwanted rows of data, I know that I should get a record count of 85. So my most recent attempt to use the Max query is way off.

The best way to keep count data in postgres

I need to create a statistic for some aggragete date splitted by days.
For example:
select
(select count(*) from bananas) as bananas_count,
(select count(*) from apples) as apples_count,
(select count(*) from bananas where color = 'yellow') as yellow_bananas_count;
obviously I will get:
bananas_count | apples_count | yellow_bananas_count
--------------+------------------+ ---------------------
123| 321 | 15
but I need to get that data grouped by day, we need to know how many banaras we had yesterday.
The first thought which I got is create aview, but in that case i will not be able split by dates ( or I don't know how to do it).
I need a performance-wise database sided implementation of this task.

PostgreSQL calculate the top places per group and other statistics

I have a table with the following structure
|user_id | place | type_of_place | money_earned| time |
|--------+-------+---------------+-------------+------|
| | | | | |
The table is very large, several millions of rows. The data is in a PostgreSQL 9.1 database.
I want to calculate, per user_id and type_of_place: the mean, the standard deviation, and the top 5 of places (ordered by counts), and the most used hour of time (mode).
The resulting data must be in this form:
| user_id | type_of_place | avg | stddev | top5_places | mode |
+---------+---------------+-----+--------+------------------+------+
| 1 | tp1 | 10 | 1 | {p1,p2,p3,p4,p5} | 8 |
| 2 | tp1 | 3 | 2 | {p3,p4} | 23 |
| 1 | tp3 | 1 | 1 | {p1} | 4 |
etc.
Is there a for of doing this with window functions efficiently?
What if I want to grouping by week? (i.e. another column that represents the number of week)
Thank you!
A standard GROUP BY query will get you most of the way:
SELECT
user_id,
type_of_place,
avg(money_earned) AS avg,
stddev(money_earned) AS stddev
FROM
earnings -- I'm not sure what your data table is called...
GROUP BY
user_id,
type_of_place
This leaves the top5_places and mode columns. These are both also aggregates, but not ones which are defined in the standard PostgreSQL installation. Luckily, you can add them.
Here's a page discussing how to define a mode aggregate function: http://wiki.postgresql.org/wiki/Aggregate_Mode
Once you have a mode aggregate function, assuming time is a timestamp of some kind, the expression you will add to the select list will be:
SELECT
...
mode(extract(hour FROM time)) AS mode -- Add this expression
FROM
...
Assuming order by money
For top5_places, there are several approaches, but the quickest is probably to use PostgreSQL's builtin array_agg function, and take the first 5 elements:
SELECT
...
(array_agg(place ORDER BY money_earned DESC))[1:5] AS top5_places -- Add this expression
FROM
...
One alternative is to define another aggregate called (for instance) top5, which performs the same function. This could be more efficient if there are many distinct places for each user/type of place combination, since it can stop accumulating after the first 5, whereas the above expression will generally build a complete array of all places, and then truncate to the first 5.
This assumes that a place has a unique earnings entry for each user/type combination. If a place can occur more than once, and you want to sort by sum(money_earned) for each place, then you need to use a subquery like in the examples below...
Order by counts
Ok, so the places should be ordered by how often they occur. Here's a quick way, which uses a couple of subqueries -- add this as an expression to the select-clause of the above query:
(SELECT
(array_agg(place ORDER BY cnt DESC))[1:5]
FROM
(SELECT place, count(*) FROM earnings AS t2
WHERE t2.user_id = earnings.user_id AND t2.type_of_place = earnings.type_of_place
GROUP BY place) AS s (place, cnt)
) AS top5_places
The inner subquery called s evaluates to a table of each place for that user/type combination, and the number of times it occurs (which I've called cnt). These are then fed to array_agg in descending order of that count.
I suspect there could be much neater (and probably more efficient) ways of writing it. If not, then I would recommend trying to move this complicated expression into a function or aggregate, if you can...
Histrogram of places in each hour
We'll use a similar expression, which will return the array of counts, ordered by hour:
(SELECT
array_agg(cnt ORDER BY hour DESC)
FROM
(SELECT extract(hour FROM time), count(*) FROM earnings AS t2
WHERE t2.user_id = earnings.user_id AND t2.type_of_place = earnings.type_of_place
GROUP BY 1) AS s (hour, cnt)
) AS hourly_histogram
(Add that to the select-clause of the original query.)

Optimal solution for interview question

Recently in a job interview, I was given the following problem.
Say I have the following table
widget_Name | widget_Costs | In_Stock
---------------------------------------------------------
a | 15.00 | 1
b | 30.00 | 1
c | 20.00 | 1
d | 25.00 | 1
where widget_name is holds the name of the widget, widget_costs is the price of a widget, and in stock is a constant of 1.
Now for my business insurance I have a certain deductible. I am looking to find a sql statement that will tell me every widget and it's price exceeds the deductible. So if my dedudctible is $50.00 the above would just return
widget_Name | widget_Costs | In_Stock
---------------------------------------------------------
a | 15.00 | 1
d | 25.00 | 1
Since widgets b and c where used to meet the deductible
The closest I could get is the following
SELECT
*
FROM (
SELECT
widget_name,
widget_price
FROM interview.tbl_widgets
minus
SELECT widget_name,widget_price
FROM (
SELECT
widget_name,
widget_price,
50 - sum(widget_price) over (ORDER BY widget_price ROWS between unbounded preceding and current row) as running_total
FROM interview.tbl_widgets
)
where running_total >= 0
)
;
Which gives me
widget_Name | widget_Costs | In_Stock
---------------------------------------------------------
c | 20.00 | 1
d | 25.00 | 1
because it uses a and b to meet the majority of the deductible
I was hoping someone might be able to show me the correct answer
EDIT: I understood the interview question to be asking this. Given a table of widgets and their prices and given a dollar amount, substract as many of the widgets you can up to the dollar amount and return those widgets and their prices that remain
I'll put an answer up, just in case it's easier than it looks, but if the idea is just to return any widget that costs more than the deductible then you'd do something like this:
Select
Widget_Name, Widget_Cost, In_Stock
From
Widgets
Where
Widget_Cost > 50 -- SubSelect for variable deductibles?
For your sample data my query returns no rows.
I believe I understand your question, but I'm not 100%. Here is what I'm assuming you mean:
Your deductible is say, $50. To meet the deductible you have you "use" two items. (Is this always two? How high can it go? Can it be just one? What if they don't total exactly $50, there is a lot of missing information). You then want to return the widgets that aren't being used towards deductible. I have the following.
CREATE TABLE #test
(
widget_name char(1),
widget_cost money
)
INSERT INTO #test (widget_name, widget_cost)
SELECT 'a', 15.00 UNION ALL
SELECT 'b', 30.00 UNION ALL
SELECT 'c', 20.00 UNION ALL
SELECT 'd', 25.00
SELECT * FROM #test t1
WHERE t1.widget_name NOT IN (
SELECT t1.widget_name FROM #test t1
CROSS JOIN #test t2
WHERE t1.widget_cost + t2.widget_cost = 50 AND t1.widget_name != t2.widget_name)
Which returns
widget_name widget_cost
----------- ---------------------
a 15.00
d 25.00
This looks like a Bin Packing problem these are really hard to solve especially with SQL.
If you search on SO for Bin Packing + SQL, you'll find how to find Sum(field) in condition ie “select * from table where sum(field) < 150” Which is basically the same problem except you want to add a NOT IN to it.
I couldn't get the accepted answer by brianegge to work but what he wrote about it in general was interesting
..the problem you
describe of wanting the selection of
users which would most closely fit
into a given size, is a bin packing
problem. This is an NP-Hard problem,
and won't be easily solved with ANSI
SQL. However, the above seems to
return the right result, but in fact
it simply starts with the smallest
item, and continues to add items until
the bin is full.
A general, more effective bin packing
algorithm would is to start with the
largest item and continue to add
smaller ones as they fit. This
algorithm would select users 5 and 4.
So with this advice you could write a cursor to loop over the table to do just this (it just wouldn't be pretty).
Aaron Alton gives a nice link to a series of articles that attempts to solve the Bin Packing problem with sql but basically concludes that its probably best to use a cursor to do it.

Cumulative average number of records created for specific day of week or date range

Yeah, so I'm filling out a requirements document for a new client project and they're asking for growth trends and performance expectations calculated from existing data within our database.
The best source of data for something like this would be our logs table as we pretty much log every single transaction that occurs within our application.
Now, here's the issue, I don't have a whole lot of experience with MySql when it comes to collating cumulative sum and running averages. I've thrown together the following query which kind of makes sense to me, but it just keeps locking up the command console. The thing takes forever to execute and there are only 80k records within the test sample.
So, given the following basic table structure:
id | action | date_created
1 | 'merp' | 2007-06-20 17:17:00
2 | 'foo' | 2007-06-21 09:54:48
3 | 'bar' | 2007-06-21 12:47:30
... thousands of records ...
3545 | 'stab' | 2007-07-05 11:28:36
How would I go about calculating the average number of records created for each given day of the week?
day_of_week | average_records_created
1 | 234
2 | 23
3 | 5
4 | 67
5 | 234
6 | 12
7 | 36
I have the following query which makes me want to murderdeathkill myself by casting my body down an elevator shaft... and onto some bullets:
SELECT
DISTINCT(DAYOFWEEK(DATE(t1.datetime_entry))) AS t1.day_of_week,
AVG((SELECT COUNT(*) FROM VMS_LOGS t2 WHERE DAYOFWEEK(DATE(t2.date_time_entry)) = t1.day_of_week)) AS average_records_created
FROM VMS_LOGS t1
GROUP BY t1.day_of_week;
Halps? Please, don't make me cut myself again. :'(
How far back do you need to go when sampling this information? This solution works as long as it's less than a year.
Because day of week and week number are constant for a record, create a companion table that has the ID, WeekNumber, and DayOfWeek. Whenever you want to run this statistic, just generate the "missing" records from your master table.
Then, your report can be something along the lines of:
select
DayOfWeek
, count(*)/count(distinct(WeekNumber)) as Average
from
MyCompanionTable
group by
DayOfWeek
Of course if the table is too large, then you can instead pre-summarize the data on a daily basis and just use that, and add in "today's" data from your master table when running the report.
I rewrote your query as:
SELECT x.day_of_week,
AVG(x.count) 'average_records_created'
FROM (SELECT DAYOFWEEK(t.datetime_entry) 'day_of_week',
COUNT(*) 'count'
FROM VMS_LOGS t
GROUP BY DAYOFWEEK(t.datetime_entry)) x
GROUP BY x.day_of_week
The reason why your query takes so long is because of your inner select, you are essentialy running 6,400,000,000 queries. With a query like this your best solution may be to develop a timed reporting system, where the user receives an email when the query is done and the report is constructed or the user logs in and checks the report after.
Even with the optimization written by OMG Ponies (bellow) you are still looking at around the same number of queries.
SELECT x.day_of_week,
AVG(x.count) 'average_records_created'
FROM (SELECT DAYOFWEEK(t.datetime_entry) 'day_of_week',
COUNT(*) 'count'
FROM VMS_LOGS t
GROUP BY DAYOFWEEK(t.datetime_entry)) x
GROUP BY x.day_of_week