Redshift - Case statement returns duplicates

Redshift - Case statement returns duplicates - sql

I have a dataset that has product name, order number and the time order was placed.
prod_name,order_no,order_time
a,101,2018-05-01
a,102,2018-06-04
a,103,2018-05-03
b,104,2018-01-21
b,105,2018-01-11
I am trying to build a report that shows time since first order (compared against current time) with an output as below:
prod_name,time_since_first_sale,aging
a,64,Less than 3 months back
b,177,Less than 6 months back
Given below is the SQL I am using:
select DISTINCT b.prod_name,case when((CURRENT_TIMESTAMP - min(a.order_time))) < '90' THEN 'Less than 3 months'
when ((CURRENT_TIMESTAMP - min(order_time))) < '180' THEN 'Less than 6 months'
else 'Other' end as aging
from sales a, prod b where a.id=b.prod_id;
The above SQL when executed returns duplicates, believe it also considers each sale_id in the sales table. How could I modify the above query to get just one record per prod_name. If I however remove the case statement the duplicates are not there. Could any one assist as to what I am doing wrong that pulls in these duplicates.
I am using Amazon Redshift DB.
Thanks..

Never use commas in the FROM clause. Always use proper, explicit, standard JOIN syntax.
Don't use SELECT DISTINCT when you intend GROUP BY.
So your query should look like:
select p.prod_name,
(case when CURRENT_TIMESTAMP - min(s.order_time) < '90'
then 'Less than 3 months'
when CURRENT_TIMESTAMP - min(s.order_time) < '180' then 'Less than 6 months'
else 'Other'
end) as aging
from sales s join
prod p
on s.id = p.prod_id
group by p.prod_name;
Notice that I also added in reasonable table aliases (abbreviations for the table names) and qualified all column references.

Related

How to write SQL statement to select for data broken up for each month of the year?

I am looking for a way to write an SQL statement that selects data for each month of the year, separately.
In the SQL statement below, I am trying to count the number of instances in the TOTAL_PRECIP_IN and TOTAL_SNOWFALL_IN columns when either column is greater than 0. In my data table, I have information for those two columns ("TOTAL_PRECIP_IN" and "TOTAL_SNOWFALL_IN") for each day of the year (365 total entries).
I want to break up my data by each calendar month, but am not sure of the best way to do this. In the statement below, I am using a UNION statement to break up the months of January and February. If I keep using UNION statements for the remaining months of the year, I can get the answer I am looking for. However, using 11 different UNION statements cannot be the optimal solution.
Can anyone give me a suggestion how I can edit my SQL statement to measure from the first day of the month, to the last day of the month for every month of the year?
select monthname(OBSERVATION_DATE) as "Month", sum(case when TOTAL_PRECIP_IN or TOTAL_SNOWFALL_IN > 0 then 1 else 0 end) AS "Days of Rain" from EMP_BASIC
where OBSERVATION_DATE between '2019-01-01' and '2019-01-31'
and CITY = 'Olympia'
group by "Month"
UNION
select monthname(OBSERVATION_DATE) as "Month", sum(case when TOTAL_PRECIP_IN or TOTAL_SNOWFALL_IN > 0 then 1 else 0 end) from EMP_BASIC
where OBSERVATION_DATE between '2019-02-01' and '2019-02-28'
and CITY = 'Olympia'
group by "Month"```

Your table structure is too unclear to tell you the exact query you will need. But a general easy idea is to build the sum of your value and then group by monthname and/or by month. Sice you wrote you only want sum values greater 0, you can just put this condition in the where clause. So your query will be something like this:
SELECT MONTHNAME(yourdate) AS month,
MONTH(yourdate) AS monthnr,
SUM(yourvalue) AS yoursum
FROM yourtable
WHERE yourvalue > 0
GROUP BY MONTHNAME(yourdate), MONTH(yourdate)
ORDER BY MONTH(yourdate);
I created an example here: db<>fiddle
You might need to modify this general construct for your concrete purpose (maybe take care of different years, of NULL values etc.). And note this is an example for a MYSQL DB because you wrote about MONTHNAME() which is in most cases used in MYSQL databases. If you are using another DB type, maybe you need to do some modifications. To make sure that answers match your DB type, tag it in your question, please.

SQL Why am I getting the invalid identifier error?

I am trying to use columns that I created in this query to create another column.
Let me first my messy query. The query looks like this:
SELECT tb.team, tb.player, tb.type, tb.date, ToChar(Current Date-1, 'DD-MON-YY') as yesterday,
CASE WHEN to_date(tb.date) = yesterday then 1 else 0 end dateindicator,
FROM (
COUNT DISTINCT(*)
FROM TABLE_A, dual
where dateindicator = 1
Group by tb.team
)
What I am trying to do here is:
creating a column with "Yesterday's date"
Using the "Yesterday" column to create another column called dateindicator indicating each row is yesterday's data or not.
then using that dateindicator, I want to count the distinct number of player for each team that has 1 of the dateindicator column.
But I am getting the "invalid identifier" error. I am new to this oracle SQL, and trying to learn here.

You cannot use an Alias in your Select statement.
see here: SQL: Alias Column Name for Use in CASE Statement
you need to use the full toChar(.. in the CASE WHEN.
Also:
Your WHERE-condition (Line 5) doesnt belong there.. it should be:
SELECT DISTINCT .>. FROM .>. WHERE. you have to specify the table first. then you can filter it with where.

If I follow your explanation correctly: for each team, you want to count the number of players whose date column is yesterday.
If so, you can just filter and aggregate:
select team, count(*) as cnt
from mytable
where mydate >= trunc(sysdate) - 1 and mydate < trunc(sysdate)
group by team
This assumes that the dates are stored in column mydate, that is of date datatype.
I am unsure what you mean by counting distinct players; presumably, a given player appears just once per team, so I used count(*). If you really need to, you can change that to count(distinct player).
Finally: if you want to allow teams where no player matches, you can move the filtering logic within the aggregate function:
select team,
sum(case when mydate >= trunc(sysdate) - 1 and mydate < trunc(sysdate) then 1 else 0 end) as cnt
from mytable
group by team

how to use count with case when

I'm newbie to Hivesql.
I have a raw table with 6 million records like this:
I want to count the number of IP_address access to each Modem_id everyweek.
The result table I want will be like this:
I did it with left join, and it worked. But since using join will be time-consuming, I want do it with case when statement - but I can't write a correct statement. Do you have any ideas?
This is the join statement I used:
select a.modem_id,
a.Number_of_IP_in_Day_1,
b.Number_of_IP_in_Day_2
from
(select modem_id,
count(distinct ip_address) as Number_of_IP_in_Day_1
from F_ACS_DEVICE_INFORMATION_NEW
where day=1
group by modem_id) a
left join
(select modem_id,
count(distinct param_value) as Number_of_IP_in_Day_2
from F_ACS_DEVICE_INFORMATION_NEW
where day=2
group by modem_id) b
on a.modem_id= b.modem_id;

You can express your logic using just aggregatoin:
select a.modem_id,
count(distinct case when date = 1 then ip_address end) as day_1,
count(distinct case when date = 2 then ip_address end) as day_2
from F_ACS_DEVICE_INFORMATION_NEW a
group by a.modem_id;
You can obviously extend this for more days.
Note: As your question and code are written, this assumes that your base table has data for only one week. Otherwise, I would expect some date filtering. Presumably, that is what the _NEW suffix means on the table name.

Based on your question and further comments, you would like
The number of different IP addresses accessed by each modem
In counts by week (as columns) for 4 weeks
e.g., result would be 5 columns
modem_id
IPs_accessed_week1
IPs_accessed_week2
IPs_accessed_week3
IPs_accessed_week4
My answer here is based on knowledge of SQL - I haven't used Hive but it appears to support the things I use (e.g., CTEs). You may need to tweak the answer a bit.
The first key step is to turn the day_number into a week_number. A straightforward way to do this is FLOOR((day_num-1)/7)+1 so days 1-7 become week 1, days 8-14 become week2, etc.
Note - it is up to you to make sure the day_nums are correct. I would guess you'd actually want info the the last 4 weeks, not the first four weeks of data - and as such you'd probably calculate the day_num as something like SELECT DATEDIFF(day, IP_access_date, CAST(getdate() AS date)) - whatever the equivalent is in Hive.
There are a few ways to do this - I think the clearest is to use a CTE to convert your dataset to what you need e.g.,
convert day_nums to weeknums
get rid of duplicates within the week (your code has COUNT(DISTINCT ...) - I assume this is what you want) - I'm doing this with SELECT DISTINCT (rather than grouping by all fields)
From there, you could PIVOT the data to get it into your table, or just use SUM of CASE statements. I'll use SUM of CASE here as I think it's clearer to understand.
WITH IPs_per_week AS
(SELECT DISTINCT
modem_id,
ip_address,
FLOOR((day-1)/7)+1 AS week_num -- Note I've referred to it as day_num in text for clarity
FROM F_ACS_DEVICE_INFORMATION_NEW
)
SELECT modem_id,
SUM(CASE WHEN week_num = 1 THEN 1 ELSE 0 END) AS IPs_access_week1,
SUM(CASE WHEN week_num = 2 THEN 1 ELSE 0 END) AS IPs_access_week2,
SUM(CASE WHEN week_num = 3 THEN 1 ELSE 0 END) AS IPs_access_week3,
SUM(CASE WHEN week_num = 4 THEN 1 ELSE 0 END) AS IPs_access_week4
FROM IPs_per_week
GROUP BY modem_id;

Can someone help me with this join

I need it to give me me a total of 0 for week 33 - 39, but I'm really bad with joining 3 tables and I cant figure it out
Right now it only gives me an answer for dates that there are actual records in the tracker_weld_table.
SELECT SUM(tracker_parts_archive.weight),
WEEK(mycal.dt) as week
FROM
tracker_parts_archive, tracker_weld_archive
RIGHT JOIN
(SELECT dt FROM calendar_table WHERE dt >= '2018-7-1' AND dt <= '2018-10-1') as mycal
ON
weld_worker = '133'AND date(weld_dateandtime) = mycal.dt
WHERE
tracker_weld_archive.tracker_partsID = tracker_parts_archive.id
GROUP BY week

I think you are trying for something like this:
SELECT WEEK(c.dt) as week, COALESCE(SUM(tpa.weight), 0)
FROM calendar_table c left join
tracker_weld_archive tw
on date(tw.weld_dateandtime) = c.dt left join
tracker_parts_archive tp
on tw.tracker_partsID = tp.id and tp.weld_worker = 133
WHERE c.dt >= '2018-07-01' AND c.dt <= '2018-10-01'
GROUP BY week
ORDER BY week;
Notes:
You want to keep all (matching) rows in the calendar table, so it should be first.
All subsequent joins should be LEFT JOINs.
Never use commas in the FROM clause. Always use proper, explicit, standard JOIN syntax.
Write out the full proper date constant -- YYYY-MM-DD. This is an ISO-standard format.
I am guessing that weld_worker is a number, so single quotes are not needed for the comparison.

First, lets start with understanding what you want.. You want totals per week. This means there will be a "GROUP BY" clause (also for any MIN(), MAX(), AVG(), SUM(), COUNT(), etc. aggregates). What is the group BY basis. In this scenario, you want per week. Leading to the next part that you want for a specific date range qualified per your calendar table.
I would start in order what WHAT filtering criteria first. Also, ALWAYS TRY to identify all table( or alias).column in your queries so anyone after you knows where the columns are coming from, especially when multiple tables. In this case "ct" is the ALIAS for "Calendar_Table"
SELECT
ct.dt
from
calendar_table ct
where
ct.dt >= '2018-07-01'
AND ct.dt <= '2018-10-01'
Now, the above date looks to be INCLUSIVE of October 1 and looks like you are trying to generate a quarterly sum from July, Aug, Sept. I would change to LESS than Oct 1.
Now, your calendar has many days and you want it grouped by week, so the WEEK() function gets you that distinct reference without explicitly checking every date. Also, try NOT to use reserved keywords as final column names... makes for confusion later on sometimes.
I have aliased the column name as "WeekBasis". Here, I did a COUNT(*) just to show the total days and the group by showing it in context.
SELECT
WEEK( ct.dt ) WeekBasis,
MIN( ct.dt ) as FirstDayOfThisWeek,
MAX( ct.dt ) as LastDayOfThisWeek,
COUNT(*) as DaysInThisWeek
from
calendar_table ct
where
ct.dt >= '2018-07-01'
AND ct.dt <= '2018-10-01'
group by
WEEK( ct.dt )
So, at this point, we have 1 record per week within the date period you are concerned,
but I also grabbed the earliest and latest dates just to show other components too.
Now, lets get back to your extra tables. We know the dates in question, now need to
get the details from the other tables (which is lacking in the post. You should post
critical components such as how tables are related via common / joined column basis.
How is tracker_part_archive related to tracker_weld_archive??
To simplify your query, you dont even NEED your calendar table as the welding
table HAS a date field and you know your range. Just query against that directly.
IF your worker's ID is numeric, don't add quotes around it, just leave as a number.
SELECT
WEEK( twa.Weld_DateAndTime ) WeekBasis,
COUNT(*) WeldingEntriesDone,
SUM(tpa.weight) TotalWeight
from
tracker_weld_archive twa
JOIN tracker_parts_archive tpa
-- GUESSING on therelationship here.
-- may also be on a given date too???
-- all pieces welded by a person on a given date
ON twa.weld_worker = tpa.weld_worker
AND twa.Weld_DateAndTime = tpa.Weld_DateAndTime
where
twa.Weld_Worker = 133
AND twa.Weld_DateAndTime >= '2018-07-01'
AND twa.Weld_DateAndTime <= '2018-10-01'
group by
WEEK( twa.Weld_DateAndTime )
IF you provide the table structures AND sample data, this can be refined a bit more for you.

SQL: Average value per day

I have a database called ‘tweets’. The database 'tweets' includes (amongst others) the rows 'tweet_id', 'created at' (dd/mm/yyyy hh/mm/ss), ‘classified’ and 'processed text'. Within the ‘processed text’ row there are certain strings such as {TICKER|IBM}', to which I will refer as ticker-strings.
My target is to get the average value of ‘classified’ per ticker-string per day. The row ‘classified’ includes the numerical values -1, 0 and 1.
At this moment, I have a working SQL query for the average value of ‘classified’ for one ticker-string per day. See the script below.
SELECT Date( `created_at` ) , AVG( `classified` ) AS Classified
FROM `tweets`
WHERE `processed_text` LIKE '%{TICKER|IBM}%'
GROUP BY Date( `created_at` )
There are however two problems with this script:
It does not include days on which there were zero ‘processed_text’s like {TICKER|IBM}. I would however like it to spit out the value zero in this case.
I have 100+ different ticker-strings and would thus like to have a script which can process multiple strings at the same time. I can also do them manually, one by one, but this would cost me a terrible lot of time.
When I had a similar question for counting the ‘tweet_id’s per ticker-string, somebody else suggested using the following:
SELECT d.date, coalesce(IBM, 0) as IBM, coalesce(GOOG, 0) as GOOG,
coalesce(BAC, 0) AS BAC
FROM dates d LEFT JOIN
(SELECT DATE(created_at) AS date,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|IBM}%' then tweet_id
END) as IBM,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|GOOG}%' then tweet_id
END) as GOOG,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|BAC}%' then tweet_id
END) as BAC
FROM tweets
GROUP BY date
) t
ON d.date = t.date;
This script worked perfectly for counting the tweet_ids per ticker-string. As I however stated, I am not looking to find the average classified scores per ticker-string. My question is therefore: Could someone show me how to adjust this script in such a way that I can calculate the average classified scores per ticker-string per day?

SELECT d.date, t.ticker, COALESCE(COUNT(DISTINCT tweet_id), 0) AS tweets
FROM dates d
LEFT JOIN
(SELECT DATE(created_at) AS date,
SUBSTR(processed_text,
LOCATE('{TICKER|', processed_text) + 8,
LOCATE('}', processed_text, LOCATE('{TICKER|', processed_text))
- LOCATE('{TICKER|', processed_text) - 8)) t
ON d.date = t.date
GROUP BY d.date, t.ticker
This will put each ticker on its own row, not a column. If you want them moved to columns, you have to pivot the result. How you do this depends on the DBMS. Some have built-in features for creating pivot tables. Others (e.g. MySQL) do not and you have to write tricky code to do it; if you know all the possible values ahead of time, it's not too hard, but if they can change you have to write dynamic SQL in a stored procedure.
See MySQL pivot table for how to do it in MySQL.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Redshift - Case statement returns duplicates - sql

Related

How to write SQL statement to select for data broken up for each month of the year?

SQL Why am I getting the invalid identifier error?

how to use count with case when

Can someone help me with this join

SQL: Average value per day

Categories

Resources