Combining two queries with case statement in BigQuery - sql

I'm new to SQL and have been trying to combine two queries that give me a count of unique uses by day of the week ('weekday' - with days of the week coded 1-7) and by user type ('member_casual' - member or casual user). I managed to use a case statement to combine them into one table, with the following query:
SELECT weekday,
CASE WHEN member_casual = 'member' THEN COUNT (*) END AS member_total,
CASE WHEN member_casual = 'casual' THEN COUNT (*) END AS casual_total,
FROM
`case-study-319921.2020_2021_Trip_Data.2020_2021_Rides_Merged`
GROUP BY weekday, member_casual;
Resulting in a table that looks like this:
Row
weekday
member_total
casual_total
1
1
null
330529
2
1
308760
null
3
2
null
188687
4
2
316228
null
5
3
330656
null
6
3
null
174799
etc...
I can see that this likely has to do with the fact that I grouped by 'weekday' and 'member_casual', however I get errors if remove 'member casual' from the GROUP BY statement. I have tried to play around with a CASE IF statement instead, but have yet to find a solution.

You want countif():
SELECT weekday,
COUNTIF(member_casual = 'member') AS member_total,
COUNTIF(member_casual = 'casual') AS casual_total,
FROM`case-study-319921.2020_2021_Trip_Data.2020_2021_Rides_Merged`
GROUP BY weekday;

Related

Using Parameter within timestamp_trunc in SQL Query for DataStudio

I am trying to use a custom parameter within DataStudio. The data is hosted in BigQuery.
SELECT
timestamp_trunc(o.created_at, #groupby) AS dateMain,
count(o.id) AS total_orders
FROM `x.default.orders` o
group by 1
When I try this, it returns an error saying that "A valid date part name is required at [2:35]"
I basically need to group the dates using a parameter (e.g. day, week, month).
I have also included a screenshot of how I have created the parameter in Google DataStudio. There is a default value set which is "day".
A workaround that might do the trick here is to use a rollup in the group by with the different levels of aggregation of the date, since I am not sure you can pass a DS parameter to work like that.
See the following example for clarity:
with default_orders as (
select timestamp'2021-01-01' as created_at, 1 as id
union all
select timestamp'2021-01-01', 2
union all
select timestamp'2021-01-02', 3
union all
select timestamp'2021-01-03', 4
union all
select timestamp'2021-01-03', 5
union all
select timestamp'2021-01-04', 6
),
final as (
select
count(id) as count_orders,
timestamp_trunc(created_at, day) as days,
timestamp_trunc(created_at, week) as weeks,
timestamp_trunc(created_at, month) as months
from
default_orders
group by
rollup(days, weeks, months)
)
select * from final
The output, then, would be similar to the following:
count | days | weeks | months
------+------------+----------+----------
6 | null | null | null <- this, represents the overall (counted 6 ids)
2 | 2021-01-01| null | null <- this, the 1st rollup level (day)
2 | 2021-01-01|2020-12-27| null <- this, the 1st and 2nd (day, week)
2 | 2021-01-01|2020-12-27|2021-01-01 <- this, all of them
And so on.
At the moment of visualizing this on data studio, you have two options: setting the metric as Avg instead of Sum, because as you can see there's kind of a duplication at each stage of the day column; or doing another step in the query and get rid of nulls, like this:
select
*
from
final
where
days is not null and
weeks is not null and
months is not null

Impala: values are in wrong columns in result query

In my result query the values are in wrong columns.
My SQL Query is like:
create table some_database.table name as
select
extract(year from t.operation_date) operation_year,
extract(month from t.operation_date) operation_month,
extract(day from t.operation_date) operation_day,
d.status_name,
sum(t.operation_amount) operation_amt,
current_timestamp() calculation_moment
from operations t
left join status_dict d on
d.status_id = t.status_id
group by
extract(year from t.operation_date) operation_year,
extract(month from t.operation_date) operation_month,
extract(day from t.operation_date) operation_day,
d.status_name
(In fact, it's more complicated, but the main idea is that I'm aggregating source table and making some joins.)
The result I get is like:
#
operation_year
operation_month
operation_day
status_name
operation_amt
1
2021
1
1
success
100
2
2021
1
1
success
150
3
2021
1
2
success
120
4
null
2021-01-01 21:53:00
success
120
null
The problem is in row 4.
The field t.operation_date is not nullable, but in result query in column operation_year we get null
In operation_month we get untruncated timestamp
In operation_day we get string value from d.status_name
In status_name we get numeric aggregate from t.operation_amount
In operation_amt we get null
It looks very similar to a wrong parsing of a csv file when values jump to other columns, but obviously it can't be the case here. I can't figure out how on earth is it possible. I'm new to Hadoop and apparently I'm not aware of some important concept which causes the problem.

How to run a query for multiple independent date ranges?

I would like to run the below query that looks like this for week 1:
Select week(datetime), count(customer_call) from table where week(datetime) = 1 and week(orderdatetime) < 7
... but for weeks 2, 3, 4, 5 and 6 all in one query and with the 'week(orderdatetime)' to still be for the 6 weeks following the week(datetime) value.
This means that for 'week(datetime) = 2', 'week(orderdatetime)' would be between 2 and 7 and so on.
'datetime' is a datetime field denoting registration.
'customer_call' is a datetime field denoting when they called.
'orderdatetime' is a datetime field denoting when they ordered.
Thanks!
I think you want group by:
Select week(datetime), count(customer_call)
from table
where week(datetime) = 1 and week(orderdatetime) < 7
group by week(datetime);
I would also point out that week doesn't take the year into account, so you might want to include that in the group by or in a where filter.
EDIT:
If you want 6 weeks of cumulative counts, then use:
Select week(datetime), count(customer_call),
sum(count(customer_call)) over (order by week(datetime)
rows between 5 preceding and current row) as running_sum_6
from table
group by week(datetime);
Note: If you want to filter this to particular weeks, then make this a subquery and filter in the outer query.

how to use count with case when

I'm newbie to Hivesql.
I have a raw table with 6 million records like this:
I want to count the number of IP_address access to each Modem_id everyweek.
The result table I want will be like this:
I did it with left join, and it worked. But since using join will be time-consuming, I want do it with case when statement - but I can't write a correct statement. Do you have any ideas?
This is the join statement I used:
select a.modem_id,
a.Number_of_IP_in_Day_1,
b.Number_of_IP_in_Day_2
from
(select modem_id,
count(distinct ip_address) as Number_of_IP_in_Day_1
from F_ACS_DEVICE_INFORMATION_NEW
where day=1
group by modem_id) a
left join
(select modem_id,
count(distinct param_value) as Number_of_IP_in_Day_2
from F_ACS_DEVICE_INFORMATION_NEW
where day=2
group by modem_id) b
on a.modem_id= b.modem_id;
You can express your logic using just aggregatoin:
select a.modem_id,
count(distinct case when date = 1 then ip_address end) as day_1,
count(distinct case when date = 2 then ip_address end) as day_2
from F_ACS_DEVICE_INFORMATION_NEW a
group by a.modem_id;
You can obviously extend this for more days.
Note: As your question and code are written, this assumes that your base table has data for only one week. Otherwise, I would expect some date filtering. Presumably, that is what the _NEW suffix means on the table name.
Based on your question and further comments, you would like
The number of different IP addresses accessed by each modem
In counts by week (as columns) for 4 weeks
e.g., result would be 5 columns
modem_id
IPs_accessed_week1
IPs_accessed_week2
IPs_accessed_week3
IPs_accessed_week4
My answer here is based on knowledge of SQL - I haven't used Hive but it appears to support the things I use (e.g., CTEs). You may need to tweak the answer a bit.
The first key step is to turn the day_number into a week_number. A straightforward way to do this is FLOOR((day_num-1)/7)+1 so days 1-7 become week 1, days 8-14 become week2, etc.
Note - it is up to you to make sure the day_nums are correct. I would guess you'd actually want info the the last 4 weeks, not the first four weeks of data - and as such you'd probably calculate the day_num as something like SELECT DATEDIFF(day, IP_access_date, CAST(getdate() AS date)) - whatever the equivalent is in Hive.
There are a few ways to do this - I think the clearest is to use a CTE to convert your dataset to what you need e.g.,
convert day_nums to weeknums
get rid of duplicates within the week (your code has COUNT(DISTINCT ...) - I assume this is what you want) - I'm doing this with SELECT DISTINCT (rather than grouping by all fields)
From there, you could PIVOT the data to get it into your table, or just use SUM of CASE statements. I'll use SUM of CASE here as I think it's clearer to understand.
WITH IPs_per_week AS
(SELECT DISTINCT
modem_id,
ip_address,
FLOOR((day-1)/7)+1 AS week_num -- Note I've referred to it as day_num in text for clarity
FROM F_ACS_DEVICE_INFORMATION_NEW
)
SELECT modem_id,
SUM(CASE WHEN week_num = 1 THEN 1 ELSE 0 END) AS IPs_access_week1,
SUM(CASE WHEN week_num = 2 THEN 1 ELSE 0 END) AS IPs_access_week2,
SUM(CASE WHEN week_num = 3 THEN 1 ELSE 0 END) AS IPs_access_week3,
SUM(CASE WHEN week_num = 4 THEN 1 ELSE 0 END) AS IPs_access_week4
FROM IPs_per_week
GROUP BY modem_id;

SQL Server - Need to obtain duplicate records based on mutiple criteria of the same column

I work with a huge dataset of hospital activity records. Each record represents something done on behalf of a patient. My focus is on patients that have experienced 'outpatient' activity, such as attended an appointment or clinic.
In the data, we get records that are duplicates in that; a patient is shown to have attended their first out patient appointment more than once in a six month period. This is an error on the part of the hospital who send their data. We have to identify these records to send back as challenges.
I have the following SQL statement which is finding records where the 'Patient Code' appears more than once.
SELECT * FROM dbo.Z_ForQueries a
JOIN (SELECT PatientCode
FROM dbo.Z_ForQueries
GROUP BY PatientCode
HAVING COUNT (*) > 1 ) b
ON a.PatientCode = b.PatientCode
WHERE [Multiple OPFA in month] = 'y'
I cannot for the life of me figure out how to syntax the next bit; For each set of duplicated patient codes, I only want to see the records where one of the records has a 'Month' of 7 (that's the just the current month I'm working on). If non of the groups of duplicated records have '7' in the month, then I don't need to see them.
For example, patient code L000066715 has 4 records, I can see that each record represents the same initial outpatient appointment in the same hospital speciality. Obviously you can only 'first attend' once. Each record has a month number; 3,4,6 & 7. Because this patient code has one of their duplicate records in month 7, I need it to be returned in the results along with the other 3 records.
Other patient codes exist in duplicate but none of their records are from month 7, so they don't need to be returned.
I hope I've set the scene properly for some help! Thanks.
Something like this should work:
SELECT *
FROM dbo.Z_ForQueries a
JOIN (
SELECT PatientCode,
MAX(CASE WHEN MONTH(dateColumn) = 7 THEN 1 ELSE 0 END) As InMonth
FROM dbo.Z_ForQueries
GROUP BY PatientCode
HAVING COUNT (*) > 1
) b ON a.PatientCode = b.PatientCode
And InMonth = 1
WHERE [Multiple OPFA in month] = 'y'
Explanation:
The CASE expression returns 1 for rows where Month=7, and 0 in all other cases. The MAX(..) around this CASE expressions thus returns 1 if any rows in the GROUP had a Month=7 and a 0 only if none of them did.