Impala: values are in wrong columns in result query - sql

In my result query the values are in wrong columns.
My SQL Query is like:
create table some_database.table name as
select
extract(year from t.operation_date) operation_year,
extract(month from t.operation_date) operation_month,
extract(day from t.operation_date) operation_day,
d.status_name,
sum(t.operation_amount) operation_amt,
current_timestamp() calculation_moment
from operations t
left join status_dict d on
d.status_id = t.status_id
group by
extract(year from t.operation_date) operation_year,
extract(month from t.operation_date) operation_month,
extract(day from t.operation_date) operation_day,
d.status_name
(In fact, it's more complicated, but the main idea is that I'm aggregating source table and making some joins.)
The result I get is like:
#
operation_year
operation_month
operation_day
status_name
operation_amt
1
2021
1
1
success
100
2
2021
1
1
success
150
3
2021
1
2
success
120
4
null
2021-01-01 21:53:00
success
120
null
The problem is in row 4.
The field t.operation_date is not nullable, but in result query in column operation_year we get null
In operation_month we get untruncated timestamp
In operation_day we get string value from d.status_name
In status_name we get numeric aggregate from t.operation_amount
In operation_amt we get null
It looks very similar to a wrong parsing of a csv file when values jump to other columns, but obviously it can't be the case here. I can't figure out how on earth is it possible. I'm new to Hadoop and apparently I'm not aware of some important concept which causes the problem.

Related

Using Parameter within timestamp_trunc in SQL Query for DataStudio

I am trying to use a custom parameter within DataStudio. The data is hosted in BigQuery.
SELECT
timestamp_trunc(o.created_at, #groupby) AS dateMain,
count(o.id) AS total_orders
FROM `x.default.orders` o
group by 1
When I try this, it returns an error saying that "A valid date part name is required at [2:35]"
I basically need to group the dates using a parameter (e.g. day, week, month).
I have also included a screenshot of how I have created the parameter in Google DataStudio. There is a default value set which is "day".
A workaround that might do the trick here is to use a rollup in the group by with the different levels of aggregation of the date, since I am not sure you can pass a DS parameter to work like that.
See the following example for clarity:
with default_orders as (
select timestamp'2021-01-01' as created_at, 1 as id
union all
select timestamp'2021-01-01', 2
union all
select timestamp'2021-01-02', 3
union all
select timestamp'2021-01-03', 4
union all
select timestamp'2021-01-03', 5
union all
select timestamp'2021-01-04', 6
),
final as (
select
count(id) as count_orders,
timestamp_trunc(created_at, day) as days,
timestamp_trunc(created_at, week) as weeks,
timestamp_trunc(created_at, month) as months
from
default_orders
group by
rollup(days, weeks, months)
)
select * from final
The output, then, would be similar to the following:
count | days | weeks | months
------+------------+----------+----------
6 | null | null | null <- this, represents the overall (counted 6 ids)
2 | 2021-01-01| null | null <- this, the 1st rollup level (day)
2 | 2021-01-01|2020-12-27| null <- this, the 1st and 2nd (day, week)
2 | 2021-01-01|2020-12-27|2021-01-01 <- this, all of them
And so on.
At the moment of visualizing this on data studio, you have two options: setting the metric as Avg instead of Sum, because as you can see there's kind of a duplication at each stage of the day column; or doing another step in the query and get rid of nulls, like this:
select
*
from
final
where
days is not null and
weeks is not null and
months is not null

Count Distinct values in one column based on other columns

I have a table that looks like the following:
app_id supplier_reached creation_date platform
10001 1 9/11/2018 iOS
10001 2 9/18/2018 iOS
10002 1 5/16/2018 android
10003 1 5/6/2018 android
10004 1 10/1/2018 android
10004 1 2/3/2018 android
10004 2 2/2/2018 web
10005 4 1/5/2018 web
10005 2 5/1/2018 android
10006 3 10/1/2018 iOS
10005 4 1/1/2018 iOS
The objective is to find the unique number of app_id submitted per month.
If I just do a count(distinct app_id) I will get the following results:
Group by month count(app number)
Jan 1
Feb 1
may 3
september 1
october 2
However, an application is considered unique based on a combination of other fields as well. For example, for the month of January, the app_id is the same however a combination of app_id, supplier_reached and platform show different values and hence the app_id should be counted twice.
Following the same pattern, the desired result should be:
Group by month Desired answer
Jan 2
Feb 2
may 3
september 2
october 2
Lastly, there can be many other columns in the table which may or may not contribute to the uniqueness of an application.
Is there a way to do this type of count in SQL?
I am using Redshift.
As pointed out above, in Redshift count(distinct ...) does not work with multiple fields.
You can first group by the columns that you want to be unique and then count the records like this:
select month,count(1) as app_number
from (
select month,app_id,supplier_reached,platform
from your_table
group by 1,2,3,4
)
group by 1
I don't think Postgres or Redshift supports COUNT(DISTINCT) with multiple arguments. One workaround is to use concatenation:
count(distinct app_id || ':' || supplier_reached || ':' || platform)
Your objective's mean is wrong.
You don't want
to find the unique number of app_id submitted per month
you want
to find the unique number of app_id + supplier_reached + platform submitted per month.
And so, you need to use a) combination of columns like count(distinct col1||col2||col3) or b)
select t1.month, count(t1.*)
(select distinct
app_id,
supplier_reached,
platform,
month
from sometable) t1
group by month
Actually, you can count distinct ROW values conveniently in Postgres:
SELECT month, count(DISTINCT (app_id, supplier_reached, platform)) AS dist_apps
FROM tbl
GROUP BY 1;
The ROW keyword would be just noise here:
count(DISTINCT ROW(app_id, supplier_reached, platform))
I would discourage concatenating columns for the purpose. This is comparatively expensive, error prone (think of distinct data types and locale-dependent text representation) and introduces corner-case errors if the used separator can be contained in column values.
Alas, not supported by Redshift:
...
Value expressions
Subscripted expressions
Array constructors
Row constructors
...

Using crosstab, dynamically loading column names of resulting pivot table in one query?

The gem we have installed (Blazer) on our site limits us to one query.
We are trying to write a query to show how many hours each employee has for the past 10 days. The first column would have employee names and the rest would have hours with the column header being each date. I'm having trouble figuring out how to make the column headers dynamic based on the day. The following is an example of what we have working without dynamic column headers and only using 3 days.
SELECT
pivot_table.*
FROM
crosstab(
E'SELECT
"User",
"Date",
"Hours"
FROM
(SELECT
"q"."qdb_users"."name" AS "User",
to_char("qdb_works"."date", \'YYYY-MM-DD\') AS "Date",
sum("qdb_works"."hours") AS "Hours"
FROM
"q"."qdb_works"
LEFT OUTER JOIN
"q"."qdb_users" ON
"q"."qdb_users"."id" = "q"."qdb_works"."qdb_user_id"
WHERE
"qdb_works"."date" > current_date - 20
GROUP BY
"User",
"Date"
ORDER BY
"Date" DESC,
"User" DESC) "x"
ORDER BY 1, 2')
AS
pivot_table (
"User" VARCHAR,
"2017-10-06" FLOAT,
"2017-10-05" FLOAT,
"2017-10-04" FLOAT
);
This results in
| User | 2017-10-05 | 2017-10-04 | 2017-10-03 |
|------|------------|------------|------------|
| John | 1.5 | 3.25 | 2.25 |
| Jill | 6.25 | 6.25 | 6 |
| Bill | 2.75 | 3 | 4 |
This is correct, but tomorrow, the column headers will be off unless we update the query every day. I know we could pivot this table with date on the left and names on the top, but that will still need updating with each new employee – and we get new ones often.
We have tried using functions and queries in the "AS" section with no luck. For example:
AS
pivot_table (
"User" VARCHAR,
current_date - 0 FLOAT,
current_date - 1 FLOAT,
current_date - 2 FLOAT
);
Is there any way to pull this off with one query?
You could select a row for each user, and then per column sum the hours for one day:
with user_work as
(
select u.name as user
, to_char(w.date, 'YYYY-MM-DD') as dt_str
, w.hours
from qdb_works w
join qdb_users u
on u.id = w.qdb_user_id
where w.date >= current_date - interval '2 days'
)
select User
, sum(case when dt_str = to_char(current_date,
'YYYY-MM-DD') then hours end) as Today
, sum(case when dt_str = to_char(current_date - 'interval 1 day',
'YYYY-MM-DD') then hours end) as Yesterday
, sum(case when dt_str = to_char(current_date - 'interval 2 days',
'YYYY-MM-DD') then hours end) as DayBeforeYesterday
from user_work
group by
user
, dt_str
It's often easier to return a list and pivot it client side. That also allows you to generate column names with a date.
Is there any way to pull this off with one query?
No, because a fixed SQL query cannot have any variability in its output columns. The SQL engine determines the number, types and names of every column of a query before executing it, without reading any data except in the catalog (for the structure of tables and other objects), execution being just the last of 5 stages.
A single-query dynamic pivot, if such a thing existed, couldn't be prepared, since a prepared query always have the same results structure, whereas by definition a dynamic pivot doesn't, as the rows that pivot into columns can change between executions. That would be at odds again with the Prepare-Bind-Execute model.
You may find some limited workarounds and additional explanations in other questions, for example: Execute a dynamic crosstab query, but since you mentioned specifically:
The gem we have installed (Blazer) on our site limits us to one
query
I'm afraid you're out of luck. Whatever the workaround, it always need at best one step with a query to figure out the columns and generate a dynamic query from them, and a second step executing the query generated at the previous step.

SQL EXTRACT(YEAR FROM MYDATE) not a GROUP BY expression

I have table MYTABLE with columns mydate and quantity of VARCHAR2 type.
|mydate| |quantity|
10/15/2010 15
01/20/2010 20
05/16/2005 30
04/29/2005 50
03/30/2008 5
I want to get:
|year| |quantity|
2010 35
2005 80
2008 5
I try:
SELECT
to_char(mydate,'yyyy') YEAR,
SUM(to_number(quantity))
FROM MYTABLE
GROUP BY
to_char(mydate,'yyyy');
But I get an error
ORA-00979: not a GROUP BY expression
What did I do wrong?
You must put all columns of the SELECT in the GROUP BY or use functions on them which compress the results to a single value (like MIN, MAX or SUM).
A simple example to understand why this happens: Imagine you have a database like this:
FOO BAR
0 A
0 B
and you run SELECT * FROM table GROUP BY foo. This means the database must return a single row as result with the first column 0 to fulfill the GROUP BY but there are now two values of bar to chose from. Which result would you expect - A or B? Or should the database return more than one row, violating the contract of GROUP BY?
Try this
select extract(year from mydate),sum(to_number(quant)) from mytable
group by extract(year from mydate);
SQLFiddle Example

SQL sum 2 different column by different condtion then subtraction and add

what I am trying is kind of complex, I will try my best to explain.
I achieved the first part which is to sum the column by hours.
example
ID TIMESTAMP CUSTAFFECTED
1 10-01-2013 01:00:23 23
2 10-01-2013 03:00:23 55
3 10-01-2013 05:00:23 2369
4 10-01-2013 04:00:23 12
5 10-01-2013 01:00:23 1
6 10-01-2013 12:00:23 99
7 10-01-2013 01:00:23 22
8 10-01-2013 02:00:23 3
output would be
Hour TotalCALLS CUSTAFFECTED
10/1/2013 01:00 3 46
10/1/2013 02:00 1 3
10/1/2013 03:00 1 55
10/1/2013 04:00 1 12
10/1/2013 05:00 1 2369
10/1/2013 12:00 1 99
Query
SELECT TRUNC(STARTDATETIME, 'HH24') AS hour,
COUNT(*) AS TotalCalls,
sum(CUSTAFFECTED) AS CUSTAFFECTED
FROM some_table
where STARTDATETIME >= To_Date('09-12-2013 00:00:00','MM-DD-YYYY HH24:MI:SS') and
STARTDATETIME <= To_Date('09-13-2013 00:00:00','MM-DD-YYYY HH24:MI:SS') and
GROUP BY TRUNC(STARTDATETIME, 'HH')
what I need
what I need sum 2 queries and group by timestamp/hour. 2nd query is exactly same as first but just the where clause is different.
2nd query
SELECT TRUNC(RESTOREDDATETIME , 'HH24') AS hour,
COUNT(*) AS TotalCalls,
SUM(CUSTAFFECTED) AS CUSTRESTORED
FROM some_table
where RESTOREDDATETIME >= To_Date('09-12-2013 00:00:00','MM-DD-YYYY HH24:MI:SS') and
RESTOREDDATETIME <= To_Date('09-13-2013 00:00:00','MM-DD-YYYY HH24:MI:SS')
GROUP BY TRUNC(RESTOREDDATETIME , 'HH24')
so I need to subtract custaffected - custrestoed, and display tht total.
I added link to excel file. http://goo.gl/ioo9hg
Thanks
Ok, now that correct sql is in question text, try this:
SELECT TRUNC(STARTDATETIME, 'HH24') AS hour,
COUNT(*) AS TotalCalls,
Sum(case when RESTOREDDATETIME is null Then 0 else 1 end) RestoredCount,
Sum(CUSTAFFECTED) as CUSTAFFECTED,
Sum(case when RESTOREDDATETIME is null Then 0 else CUSTAFFECTED end) CustRestored,
SUM(CUSTAFFECTED) -
Sum(case when RESTOREDDATETIME is null Then 0 else CUSTAFFECTED end) AS CUSTNotRestored
FROM some_table
where STARTDATETIME >= To_Date('09-12-2013 00:00:00','MM-DD-YYYY HH24:MI:SS')
and STARTDATETIME <= To_Date('09-13-2013 00:00:00','MM-DD-YYYY HH24:MI:SS')
GROUP BY TRUNC(STARTDATETIME, 'HH24')
I recently needed to do this and had to play with it some to get it to work.
The challenge is to get the results of one query to link over to another query all inside the same query and then manipulate the returned value of a field so that the value in a given field in one query's resultset, call it FieldA, is subtracted from the value in a field in a different resultset, call it FieldB. It doesn't matter if the subject values are the result of an aggregation function like COUNT(...); they could be any numeric field in a resultset needing grouping or not. Looking at values from aggregation functions just means you need to adjust your query logic to use GROUP BY for the proper fields. The approach requires creating in-line views in the query and using those as the source of data for doing the subtraction.
A red herring when dealing with this kind of thing is the MINUS operator (assuming you are using an Oracle database) but that will not work since MINUS is not about subtracting values inside a resultset's field values from one another, but subtracting one set of matching records found in another set of records from the final result set returned from the query. In addition, MINUS is not a SQL standard operator so your database probably won't support it if it isn't Oracle you are using. Still, it's awfully nice to have around when you need it.
OK, enough prelude. Here's the query form you will want to use, taking for example a date range we want grouped by YYYY-MM:
select inlineview1.year_mon, (inlineview1.CNT - inlineview2.CNT) as finalcnt from
(SELECT TO_CHAR(*date_field*, 'YYYY-MM') AS year_mon, count(*any_field_name*) as CNT
FROM *schemaname.tablename*
WHERE *date_field* > TO_DATE('*{a year}-{a month}-{a day}*', 'YYYY-MM-DD') and
*date_field* < TO_DATE('*{a year}-{a month}-{a day}*', 'YYYY-MM-DD') and
*another_field* = *{value_of_some_kind}* -- ... etc. ...
GROUP BY TO_CHAR(*date_field*, 'YYYY-MM')) inlineview1,
(SELECT TO_CHAR(*date_field*, 'YYYY-MM') AS year_mon, count(*any_field_name*) as CNT
FROM *schemaname.tablename*
WHERE *date_field* > TO_DATE('*{a year}-{a month}-{a day}*', 'YYYY-MM-DD') and
*date_field* < TO_DATE('*{a year}-{a month}-{a day}*', 'YYYY-MM-DD') and
*another_field* = *{value_of_some_kind}* -- ... etc. ...
GROUP BY TO_CHAR(*date_field*, 'YYYY-MM')) inlineview2
WHERE
inlineview1.year_mon = inlineview2.year_mon
order by *either or any of the final resultset's fields* -- optional
A bit less abstractly, an example wherein a bookseller wants to see the net number of books that were sold in any given month in 2013. To do this, the seller must subtract the number of books retruned for refund from the number sold. He does not care when the book was sold, as he feels a returned book represents a loss of a sale and income statistically no matter when it occurs vs. when the book was sold. Example:
select bookssold.year_mon, (bookssold.CNT - booksreturned.CNT) as netsalescount from
(SELECT TO_CHAR(SALE_DATE, 'YYYY-MM') AS year_mon, count(TITLE) as CNT
FROM RETAILOPS.ACTIVITY
WHERE SALE_DATE > TO_DATE('2012-12-31', 'YYYY-MM-DD') and
SALE_DATE < TO_DATE('2014-01-01', 'YYYY-MM-DD') and
OPERATION = 'sale'
GROUP BY TO_CHAR(SALE_DATE, 'YYYY-MM')) bookssold,
(SELECT TO_CHAR(SALE_DATE, 'YYYY-MM') AS year_mon, count(TITLE) as CNT
FROM RETAILOPS.ACTIVITY
WHERE SALE_DATE > TO_DATE('2012-12-31', 'YYYY-MM-DD') and
SALE_DATE < TO_DATE('2014-01-01', 'YYYY-MM-DD') and
OPERATION = 'return'
GROUP BY TO_CHAR(SALE_DATE, 'YYYY-MM')) booksreturned
WHERE
bookssold.year_mon = booksreturned.year_mon
order by bookssold.year_mon desc
Note that to be sure the query returns as expected, the two in-line views must be equijoined based as shown above on some criteria, as in:
bookssold.year_mon = booksreturned.year_mon
or the subtraction of the counted records can't be done on a 1:1 basis, as the query parser will not know which of the records returned with a grouped count value is to be subtracted from which. Failing to specifiy an equijoin condition will yield a Cartesian join result, probably not what you want (though you may inded want that). For example, adding 'booksreturned.year_mon' right after 'bookssold.year_mon' to the returned fields list in the top-level select statement in the above example and eliminating the
bookssold.year_mon = booksreturned.year_mon
criteria in its WHERE clause will produce a working query that does the subtraction calculation on the CNT values for the YYYY-MM values in the first two columns of the resultset and shows them in the third column. Handy to know this if you need it, as it has solid application in business trends analysis if you can compare sales and returns not just within a given atomic timeframe but as compared across such timeframes in a 1:N fashion.