Any easier way to group by individual columns in Hive/Impala? - hive

I need to output report of users by their age, gender, education, income, etc from our database. However, there are about 40 variables. It seems just silly to group by each variable one bye one but I'm not aware of other ways and I don't know how to write UDF to solve it yet. I'd appreciate your help.
It's not that complicated but it does come up a lot in daily work. My work environment is Hive/Impala.

We cannot implement 'Group By' task on input rows in UDF , UDAF or UDTF.
UDF takes in a single input row and output a single output row.
UDAF just does Aggregations on one column, but not by Grouping rows.
UDTF transforms a single input row to multiple output rows.
Only possible solution is to write multiple Queries and Combine them using UNION ALL and display/insert into table
Sample Query:
SELECT *
FROM
(
SELECT COUNT(column1),column1 FROM table GROUP BY column1
UNION ALL
SELECT COUNT(column2),column2 FROM table GROUP BY column2
UNION ALL
SELECT COUNT(column3),column3 FROM table GROUP BY column3
) s

Related

Need SQL with subquery to get distinct values for VBA code

I have a table BAR_DATA with two fields: LongDate, Time. Both are long integers. No Access Date/Time involved here.
For each distinct LongDate value there are hundreds of records, each with Time value which may be distinct or duplicate within that LongDate.
I need to create an SQL statement that will group by LongDate and give me a count of distinct Times within each LongDate.
The following SQL statement, (built by an Acess query) does NOT work (some LongDates are omitted):
Query A
SELECT DISTINCT BAR_DATA.LongDate, Count(BAR_DATA.Time) AS CountOfTime
FROM BAR_DATA
GROUP BY BAR_DATA.LongDate
HAVING (((Count(BAR_DATA.Time))<>390 And (Count(BAR_DATA.Time))<>210));
However, if I use Query B to reference Query DistinctDateTime, it does work:
Query B
SELECT DistinctDateTime.LongDate, Count(DistinctDateTime.Time) AS CountOfTime
FROM DistinctDateTime
GROUP BY DistinctDateTime.LongDate
HAVING (((Count(DistinctDateTime.Time))<>390 And (Count(DistinctDateTime.Time))<>210));
Query DistinctDateTime
SELECT DISTINCT BAR_DATA.LongDate, BAR_DATA.Time
FROM BAR_DATA;
My problem:
I need to get Query B and Query DistinctDateTime wrapped into a single SQL statement so I can paste it into a VBA function. I presume there
is some subquery techniques, but I have failed at every attempt, and find no pertinent example.
Any help will be greatly appreciated. Thanks!
Subquery your distinct table inside and perform your aggregates outside until you get the desired result:
SELECT DistinctDateTime.LongDate, Count(DistinctDateTime.Time) AS CountOfTime
FROM
(
SELECT DISTINCT BAR_DATA.LongDate, BAR_DATA.Time
FROM BAR_DATA
) AS DistinctDateTime
GROUP BY DistinctDateTime.LongDate
HAVING (((Count(DistinctDateTime.Time))<>390 And (Count(DistinctDateTime.Time))<>210));

Is it possible to combine or reduce these sql statements?

I'm not particularly familiar with SQL, but my team asked me to take a look at this series of sql statements and see if it is possible to reduce it down to just 1 or 2. I looked at it, and I don't believe so, but I don't quite have the experience or knowledge of the tricks of sql.
So all of the statements have pretty much the same format
select
id, count(*)
from
table
where
x_date between to_date(start_date) and to_date(end_date)
group by
id
where x_date is the only thing that changes. (Start_date and end_date are just what I typed here to make it a bit more readable). There are 10 statements total, 7 of which are exactly this format.
Of the 3 different ones, one of them looks like this:
select
id, count(*)
from
table
where
x_date between to_date(start_date) and to_date(end_date)
and userid not like 'AUTOPEND'
group by
id
and the other 2 look like this:
select
id, count(*)
from
table
where
x_date between to_date(start_date) and to. _date(end_date)
group by
id, x_code
Where x_code differs between them.
They want to use this data for statistical analysis, but they insist on manually using a query and typing it in. The way I see it is that I can't really combine these statements because they are all grouping by the same field (except the last 2), so it all gets combined in the results, making the results useless for the analysis.
Am I thinking about it right, or is there some way to do like they asked? Can I make a 1 or 2 sql statements output more than 1 table each?
Oh, I almost forgot. I believe that this is Oracle PL/SQL using SQL developer.
You are trying to get multiple aggregates with different grouping sets in a single query. This is called a ROLLUP or a CUBE. There are a few ways to solve your specific problem, but the extended grouping functions are the right tool for the job. Going forward, it will be more maintainable and faster.
http://www.oracle-base.com/articles/misc/rollup-cube-grouping-functions-and-grouping-sets.php
Since the first and second example are grouping by the same thing you can use CASE statement nested in an aggregation function:
SELECT id, COUNT(*) ,
SUM( CASE WHEN userid not like 'AUTOPEND' THEN 1 ELSE 0 END) AS [NotAUTOPEND]
FROM table
WHERE x_date between to_date(start_date) and to_date(end_date)
GROUP BY id

counting rows in select clause with DB2

I would like to query a DB2 table and get all the results of a query in addition to all of the rows returned by the select statement in a separate column.
E.g., if the table contains columns 'id' and 'user_id', assuming 100 rows, the result of the query would appear in this format: (id) | (user_id) | 100.
I do not wish to use a 'group by' clause in the query. (Just in case you are confused about what i am asking) Also, I could not find an example here: http://mysite.verizon.net/Graeme_Birchall/cookbook/DB2V97CK.PDF.
Also, if there is a more efficient way of getting both these results (values + count), I would welcome any ideas. My environment uses zend framework 1.x, which does not have an ODBC adapter for DB2. (See issue http://framework.zend.com/issues/browse/ZF-905.)
If I understand what you are asking for, then the answer should be
select t.*, g.tally
from mytable t,
(select count(*) as tally
from mytable
) as g;
If this is not what you want, then please give an actual example of desired output, supposing there are 3 to 5 records, so that we can see exactly what you want.
You would use window/analytic functions for this:
select t.*, count(*) over() as NumRows
from table t;
This will work for whatever kind of query you have.

Column manipulation

I have a query that generates 8 columns worth of data from a list of unique IDs. These columns are then copied into an excel sheet for holding. I am attempting to find a way to either write a query or find an excel function to get it to generate 2 columns: 1 with the list of unique IDs and the other with the number of times it appears in the initial 8 columns. Any thoughts or comments would be most welcome.
Thanks for the help.
If I understand your problem correctly, you can write a query that returns your two columns. Unfortunately, it's a little tedious but it should work. This is generic enough to work in any RMDBS. There are probably more elegant solutions using specific functions of a particular RMDBS.
SELECT DISTINCT A.UniqueID, SUM(A.IDCounter) AS IDCount FROM
(
SELECT UniqueIDCol1 AS UniqueID, Count(UniqueIDCol1) AS IDCounter
FROM MyTable
GROUP BY UniqueIDCol1
UNION ALL
SELECT UniqueIDCol2 AS UniqueID, Count(UniqueIDCol2) AS IDCounter
FROM MyTable
GROUP BY UniqueIDCol2
UNION ALL
.
.
.
SELECT UniqueIDCol8 AS UniqueID, Count(UniqueIDCol8) AS IDCounter
FROM MyTable
GROUP BY UniqueIDCol8
) AS A
GROUP BY A.UniqueID

Why does the number of rows increase in a SELECT statement with INNER JOIN when a second column is selected?

I am writing some queries with self-joins in SQL Server. When I have only one column in the SELECT clause, the query returns a certain number of rows. When I add another column, from the second instance of the table, to the SELECT clause, the results increase by 1000 rows!
How is this possible?
Thanks.
EDIT:
I have a subquery in the FROM clause, which is also a self-join on the same table.
How is this possible?
the only thing I can think of is that you have SELECT DISTINCT and the additional column makes some results distinct that weren't before the additional column.
For example I would expect the second result to have many more rows
SELECT DISTINCT First_name From Table
vs
SELECT DISTINCT First_name, Last_name From Table
But if we had the actual SQL then something else might come to mind