Grouping by large number of columns with ordinal reference - sql

I have a lengthy query in postgres that requires grouping by 30 different columns referenced in the query
At the moment, I have manually listed the group by columns by their ordinal value in the query
SELECT ...
GROUP BY 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30
Does a better way of doing this exist? I have tried
SELECT...
GROUP BY array(generate_series(1, 30))
But this method fails. Is there a way to use generate_series in the GROUP BY clause to reference columns in the query in cases where queries require a large number of group by columns?

Related

Using Distinct in Aggregate Select query

I am using oracle DB. I have a Aggregated script. We found that some of the rows in the table are repeated, unwanted and hence, is not supposed to be added in the sum.
now suppose i use Distinct command just after the select statement, will distinct command applied before aggregation or after it.
If you use SELECT DISTINCT, then the result set will have no duplicate rows.
If you use SELECT COUNT(DISTINCT), then the count will only count distinct values.
If you are thinking of using SUM(DISTINCT) (or DISTINCT with any other aggregation function) be warned. I have never used it (except perhaps as a demonstration), and I have written a fair number of queries.
You really need to solve the problem at the source. For instance, if accounts are being repeated, then SUM(DISTINCT) does not distinguish between accounts, only by the values assigned to the account. You need to get the logic right.
when you say that you have repeated rows - you must have a clear idea of uniqueness for the combination of some specific columns.
If you expect that certain column combinations are unique within specified groups yo can detect the groups deviating from that using queries following the pattern below.
select <your group by columns>
from <your table name>
group by <your group by predicate>
having (max(A)!=min(A) or max(B)!=min(B) or max(C)!=min(C))
Then you have to decide what to do with the problem. I would suggest cleaning up and adding unique constraints to the table.
The aggregate query you mention would run successfully for the rows in your table not having duplicate values for the combination of columns that needs to be unique. Using my example you could get the aggregates for that part of your data using the inverted having predicate.
It would be something like this
select <your aggregate functions, counts, sums, averages and so on>
from <your table name>
group by <your group by predicate>
having (max(A)=min(A) and max(B)=min(B) and max(C)=min(C))
If you must include the groups breaking uniqueness expectations you must somehow do a qualified selection of which of the variants in the group to use - you could for example go for the last one or the first one if one of your columns should happen to express something about when the row was created.

exclude a column from group by statement

I would like to exclude a column from group by statement, because it results in some redundant records. Are there any recommendations?
I use Oracle, and have a complex query which join 6 tables together, and want to use sql aggregate function (count), without duplicate result.
You can't.
When using aggregate functions every column/column expression which is not an aggregate must be in the GROUP BY.
This is completely logical. If you're not aggregating the column then excluding it from the GROUP BY would force Oracle to chose a random value, which is not very useful.
If you don't want this column in your GROUP BY then you must decide what aggregation to apply to this column in order to return the appropriate data for your situation. You can't hand this responsibility off to the database engine.

Is the GROUP BY clause applied after the WHERE clause in Hive?

Suppose I have the following SQL:
select user_group, count(*)
from table
where user_group is not null
group by user_group
Suppose further that 99% of the data has null user_group.
Will this discard the rows with null before the GROUP BY, or will one poor reducer end up with 99% of the rows that are later discarded?
I hope it is the former. That would make more sense.
Bonus points if you say what will happen by Hive version. We are using 0.11 and migrating to 0.13.
Bonus points if you can point to any documentation that confirms.
Sequence
FROM & JOINs determine & filter rows
WHERE more filters on the rows
GROUP BY combines those rows into groups
HAVING filters groups
SELECT
ORDER BY arranges the remaining rows/groups
The first step is always the FROM clause. In your case, this is pretty straight-forward, because there's only one table, and there aren't any complicated joins to worry about. In a query with joins, these are evaluated in this first step. The joins are assembled to decide which rows to retrieve, with the ON clause conditions being the criteria for deciding which rows to join from each table. The result of the FROM clause is an intermediate result. You could think of this as a temporary table, consisting of combined rows which satisfy all the join conditions. (In your case the temporary table isn't actually built, because the optimizer knows it can just access your table directly without joining to any others.)
The next step is the WHERE clause. In a query with a WHERE clause, each row in the intermediate result is evaluated according to the WHERE conditions, and either discarded or retained. So null will be discarded before going to Group by clause
Next comes the GROUP BY. If there's a GROUP BY clause, the intermediate result is now partitioned into groups, one group for every combination of values in the columns in the GROUP BY clause.
Now comes the HAVING clause. The HAVING clause operates once on each group, and all rows from groups which do not satisfy the HAVING clause are eliminated.
Next comes the SELECT. From the rows of the new intermediate result produced by the GROUP BY and HAVING clauses, the SELECT now assembles the columns it needs.
Finally, the last step is the ORDER BY clause.
This query discards the rows with NULL before the GROUP BY operation.
Hope this link will be useful:-
http://dev.hortonworks.com.s3.amazonaws.com/HDPDocuments/HDP2/HDP-2.2.0/bk_dataintegration/content/hive-013-feature-subqueries-in-where-clauses.html

Apply the same aggregate to every column in a table

I am using a proprietary mpp database that has been forked off psql 8.3. I am trying to apply a simple count to a wide table (around 450 columns) and so I was wondering if the best way to do this in terms of a simple sql function. I am just counting the number of distinct values in a given column as well as the count of the number of null values in the column. The query i want to generalize for every column is for example
If i want to run the query against the column names i write
select
count(distinct names) d_names,
sum(case when names is not null then 1 else 0 end) n_s_ip
from table;
How do i generalize the query above to iterate through every column in the table if the number of columns is 450 without writing out each column name by hand?
First, since COUNT() only counts non-null values, your query can be simplified:
SELECT count(DISTINCT names) AS unique_names
,count(names) AS names_not_null
FROM table;
But that's the number of non-null values and contradicts your description:
count of the number of null values in the column
For that you would use:
count(*) - count(names) AS names_null
Since count(*) count all rows and count(names) only rows with non-null names.
Removed inferior alternative after hint by #Andriy.
To automate that for all columns build an SQL statement off of the catalog table pg_attribute dynamically. You can use EXECUTE in a PL/pgSQL function to execute it immediately. Find full code examples with links to the manual and explanation under these closely related questions:
How to perform the same aggregation on every column, without listing the columns?
postgresql - count (no null values) of each column in a table
You can generate the repetitive part of query by using information_scheam.columns.
select 'count(distinct '||column_name||') d_names, sum(case when '||column_name||' is not null then 1 else 0 end) n_s_ip,'
from information_schema.columns where table_name='table'
order by ordinal_position;
The above query will generate count(...) and sum(...) for each column of table. This result can be used as select-list for your query. You can cut&paste the result to the following query:
select
-- paste here
from table;
After paste, you have to remove the last comma.
In this way, you can avoid writing select-list for 450 columns.

How to get other columns in this query

I am using a group by clause in my query. I want to get other columns not specified in the group by parameters
SELECT un.user, un.role
FROM [Unique] un
group by user, role
In the query about [Unique] has 7 columns altogether. How do I get the other columns?
In most databases (MySQL and SQLite are the exceptions I know of), you cannot include a column in a GROUP BY SELECT unless:
The column is included in the GROUP BY clause.
The column is aggregated in one of the supported aggregate functions.
In MySQL and SQLite, the rows inside the aggregate groups from which the extra values get taken are undefined.
If you want extra columns in any other engine, you can wrap the column names in MAX():
SELECT un.user, un.role, MAX(un.city), MAX(un.bday)
FROM [Unique] un
GROUP BY user, role
In this case, the values for the extra columns are likely to come from different rows in the input record set. If this is important (sometimes it isn't since the extra columns come from the one side of a one-to-many JOIN), you can't use this technique.
Just to be clear: If you use GROUP BY in a SELECT, then each row you get back is constructed out of groups of multiple rows in the table you're SELECTing against. If you include columns that are not part of the GROUP BY clause, you're not giving the engine any instructions on which row from the table you want that value read from. Most engines, therefore, do not allow you to run this kind of SQL. MySQL does, with undefined results but I personally consider it bad practice to do this.
You have to choose on what basis you want the other columns. If multiple entries exist for the same user / role, do you want the first / last / random? You have to make choices on the other columns, by aggregating them or choosing to include them in the group by statement.
Some RDBMS do provide a default behaviour for performing this, but since the question is just marked SQL, we do not know if it applies.
Have you tried just specifying them?
SELECT un.user, un.role, un.col3, un.col4
FROM [Unique] un
group by user, role
You need to use a Order By to get extra column. or you end up specifying every column in your group by.
Use LEFT JOIN to self-join the Unique or use the SELECT with GROUP BY as sub-query.