Hive LIMIT changes GROUP BY result - hive

SELECT `col1`
, `col2`
, count(*)
FROM `tab1`
GROUP BY `col1`
, `col2`
limit 10;
+-------+-------+--------+
| col1 | col2 | _c2 |
+-------+-------+--------+
| A | A | 1 |
| A | B | 34241 |
| A | C | 12345 |
| A | D | 145 |
| A | E | 26 |
| A | F | 224547 |
| B | A | 1429 |
| B | B | 25 |
| B | C | 94 |
| B | D | 1 |
+-------+-------+--------+
If I take one of the results from that, and do a specific query for that combination, the result changes.
SELECT `col1`
, `col2`
, count(*)
FROM `tab1`
WHERE `col1`='A'
AND `col2`='B'
GROUP BY `col1`
, `col2`;
+-------+-------+--------+
| col1 | col2 | _c2 |
+-------+-------+--------+
| A | B | 38944 |
+-------+-------+--------+
If I run set hive.map.aggr=true; then I get a different count, somewhere in between the two.
Any ideas why or how to fix?
If I run the same query with LIMIT 20 then it gives the right count. Or, I should say, the same count as the WHERE query, I haven't counted them myself to check that it is correct!

Related

SQL return only rows where value exists multiple times and other value is present

I have a table like this in MS SQL SERVER
+------+------+
| ID | Cust |
+------+------+
| 1 | A |
| 1 | A |
| 1 | B |
| 1 | B |
| 2 | A |
| 2 | A |
| 2 | A |
| 2 | B |
| 3 | A |
| 3 | B |
| 3 | B |
| 3 | C |
| 3 | C |
+------+------+
I don't know the values in column "Cust" and I want to return all rows where the value of "Cust" appears multiple times and where at least one of the "ID" values is "1".
Like this:
+------+------+
| ID | Cust |
+------+------+
| 1 | A |
| 1 | A |
| 1 | B |
| 1 | B |
| 2 | A |
| 2 | A |
| 2 | A |
| 2 | B |
| 3 | A |
| 3 | B |
| 3 | B |
+------+------+
Any ideas? I can't find it.
You may use COUNT window function as the following:
SELECT ID, Cust
FROM
(
SELECT ID, Cust,
COUNT(*) OVER (PARTITION BY Cust) cn,
COUNT(CASE WHEN ID=1 THEN 1 END) OVER (PARTITION BY Cust) cn2
FROM table_name
) T
WHERE cn>1 AND cn2>0
ORDER BY ID, Cust
COUNT(*) OVER (PARTITION BY Cust) to check if the value of "Cust" appears multiple times.
COUNT(CASE WHEN ID=1 THEN 1 END) OVER (PARTITION BY Cust) to check that at least one of the "ID" values is "1".
See a demo.

Select minimum of two rows and then sort (with no complete grouping)

I have a Microsoft Access table with the following values:
id | C | D | ED | T |
---+-------+-----+------------+---+
1 | 33105 | ABC | 2020/01/04 | 1 |
2 | 33105 | ABC | 2020/01/08 | 2 |
3 | 33102 | DEF | 2020/02/01 | 2 |
4 | 34145 | GHI | 2020/02/09 | 1 |
5 | 34145 | GHI | 2020/02/10 | 2 |
6 | 34162 | JKL | 2020/02/08 | 1 |
I would like to extract the rows with the same C but lowest T (with this precedence) and finally sort the results by date (ED) descending. So my expected result is the following:
id | C | D | ED | T |
---+-------+-----+------------+---+
4 | 34145 | GHI | 2020/02/09 | 1 |
6 | 34162 | JKL | 2020/02/08 | 1 |
3 | 33102 | DEF | 2020/02/01 | 2 |
1 | 33105 | ABC | 2020/01/04 | 1 |
What's the fastest way in SQL to do so (the table is actually pretty large)?
You can do it with NOT EXISTS:
SELECT t.*
FROM tablename AS t
WHERE NOT EXISTS (SELECT 1 FROM tablename WHERE C = t.C AND T < t.T)
Or with a correlated subquery:
SELECT t.*
FROM tablename AS t
WHERE t.T = (SELECT MIN(T) FROM tablename WHERE C = t.C)

Get new Id + 1 for each group in SQL

Please help me to figure out a way of getting from a data set the first number id of each group IF the Id is not already taken yet... I don't even know to explain it, So I will explain down here:
Id | Col1 | Col2 | Value | Number
------+-------+------+----------+-------
17525 | A | B | 1086.00 | 1
17525 | A | B | 1086.00 | 2
17525 | A | B | 1086.00 | 3
17526 | A | B | 1378.00 | 1
17526 | A | B | 1378.00 | 2
17526 | A | B | 1378.00 | 3
17527 | A | B | 1498.00 | 1
17527 | A | B | 1498.00 | 2
17527 | A | B | 1498.00 | 3
And I want to get something like this:
For each Id OR Value (doesn't matter, are equal) the FIRST Number, after the FIRST already taken from the other group.
Something like this:
Id | Col1 | Col2 | Value | Number
------+-------+------+----------+-------
17525 | A | B | 1086.00 | 1
17526 | A | B | 1378.00 | 2
17527 | A | B | 1498.00 | 3
So for the first value, 1086.00 I'll take Number 1, for the 2nd value 1378.00 I'll will take Number 2, because 1 is already taken be the first value.
I tried for 3 hours, with ROW_NUMBER, doesn't work, Recursion CTE could't pass the Max Recursion Limit 100 error.
Please HELP!
Thanks.
Have you considered using dense_rank()?:
select distinct Id, Col1, Col2, Value
, dr = dense_rank() over (order by Id)
from t
returns:
+-------+------+------+---------+----+
| Id | Col1 | Col2 | Value | dr |
+-------+------+------+---------+----+
| 17525 | A | B | 1086,00 | 1 |
| 17526 | A | B | 1378,00 | 2 |
| 17527 | A | B | 1498,00 | 3 |
+-------+------+------+---------+----+

select from table using sql query

Table
id | name | type | x | y | z | refer
-----+------------+---------------+---------------+-------------+------------------+-----------------
1001 | A | 4 | | | | 0
2000 | B | 2 | -1062731776 | | -65536 | 1001
2001 | C | 2 | 167772160 | | -16777216 | 1001
2002 | D | 2 | -1408237568 | | -1048576 | 1001
I need to select columns name,x,y,z if in refer column it refers to id column
and name must be of that id's name. Is it possible with a single query? can anyone please help
here, output should be:
name| x | y | z
----+-----------------+-------------+-----------------
A | -1062731776 | | -65536
A | 167772160 | | -16777216
A | -1408237568 | | -1048576
SELECT t1.name, t2.x, t2.y, t2.z FROM TABLENAME t1
JOIN TABLENAME t2 on t1.id = t2.refer

Joining two tables and calculating divide-SUM from the resulting table in SQL Server

I have one table that looks like this:
+---------------+---------------+-----------+-------+------+
| id_instrument | id_data_label | Date | Value | Note |
+---------------+---------------+-----------+-------+------+
| 1 | 57 | 1.10.2010 | 200 | NULL |
| 1 | 57 | 2.10.2010 | 190 | NULL |
| 1 | 57 | 3.10.2010 | 202 | NULL |
| | | | | |
+---------------+---------------+-----------+-------+------+
And the other that looks like this:
+----------------+---------------+---------------+--------------+-------+-----------+------+
| id_fundamental | id_instrument | id_data_label | quarter_code | value | AnnDate | Note |
+----------------+---------------+---------------+--------------+-------+-----------+------+
| 1 | 1 | 20 | 20101 | 3 | 28.2.2010 | NULL |
| 2 | 1 | 20 | 20102 | 4 | 1.8.2010 | NULL |
| 3 | 1 | 20 | 20103 | 5 | 2.11.2010 | NULL |
| | | | | | | |
+----------------+---------------+---------------+--------------+-------+-----------+------+
What I would like to do is to merge/join these two tables in one in a way that I get something like this:
+------------+--------------+--------------+----------+--------------+
| Date | Table1.Value | Table2.Value | AnnDate | quarter_code |
+------------+--------------+--------------+----------+--------------+
| 1.10.2010. | 200 | 3 | 1.8.2010 | 20102 |
| 2.10.2010. | 190 | 3 | 1.8.2010 | 20102 |
| 3.10.2010. | 202 | 3 | 1.8.2010 | 20102 |
| | | | | |
+------------+--------------+--------------+----------+--------------+
So the idea is to order them by Date from Table1 and since Table2 Values only change on the change of AnnDate we populate the Resulting table with same values from Table2.
After that I would like to go through the resulting table and create another (Final table) with the following.
On Date 1.10.2010. take last 4 AnnDates (so it would be 1.8.2010. and f.e. 20.3.2010. 30.1.2010. 15.11.2009) and Table2 values on those AnnDate. Make SUM of those 4 values and then divide the Table1 Value with that SUM.
So we would get something like:
+-----------+---------------------------------------------------------------+
| Date | FinalValue |
+-----------+---------------------------------------------------------------+
| 1.10.2010 | 200/(Table2.Value on 1.8.2010+Table2.Value on 20.3.2010 +...) |
| | |
+-----------+---------------------------------------------------------------+
Is there any way this can be done?
EDIT:
Hmm yes now I see that I really didn't do a good job explaining it.
What I wanted to say is
I try INNER JOIN like this:
SELECT TableOne.Date, TableOne.Value, TableTwo.Value, TableTwo.AnnDate, TableTwo.quarter_code
FROM TableOne
INNER JOIN TableTwo ON TableOne.id_intrument=TableTwo.id_instrument WHERE TableOne.id_data_label = somevalue AND TableTwo.id_data_label = somevalue AND date > xxx AND date < yyy
And this inner join returns 2620*40 rows which means for every AnnDate from table2 it returns all Date from table1.
What I want is to return 2620 values with Dates from Table1
Values from table1 on that date and Values from table2 that respond to that period of dates
f.e.
Table1:
+-------+-------+
| Date | Value |
+-------+-------+
| 1 | a |
| 2 | b |
| 3 | c |
| 4 | d |
+-------+-------+
Table2
+-------+---------+
| Value | AnnDate |
+-------+---------+
| x | 1 |
| y | 4 |
+-------+---------+
Resulting table:
+-------+---------+---------+
| Date | ValueT1 | ValueT2 |
+-------+---------+---------+
| 1 | a | x |
| 2 | b | x |
| 3 | c | x |
| 4 | d | y |
+-------+---------+---------+
You need a JOIN statement for your first query. Try:
SELECT TableOne.Date, TableOne.Value, TableTwo.Value, TableTwo.AnnDate, TableTwo.quarter_code FROM TableOne
INNER JOIN TableTwo
ON TableOne.id_intrument=TableTwo.id_instrument;