HIVE ROW_NUMBER() function creates wrong results

HIVE ROW_NUMBER() function creates wrong results - hive

When using ROW_NUMBER() on hive tables with more than 1024 rows, the function ROW_NUMBER() duplicates rows with separate row numbers. If use rank() or dense_rank() insteadly, the issue is not present.
Reproduce by using a large table with more than 1024 rows with distinct values and then using a query involving ROW_NUMBER(). It will then result in getting the same distinct values twice with separate row numbers apart by 1024.
SELECT cell
,mobilearea
,dat
,row_number() OVER (
PARTITION BY mobilearea
,day
) AS ind
FROM mk.sd_areadetails_tmp;
It shows below result with error duplicated ind.
cell mobilearea day ind
...
13e98 Tiankong 2022-04-11 1016
86e19 Tiankong 2022-04-11 1017
73a7e Tiankong 2022-04-11 1018
1dafd Tiankong 2022-04-11 1019
3e59e Tiankong 2022-04-11 1020
a3aaf Tiankong 2022-04-11 1021
0b3d1 Tiankong 2022-04-11 1022
ad190 Tiankong 2022-04-11 1023
994aa Tiankong 2022-04-11 *1024*
d552b Tiankong 2022-04-11 *1*
61623 Tiankong 2022-04-11 2
01869 Tiankong 2022-04-11 3
a5478 Tiankong 2022-04-11 4
4c63b Tiankong 2022-04-11 5
7f90b Tiankong 2022-04-11 6
6e3ab Tiankong 2022-04-11 7
294ad Tiankong 2022-04-11 8
4f739 Tiankong 2022-04-11 9

Related

Add +1 to date upto 7days and other column value should be divided by 7 upto 7 rows in ORACLE SQL

Sample table:
emp
date
sal
698
28/11/2021
9200
724
02/01/2022
8700
output should be:
emp
date
sal
698
28/11/2021
1314
698
29/11/2021
1314
698
30/11/2021
1314
698
01/12/2021
1314
698
02/12/2021
1314
698
03/12/2021
1314
698
04/12/2021
1314
724
02/01/2022
1242
724
03/01/2022
1242
724
04/01/2022
1242
724
05/01/2022
1242
724
06/01/2022
1242
724
07/01/2022
1242
724
08/01/2022
1242
here, I should display, add +1 for the date upto 7days and sal should be divided by 7
Each row from the sample input should result as 7 rows in the sample output
I need a query in Oracle SQL

You can CROSS JOIN with a row generator to create the 7 days:
SELECT t.emp,
t."DATE" + d.days AS "DATE",
TRUNC(t.sal / 7) AS sal
FROM table_name t
CROSS JOIN (
SELECT LEVEL - 1 AS days
FROM DUAL
CONNECT BY LEVEL <= 7
) d
ORDER BY emp, "DATE"
Which, for the sample data:
CREATE TABLE table_name (emp, "DATE", sal) AS
SELECT 698, DATE '2021-11-28', 9200 FROM DUAL UNION ALL
SELECT 724, DATE '2021-01-02', 8700 FROM DUAL;
Outputs:
EMP
DATE
SAL
698
2021-11-28 00:00:00
1314
698
2021-11-29 00:00:00
1314
698
2021-11-30 00:00:00
1314
698
2021-12-01 00:00:00
1314
698
2021-12-02 00:00:00
1314
698
2021-12-03 00:00:00
1314
698
2021-12-04 00:00:00
1314
724
2021-01-02 00:00:00
1242
724
2021-01-03 00:00:00
1242
724
2021-01-04 00:00:00
1242
724
2021-01-05 00:00:00
1242
724
2021-01-06 00:00:00
1242
724
2021-01-07 00:00:00
1242
724
2021-01-08 00:00:00
1242
db<>fiddle here

Here's a compact way to do this - using a simple XQuery expression to generate numbers between 0 and 6:
select emp,
date_ + xmlcast(column_value as number) as date_,
round(sal/7, 2) as sal
from table_name cross join xmltable('0 to 6')
;
Note - I changed the date column name to date_ (with an underscore). date is a reserved keyword, it can't be a column name. Also, obviously, you will need to use your actual table name.

Future dates calculating incorrectly in FBProphet - make_future_dataframe method

I'm trying to do a weekly forecast in FBProphet for just 5 weeks ahead. The make_future_dataframe method doesn't seem to be working right....makes the correct one week intervals except for one week between jul 3 and Jul 5....every other interval is correct at 7 days or a week. Code and output below:
INPUT DATAFRAME
ds y
548 2010-01-01 3117
547 2010-01-08 2850
546 2010-01-15 2607
545 2010-01-22 2521
544 2010-01-29 2406
... ... ...
4 2020-06-05 2807
3 2020-06-12 2892
2 2020-06-19 3012
1 2020-06-26 3077
0 2020-07-03 3133
CODE
future = m.make_future_dataframe(periods=5, freq='W')
future.tail(9)
OUTPUT
ds
545 2020-06-12
546 2020-06-19
547 2020-06-26
548 2020-07-03
549 2020-07-05
550 2020-07-12
551 2020-07-19
552 2020-07-26
553 2020-08-02

All you need to do is create a dataframe with the dates you need for predict method. utilizing the make_future_dataframe method is not necessary.

How to calculate time difference between rows

I have following data in a table:
qincId ID lc1 lc2 Time SP
963 544 22.3000526428 73.1743087769 2019-03-31 17:00:46.000 15
965 544 22.2998828888 73.1746368408 2019-03-31 17:01:07.000 2
968 544 22.2998828888 73.1746368408 2019-03-31 17:01:40.000 2
997 544 22.3010215759 73.1744003296 2019-03-31 17:06:11.000 15
998 544 22.3011436462 73.1747131348 2019-03-31 17:06:21.000 17
1010 544 22.3034667969 73.1747512817 2019-03-31 17:08:04.000 0
1011 544 22.3032741547 73.1747512817 2019-03-31 17:08:03.000 0
1565 544 22.3032035828 73.1748123169 2019-03-31 18:45:26.000 0
1571 544 22.3028964996 73.1748123169 2019-03-31 18:46:03.000 16
1573 544 22.3023796082 73.1747131348 2019-03-31 18:46:21.000 15
1575 544 22.3021774292 73.1746444702 2019-03-31 18:46:37.000 0
1577 544 22.3019657135 73.1747665405 2019-03-31 18:46:50.000 15
1586 544 22.3009243011 73.1742477417 2019-03-31 18:47:33.000 5
1591 544 22.2998828888 73.1745300293 2019-03-31 18:48:19.000 5
1592 544 22.2998828888 73.1745300293 2019-03-31 18:48:28.000 5
1593 544 22.2998981476 73.1746063232 2019-03-31 18:48:29.000 4
1597 544 22.3000450134 73.1744232178 2019-03-31 18:49:08.000 0
1677 544 22.3000450134 73.1744232178 2019-03-31 19:03:28.000 0
Now I want to calculate time difference between to row only for sp = 0 from their next record.
Expected output:
qincId ID lc1 lc2 Time SP TimeDiff (Minute)
963 544 22.3000526428 73.1743087769 2019-03-31 17:00:46.000 15 NULL
965 544 22.2998828888 73.1746368408 2019-03-31 17:01:07.000 2 NULL
968 544 22.2998828888 73.1746368408 2019-03-31 17:01:40.000 2 NULL
997 544 22.3010215759 73.1744003296 2019-03-31 17:06:11.000 15 NULL
998 544 22.3011436462 73.1747131348 2019-03-31 17:06:21.000 17 NULL
1010 544 22.3034667969 73.1747512817 2019-03-31 17:08:04.000 0 0.01
1011 544 22.3032741547 73.1747512817 2019-03-31 17:08:03.000 0 97
1565 544 22.3032035828 73.1748123169 2019-03-31 18:45:26.000 0 1
1571 544 22.3028964996 73.1748123169 2019-03-31 18:46:03.000 16 NULL
1573 544 22.3023796082 73.1747131348 2019-03-31 18:46:21.000 15 NULL
1575 544 22.3021774292 73.1746444702 2019-03-31 18:46:37.000 0 0.21
1577 544 22.3019657135 73.1747665405 2019-03-31 18:46:50.000 15 NULL
1586 544 22.3009243011 73.1742477417 2019-03-31 18:47:33.000 5 NULL
1591 544 22.2998828888 73.1745300293 2019-03-31 18:48:19.000 5 NULL
1592 544 22.2998828888 73.1745300293 2019-03-31 18:48:28.000 5 NULL
1593 544 22.2998981476 73.1746063232 2019-03-31 18:48:29.000 4 NULL
1597 544 22.3000450134 73.1744232178 2019-03-31 18:49:08.000 0 14
1677 544 22.3000450134 73.1744232178 2019-03-31 19:03:28.000 0 NULL
So basically I just want to calculate time difference in minute only.
How can I do this ?

If by next record you mean the row that has the minimum time that is greater than the current time:
select t.*,
round(case
when t.sp = 0 then
datediff(second, t.time,
(select min(time) from tablename where time > t.time)
)
else null
end / 60.0, 2) timediff
from tablename t

you can try by using lag() sqlserver version>=2012
select *, case when sp=0 then
datediff(second,time,lag(time) over(order by time)) else null end
from table_name

Varying apply function on Columns when using Pandas TimeGrouper

I have a very large time series dataset, I would like to do a count() on close_p but a sum() on prd_vlm.
open_p high_p low_p close_p tot_vlm prd_vlm
datetime
2005-09-06 16:33:00 1234.25 1234.50 1234.25 1234.25 776 98
2005-09-06 16:34:00 1234.50 1234.75 1234.25 1234.50 1199 423
2005-09-06 16:35:00 1234.50 1234.50 1234.25 1234.50 1330 131
...
2017-06-25 18:41:00 2431.75 2432.00 2431.75 2432.00 5436 189
2017-06-25 18:42:00 2431.75 2432.25 2431.75 2432.25 5654 218
2017-06-25 18:43:00 2432.25 2432.75 2432.25 2432.75 5877 223
2017-06-25 18:44:00 2432.75 2432.75 2432.50 2432.75 5894 17
2017-06-25 18:45:00 2432.50 2432.50 2432.25 2432.25 6098 204
I can achieve this using the following code. But was wondering if there is a better way of achieve this using an apply function
group_count = df['close_p'].groupby(pd.TimeGrouper('D')).count()
group_volume = df['prd_vlm'].groupby(pd.TimeGrouper('D')).sum()
grouped = pd.concat([group_count,group_volume], axis=1)
print(grouped)
close_p prd_vlm
datetime
2005-09-06 232 4776.0
2005-09-07 1039 631548.0
2005-09-08 999 544112.0
2005-09-09 810 595044.0

You can use agg and apply different functions to different columns.
df.groupby(pd.TimeGrouper('D')).agg({'close_p':'count','prd_vlm':'sum'})

Convert more than one Columns into Rows

I have a table with columns and value like
ID Values FirstCol 2ndCol 3rdCol 4thCol 5thCol
1 1stValue 5466 34556 53536 54646 566
1 2ndValue 3544 957 667 1050 35363
1 3rdValue 1040 1041 4647 6477 1045
1 4thValue 1048 3546 1095 1151 65757
2 1stValue 845 5466 86578 885 859
2 2ndValue 35646 996 1300 7101 456467
2 3rdValue 102 46478 565 657 107
2 4thValue 5509 55110 1411 1152 1144
3 1stValue 845 854 847 884 675
3 2ndValue 984 994 4647 1041 1503
3 3rdValue 1602 1034 1034 1055 466
3 4thValue 1069 1610 6111 1124 1144
Now I want a result set in below form, is this possible with Pivot or Case statment?
ID Cols 1stValue 2ndValue 3rdValue 4thValue
1 FirstCol 5466 3544 1040 1048
1 2ndCol 34556 957 1041 3546
1 3rdCol 53536 667 4647 1095
1 4thCol 54646 1050 6477 1151
1 5thCol 566 35363 1045 65757
2 FirstCol 845 35646 102 5509
2 2ndCol 5466 996 46478 55110
2 3rdCol 86578 1300 565 1411
2 4thCol 885 7101 657 1152
2 5thCol 859 456467 107 1144
3 FirstCol 845 984 1602 1069
3 2ndCol 854 994 1034 1610
3 3rdCol 847 4647 1034 6111
3 4thCol 884 1041 1055 1124
3 5thCol 675 1503 466 1144

Assuming the table name is t1 this should do the trick:
SELECT * FROM t1
UNPIVOT (val FOR name IN ([FirstCol], [2ndCol], [3rdCol], [4thCol], [5thCol])) unpiv
PIVOT (SUM(val) FOR [Values] IN ([1stValue], [2ndValue], [3rdValue], [4thValue])) piv
There's sorting issue, it'd be good to rename FirstCol to 1stCol, then ORDER BY ID, name would put it in required order.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

HIVE ROW_NUMBER() function creates wrong results - hive

Related

Add +1 to date upto 7days and other column value should be divided by 7 upto 7 rows in ORACLE SQL

Future dates calculating incorrectly in FBProphet - make_future_dataframe method

How to calculate time difference between rows

Varying apply function on Columns when using Pandas TimeGrouper

Convert more than one Columns into Rows

Categories

Resources