hive - sum & max within date group - hive

I want to use sum and max depending on date value :
I have table 1:
Cal_day | Date2 | value |
20160505| 20160505| 50 |
20160505| 20160505| 15|
20160505| 20160507| 1|
20160505| 20160507| 3|
20160505| 20160508| 2|
Output :
Depending on the date2 , in this case there are 2 entries on 20160505 (1st date2 so I would order by and get this sorted) so I need a max from it if i have more than 1 entry and then sum of all remaining values for date2.
Final answer is for value on cal day 20160505 = max(50,15) + sum(1,3,2) = 56
How can I achieve this in Hive ?
Thanks in advance.

Related

SQL to Determine Unpaid Balance Within Periods

I have the following table:
/**
| NAME | DELTA (PAID - EXPECTED) | PERIOD |
|-------|-------------------------|--------|
| SMITH | -50| 1|
| SMITH | 0| 2|
| SMITH | 150| 3|
| SMITH | -200| 4|
| DOE | 300| 1|
| DOE | 0| 2|
| DOE | -200| 3|
| DOE | -200| 4|
**/
DROP TABLE delete_me;
CREATE TABLE delete_me (
"NAME" varchar(255),
"DELTA (PAID - EXPECTED)" numeric(15,2),
"PERIOD" Int
);
INSERT INTO delete_me("NAME", "DELTA (PAID - EXPECTED)", "PERIOD")
VALUES
('SMITH', -50, 1),
('SMITH', 0, 2),
('SMITH', 150, 3),
('SMITH', -200, 4),
('DOE', 300, 1),
('DOE', 0, 2),
('DOE', -200, 3),
('DOE', -200, 4)
Where period represents time, with 1 being the newest and 4 being the oldest. In each time period the person was charged and amount and they could pay off that amount or more. A negative delta means that they owe for that time period. A positive delta means that they paid over the expected amount and has a credit for that time period that can be applied to other time periods. If there's a credit we'd want to pay off the oldest time period first. I want to get how much unpaid debt is still looming for each time period.
So in the example above I'd want to see:
| NAME | DELTA (PAID - EXPECTED) | PERIOD | PERIOD BALANCE |
|-------|-------------------------|--------|----------------|
| SMITH | -50| 1| -50|
| SMITH | 0| 2| 0|
| SMITH | 150| 3| 0|
| SMITH | -200| 4| -50|
| DOE | 300| 1| 0|
| DOE | 0| 2| 0|
| DOE | -200| 3| -100|
| DOE | -200| 4| 0|
How can I use Postgres SQL to show the Unpaid debt within periods?
Additional description: for doe initially, in the oldest period, 200 was owed, the next period the owed the original 200 plus another 200 (400 total owed). In Period 2 the monthly charge was paid, but not the past balances. In the most recent period (1) 300 over the monthly amount was paid (200 of this was applied to the oldest debt in period 4, meaning it was paid off; leaving 100 to apply to period three's debt; and after applying the remaining 100, 100 was still owed).
For the Smith family initially in period 4 they underpaid 200. The next period they overpaid 150 for the month and this was applied to the oldest debt of 200, leaving 50 to still be paid. In period 2 the monthly bill was paid exactly, they still owed the 50 dollars from period 4. Then in period 1 they underpaid 50. They owe 100 in total, 50 for period 1 and 50 for period 4.
According to what I understood, you want to distribute the sum of positive DELTA values (credit) among the negative DELTA values starting from the oldest period.
with cte as (
Select Name_, DELTA, PERIOD_,
Sum(case when DELTA<0 then delta else 0 end)
Over (Partition By NAME_ Order By PERIOD_ desc) +
Sum(case when DELTA>0 then delta else 0 end)
Over (Partition By NAME_) as positive_credit_to_negativ_delta
From delete_me
)
Select Name_,DELTA,PERIOD_,positive_credit_to_negativ_delta,
case
when delta >= 0 then 0
else
case
when positive_credit_to_negativ_delta >= 0 then 0
else
greatest(delta , positive_credit_to_negativ_delta)
end
end as PERIOD_BALANCE
from cte
Order By NAME_,PERIOD_
See a demo from db-fiddle.
The idea in this query is to find the sum of all positive DELTA values for each user, then add that sum to the cumulative sum of the negative values starting from the oldest period. The result of this addition is stored in positive_credit_to_negativ_delta in the query.
Of course for DELTA with values >= 0, the result will be 0 since no debit for that period.
For negative DELTA values:
If the value of positive_credit_to_negativ_delta is >= 0 then the
result will be 0, that means the period delta is covered by the
positive credit.
If the value of positive_credit_to_negativ_delta
is < 0 then the result will be the max value from positive_credit_to_negativ_delta and DELTA.
The query below utilizes a recursive cte with several JSON structures. The usage of the latter allows accurate tracking of negative to positive balance intervals with possibly more than one potential positive balances after negative:
with recursive cte(n, l, p, js, js1) as (
select d1.name, d5.delta, d1.m, jsonb_build_object(d1.m, d5.delta), jsonb_build_array(d1.m)
from (select d.name, max(d.period) m from delete_me d where d.delta < 0 group by d.name) d1
join delete_me d5 on d5.name = d1.name and d5.period = d1.m
union all
select c.n, d.delta, d.period,
case when d.delta < 0 then c.js||jsonb_build_object(d.period, d.delta)
when d.delta = 0 then c.js
else (select jsonb_object_agg(k.v3, least((c.js -> k.v3::text)::int +
greatest(d.delta + coalesce((select sum((c.js -> v2.value::text)::int)
from jsonb_array_elements(c.js1) v2 where v2.value::int > k.v3),0),0),0))
from (select v.value::int v3 from jsonb_array_elements(c.js1) v
order by v.value::int desc) k)||jsonb_build_object(d.period, greatest(d.delta + (select sum((c.js -> v2.value::text)::int)
from jsonb_array_elements(c.js1) v2),0)) end,
case when d.delta < 0 then (case when c.l <= 0 then c.js1 else '[]'::jsonb end) || ('['||d.period||']')::jsonb
else c.js1 end
from cte c join delete_me d on d.period = c.p - 1 and d.name = c.n
)
select d.*, coalesce((c.js -> d.period::text)::int, 0) from delete_me d
join cte c on c.n = d.name where c.p = 1
order by d.name desc, d.period asc

How to return all records with the latest datetime value [Postgreql]

How can I return only the records with the latest upload_date(s) from the data below?
My data is as follows:
upload_date |day_name |rows_added|row_count_delta|days_since_last_update|
-----------------------+---------+----------+---------------+----------------------+
2022-05-01 00:00:00.000|Sunday | 526043| | |
2022-05-02 00:00:00.000|Monday | 467082| -58961| 1|
2022-05-02 15:58:54.094|Monday | 421427| -45655| 0|
2022-05-02 18:19:22.894|Monday | 421427| 0| 0|
2022-05-03 16:54:04.136|Tuesday | 496021| 74594| 1|
2022-05-03 18:17:27.502|Tuesday | 496021| 0| 0|
2022-05-04 18:19:26.392|Wednesday| 487154| -8867| 1|
2022-05-05 18:18:15.277|Thursday | 489713| 2559| 1|
2022-05-06 16:15:39.518|Friday | 489713| 0| 1|
2022-05-07 16:18:00.916|Saturday | 482955| -6758| 1|
My desired results should be:
upload_date |day_name |rows_added|row_count_delta|days_since_last_update|
-----------------------+---------+----------+---------------+----------------------+
2022-05-01 00:00:00.000|Sunday | 526043| | |
2022-05-02 18:19:22.894|Monday | 421427| 0| 0|
2022-05-03 18:17:27.502|Tuesday | 496021| 0| 0|
2022-05-04 18:19:26.392|Wednesday| 487154| -8867| 1|
2022-05-05 18:18:15.277|Thursday | 489713| 2559| 1|
2022-05-06 16:15:39.518|Friday | 489713| 0| 1|
2022-05-07 16:18:00.916|Saturday | 482955| -6758| 1|
NOTE only the latest upload_date for 2022-05-02 and 2022-05-03 should be in the result set.
You can use a window function to PARTITION by day (casting the timestamp to a date) and sort the results by most recent first by ordering by upload_date descending. Using ROW_NUMBER() it will assign a 1 to the most recent record per date. Then just filter on that row number. Note that I am assuming the datatype for upload_date is TIMESTAMP in this case.
SELECT
*
FROM (
SELECT
your_table.*,
ROW_NUMBER() OVER (PARTITION BY CAST(upload_date AS DATE)
ORDER BY upload_date DESC) rownum
FROM your_table
)
WHERE rownum = 1
demo
WITH cte AS (
SELECT
max(upload_date) OVER (PARTITION BY upload_date::date),
upload_date,
day_name,
rows_added,
row_count_delta,
days_since_last_update
FROM test101 ORDER BY 1
)
SELECT
upload_date,
day_name,
rows_added,
row_count_delta,
days_since_last_update
FROM
cte
WHERE
max = upload_date;
This is more verbose but I find it easier to read and build:
SELECT *
FROM mytable t1
JOIN (
SELECT CAST(upload_date AS DATE) day_date, MAX(upload_date) max_date
FROM mytable
GROUP BY day_date) t2
ON t1.upload_date = t2.max_date AND
CAST(upload_date AS DATE) = t2.day_date;
I don't know about perfomance right away, but I suspect the window function is worse because you will need to order by, which is usually a slow operation unless your table already have an index for doing so.
Use DISTINCT ON:
SELECT DISTINCT ON (date_trunc('day', upload_date))
to_char(upload_date, 'Day') AS weekday, * -- added weekday optional
FROM tbl
ORDER BY date_trunc('day', upload_date), upload_date DESC;
db<>fiddle here
For few rows per day (like your sample data suggests) it's the simplest and fastest solution possible. See:
Select first row in each GROUP BY group?
I dropped the redundant column day_name from the table. That's just a redundant representation of the timestamp. Storing it only adds cost and noise and opportunities for inconsistent data. If you need the weekday displayed, use to_char(upload_date, 'Day') AS weekday like demonstrated above.
The query works for any number of days, not restricted to 7 weekdays.

How to validate particular column in a Dataframe without troubling other columns using spark-sql?

set.createOrReplaceTempView("input1");
String look = "select case when length(date)>0 then 'Y' else 'N' end as date from input1";
Dataset<Row> Dataset_op = spark.sql(look);
Dataset_op.show();
In the above code the dataframe 'set' has 10 columns and i've done the validation for one column among them (i.e) 'date'. It return date column alone.
My question is how to return all the columns with the validated date column in a single dataframe?
Is there any way to get all the columns in the dataframe without manually selecting all the columns in the select statement. Please share your suggestions.TIA
Data
df= spark.createDataFrame([
(1,'2022-03-01'),
(2,'2022-04-17'),
(3,None)
],('id','date'))
df.show()
+---+----------+
| id| date|
+---+----------+
| 1|2022-03-01|
| 2|2022-04-17|
| 3| null|
+---+----------+
You have two options
Option 1 select without projecting a new column with N and Y
df.createOrReplaceTempView("input1");
String_look = "select id, date from input1 where length(date)>0";
Dataset_op = spark.sql(String_look).show()
+---+----------+
| id| date|
+---+----------+
| 1|2022-03-01|
| 2|2022-04-17|
+---+----------+
Or project Y and N into a new column. Remember the where clause is applied before column projection. So you cant use the newly created column in the where clause
String_look = "select id, date, case when length(date)>0 then 'Y' else 'N' end as status from input1 where length(date)>0";
+---+----------+------+
| id| date|status|
+---+----------+------+
| 1|2022-03-01| Y|
| 2|2022-04-17| Y|
+---+----------+------+

Join tables by timestamp and date columns?

I should retrieve data from two log tables (BALHDR and ZIF_LOG_XML_CONTENT). My problem is that the only commonality between the log tables is the time when the entries were created. The query has to work for a PERIOD and not for a TIME POINT.
However, the time for the entries is not stored in the same format in the two tables. In ZIF_LOG_XML_CONTENT it is stored in one column TIMESTAMP in the other log table in the BALHDR it is stored in two columns, where DATE and TIME are stored separately.
I tried to transform all the times to STRING, but still not working...
What am I doing wrong?
DATA: GV_DATEANDTIMETO TYPE STRING,
GV_DATETO TYPE STRING,
GV_TIMETO TYPE STRING,
GV_DATEANDTIMEFROM TYPE STRING,
GV_DATEFROM TYPE STRING,
GV_TIMEFROM TYPE STRING,
GV_DATUM TYPE STRING.
SELECT * FROM BALHDR INTO #GS_MSG_STRUKT WHERE
EXTNUMBER = #P_EXTID AND
OBJECT = #P_OBJ AND
SUBOBJECT = #P_SUBOBJ AND
ALUSER = #P_USER AND
( ALDATE_BALHDR >= #GV_INPUT_DATETO AND ALTIME_BALHDR >= #GV__INPUT_TIMETO ) AND
( ALDATE_BALHDR <= #GV_INPUT_DATEFROM AND ALTIME_BALHDR <= #GV__INPUT_TIMEFROM ) AND
MSG_CNT_E >= 1 OR MSG_CNT_AL IS ZERO.
concatenate GS_MSGTABLE-DATE GS_MSGTABLE-TIME into GV_DATUM.
SELECT RES_CONTENT, REQ_CONTENT
FROM zif_log_content
INTO #GS_MSG_STRUKT
WHERE TIMESTAMP >= #Gv_date AND TIMESTAMP <= #Gv_date.
ENDSELECT.
ENDSELECT.
Concatenating works, you just need to pass timestamp into your SELECT, not string.
Here is a working simplified example based on standard BALHDR and MBEW tables:
TYPES: BEGIN OF struct,
lognumber TYPE balhdr-lognumber,
aldate TYPE balhdr-aldate,
altime TYPE balhdr-altime,
timestamp TYPE mbew-timestamp,
END OF struct.
DATA: gs_msg_strukt TYPE struct.
DATA: gt_msg_strukt TYPE TABLE OF struct.
SELECT *
FROM balhdr
INTO CORRESPONDING FIELDS OF #gs_msg_strukt
WHERE aldate >= #gv_input_dateto AND altime <= #gv_input_timeto.
CONCATENATE gs_msg_strukt-aldate gs_msg_strukt-altime INTO gv_datum.
DATA(gv_date) = CONV timestamp( gv_datum ).
SELECT timestamp
FROM mbew
INTO CORRESPONDING FIELDS OF #gs_msg_strukt
WHERE timestamp >= #gv_date AND timestamp <= #gv_date.
ENDSELECT.
APPEND gs_msg_strukt TO gt_msg_strukt. "<---move APPEND here
ENDSELECT.
This won't work as the TIMESTAMP in SAP is a decimal type and does not equal to a concatenation of the date and time in any way.
You should create your time stamp using the following sentence.
CONVERT DATE gs_msgtable-date TIME gs_msgtable-time INTO TIME STAMP DATA(gv_timestamp) TIME ZONE sy-zonlo.
Be careful also with the time zone. I do not know in which time zone your entries in Z-table are. In the BAL table they should be stored in UTC. Be sure to check it before.
QUESTION
I haven t got a fully working minimal example yet, but I can give you an example for the the two tables, which I would like to join together. The third table shows the wanted result. THX
-----------------------------------------------------------------------------
BALHDR
----------------------------------------------------------------------------
| EXTNUMBER| DATE | TIME |OBJECT|SUBOBJECT| USER|MSG_ALL |MSG_ERROR
|---------------------------------------------------------------------------
A| 1236 |2000.10.10 |12:33:24 |KAT |LEK |NEK | NULL | NULL
B| 1936 |2010.02.20 |02:33:44 |KAT |MOK |NEK | 3 | 1
C| 1466 |2010.10.10 |11:35:34 |KAT |LEK |NEK | 2 | 0
D| 1156 |2011.08.03 |02:13:14 |KAT |MOK |NEK | 3 | 0
E| 1466 |2014.10.10 |11:35:34 |KAT |LEK |NEK | NULL | NULL
F| 1156 |2019.08.03 |02:13:14 |KAT |MOK |NEK | 1 | 1
----------------------------------------------------------------------------
ZIF_LOG
-----------------------------------------------------------------------------
| TIMESTAMP | REQ | RES
|---------------------------------------------------------------------------
1| 20100220023344 | he |hello
2| 20101010113534 | bla |blala
3| 20110803021314 | to |toto
4| 20190803021314 | macs |ka
The following table shows the wanted result. The nummbers from 1 to 4 and the letters from A to F are to helps to understand how the fields would correspond with each other.
-----------------------------------------------------------------------------
WANTED RESULT TABLE
----------------------------------------------------------------------------
|EXTNUMBER| DATE | TIME |OBJECT|SUBOBJECT | USER| REQ | RES
----------------------------------------------------------------------------
A| 1236 |2000.10.10 |12:33:24 |KAT |LEK |NEK | NULL | NULL
B2| 1936 |2010.02.20 |02:33:44 |KAT |MOK |NEK | he | hello
E | 1466 |2014.10.10 |11:35:34 |KAT |LEK |NEK | NULL | NULL
F6| 1156 |2019.08.03 |02:13:14 |KAT |MOK |NEK | macs | ka
THX

Spark: how to perform loop fuction to dataframes

I have two dataframes as below, I'm trying to search the second df using the foreign key, and then generate a new data frame. I was thinking of doing a spark.sql("""select history.value as previous_year 1 from df1, history where df1.key=history.key and history.date=add_months($currentdate,-1*12)""" but then I need to do it multiple times for say 10 previous_years. and join them back together. How can I create a function for this? Many thanks. Quite new here.
dataframe one:
+---+---+-----------+
|key|val| date |
+---+---+-----------+
| 1|100| 2018-04-16|
| 2|200| 2018-04-16|
+---+---+-----------+
dataframe two : historical data
+---+---+-----------+
|key|val| date |
+---+---+-----------+
| 1|10 | 2017-04-16|
| 1|20 | 2016-04-16|
+---+---+-----------+
The result I want to generate is
+---+----------+-----------------+-----------------+
|key|date | previous_year_1 | previous_year_2 |
+---+----------+-----------------+-----------------+
| 1|2018-04-16| 10 | 20 |
| 2|null | null | null |
+---+----------+-----------------+-----------------+
To solve this, the following approach can be applied:
1) Join the two dataframes by key.
2) Filter out all the rows where previous dates are not exactly years before reference dates.
3) Calculate the years difference for the row and put the value in a dedicated column.
4) Pivot the DataFrame around the column calculated in the previous step and aggregate on the value of the respective year.
private def generateWhereForPreviousYears(nbYears: Int): Column =
(-1 to -nbYears by -1) // loop on each backwards year value
.map(yearsBack =>
/*
* Each year back count number is transformed in an expression
* to be included into the WHERE clause.
* This is equivalent to "history.date=add_months($currentdate,-1*12)"
* in your comment in the question.
*/
add_months($"df1.date", 12 * yearsBack) === $"df2.date"
)
/*
The previous .map call produces a sequence of Column expressions,
we need to concatenate them with "or" in order to obtain
a single Spark Column reference. .reduce() function is most
appropriate here.
*/
.reduce(_ or _) or $"df2.date".isNull // the last "or" is added to include empty lines in the result.
val nbYearsBack = 3
val result = sourceDf1.as("df1")
.join(sourceDf2.as("df2"), $"df1.key" === $"df2.key", "left")
.where(generateWhereForPreviousYears(nbYearsBack))
.withColumn("diff_years", concat(lit("previous_year_"), year($"df1.date") - year($"df2.date")))
.groupBy($"df1.key", $"df1.date")
.pivot("diff_years")
.agg(first($"df2.value"))
.drop("null") // drop the unwanted extra column with null values
The output is:
+---+----------+---------------+---------------+
|key|date |previous_year_1|previous_year_2|
+---+----------+---------------+---------------+
|1 |2018-04-16|10 |20 |
|2 |2018-04-16|null |null |
+---+----------+---------------+---------------+
Let me "read through the lines" and give you a "similar" solution to what you are asking:
val df1Pivot = df1.groupBy("key").pivot("date").agg(max("val"))
val df2Pivot = df2.groupBy("key").pivot("date").agg(max("val"))
val result = df1Pivot.join(df2Pivot, Seq("key"), "left")
result.show
+---+----------+----------+----------+
|key|2018-04-16|2016-04-16|2017-04-16|
+---+----------+----------+----------+
| 1| 100| 20| 10|
| 2| 200| null| null|
+---+----------+----------+----------+
Feel free to manipulate the data a bit if you really need to change the column names.
Or even better:
df1.union(df2).groupBy("key").pivot("date").agg(max("val")).show
+---+----------+----------+----------+
|key|2016-04-16|2017-04-16|2018-04-16|
+---+----------+----------+----------+
| 1| 20| 10| 100|
| 2| null| null| 200|
+---+----------+----------+----------+