How to process Row number window function on incremental data - sql

I have a table which as row number window function running for some IDs.
Now every time a new data comes its a full load and the new row numbers are assigned to them again. So Row Num runs on the entire data set again , which is quite ineffeciet as lot of resources get consumed and it makes it CPU intensive. This table is built every 15 to 30 mins. I am trying to achieve the same thing but using incremental and then add the result of the incremental to the last row_count of a particular customer_ID
So when new record comes , I want to save the max row_num for that particular record lets say max_row_num = 4 , now two new record comes for a ID , so row_num for incremental is 1,2. Final output should be 4+1 and 4+2 something. so the new row number looks like 1,2,3,4,5,6 adding 1 and 2 to the max of the previous Row_num.
I want to implement the logic in my Pyspark actually! But I am open to python solution and then may be convert to pyspark DataFrame later.
Please help and suggest the possible solutions
Full load -- intial table
Row_num
customer_ID
1
ABC123
2
ABC123
3
ABC123
1
ABC125
2
ABC125
1
ABC225
2
ABC225
3
ABC225
4
ABC225
5
ABC225
incremental load
Row_num
customer_ID
1
ABC123
2
ABC123
1
ABC125
1
ABC225
2
ABC225
1
ABC330
DESIRED OUPUT
Row_num
customer_ID
1
ABC123
2
ABC123
3
ABC123
4
ABC123
1
ABC125
2
ABC125
3
ABC125
1
ABC225
2
ABC225
3
ABC225
4
ABC225
5
ABC225
6
ABC225
1
ABC330

If you are trying to insert the values with the new row number, you can join in the maximum existing row number:
insert into full (row_num, customer_id)
select i.row_number + coalesce(f.max_row_number, 0), i.customer_id
from incremental i left join
(select f.customer_id, max(row_number) as max_row_number
from full f
group by f.customer_id
) f
on i.customer_id = f.customer_id;

Related

How to query the front and back n rows of a piece of random row in hive

After selecting a random row, I want to be able to select n number of records preceding and following it.
example:
id content
1 add
2 bob
3 cdf
4 asd
random row id is 3,i need select result:
2 bob
3 cdf
4 asd

CTE Hierarchy Question - 1 table, 2 columns

I'm new to CTEs, and I am slowly starting to understand the basics.
I have a table that essentially goes in this pattern:
CUSTOMER X
CUSTOMER Y
1
1
1
2
2
3
3
4
3
5
4
5
5
6
I was wondering if a CTE would help return the numbers 1 through 6 (CUSTOMER Y) if Number 1 in CUSTOMER X had a specific column flagged for relevancy.
Customer 1 would be considered the main customer, while 2 - 6 could be stores related to said customer.
My end goal would be propogating down this relevancy flag for customers 2 - 6 if customer 1 has it and I'm currently trying to figure out how to get that full list.
I'd want the CTE to return a distinct list of customers.
CUSTOMER
1
2
3
4
5
6

Select maximum value where another column is used for for the Grouping

I'm trying to join several tables, where one of the tables is acting as a
key-value store, and then after the joins find the maximum value in a
column less than another column. As a simplified example, I have the following three tables:
Documents:
DocumentID
Filename
LatestRevision
1
D1001.SLDDRW
18
2
P5002.SLDPRT
10
Variables:
VariableID
VariableName
1
DateReleased
2
Change
3
Description
VariableValues:
DocumentID
VariableID
Revision
Value
1
2
1
Created
1
3
1
Drawing
1
2
3
Changed Dimension
1
1
4
2021-02-01
1
2
11
Corrected typos
1
1
16
2021-02-25
2
3
1
Generic part
2
3
5
Screw
2
2
4
2021-02-24
I can use the LEFT JOIN/IS NULL thing to get the latest version of
variables relatively easily (see http://sqlfiddle.com/#!7/5982d/3/0).
What I want is the latest version of variables that are less than or equal
to a revision which has a DateReleased, for example:
DocumentID
Filename
Variable
Value
VariableRev
DateReleased
ReleasedRev
1
D1001.SLDDRW
Change
Changed Dimension
3
2021-02-01
4
1
D1001.SLDDRW
Description
Drawing
1
2021-02-01
4
1
D1001.SLDDRW
Description
Drawing
1
2021-02-25
16
1
D1001.SLDDRW
Change
Corrected Typos
11
2021-02-25
16
2
P5002.SLDPRT
Description
Generic Part
1
2021-02-24
4
How do I do this?
I figured this out. Add another JOIN at the start to add in another version of the VariableValues table selecting only the DateReleased variables, then make sure that all the VariableValues Revisions selected are less than this date released. I think the LEFT JOIN has to be added after this table.
The example at http://sqlfiddle.com/#!9/bd6068/3/0 shows this better.

Percentage of variable corresponding to percentage of other variable

I have two numerical variables, and would like to calculate the percentage of one variable corresponding to at least 50% of the other variable's sum.
For example:
A | B
__________
2 | 8
1 | 20
3 | 12
5 | 4
2 | 7
1 | 11
4 | 5
Here, the sum of column B is 68, so I'm looking for the rows (in B's descending order) where cumulative sum is at least 34.
In that case, they are rows 2, 3 & 6 (cumulative sum of 45). The sum of these row's column A is 5, which I want to compare to the total sum of column A (18).
Therefore, the result I'm looking for is 5 / 18 * 100 = 28.78%
I'm looking for a way to implement this in QlikSense, or in SQL.
Here's one way you can do it - there is probably some optimisation to be done, but this gives what you want.
Source:
LOAD
*,
RowNo() as RowNo_Source
Inline [
A , B
2 , 8
1 , 20
3 , 12
5 , 4
2 , 7
1 , 11
4 , 5
];
SourceSorted:
NoConcatenate LOAD *,
RowNo() as RowNo_SourceSorted
Resident Source
Order by B asc;
drop table Source;
BTotal:
LOAD sum(B) as BTotal
Resident SourceSorted;
let BTotal=peek('BTotal',0);
SourceWithCumu:
NoConcatenate LOAD
*,
rangesum(peek('BCumu'),B) as BCumu,
$(BTotal) as BTotal,
rangesum(peek('BCumu'),B)/$(BTotal) as BCumuPct,
if(rangesum(peek('BCumu'),B)/$(BTotal)>=0.5,A,0) as AFiltered
Resident SourceSorted;
Drop Table SourceSorted;
I worked with a debug fields that might be useful but you could of course remove these.
Then in the front end you do your calculation of sum(AFiltered)/sum(A) to get the stat you want and format it as a percentage.

DB Query matching ids and sum data on columns

Here is the info i have on my tables, what i need is to create a report based on certain dates and make a sum of every stock movement of the same id
Table one Table Two
Items Stocks
---------- ---------------------------
ID - NAME items_id - altas - bajas - created_at
1 White 4 5 0 8/10/2016
2 Black 2 1 5 8/10/2016
3 Red 3 3 2 8/11/2016
4 Blue 4 1 4 8/11/2016
2 10 2 8/12/2016
So based on a customer choice of dates (on this case lets say it selects all the data available on the table), i need to group them by items_id and then SUM all altas, and all bajas for that items_id, having the following at the end:
items_id altas bajas
1 0 0
2 11 7
3 3 2
4 6 4
Any help solving this?
Hope this will help:
Stock.select("sum(altas) as altas, sum(bajas) as bajas").group("item_id")