Forward fill in spark SQL based on column value condition

Forward fill in spark SQL based on column value condition - sql

Please can someone help me how to forward fill values in a case statement based on another column value in SPARK SQL.
I am basically trying to detect outliers in the SQL dataset and so far how I have identified these outliers is identifying standard deviation of a value far from the mean of the dataset.
Now the problem statement is wherever these outliers fall, I have to fill the value in a new column the value which was last valid/authentic.
For example: after 1 in the first column, I want to append 556 in third column and for 3 in the first column, I want to append 561 in the third column
So far, I have identified the outliers and based on the value, I am guessing I can use lag function and go back 1 row. But I also know, this is not a good approach. For example, I get 10 outliers in a sequence, I will have to write 10 CASE statement for that.
Please if someone have any better/efficient approach, please help.

Related

Looping/Iterating a table query in Bigquery

I am using BigQuery and I don´t know how to loop a table that is in a database here. For example, lets suppose we have schema_A.tableA with the following information
Table A
Originally the TableA.columnA holds the information for the rest row. The columnE is the calculation of the other three columns. But what I am looking for is to iterate/loop in a column the result coming from E (LAG(columnE)) and generate the calculation for the second row. The third row would take the result of columnE_2row and so on.
The desired output is like this :
For example the 2 row- columnA is using 500 because the result of the previous row is 500. In the third row is 300 because that was the result of columnE_row2 and so on. I don´t know how looping works in BigQuery, I would really appreciate your knowledge
Please help!!!
So far, I read some threads but none of them shows how to set a variable from a query, all are loops from 0. https://towardsdatascience.com/loops-in-bigquery-db137e128d2d

Leading a value into another row based on scenario

I want to lead a row value into another row depending on scenario.
Here is my input in Hive table:
Output:
There should be only one entry for each seconds, Lat-Lang, V1 & V2 column values can be derived from the latest milli second having valid value means not null.
Please suggest Windowing function or spark API to achieve this.

SQL query - How to achieve the subsequent column updation by summing up the value of current row in Single select query (need to avoid while loop)

The logic which we are trying to achieve in single query is as follows.
We need to loop based on row number column. So, on each loop we need to sum-up remaining value and new value.. resultant value to be updated in "by summing up column". and the decimal part to be updated in decimal value column.
in next step, need to sum-up the decimal value column by grouping on row number. and the resultant to be updated in remaining value column of next row number
the above step 1-2 to be continued till we reach last record.
We achieved this through while loop.. But trying to achieve this without while loop.
Can someone please give idea to achieve this
Please refer the attached image for understanding table
enter image description here

How to normalise inconsistent category labels in teradata

Table has text labels for category identification
The spelling has changed over time
I want to normalise the text labels when I count the records in each category
So for example I have The category labels:
'Ready to go' and 'Readytogo'
But I also have another text value abc that I want to replace with Abcd
The rest Inwant to keep in my Groupby and Count
How can I count these in the new group names in Teradata?
At the moment I'm using case statements for the ones I'm okay with, then using OREPLACE to switch one of the values but how do I nest it so that I can OREPLACE 2 or more values with 2 or more new ones? Is OREPLACE the best function to use here?
Thanks

Apologies I have managed to solve it using multiple WHEN, THEN clauses as part of 1 case statement
This changes the text value ax to bn along with Fx to BM for example, as required
The OREPLACE function was not recognised in the release of Teradata that I am using
I will specify question better next time
Thanks

MS SQL 2000 - How to efficiently walk through a set of previous records and process them in groups. Large table

I'd like to consult one thing. I have table in DB. It has 2 columns and looks like this:
Name...bilance
Jane...+3
Jane...-5
Jane...0
Jane...-8
Jane...-2
Paul...-1
Paul...2
Paul....9
Paul...1
...
I have to walk through this table and if I find record with different "name" (than was on previous row) I process all rows with the previous "name". (If I step on the first Paul row I process all Jane rows)
The processing goes like this:
Now I work only with Jane records and walk through them one by one. On each record I stop and compare it with all previous Jane rows one by one.
The task is to sumarize "bilance" column (in the scope of actual person) if they have different signs
Summary:
I loop through this table in 3 levels paralelly (nested loops)
1st level = search for changes of "name" column
2nd level = if change was found, get all rows with previous "name" and walk through them
3rd level = on each row stop and walk through all previous rows with current "name"
Can this be solved only using CURSOR and FETCHING, or is there some smoother solution?
My real table has 30 000 rows and 1500 people and If I do the logic in PHP, it takes long minutes and than timeouts. So I would like to rewrite it to MS SQL 2000 (no other DB is allowed). Are cursors fast solution or is it better to use something else?
Thank you for your opinions.
UPDATE:
There are lots of questions about my "summarization". Problem is a little bit more difficult than I explained. I simplified it just to describe my algorithm.
Each row of my table contains much more columns. The most important is month. That's why there are more rows for each person. Each is for different month.
"Bilances" are "working overtimes" and "arrear hours" of workers. And I need to sumarize + and - bilances to neutralize them using values from previous months. I want to have as many zeroes as possible. All the table must stay as it is, just bilances must be changed to zeroes.
Example:
Row (Jane -5) will be summarized with row (Jane +3). Instead of 3 I will get 0 and instead of -5 I will get -2. Because I used this -5 to reduce +3.
Next row (Jane 0) won't be affected
Next row (Jane -8) can not be used, because all previous bilances are negative
etc.

You can sum all the values per name using a single SQL statement:
select
name,
sum(bilance) as bilance_sum
from
my_table
group by
name
order by
name

On the face of it, it sounds like this should do what you want:
select Name, sum(bilance)
from table
group by Name
order by Name
If not, you might need to elaborate on how the Names are sorted and what you mean by "summarize".

I'm not sure what you mean by this line... "The task is to sumarize "bilance" column (in the scope of actual person) if they have different signs".
But, it may be possible to use a group by query to get a lot of what you need.
select name, case when bilance < 0 then 'negative' when bilance >= 0 then 'positive', count(*)
from table
group by name, bilance
That might not be perfect syntax for the case statement, but it should get you really close.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas