Copy value in previous row if not set - sql

I would like to write a SELECT query for BigQuery that sets the value of a column to the value in the previous row, if in the current row it is set to NULL.
I have something like this for now:
SELECT *, IFNULL(tag, LAG(tag) OVER(ORDER BY id)) as new_tag FROM tags
...but it only copies values into adjacent NULL rows. Is there some way of doing this?

LAG window function doesn't support IGNORE NULLS clause, so use LAST_VALUE function along with IGNORE NULLS instead. If applied to your query,
SELECT *, LAST_VALUE(tag IGNORE NULLS) OVER(ORDER BY id) as new_tag FROM tags

Related

How to get the preceding values in Redshift based on Where condition?

I have three columns a student_name, column_1, column_2. I want to print the preceding value wherever the 0's exist in column 2.
I want the output like the below one, I used lag function. Probably I might be using it the wrong way.
From what I can tell, you want to count the number of 0 values up to and including each row. If this interpretation is correct, you would use a conditional cumulative sum:
select t.*,
sum( (column1 = 0)::int ) over (partition by student
order by <ordering column>
rows between unbounded preceding and current row
)
from t;
Note: This assumes that you have an ordering column which you have not included in the question.

Copy column values using a partition by statement in BigQuery

In BigQuery, I am trying to copy column values into other rows using a PARTITION BY statement anchored to a particular ID number.
Here is an example:
Right now, I am trying to use:
MIN(col_a) OVER (PARTITION BY CAST(id AS STRING), date ORDER BY date) AS col_b
It doesn't seem like the aggregate function is working properly. As in, the "col_b" still has null values when I try this method. Am I misunderstanding how aggregate functions work?
You can use this:
MIN(col_a) OVER (PARTITION BY id) AS col_b
If you have one value per id, this will return that value.
Note that converting id to a string is unnecessary. Also, you don't need a cumulative minimum, hence no ORDER BY.
Another option using coalesce
select *, coalesce(col_a, (select min(col_a) from my_table b where a.id=b.id)) col_b
from my_table a;
DEMO

Is there an easy way to use the lag function for many iterations on the same column?

I am working with Impala (but could do the same in oracle Sql) and I have a column for which I need to fill in the nulls with the previous non null value for every merchandise (even if it is 25 rows before).
I wrote a query that could get me to my end result but I will need to write 30 case when statements (as many as the days of a month).
Is there an easier way to do it?
I used the lag function with but could only make it work by getting the previous value of the column. If that value is null, I have to redo the lag function on the new column I had just created
select a.*,
case when new_value is null then LAG (new_value,1) OVER ( partition by merchandise ORDER BY date_mec) else new_value end as new_value_2
from
(SELECT merchandise, date_mec, value,
case when value is null then LAG (value,1) OVER ( partition by merchandise ORDER BY date_mec) else value end AS new_value
FROM mer_try_value) a
My table looks like this
The table that I created with 2 case when statements looks like this
Is there a better way to reach to my required end result?
In this case you don't want previous value if current row has non-null value. So use last_value with ignore nulls clause:
select merchandise, date_mec, value,
last_value(value) ignore nulls over (partition by merchandise order by date_mec) new_val
from mer_try_value
dbfiddle
Last_value() by default checks current row and if it's null looks for last non-null value.

SQL to detect adjacent duplicate rows (duplicates indicated by values in subset of columns)?

I'm trying to hunt down a potential bug in my application, and I'd like to see if there are duplicate events recorded for a user, which would be evidenced by duplicate data in two columns, user and value whereas the other columns could have slightly different meta data. Is there a way to identify such duplicates in a single SQL query?
Window functions can be used here. Especifically lag or lead deppending if you want the previous or the next occurrence of the dupplicate.
This query uses lag but can be safely changed by lead:
WITH event_with_lag_data AS (
SELECT user, value, value_ts,
lag(user) over (order by value_ts) as prev_user,
lag(value) over (order by value_ts) as prev_value
FROM event_data
)
SELECT user, value, value_ts
FROM event_with_lag_data
WHERE user = prev_user AND value = prev_value
value_ts is the ordering column. I assume events are ordered by date/time.
If you have more columns to test for equality is just a matter of adding more lines to the lag part and to the final where part
In case you are just looking to find any duplicates based on two columns since it isn't clear what you mean by "adjacent", here is a solution:
WITH duplicates AS (
select
user,
value,
COUNT(*) AS COUNT
FROM event_data
GROUP BY
user,
value
)
SELECT
user,
value
FROM duplicates
WHERE COUNT > 1

Return only the newest rows from a BigQuery table with a duplicate items

I have a table with many duplicate items – Many rows with the same id, perhaps with the only difference being a requested_at column.
I'd like to do a select * from the table, but only return one row with the same id – the most recently requested.
I've looked into group by id but then I need to do an aggregate for each column. This is easy with requested_at – max(requested_at) as requested_at – but the others are tough.
How do I make sure I get the value for title, etc that corresponds to that most recently updated row?
I suggest a similar form that avoids a sort in the window function:
SELECT *
FROM (
SELECT
*,
MAX(<timestamp_column>)
OVER (PARTITION BY <id_column>)
AS max_timestamp,
FROM <table>
)
WHERE <timestamp_column> = max_timestamp
Try something like this:
SELECT *
FROM (
SELECT
*,
ROW_NUMBER()
OVER (
PARTITION BY <id_column>
ORDER BY <timestamp column> DESC)
row_number,
FROM <table>
)
WHERE row_number = 1
Note it will add a row_number column, which you might not want. To fix this, you can select individual columns by name in the outer select statement.
In your case, it sounds like the requested_at column is the one you want to use in the ORDER BY.
And, you will also want to use allow_large_results, set a destination table, and specify no flattening of results (if you have a schema with repeated fields).