I'm trying to migrate some poorly formed data into a database. The data comes from a CSV, and is first loaded into a staging table of all varchar columns (as I cannot enforce type safety at this stage).
The data might look like
COL1 | COL2 | COL3
Name 1 | |
2/11/16 | $350 | $230
2/12/16 | $420 | $387
2/13/16 | $435 | $727
Name 2 | |
2/11/16 | $121 | $144
2/12/16 | $243 | $658
2/13/16 | $453 | $214
The first colum is a mixture of company names as pseudo-headers, and dates for which colum 2 and 3 data is relevant. I'd like to start transforming the data by creating a 'Brand' column - where 'StoreBrand' is the value of Col1 if Col2 is NULL, or the previous row's StoreBrand otherwise. Comething like:
COL1 | COL2 | COL3 | StoreBrand
Name 1 | | | Name 1
2/11/16 | $350 | $230 | Name 1
2/12/16 | $420 | $387 | Name 1
2/13/16 | $435 | $727 | Name 1
Name 2 | | | Name 2
2/11/16 | $121 | $144 | Name 2
2/12/16 | $243 | $658 | Name 2
2/13/16 | $453 | $214 | Name 2
I wrote this:
SELECT
t.*,
CASE
WHEN t.COL2 IS NULL THEN COL1
ELSE LAG(StoreBrand) OVER ()
END AS StoreBrand
FROM
(
SELECT
ROW_NUMBER() OVER () AS i,
*
FROM
Staging_Data
) t;
But the database (postgres in this case, but we're considering alternatives so the most diverse answer is preferred) chokes on LAG(StoreBrand) because that's the derived column I'm creating. Invoking LAG(Col1) only populates the first row's real data:
COL1 | COL2 | COL3 | StoreBrand
Name 1 | | | Name 1
2/11/16 | $350 | $230 | Name 1
2/12/16 | $420 | $387 | 2/11/16
2/13/16 | $435 | $727 | 2/12/16
Name 2 | | | Name 2
2/11/16 | $121 | $144 | Name 2
2/12/16 | $243 | $658 | 2/11/16
2/13/16 | $453 | $214 | 2/12/16
My goal would be a StoreBrand column which is the first value of COL1 for all date values before the next brand name:
COL1 | COL2 | COL3 | StoreBrand
Name 1 | | | Name 1
2/11/16 | $350 | $230 | Name 1
2/12/16 | $420 | $387 | Name 1
2/13/16 | $435 | $727 | Name 1
Name 2 | | | Name 2
2/11/16 | $121 | $144 | Name 2
2/12/16 | $243 | $658 | Name 2
2/13/16 | $453 | $214 | Name 2
The value of StoreBrand when Col2 and Col3 are null is inconsequential - that row will be dropped as part of the conversion process. The important thing is associating the data rows (i.e. those with dates) with their brand.
Is there a way to reference the previous value for the column that I'm missing?
Edit for people who find this question through a search engine:
The trick was to use WITH which allows to use a temporary result at several places (link).
I think this does what you want and discards the null rows at the same time (if you want to). We basically select all the brands before the row we are currently looking at and if no "brand row" exists between it and the current row then we take it.
WITH t AS
(SELECT
ROW_NUMBER() OVER () AS i,
*
FROM
Staging_Data
)
SELECT
a.COL1,
a.COL2,
a.COL3,
(SELECT b.COL1 FROM t b WHERE b.COL2 IS NULL AND b.i <= a.i AND NOT EXISTS(
SELECT * FROM t c WHERE c.COL2 IS NULL AND c.i <= a.i AND c.i > b.i)
) StoreBrand
FROM
t a
WHERE -- I don't think you need those rows? Otherwise remove it.
a.COL2 IS NOT NULL
It can be a bit confusing. t is a temporary table we defined with your query. And a, b and c are aliases for t. We could also write FROM t AS a to make it more obvious.
I think I understand what you want. Technically, you want the ignore nulls option on lag(), so it would look like this:
select lag(case when col1 not like '%/%/%' then col1 end ignore nulls) over (order by linenumber) as brandname
The only problem? Postgres doesn't support ignore nulls.
But, you can do pretty much the same thing with a subquery. The idea is to assign a grouping identifier to each group. And this is a cumulative count of the valid brand names. Then a simple max() aggregation works:
select t.*,
max(case when col1 not like '%/%/%' then col1 end) over (partition by grp) as brand
from (select t.*,
sum(case when col1 not like '%/%/%' then 1 end) over
(order by linenumber) as grp
from t
);
Related
I hope everyone is doing well. I have a dilemma that i can not quite figure out.
I am trying to find a unique value for a field that is not a duplicate.
For example:
Table 1
|Col1 | Col2| Col3 |
| 123 | A | 1 |
| 123 | A | 2 |
| 12 | B | 1 |
| 12 | B | 2 |
| 12 | C | 3 |
| 12 | D | 4 |
| 1 | A | 1 |
| 2 | D | 1 |
| 3 | D | 1 |
Col 1 is the field that would have the duplicate values. Col2 would be the owner of the value in Col 1. Col 3 uses the row number() Over Partition syntax to get the numbers in ascending order.
The goal i am trying to accomplish is to remove the value in col 1 if it is not truly unique when looking at col2.
Example:
Col1 has the value 123, Col2 has the value A. Although there are two instances of 123 being owned by A, i can determine that it is indeed unique.
Now look at Col1 that has the value 12 with values in Col2 of B,C,D.
Value 12 is associated with three different owners thus eliminating 12 from our result list.
So in the end i would like to see a result table such as this :
|Col1 | Col2|
| 123 | A |
| 1 | A |
| 2 | D |
| 3 | D |
To summarize, i would like to first use the partition numbers to identify if the value in col1 is repeated. From there i want to verify that the values in col 2 are the same. If so the value in col 1 and col 2 remains as one single entry. However if the values in col 2 do not match, all records for the col1 value are removed.
I will provide the syntax code for my query if needed.
Update**
I failed to mention that table 1 is the result of inner joining two tables.
So Col1 comes from table a and Col2 comes from table b.
The values in table a for col2 are hard to interpret so i had to make sense of them and assigned it proper name values.
The join query i used to combine the two are:
Select a.Col1, B.Col2 FROM Table a INNER JOIN Table b on a.Colx = b.Colx
Update**
Table a:
|Col1 | Colx| Col3 |
| 123 | SMS | 1 |
| 123 | S9W | 2 |
| 12 | NAV | 1 |
| 12 | NFR | 2 |
| 12 | ABC | 3 |
| 12 | DEF | 4 |
| 1 | SMS | 1 |
| 2 | DEF | 1 |
| 3 | DES | 1 |
Table b:
|Colx | Col2|
| SMS | A |
| S9W | A |
| DEF | D |
| DES | D |
| NAV | B |
| NFR | B |
| ABC | C |
Above are sample data for both tables that get joined in order to create the first table displayed in this body.
Thank you all so much!
NOT EXISTS operator can be used to do this task:
SELECT distinct Col1 , Col2
FROM table t
WHERE NOT EXISTS(
SELECT 1 FROM table t1
WHERE t.col1=t1.col1 AND t.col2 <> t1.col2
)
If I understand correctly, you want:
select col1, min(col2)
from t
group by col1
where min(col2) <> max(col2);
I think the third column is confusing you. It doesn't seem to play any role in the logic you want.
I have a scenario where i need to pick the greatest value in the row from three columns, there is a function called Greatest but it doesn't work in my version of Hive 0.13.
Please suggest better way to accomplish it.
Example table:
+---------+------+------+------+
| Col1 | Col2 | Col3 | Col4 |
+---------+------+------+------+
| Group A | 1 | 2 | 3 |
+---------+------+------+------+
| Group B | 4 | 5 | 1 |
+---------+------+------+------+
| Group C | 4 | 2 | 1 |
+---------+------+------+------+
expected Result:
+---------+------------+------------+
| Col1 | output_max | max_column |
+---------+------------+------------+
| Group A | 3 | Col4 |
+---------+------------+------------+
| Group B | 5 | col3 |
+---------+------------+------------+
| Group C | 4 | col2 |
+---------+------------+------------+
select col1
,tuple.col1 as output_max
,concat('Col',tuple.col2) as max_column
from (select Col1
,sort_array(array(struct(Col2,2),struct(Col3,3),struct(Col4,4)))[2] as tuple
from t
) t
;
sort_array(Array)
Sorts the input array in ascending order according to the natural ordering of the array elements and returns it
(as of version 0.9.0).
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
hive> select col1
> ,tuple.col1 as output_max
> ,concat('Col',tuple.col2) as max_column
>
> from (select Col1
> ,sort_array(array(struct(Col2,2),struct(Col3,3),struct(Col4,4)))[2] as tuple
> from t
> ) t
> ;
OK
Group A 3 Col4
Group B 5 Col3
Group C 4 Col2
Lets say I have the following table:
| User_id | COL1 | COL2 |
+---------+----------+------+
| 1 | | 1 |
| 1 | | 2 |
| 1 | 2421 | |
| 1 | | 1 |
| 1 | 3542 | |
| 2 | | 1 |
I need another column indicating the next non-null COL1 value for each row, so the result would look like the below:
| User_id | COL1 | COL2 | COL3 |
+---------+----------+------+------
| 1 | | 1 | 2421 |
| 1 | | 2 | 2421 |
| 1 | 2421 | | |
| 1 | | 1 | 3542 |
| 1 | 3542 | | |
| 2 | | 1 | |
SELECT
first_value(COL1 ignore nulls) over (partition by user_id order by COL2 rows unbounded following)
FROM table;
would work but I'm using PostgreSQL which doesn't support the ignore nulls clause.
Any suggested workarounds?
You can still do it with windowing function if you add a case when criteria in the order by like this:
select
first_value(COL1)
over (
partition by user_id
order by case when COL1 is not null then 0 else 1 end ASC, COL2
rows unbounded following
)
from table
This will use non null values first.
However performance will probably not be great compared to skip nulls because the database will have to sort on the additional criteria.
I also had the same problem. The other solutions may work, but I have to build multiple windows for each row I need.
You can try this snippets : https://wiki.postgresql.org/wiki/First/last_(aggregate)
If you create the aggregates you can use them:
SELECT
first(COL1) over (partition by user_id order by COL2 rows unbounded following)
FROM table;
There is always the tried and true approach of using a correlated subquery:
select t.*,
(select t2.col1
from t t2
where t2.id >= t.id and t2.col1 is not null
order by t2.id desc
fetch first 1 row only
) as nextcol1
from t;
Hope this helps,
SELECT * FROM TABLE ORDER BY COALESCE(colA, colB);
which orders by colA and if colA has NULL value it orders by colB.
You can use COALESCE() function. For your query:
SELECT
first_value(COALESCE(COL1)) over (partition by user_id order by COL2 rows unbounded following)
FROM table;
but i don't understand what the reason to use sort by COL2, because this rows has null value for COL2:
| User_id | COL1 | COL2 |
+---------+----------+------+
| 1 | | 1 |
| 1 | | 2 |
| 1 | 2421 | | <<--- null?
| 1 | | 1 |
| 1 | 3542 | | <<--- null?
| 2 | | 1 |
I have an SQL table with some data like this, it is sorted by date:
+----------+------+
| Date | Col2 |
+----------+------+
| 12:00:01 | a |
| 12:00:02 | a |
| 12:00:03 | b |
| 12:00:04 | b |
| 12:00:05 | c |
| 12:00:06 | c |
| 12:00:07 | a |
| 12:00:08 | a |
+----------+------+
So, I want my select result to be the following:
+----------+------+
| Date | Col2 |
+----------+------+
| 12:00:01 | a |
| 12:00:03 | b |
| 12:00:05 | c |
| 12:00:07 | a |
+----------+------+
I have used the distinct clause but it removes the last two rows with Col2 = 'a'
You can use lag (SQL Server 2012+) to get the value in the previous row and then compare it with the current row value. If they are equal assign them to one group (1 here) and a different group (0 here) otherwise. Finally select the required rows.
select dt,col2
from (
select dt,col2,
case when lag(col2,1,0) over(order by dt) = col2 then 1 else 0 end as somecol
from t) x
where somecol=0
If you are using Microsoft SQL Server 2012 or later, you can do this:
select date, col2
from (
select date, col2,
case when isnull(lag(col2) over (order by date, col2), '') = col2 then 1 else 0 end as ignore
from (yourtable)
) x
where ignore = 0
This should work as long as col2 cannot contain nulls and if the empty string ('') is not a valid value for col2. The query will need some work if either assumption is not valid.
same as accepted answer (+1) just moving the conditions
assumes col2 is not null
select dt, col2
from ( select dt, col2
lag(col2, 1) over(order by dt) as lagCol2
from t
) x
where x.lagCol2 is null or x.lagCol2 <> x.col2
i have a table like this
+------+------+------+------+
| col1 | col2 | col3 | rank |
+------+------+------+------+
| 1 | A | X | 4 |
| 2 | C | Y | 3 |
| 2 | C | Y | 3 |
| | A | X | 3 |
| 1 | B | Z | 2 |
+------+------+------+------+
(5 rows)
I need o/p like this
+------+------+------+------+
| col1 | col2 | col3 | rank |
+------+------+------+------+
| 1 | A | X | 4 |
| 2 | C | Y | 3 |
| 1 | B | Z | 2 |
+------+------+------+------+
so that I written query like below
select col1,col2,col3,rank,dense_rank() over(order by rank desc) from table1;
but its not giving proper o/p
try this !!
select a.col1,a.col2,a.col3,max(a.rank) as rank
from [dbo].[5] a join [dbo].[5] b
on a.col1=b.col1 group by a.col1,a.col2,a.col3
looks like you need aggregation with max():
select
col1,col2,col3,
max(rnk)
from table1
group by col1,col2,col3
If you could have different values of col1 for one combination of col2, col3, then distinct on is what you need:
select distinct on (col2, col3)
col1,col2,col3,
rnk
from table1
order by col2, col3, rnk desc
sql fiddle demo
The following should match what you are looking for:
select col1,col2,col3,rank,dense_rank() over(order by rank desc) from table1
WHERE col1 IS NOT NULL
GROUP BY 1, 2, 3, 4;
You can also use numeric aliases in your order by clause if you want one.