SQL - suppressing duplicate *adjacent* records - sql

I need to run a Select statement (DB2 SQL) that does not pull adjacent row duplicates based on a certain field. In specific, I am trying to find out when data changes, which is made difficult because it might change back to its original value.
That is to say, I have a table that vaguely resembles the below, sorted by Letter and then by Date:
A, 5, 2009-01-01
A, 12, 2009-02-01
A, 12, 2009-03-01
A, 12, 2009-04-01
A, 9, 2009-05-01
A, 9, 2009-06-01
A, 5, 2009-07-01
And I want to get the results:
A, 5, 2009-01-01
A, 12, 2009-02-01
A, 9, 2009-05-01
A, 5, 2009-07-01
discarding adjacent duplicates but keeping the last row (despite it having the same number as the first row). The obvious:
Select Letter, Number, Min(Update_Date) from Table group by Letter, Number
does not work -- it doesn't include the last row.
Edit: As there seems to have been some confusion, I have clarified the month column into a date column. It was meant as a human-parseable short form, not as actual valid data.
Edit: The last row is not important BECAUSE it is the last row, but because it has a "new value" that is also an "old value". Grouping by NUMBER would wrap it in with the first row; it needs to remain a separate entity.

Depending on which DB2 you're on, there are analytic functions which can make this problem easy to solve. An example in Oracle is below, but the select syntax appears to be pretty similar.
create table t1 (c1 char, c2 number, c3 date);
insert into t1 VALUES ('A', 5, DATE '2009-01-01');
insert into t1 VALUES ('A', 12, DATE '2009-02-01');
insert into t1 VALUES ('A', 12, DATE '2009-03-01');
insert into t1 VALUES ('A', 12, DATE '2009-04-01');
insert into t1 VALUES ('A', 9, DATE '2009-05-01');
insert into t1 VALUES ('A', 9, DATE '2009-06-01');
insert into t1 VALUES ('A', 5, DATE '2009-07-01');
SQL> l
1 SELECT C1, C2, C3
2 FROM (SELECT C1, C2, C3,
3 LAG(C2) OVER (PARTITION BY C1 ORDER BY C3) AS PRIOR_C2,
4 LEAD(C2) OVER (PARTITION BY C1 ORDER BY C3) AS NEXT_C2
5 FROM T1
6 )
7 WHERE C2 <> PRIOR_C2
8 OR PRIOR_C2 IS NULL -- to pick up the first value
9 ORDER BY C1, C3
SQL> /
C C2 C3
- ---------- -------------------
A 5 2009-01-01 00:00:00
A 12 2009-02-01 00:00:00
A 9 2009-05-01 00:00:00
A 5 2009-07-01 00:00:00

This is not possible with set based commands (i.e. group by and such).
You may be able to do this by using cursors.
Personally, I would get the data into my client application and do the filtering there.

The first thing you'd have to do is identify the sequence within which you wish to view/consider the the data. Values of "Jan, Feb, Mar" don't help, because the data's not in alphabetical order. And what happens when you flip from Dec to Jan? Step 1: identify a sequence that uniquely defines each row with regards to your problem.
Next, you have to be able to compare item #x with item #x-1, to see if it has changed. If changed, include; if not changed, exclude. Trivial when using procedural code loops (cursors in SQL), but would you want to use those? They tend not to perform too well.
One SQL-based way to do this is to join the table on itself, with the join clause being "MyTable.SequenceVal = MyTable.SequenceVal - 1". Throw in a comparison, make sure you don't toss the very first row of the set (where there is no x-1), and you're done. Note that performance may suck if the "SequenceVal" is not indexed.

Using an "EXCEPT" clause is one way to do it. See below for the solution. I've included all of my test steps here. First, I created a session table (this will go away after I disconnect from my database).
CREATE TABLE session.sample (
letter CHAR(1),
number INT,
update_date DATE
);
Then I imported your sample data:
IMPORT FROM sample.csv OF DEL INSERT INTO session.sample;
Verified that your sample information is in the database:
SELECT * FROM session.sample;
LETTER NUMBER UPDATE_DATE
------ ----------- -----------
A 5 01/01/2009
A 12 02/01/2009
A 12 03/01/2009
A 12 04/01/2009
A 9 05/01/2009
A 9 06/01/2009
A 5 07/01/2009
7 record(s) selected.
I wrote this with an EXCEPT clause, and used the "WITH" to try to make it clearer. Basically, I'm trying to select all rows that have a previous date entry. Then, I exclude all of those rows from a select on the whole table.
WITH rows_with_previous AS (
SELECT s.*
FROM session.sample s
JOIN session.sample s2
ON s.letter = s2.letter
AND s.number = s2.number
AND s.update_date = s2.update_date - 1 MONTH
)
SELECT *
FROM session.sample
EXCEPT ALL
SELECT *
FROM rows_with_previous;
Here is the result:
LETTER NUMBER UPDATE_DATE
------ ----------- -----------
A 5 01/01/2009
A 12 04/01/2009
A 9 06/01/2009
A 5 07/01/2009
4 record(s) selected.

Related

Translating an Excel concept into SQL

Let's say I have the following range in Excel named MyRange:
This isn't a table by any means, it's more a collection of Variant values entered into cells. Excel makes it easy to sum these values doing =SUM(B3:D6) which gives 25. Let's not go into the details of type checking or anything like that and just figure that sum will easily skip values that don't make sense.
If we were translating this concept into SQL, what would be the most natural way to do this? The few approaches that came to mind are (ignore type errors for now):
MyRange returns an array of values:
-- myRangeAsList = [1,1,1,2, ...]
SELECT SUM(elem) FROM UNNEST(myRangeAsList) AS r (elem);
MyRange returns a table-valued function of a single column (basically the opposite of a list):
-- myRangeAsCol = (SELECT 1 UNION ALL SELECT 1 UNION ALL ...
SELECT SUM(elem) FROM myRangeAsCol as r (elem);
Or, perhaps more 'correctly', return a 3-columned table such as:
-- myRangeAsTable = (SELECT 1,1,1 UNION ALL SELECT 2,'other',2 UNION ALL ...
SELECT SUM(a+b+c) FROM SELECT a FROM myRangeAsTable (a,b,c)
Unfortunately, I think this makes things the most difficult to work with, as we now have to combine an unknown number of columns.
Perhaps returning a single column is the easiest of the above to work with, but even that takes a very simple concept -- SUM(myRange) and converts into something that is anything but that: SELECT SUM(elem) FROM myRangeAsCol as r (elem).
Perhaps this could also just be rewritten to a function for convenience, for example:
Just possible direction to think
create temp function extract_values (input string)
returns array<string> language js as """
return Object.values(JSON.parse(input));
""";
with myrangeastable as (
select '1' a, '1' b, '1' c union all
select '2', 'other', '2' union all
select 'true', '3', '3' union all
select '4', '4', '4'
)
select sum(safe_cast(value as float64)) range_sum
from myrangeastable t,
unnest(extract_values(to_json_string(t))) value
with output
Note: no columns explicitly used so should work for any sized range w/o any changes in code
Depends on specific use case, I think above can be wrapped into something more friendly for someone who knows excel to do
I'll try to pose, atomic, pure SQL principles that start with obvious items and goes to the more complicated ones. The intention is, all items can be used in any RDBS:
SQL is basically designed to query tabular data which has relations. (Hence the name is Structured Query Language).
The range in excel is a table for SQL. (Yes you can have some other types in different DBs, but keep it simple so you can use the concept in different types of DBs.)
Now we accept a range in the excel is a table in a database. Then the next step is how to map columns and rows of an excel range to a DB table. It is straight forward. An excel range column is a column in DB. And a row is a row. So why is this a separate item? Because the main difference between the two is usually in DBs, adding new column is usually a pain, the DB tables are almost exclusively designed for new rows not for new columns. (But, of course there are methods to add new columns, and even there exists column based DBs, but these are out of the scope of this answer.)
Items 2 and 3 in Excel and in a DB:
/*
Item 2: Table
the range in the excel is modeled as the below test_table
Item 3: Columns
id keeps the excel row number
b, c, d are the corresponding b, c, d columns of the excel
*/
create table test_table
(
id integer,
b varchar(20),
c varchar(20),
d varchar(20)
);
-- Item 3: Adding the rows in the DB
insert into test_table values (3 /* same as excel row number */ , '1', '1', '1');
insert into test_table values (4 /* same as excel row number */ , '2', 'other', '2');
insert into test_table values (5 /* same as excel row number */ , 'TRUE', '3', '3');
insert into test_table values (6 /* same as excel row number */ , '4', '4', '4');
Now we have similar structure. Then the first thing we want to do is to have equal number of rows between excel range and db table. At DB side this is called filtering and your tool is the where condition. where condition goes through all rows (or indexes for the sake of speed but this is beyond this answer's scope), and filters out which does not satisfy the test boolean logic in the condition. (So for example where 1 = 1 is brings all rows because the condition is always true for all rows.
The next thing to do is to sum the related columns. For this purpose you have two options. To use sum(column_a + column_b) (row by row summation) or sum(a) + sum(b) (column by column summation). If we assume all the data are not null, then both gives the same output.
Items 4 and 5 in Excel and in a DB:
select sum(b + c + d) -- Item 5, first option: We sum row by row
from test_table
where id between 3 and 6; -- Item 4: We simple get all rows, because for all rows above the id are between 3 and 6, if we had another row with 7, it would be filtered out
+----------------+
| sum(b + c + d) |
+----------------+
| 25 |
+----------------+
select sum(b) + sum(c) + sum(d) -- Item 5, second option: We sum column by column
from test_table
where id between 3 and 6; -- Item 4: We simple get all rows, because for all rows above the id are between 3 and 6, if we had another row with 7, it would be filtered out
+--------------------------+
| sum(b) + sum(c) + sum(d) |
+--------------------------+
| 25 |
+--------------------------+
At this point it is better to go one step further. In the excel you have got the "pivot table" structure. The corresponding structure at SQL is the powerful group by mechanics. The group by basically groups a table according to its condition and each group behaves like a sub-table. For example if you say group by column_a for a table, the values are grouped according to the values of the table.
SQL is so powerful that you can even filter the sub groups using having clauses, which acts same as where but works over the columns in group by or the functions over those columns.
Items 6 and 7 in Excel and in a DB:
-- Item 6: We can have group by clause to simulate a pivot table
insert into test_table values (7 /* same as excel row */ , '4', '2', '2');
select b, sum(d), min(d), max(d), avg(d)
from test_table
where id between 3 and 7
group by b;
+------+--------+--------+--------+--------+
| b | sum(d) | min(d) | max(d) | avg(d) |
+------+--------+--------+--------+--------+
| 1 | 1 | 1 | 1 | 1 |
| 2 | 2 | 2 | 2 | 2 |
| TRUE | 3 | 3 | 3 | 3 |
| 4 | 6 | 2 | 4 | 3 |
+------+--------+--------+--------+--------+
Beyond this point following are the details which are not directly related with the questions purpose:
SQL has the ability for table joins (the relations). They can be thought like the VLOOKUP functionality in the Excel.
The RDBSs have the indexing mechanisms to fetch the rows as quick as possible. (Where the RDBMSs start to go beyond the purpose of excel).
The RDBSs keep huge amount of data (where excel the max rows are limited).
Both RDBSMs and Excel can be used by most of frameworks as persistent data layer. But of course Excel is not the one you pick because its reason of existence is more on the presentation layer.
The excel file and the SQL used in this answer can be found in this github repo: https://github.com/MehmetKaplan/stackoverflow-72135212/
PS: I used SQL for more than 2 decades and then reduced using it and started to use Excel much frequently because of job changes. Each time I use Excel I still think of the DBs and "relational algebra" which is the mathematical foundation of the RDBMSs.
So in Snowflake:
Strings as input:
if you have your data in a "order" table represented by this CTE:
and the data was strings of comma separated values:
WITH data(raw) as (
select * from values
('null,null,null,null,null,null'),
('null,null,null,null,null,null'),
('null,1,1,1,null,null'),
('null,2, other,2,null,null'),
('null,true,3,3,null,null'),
('null,4,4,4,null,null')
)
this SQL will select the sub part, try parse it and sum the valid values:
select sum(nvl(try_to_double(r.value::text), try_to_number(r.value::text))) as sum_total
from data as d
,table(split_to_table(d.raw,',')) r
where r.index between 2 and 4 /* the column B,C,D filter */
and r.seq between 3 and 6 /* the row 3-6 filter */
;
giving:
SUM_TOTAL
25
Arrays as input:
if you already have arrays.. here I am smash those strings into STRTOK_TO_ARRAY in the CTE to make me some arrays:
WITH data(_array) as (
select STRTOK_TO_ARRAY(column1, ',') from values
('null,null,null,null,null,null'),
('null,null,null,null,null,null'),
('null,1,1,1,null,null'),
('null,2, other,2,null,null'),
('null,true,3,3,null,null'),
('null,4,4,4,null,null')
)
thus again with almost the same SQL, but not the array indexes are 0 based, and I have used FLATTEN:
select sum(nvl(try_to_double(r.value::text), try_to_number(r.value::text))) as sum_total
from data as d
,table(flatten(input=>d._array)) r
where r.index between 1 and 3 /* the column B,C,D filter */
and r.seq between 3 and 6 /* the row 3-6 filter */
;
gives:
SUM_TOTAL
25
With JSON driven data:
This time using semi-structured data, we can include the filter ranges with the data.. and some extra "out of bounds values just to show we are not just converting it all.
WITH data as (
select parse_json('{ "col_from":2,
"col_to":4,
"row_from":3,
"row_to":6,
"data":[[101,102,null,104,null,null],
[null,null,null,null,null,null],
[null,1,1,1,null,null],
[null,2, "other",2,null,null],
[null,true,3,3,null,null],
[null,4,4,4,null,null]
]}') as json
)
select
sum(try_to_double(c.value::text)) as sum_total
from data as d
,table(flatten(input=>d.json:data)) r
,table(flatten(input=>r.value)) c
where r.index+1 between d.json:row_from::number and d.json:row_to::number
and c.index+1 between d.json:col_from::number and d.json:col_to::number
;
Here is another solution using Snowflake scripting (Snowsight format) . This code can easily be wrapped as a stored procedure.
declare
table_name := 'xl_concept'; -- input
column_list := 'a,b,c'; -- input
total resultset; -- result output
pos int := 0; -- position for delimiter
sql := ''; -- sql to be generated
col := ''; -- individual column names
begin
sql := 'select sum('; -- initialize sql
loop -- repeat until column list is empty
col := replace(split_part(:column_list, ',', 1), ',', ''); -- get the column name
pos := position(',' in :column_list); -- find the delimiter
sql := sql || 'coalesce(try_to_number('|| col ||'),0)'; -- add to the sql
if (pos > 0) then -- more columns in the column list
sql := sql || ' + ';
column_list := right(:column_list, len(:column_list) - :pos); -- update column list
else -- last entry in the columns list
break;
end if;
end loop;
sql := sql || ') total from ' || table_name||';'; -- finalize the sql
total := (execute immediate :sql); -- run the sql and store total value
return table(total); -- return total value
end;
only these two variables need to be set table_name and column_list
generates the following sql to sum up the values
select sum(coalesce(try_to_number(a),0) + coalesce(try_to_number(b),0) + coalesce(try_to_number(c),0)) from xl_concept
prep steps
create or replace temp table xl_concept (a varchar,b varchar,c varchar)
;
insert into xl_concept
with cte as (
select '1' a, '1' b, '1' c union all
select '2', 'other', '2' union all
select 'true', '3', '3' union all
select '4', '4', '4'
)
select * from cte
;
result for the run with no change
TOTAL
25
result after changing column list to column_list := 'a,c';
TOTAL
17
Also, this can be enhanced setting columns_list to * and reading the column names from information_schema.columns to include all the columns from the table.
In PostgreSQL regular expression can be used to filter non numeric values before sum
select sum(e::Numeric) from (
select e
from unnest((Array[['1','2w','1.2e+4'],['-1','2.232','zz']])) as t(e)
where e ~ '^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$'
) a
expression for validating numeric value was taken from post Return Just the Numeric Values from a PostgreSQL Database Column
More secure option is to define function as in PostgreSQL alternative to SQL Servers try_cast function
Function (simplified for this example):
create function try_cast_numeric(p_in text)
returns Numeric
as
$$
begin
begin
return $1::Numeric;
exception
when others then
return 0;
end;
end;
$$
language plpgsql;
Select
select
sum(try_cast_numeric(e))
from
unnest((Array[['1','2w','1.2e+4'],['-1','2.232','zz']])) as t(e)
Most modern RDBMS support lateral joins and table value constructors. You can use them together to convert arbitrary columns to rows (3 columns per row become 3 rows with 1 column) then sum. In SQL server you would:
create table t (
id int not null primary key identity,
a int,
b int,
c int
);
insert into t(a, b, c) values
( 1, 1, 1),
( 2, null, 2),
(null, 3, 3),
( 4, 4, 4);
select sum(value)
from t
cross apply (values
(a),
(b),
(c)
) as x(value);
Below is the implementation of this concept in some popular RDBMS:
SQL Server
PostgreSQL
MySQL
Generic solution, ANSI SQL
Unpivot solution, Oracle
Using regular expression to extract all number values from a row could be another option, I guess.
DECLARE rectangular_table ARRAY<STRUCT<A STRING, B STRING, C STRING>> DEFAULT [
('1', '1', '1'), ('2', 'other', '2'), ('TRUE', '3', '3'), ('4', '4', '4')
];
SELECT SUM(SAFE_CAST(v AS FLOAT64)) AS `sum`
FROM UNNEST(rectangular_table) t,
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r':"?([-0-9.]*)"?[,}]')) v
output:
+------+------+
| Row | sum |
+------+------+
| 1 | 25.0 |
+------+------+
You could use a CTE with a SELECT FROM VALUES
with xlary as
(
select val from (values
('1')
,('1')
,('1')
,('2')
,('OTHER')
,('2')
,('TRUE')
,('3')
,('3')
,('4')
,('4')
,('4')
) as tbl (val)
)
select sum(try_cast(val as number)) from xlary;

Oracle SQL Regexp

I have a , separated string values in two different columns and need to match a specific value between these two columns.
Example:
Column A: A123,B234,I555,K987
Column B: AAA1,A123,B234,I555,K987
I want to check the value B234 from Column A (which is starting 6th position) and B234 from Column B (which is starting 11th position), if they are matching or not. I have few hundred of such records and need to check if these values are matching or not.
The way you put it, you'd compare "words" - the 2nd one in column A against the 3rd one in column B (sample data in lines #1 - 4; query you might be interested in begins at line #5):
SQL> with test (cola, colb) as
2 (select 'A123,B234,I555,K987', 'AAA1,A123,B234,I555,K987' from dual union all
3 select 'XYZ' , 'DEF' from dual
4 )
5 select *
6 from test
7 where regexp_substr(cola, '\w+', 1, 2) = regexp_substr(colb, '\w+', 1, 3);
COLA COLB
------------------- ------------------------
A123,B234,I555,K987 AAA1,A123,B234,I555,K987
SQL>

Oracle SQL - How to "pivot" one row to many

In Oracle 12c, I have a view, which takes a little time to run. When I add the where clause, it will return exactly one row of interest. The row has columns/value like this...
I need this flipped so that I can see one row per EACH "set". I need the SQL to return something like
I know I can do a UNION ALL for each of the entry sets, but as the view takes a little while to run, plus there are about 30 different sets (I only showed 3 - Car, Boat, and truck)
Is there a better way of doing this? I have looked at PIVOT/UNPIVOT, but I didn't see how to make this work.
I think you are looking for UNPIVOT
WITH TEMP_DATA (ID1, CarPrice, CarTax, BoatPrice, BoatTax, TruckPrice, TruckTax)
AS (
select 'AAA', 1, 2, 3, 4, 5, 6 from dual )
select TYPE, PRICE, TAX
from temp_data
unpivot
(
(PRICE, TAX)
for TYPE IN
(
(CarPrice, CarTax) as 'CAR',
(BoatPrice, BoatTax) as 'BOAT',
(TruckPrice, TruckTax) as 'TRUCK'
)
)
;
OUTPUT:
TYPE PRICE TAX
----- ---------- ----------
CAR 1 2
BOAT 3 4
TRUCK 5 6

SQL - How to returns values outside a date range query

Hoping someone can help out here, I have the following data
Field 1 Field 2 Date Data
1 1 12/09/14 1
2 2 12/09/14 1
3 1 11/09/14 1
4 3 11/09/14 1
I need to write an sql query that sums all "Data" based on a date range and then anything that matches in Field 2. So if a line is out of the date range but the value in Field 2 matches another line that is within the date range, it should be included
For example, if I was to query everything for the 12/09/14, I want to see the sum of line 1, 2 and 3.... as line 3 is outside of the date range but it matches line 1 in the "Field 2" column. Line 4 should not be included as it is outside the range and does not have a matching value in "Field 2"
Any ideas?
I've been playing around with variations of queries but it either selects only the date range values or everything :(
EDIT:
Ok I've given Rajesh answer a try and it doesn't seem to include the data outside the date range. I was expecting the final sum in this example to equal 3 but it's only showing 2
select sum(a) from (
select sum(batch_m2_nett) as a
from batch_inf
where batch_date = to_date('30/09/15','DD/MM/RR')
union
select sum(f2.batch_m2_nett) as a
from batch_inf f1
inner join batch_inf f2
on f1.batch_date = to_date('30/09/15','DD/MM/RR')
and f1.batch_opt_start_batch = f2.batch_opt_start_batch
and f2.batch_date != to_date('30/09/15','DD/MM/RR')
);
SUM(A)
------
2
SQL> select batch_no, batch_opt_start_batch, batch_date, batch_m2_nett from batch_inf where batch_no in (8811,8812,8814);
BATCH_NO BATCH_OPT_START_BATCH BATCH_DATE BATCH_M2_NETT
-------- --------------------- --------------- -------------
8811 8814 30-SEP-15 1
8812 8814 30-SEP-15 1
8814 8814 01-OCT-15 1
the first statement gets sum of data values where date matches
the second statement gets sum of data values where field2 of matched date row is matching with other rows using self join
select SUM(s)
from
(
select SUM(data) as s
from fields
where date ='12/09/14'
union
select sum(f2.data) as s
from fields f1
inner join fields f2
on f1.date ='12/09/14'
and f1.field2 = f2.field2
and f2.date != '12/09/14'
) T

Oracle: Select multiple values from a column while satisfying condition for some values

I have a column COL in a table which has integer values like: 1, 2, 3, 10, 11 ... and son on. Uniqueness in the table is created by an ID. Each ID can be associated with multiple COL values. For example
ID | COL
——————————
1 | 2
————+—————
1 | 3
————+—————
1 | 10
————+—————
is valid.
What I want to do is select only the COL values from the table that are greater than 3, AND (the problematic part) also select the value that is the MAX of 1, 2, and 3, if they exist at all. So in the table above, I would want to select values [3, 10] because 10 is greater than 3 and 3 = MAX(3, 2).
I know I can do this with two SQL statements, but it's sort of messy. Is there a way of doing it with one statement only?
SELECT col FROM table
WHERE
col > 3
UNION
SELECT MAX(col) FROM table
WHERE
col <= 3
This query does not assume you want the results per id, because you don't explicitely mention it.
I don't think you need pl/sql for this, SQL is enough.