How to create an observation with 0 in the column - sql

I am using the code below to get the quarterly wages for individuals from 2010Q1-2020Q4. If an individual did not work in a particular quarter they do not have an observation for that quarter. Instead, I would like for there to be an observation but have the quarterly wage be 0. For example,
What is currently happening:
| MPI | Quarter| Wage|
|PersonA|2010Q1 | 100 |
|PersonA|2010Q2 | 100 |
|PersonA|2010Q3 | 100 |
|PersonB|2010Q1 | 100 |
Desired output
| MPI | Quarter| Wage|
|PersonA|2010Q1 | 100 |
|PersonA|2010Q2 | 100 |
|PersonA|2010Q3 | 100 |
|PersonA|2010Q4 | 0 |
|PersonB|2010Q1 | 100 |
|PersonB|2010Q2 | 0 |
|PersonB|2010Q3 | 0 |
|PersonB|2010Q4 | 0 |
ws_data AS (
SELECT
MASTER_PERSON_INDEX AS mpi
,SUBSTR(cast(wg.naics as string), 1, 2) AS NAICS_2
,SUBSTR(cast(wg.yrqtr as string), 0,5) AS quarter
,wg.yrqtr
,wg.employer
,wg.wages
,SUBSTR(cast(wg.yrqtr as string), 0,4) AS YEAR
FROM
( SELECT
*
FROM
`ws.ws_ui_wage_records_di` wsui
WHERE
wsui.MASTER_PERSON_INDEX IN (SELECT mpi FROM rc_table_ra16_all_grads_1b)
AND
wsui.yrqtr IN (20101, 20102, 20103, 20104,
20111, 20112, 20113, 20114,
20121, 20122, 20123, 20124,
20131, 20132, 20133, 20134,
20141, 20142, 20143, 20144,
20151, 20152, 20153, 20154,
20161, 20162, 20163, 20164,
20171, 20172, 20173, 20174,
20181, 20182, 20183, 20184,
20191, 20192, 20193, 20194,
20201, 20202, 20203, 20204)
)wg
),
ws_agg AS (
SELECT
mpi
-- ,STATS_MODE(NAICS_2) AS NAICS_2
-- ,STATS_MODE(NAICS_DESC) AS NAICS_DESC
,quarter
,SUM(wages) AS wages_quart
FROM
ws_data
GROUP BY
mpi, quarter
),
ws_annot AS (
SELECT
dagg.*
,row_number() OVER(PARTITION BY dagg.mpi, cast(wages_quart as string) ORDER BY dagg.wages_quart DESC)AS rn
FROM
ws_agg dagg
)

Try using this data to create a CTE at the top as a Quarter table and then using that as the starting point in your main from statement. You should be able to replace the original code I copied from (wg where statement) with that top CTE as well.
(20101, 20102, 20103, 20104,
20111, 20112, 20113, 20114,
20121, 20122, 20123, 20124,
20131, 20132, 20133, 20134,
20141, 20142, 20143, 20144,
20151, 20152, 20153, 20154,
20161, 20162, 20163, 20164,
20171, 20172, 20173, 20174,
20181, 20182, 20183, 20184,
20191, 20192, 20193, 20194,
20201, 20202, 20203, 20204)
Your db may have a DateDimension table with quarters in it that you could use as well.

Since you want all quarters, and all individuals, one way to achieve this is to start with building all individual-quarter combinations in your data and use that as a 'driver' in a left join; like this:
select
Pers.MID
, Qtr.Quarter
, coalesce(W.Wage,0) as Wage
, ...
from
(select distinct MPIfrom YourTable) Pers
cross join
(select distinct Quarter from DateDimensionTable) Qtr
left join
YourTable W
on w.MPI=Pers.MPI
and w.Quarter=Qtr.Quarter
If your table has all the periods you are interested in, you can use YourTable, instead of DateDimensionTable. But if it doesn't, and I guess it can't be guaranteed, then you can use a Date/Calendar table here , if you have any, or you can dynamically generate quarters between min and max quarter in YourTable; just search for these terms). You can also hardcode them as you have in your query (as JBontje recommended).
If a combination is missing from YourTable then the Wage for that combo will be null, you can use coalesce to treat it as zero.

Related

Matching strings between columns based on position

I have a view that aggregates data about customers and shows the products they have access to, along with the status of whether they use those products on a trial basis or not (both as string comma seperated values):
+----------+----------+-----------------------+
| customer | products | products_trial_status |
+----------+----------+-----------------------+
| 234253 | A,B,C | false,true,false |
| 923403 | A,C | true,true |
| 123483 | B | true |
| 239874 | B,C | false,false |
+----------+----------+-----------------------+
and I would like to write a query that returns a list of customers who are using a certain product on a trial.
e.g. I want to see which customers using product B are on a trial, I would get something like this:
+----------+
| customer |
+----------+
| 234253 |
| 123483 |
+----------+
The only way I can think of doing this is by checking the products column for the position of the product in the string (if it exists there), then checking the corresponding value at the same position in the products_trial_status column and whether it is equal to true.
i.e. for customer 234253, product B is in position 2 (after the first comma), so it's corresponding trial status in the column would also be in position 2 after the first comma there.
How would I go about doing this?
I am aware that storing such data as a string of values is not good practice but it is not something i can change, so would need to work out using the format it is in
You could split the string but it will be quicker to do some (fairly hideous) string manipulations and
Replace your comma-delimited true/false string with a non-delimited string of 1s and 0s.
Count the number of terms before the B term by counting the number of preceding commas.
Return the appropriate substring of your 1/0 list.
Like this:
SELECT customer,
COALESCE(
SUBSTR(
status,
LENGTH(preceding_terms) - COALESCE(LENGTH(REPLACE(preceding_terms, ',')), 0),
1
),
'0'
) AS hasB
FROM (
SELECT customer,
SUBSTR(','||products, 1, INSTR(','||products||',', ',B,')) AS preceding_terms,
TRANSLATE(products_trial_status, 'tfrueals,', '10') AS status
FROM table_name
)
Which, for the sample data:
CREATE TABLE table_name ( customer, products, products_trial_status ) AS
SELECT 234253, 'A,B,C', 'false,true,false' FROM DUAL UNION ALL
SELECT 923403, 'A,C', 'true,true' FROM DUAL UNION ALL
SELECT 123483, 'B', 'true' FROM DUAL UNION ALL
SELECT 239874, 'B,C', 'false,false' FROM DUAL;
Outputs:
CUSTOMER
HASB
234253
1
923403
0
123483
1
239874
0
If you only want the customer numbers then you can add filters:
SELECT customer
FROM (
SELECT customer,
SUBSTR(','||products, 1, INSTR(','||products||',', ',B,')) AS preceding_terms,
TRANSLATE(products_trial_status, 'tfrueals,', '10') AS status
FROM table_name
WHERE INSTR(','||products||',', ',B,') > 0
)
WHERE SUBSTR(
status,
LENGTH(preceding_terms) - COALESCE(LENGTH(REPLACE(preceding_terms, ',')), 0),
1
) = '1'
Which outputs:
CUSTOMER
234253
123483
fiddle
You can use hierarchical query in which split the strings by commas and count the number of commas along with regular expression functions such as
WITH t1 AS
(
SELECT customer,
REGEXP_SUBSTR(products,'[^,]',1,level) AS products,
REGEXP_SUBSTR(products_trial_status,'[^,]+',1,level) AS products_ts
FROM t -- your data source
CONNECT BY level <= REGEXP_COUNT(products,',')+1
AND PRIOR customer = customer
AND PRIOR sys_guid() IS NOT NULL
)
SELECT customer
FROM t1
WHERE products = 'B'
AND products_ts = 'true'
Demo

Get specific row from each group

My question is very similar to this, except I want to be able to filter by some criteria.
I have a table "DOCUMENT" which looks something like this:
|ID|CONFIG_ID|STATE |MAJOR_REV|MODIFIED_ON|ELEMENT_ID|
+--+---------+----------+---------+-----------+----------+
| 1|1234 |Published | 2 |2019-04-03 | 98762 |
| 2|1234 |Draft | 1 |2019-01-02 | 98762 |
| 3|5678 |Draft | 3 |2019-01-02 | 24244 |
| 4|5678 |Published | 2 |2017-10-04 | 24244 |
| 5|5678 |Draft | 1 |2015-05-04 | 24244 |
It's actually a few more columns, but I'm trying to keep this simple.
For each CONFIG_ID, I would like to select the latest (MAX(MAJOR_REV) or MAX(MODIFIED_ON)) - but I might want to filter by additional criteria, such as state (e.g., the latest published revision of a document) and/or date (the latest revision, published or not, as of a specific date; or: all documents where a revision was published/modified within a specific date interval).
To make things more interesting, there are some other tables I want to join in.
Here's what I have so far:
SELECT
allDocs.ID,
d.CONFIG_ID,
d.[STATE],
d.MAJOR_REV,
d.MODIFIED_ON,
d.ELEMENT_ID,
f.ID FILE_ID,
f.[FILENAME],
et.COLUMN1,
e.COLUMN2
FROM DOCUMENT -- Get all document revisions
CROSS APPLY ( -- Then for each config ID, only look at the latest revision
SELECT TOP 1
ID,
MODIFIED_ON,
CONFIG_ID,
MAJOR_REV,
ELEMENT_ID,
[STATE]
FROM DOCUMENT
WHERE CONFIG_ID=allDocs.CONFIG_ID
ORDER BY MAJOR_REV desc
) as d
LEFT OUTER JOIN ELEMENT e ON e.ID = d.ELEMENT_ID
LEFT OUTER JOIN ELEMENT_TYPE et ON e.ELEMENT_TYPE_ID=et.ID
LEFT OUTER JOIN TREE t ON t.NODE_ID = d.ELEMENT_ID
OUTER APPLY ( -- This is another optional 1:1 relation, but it's wrongfully implemented as m:n
SELECT TOP 1
FILE_ID
FROM DOCUMENT_FILE_RELATION
WHERE DOCUMENT_ID=d.ID
ORDER BY MODIFIED_ON DESC
) as df -- There should never be more than 1, but we're using TOP 1 just in case, to avoid duplicates
LEFT OUTER JOIN [FILE] f on f.ID=df.FILE_ID
WHERE
allDocs.CONFIG_ID = '5678' -- Just for testing purposes
and d.state ='Released' -- One possible filter criterion, there may be others
It looks like the results are correct, but multiple identical rows are returned.
My guess is that for documents with 4 revisions, the same values are found 4 times and returned.
A simple SELECT DISTINCT would solve this, but I'd prefer to fix my query.
This would be a classic row_number & partition by question I think.
;with rows as
(
select <your-columns>,
row_number() over (partion by config_id order by <whatever you want>) as rn
from document
join <anything else>
where <whatever>
)
select * from rows where rn=1

Getting the latest entry per day / SQL Optimizing

Given the following database table, which records events (status) for different objects (id) with its timestamp:
ID | Date | Time | Status
-------------------------------
7 | 2016-10-10 | 8:23 | Passed
7 | 2016-10-10 | 8:29 | Failed
7 | 2016-10-13 | 5:23 | Passed
8 | 2016-10-09 | 5:43 | Passed
I want to get a result table using plain SQL (MS SQL) like this:
ID | Date | Status
------------------------
7 | 2016-10-10 | Failed
7 | 2016-10-13 | Passed
8 | 2016-10-09 | Passed
where the "status" is the latest entry on a day, given that at least one event for this object has been recorded.
My current solution is using "Outer Apply" and "TOP(1)" like this:
SELECT DISTINCT rn.id,
tmp.date,
tmp.status
FROM run rn OUTER apply
(SELECT rn2.date, tmp2.status AS 'status'
FROM run rn2 OUTER apply
(SELECT top(1) rn3.id, rn3.date, rn3.time, rn3.status
FROM run rn3
WHERE rn3.id = rn.id
AND rn3.date = rn2.date
ORDER BY rn3.id ASC, rn3.date + rn3.time DESC) tmp2
WHERE tmp2.status <> '' ) tmp
As far as I understand this outer apply command works like:
For every id
For every recorded day for this id
Select the newest status for this day and this id
But I'm facing performance issues, therefore I think that this solution is not adequate. Any suggestions how to solve this problem or how to optimize the sql?
Your code seems too complicated. Why not just do this?
SELECT r.id, r.date, r2.status
FROM run r OUTER APPLY
(SELECT TOP 1 r2.*
FROM run r2
WHERE r2.id = r.id AND r2.date = r.date AND r2.status <> ''
ORDER BY r2.time DESC
) r2;
For performance, I would suggest an index on run(id, date, status, time).
Using a CTE will probably be the fastest:
with cte as
(
select ID, Date, Status, row_number() over (partition by ID, Date order by Time desc) rn
from run
)
select ID, Date, Status
from cte
where rn = 1
Do not SELECT from a log table, instead, write a trigger that updates a latest_run table like:
CREATE TRIGGER tr_run_insert ON run FOR INSERT AS
BEGIN
UPDATE latest_run SET Status=INSERTED.Status WHERE ID=INSERTED.ID AND Date=INSERTED.Date
IF ##ROWCOUNT = 0
INSERT INTO latest_run (ID,Date,Status) SELECT (ID,Date,Status) FROM INSERTED
END
Then perform reads from the much shorter lastest_run table.
This will add a performance penalty on writes because you'll need two writes instead of one. But will give you much more stable response times on read. And if you do not need to SELECT from "run" table you can avoid indexing it, therefore the performance penalty of two writes is partly compensated by less indexes maintenance.

GAPING OF SQL TABLE

I'm Nguyen Van Dung,
I found difficulty with checking gap of SERIAL number, each number is a complex of chars and digits
I have a table with following
data
SERIAL NUMBER
3LBCF007787
3LBCF007788
3LBCF007789
3LBCF007790
3LBCF007792
3LBCF007793
3LBCF007794
3LBCF007795
Now I really want to display an output table as below structure
START_SERIAL END_SERIAL
3LBCF007787 3LBCF007790
3LBCF007792 3LBCF007795
Please support me by querying SQL server
Thank you very much
Assuming that the format of serial_number is fixed, here is one way of doing it
WITH split AS
(
SELECT serial_number,
LEFT(serial_number, 5) prefix,
CONVERT(INTEGER, RIGHT(serial_number, 6)) num
FROM table1
), ordered AS
(
SELECT *,
ROW_NUMBER() OVER (ORDER BY prefix, num) rn,
MIN(num) OVER (PARTITION BY prefix) mnum
FROM split
)
SELECT MIN(serial_number) start_serial,
MAX(serial_number) end_serial
FROM ordered
GROUP BY num - mnum - rn
Output:
| START_SERIAL | END_SERIAL |
|--------------|-------------|
| 3LBCF007787 | 3LBCF007790 |
| 3LBCF007792 | 3LBCF007795 |
Here is SQLFiddle demo

How to write Oracle query to find a total length of possible overlapping from-to dates

I'm struggling to find the query for the following task
I have the following data and want to find the total network day for each unique ID
ID From To NetworkDay
1 03-Sep-12 07-Sep-12 5
1 03-Sep-12 04-Sep-12 2
1 05-Sep-12 06-Sep-12 2
1 06-Sep-12 12-Sep-12 5
1 31-Aug-12 04-Sep-12 3
2 04-Sep-12 06-Sep-12 3
2 11-Sep-12 13-Sep-12 3
2 05-Sep-12 08-Sep-12 3
Problem is the date range can be overlapping and I can't come up with SQL that will give me the following results
ID From To NetworkDay
1 31-Aug-12 12-Sep-12 9
2 04-Sep-12 08-Sep-12 4
2 11-Sep-12 13-Sep-12 3
and then
ID Total Network Day
1 9
2 7
In case the network day calculation is not possible just get to the second table would be sufficient.
Hope my question is clear
We can use Oracle Analytics, namely the "OVER ... PARTITION BY" clause, in Oracle to do this. The PARTITION BY clause is kind of like a GROUP BY but without the aggregation part. That means we can group rows together (i.e. partition them) and them perform an operation on them as separate groups. As we operate on each row we can then access the columns of the previous row above. This is the feature PARTITION BY gives us. (PARTITION BY is not related to partitioning of a table for performance.)
So then how do we output the non-overlapping dates? We first order the query based on the (ID,DFROM) fields, then we use the ID field to make our partitions (row groups). We then test the previous row's TO value and the current rows FROM value for overlap using an expression like: (in pseudo code)
max(previous.DTO, current.DFROM) as DFROM
This basic expression will return the original DFROM value if it doesnt overlap, but will return the previous TO value if there is overlap. Since our rows are ordered we only need to be concerned with the last row. In cases where a previous row completely overlaps the current row we want the row then to have a 'zero' date range. So we do the same thing for the DTO field to get:
max(previous.DTO, current.DFROM) as DFROM, max(previous.DTO, current.DTO) as DTO
Once we have generated the new results set with the adjusted DFROM and DTO values, we can aggregate them up and count the range intervals of DFROM and DTO.
Be aware that most date calculations in database are not inclusive such as your data is. So something like DATEDIFF(dto,dfrom) will not include the day dto actually refers to, so we will want to adjust dto up a day first.
I dont have access to an Oracle server anymore but I know this is possible with the Oracle Analytics. The query should go something like this:
(Please update my post if you get this to work.)
SELECT id,
max(dfrom, LAST_VALUE(dto) OVER (PARTITION BY id ORDER BY dfrom) ) as dfrom,
max(dto, LAST_VALUE(dto) OVER (PARTITION BY id ORDER BY dfrom) ) as dto
from (
select id, dfrom, dto+1 as dto from my_sample -- adjust the table so that dto becomes non-inclusive
order by id, dfrom
) sample;
The secret here is the LAST_VALUE(dto) OVER (PARTITION BY id ORDER BY dfrom) expression which returns the value previous to the current row.
So this query should output new dfrom/dto values which dont overlap. It's then a simple matter of sub-querying this doing (dto-dfrom) and sum the totals.
Using MySQL
I did haves access to a mysql server so I did get it working there. MySQL doesnt have results partitioning (Analytics) like Oracle so we have to use result set variables. This means we use #var:=xxx type expressions to remember the last date value and adjust the dfrom/dto according. Same algorithm just a little longer and more complex syntax. We also have to forget the last date value any time the ID field changes!
So here is the sample table (same values you have):
create table sample(id int, dfrom date, dto date, networkDay int);
insert into sample values
(1,'2012-09-03','2012-09-07',5),
(1,'2012-09-03','2012-09-04',2),
(1,'2012-09-05','2012-09-06',2),
(1,'2012-09-06','2012-09-12',5),
(1,'2012-08-31','2012-09-04',3),
(2,'2012-09-04','2012-09-06',3),
(2,'2012-09-11','2012-09-13',3),
(2,'2012-09-05','2012-09-08',3);
On to the query, we output the un-grouped result set like above:
The variable #ld is "last date", and the variable #lid is "last id". Anytime #lid changes, we reset #ld to null. FYI In mysql the := operators is where the assignment happens, an = operator is just equals.
This is a 3 level query, but it could be reduced to 2. I went with an extra outer query to keep things more readable. The inner most query is simple and it adjusts the dto column to be non-inclusive and does the proper row ordering. The middle query does the adjustment of the dfrom/dto values to make them non-overlapped. The outer query simple drops the non-used fields, and calculate the interval range.
set #ldt=null, #lid=null;
select id, no_dfrom as dfrom, no_dto as dto, datediff(no_dto, no_dfrom) as days from (
select if(#lid=id,#ldt,#ldt:=null) as last, dfrom, dto, if(#ldt>=dfrom,#ldt,dfrom) as no_dfrom, if(#ldt>=dto,#ldt,dto) as no_dto, #ldt:=if(#ldt>=dto,#ldt,dto), #lid:=id as id,
datediff(dto, dfrom) as overlapped_days
from (select id, dfrom, dto + INTERVAL 1 DAY as dto from sample order by id, dfrom) as sample
) as nonoverlapped
order by id, dfrom;
The above query gives the results (notice dfrom/dto are non-overlapping here):
+------+------------+------------+------+
| id | dfrom | dto | days |
+------+------------+------------+------+
| 1 | 2012-08-31 | 2012-09-05 | 5 |
| 1 | 2012-09-05 | 2012-09-08 | 3 |
| 1 | 2012-09-08 | 2012-09-08 | 0 |
| 1 | 2012-09-08 | 2012-09-08 | 0 |
| 1 | 2012-09-08 | 2012-09-13 | 5 |
| 2 | 2012-09-04 | 2012-09-07 | 3 |
| 2 | 2012-09-07 | 2012-09-09 | 2 |
| 2 | 2012-09-11 | 2012-09-14 | 3 |
+------+------------+------------+------+
How about constructing an SQL which merges intervals by removing holes and considering only maximum intervals. It goes like this (not tested):
SELECT DISTINCT F.ID, F.From, L.To
FROM Temp AS F, Temp AS L
WHERE F.From < L.To AND F.ID = L.ID
AND NOT EXISTS (SELECT *
FROM Temp AS T
WHERE T.ID = F.ID
AND F.From < T.From AND T.From < L.To
AND NOT EXISTS ( SELECT *
FROM Temp AS T1
WHERE T1.ID = F.ID
AND T1.From < T.From
AND T.From <= T1.To)
)
AND NOT EXISTS (SELECT *
FROM Temp AS T2
WHERE T2.ID = F.ID
AND (
(T2.From < F.From AND F.From <= T2.To)
OR (T2.From < L.To AND L.To < T2.To)
)
)
with t_data as (
select 1 as id,
to_date('03-sep-12','dd-mon-yy') as start_date,
to_date('07-sep-12','dd-mon-yy') as end_date from dual
union all
select 1,
to_date('03-sep-12','dd-mon-yy'),
to_date('04-sep-12','dd-mon-yy') from dual
union all
select 1,
to_date('05-sep-12','dd-mon-yy'),
to_date('06-sep-12','dd-mon-yy') from dual
union all
select 1,
to_date('06-sep-12','dd-mon-yy'),
to_date('12-sep-12','dd-mon-yy') from dual
union all
select 1,
to_date('31-aug-12','dd-mon-yy'),
to_date('04-sep-12','dd-mon-yy') from dual
union all
select 2,
to_date('04-sep-12','dd-mon-yy'),
to_date('06-sep-12','dd-mon-yy') from dual
union all
select 2,
to_date('11-sep-12','dd-mon-yy'),
to_date('13-sep-12','dd-mon-yy') from dual
union all
select 2,
to_date('05-sep-12','dd-mon-yy'),
to_date('08-sep-12','dd-mon-yy') from dual
),
t_holidays as (
select to_date('01-jan-12','dd-mon-yy') as holiday
from dual
),
t_data_rn as (
select rownum as rn, t_data.* from t_data
),
t_model as (
select distinct id,
start_date
from t_data_rn
model
partition by (rn, id)
dimension by (0 as i)
measures(start_date, end_date)
rules
( start_date[for i
from 1
to end_date[0]-start_date[0]
increment 1] = start_date[0] + cv(i),
end_date[any] = start_date[cv()] + 1
)
order by 1,2
),
t_network_days as (
select t_model.*,
case when
mod(to_char(start_date, 'j'), 7) + 1 in (6, 7)
or t_holidays.holiday is not null
then 0 else 1
end as working_day
from t_model
left outer join t_holidays
on t_holidays.holiday = t_model.start_date
)
select id,
sum(working_day) as network_days
from t_network_days
group by id;
t_data - your initial data
t_holidays - contains list of holidays
t_data_rn - just adds unique key (rownum) to each row of t_data
t_model - expands t_data date ranges into a flat list of dates
t_network_days - marks each date from t_model as working day or weekend based on day of week (Sat and Sun) and holidays list
final query - calculates number of network day per each group.