Group or Sum the data based on overlapping period - sql

I'm working on migrating legacy system data to a new system. I'm trying to migrate the data with history based on changed date. My current query results to below output.
Since it's a legacy system, some of the data falls within same period. I want to group the data based on id and name, and add the value as active record or inactive based on the data falls under same period.
My expected output:
For example, lets take 119 as an example and explain the same. One row marked as yellow since its not falls any overlapping period between other rows, but other two rows overlaps the period 01-No-18 to 30-Sep-19.
I need to split the data for overlapping period, and add the value only for overlapped period. So I need to look for combination based on date, which results to introduce a two rows one for non overlapped which results to below two rows
Another row for overlapped row
Same scenario applied for 148324, two rows introduced, one for overlapped and another non overlapped row.
Also is it possible to get non-overlapped data alone based on any condition ? I want to move overlapping data alone to temp table, and I can move the non-overlapped data directly to output table.

I think I dont have 100% solution, but its hard to decision what data are right and how them sort.
This query is based on lead/lag analytic functions. I had to change NULL values to adequate values in sequence (future and past).
Please try and modify this query and I hope it will fit in your case.
My table:
Query:
SELECT id,name,value,startdate,enddate,
CASE WHEN nvl(next_startdate,29993112)>nvl(prev_enddate,19900101) THEN 'Y' ELSE 'N' END AS active
FROM
(
SELECT datatable.*,
lag(enddate) over (partition by id,name order by startdate,value desc) prev_enddate,
lead(startdate) over (partition by id,name order by startdate,value desc) next_startdate
FROM datatable
) dt
Results:

Related

Query another table with results of an another query that include a csv column

Brief Summary:
I am currently trying to get a count of completed parts that fall within a specific time range, machine number, operation number, and matches the tool number.
For example:
SELECT Sequence, Serial, Operation,Machine,DateTime,value as Tool
FROM tbPartProfile
CROSS APPLY STRING_SPLIT(Tool_Used, ',')
ORDER BY DateTime desc
is running a query which pulls all the instances that a tool has been changed, I am splitting the CSV from Tool_Used column. I am doing this because there can be multiple changes during one operation.
Objective:
This is where the production count come into place. For example, record 1 has a to0l change of 36 on 12/12/2022. I will need to go back in to the table and get the amount of part completed that equals the OPERATION/MACHINE/TOOL and fall between the date range.
For example:
SELECT *
FROM tbPartProfile
WHERE Operation = 20 AND Machine = 1 AND Tool_Used LIKE '%36%'
ORDER BY DateTime desc
For example this query will give me the datetimes the tools LIKE 36 was changed. I will need to take this datetime and compare it previous query and get the sum of all parts that were ran in this TimeRange/Operation/Machine/Tool Used

SQL - Update groups of data based on start and end dates

I have a table with dates of service for various hospital stays and want to update the starting and end dates for each claim to match the length of the entire stay. The table below has seven inpatient stays and dates of service for each of those stays. A min_max flag of 1 or 2 means that the dates in that row cover the entire length of that specific stay (each stay is color-coded).
Current table image here
I need to updated the dates for all rows within each colored grouping to match the starting and end dates for the row which has a min_max flag of 1 or 2 within the same group to ultimately find the sum of claims in each stay. I could do this manually here or in excel but I need it done on a much larger scale with thousands of hospital stays.
Goal table here
TIA!

SQL query for percentage change compared to previous date

I have a table within access containing the performance of departments on different reference dates. All data is within one table "tblmain". The table contains the following fields:
reference date (called "ref_date", formatted dd.mm.yyyy)
department identifier (called "dep_id")
performance value (called "val")
Every reference date consists of round about 100 departments and every week I import a new reference date.
My goal now is to build a query which calculates the percentage change from on reference date compared to the previous reference date. Furthermore, it should only show the departments with a change bigger than 5%.
I am currently stuck. I have created a query that gives me the val from the previous reference date but only for one specific department. And I do not know how to continue. This query looks as follows:
SELECT TOP 1 tblmain.val
FROM (SELECT TOP 2 tblmain.val, tblmain.ref_date FROM tblmain WHERE dep_id=1 ORDER BY tblmain.ref_date DESC)
ORDER BY tblmain.ref_date;
I would appreciate any feedback. After finishing this query, I plan to use this query in a form where I can choose an reference date and threshold.
Many thanks in advance!
Query to pull prior val for each record:
SELECT tblMain.ID, tblMain.ref_date, tblMain.dep_id, tblMain.val,
(SELECT TOP 1 val FROM tblMain AS Dupe
WHERE Dupe.dep_id=tblMain.dep_id AND Dupe.ref_Date < tblMain.ref_date
ORDER BY dupe.ref_date) AS PriorVal
FROM tblMain;
Now use that query to calculate percentage:
SELECT Query1.*, Abs(([PriorVal]-[val])/[PriorVal]*100) AS P
FROM Query1
WHERE (((Abs(([PriorVal]-[val])/[PriorVal]*100))>5));

Use SQL to ensure I have data for each day of a certain time period

I'm looking to only select one data point from each date in my report. I want to ensure each day is accounted for and has at least one row of information, as we had to do a few different things to move a large data file into our data warehouse (import one large Google Sheet for some data, use Python for daily pulls of some of the other data - want to make sure no date was left out), and this data goes from now through last summer. I could do a COUNT DISTINCT clause to just make sure the number of days between the first data point and yesterday (the latest data point), but I want to verify each day is accounted for. Should mention I am in BigQuery. Also, an example of the created_at style is: 2021-02-09 17:05:44.583 UTC
This is what I have so far:
SELECT FIRST(created_at)
FROM 'large_table'
ORDER BY created_at
**I know FIRST is probably not the best clause for this case, and it's currently acting to grab the very first data point in created_at, but just as a jumping-off point.
You can use aggregation:
select any_value(lt).*
from large_table lt
group by created_at
order by min(created_at);
Note: This assumes that created_at is a date -- or at least only has one value per date. You might need to convert it to a date:
select any_value(lt).*
from large_table lt
group by date(created_at)
order by min(created_at);
BigQuery equivalent of the query in your question
SELECT created_at
FROM 'large_table'
ORDER BY created_at
LIMIT 1

Is there a way to handle immutability that's robust and scalable?

Since bigquery is append-only, I was thinking about stamping each record I upload to it with an 'effective date' similar to how peoplesoft works, if anybody is familiar with that pattern.
Then, I could issue a select statement and join on the max effective date
select UTC_USEC_TO_MONTH(timestamp) as month, sum(amt)/100 as sales
from foo.orders as all
join (select id, max(effdt) as max_effdt from foo.orders group by id) as latest
on all.effdt = latest.max_effdt and all.id = latest.id
group by month
order by month;
Unfortunately, I believe this won't scale because of the big query 'small joins' restriction, so I wanted to see if anyone else had thought around this use case.
Yes, adding a timestamp for each record (or in some cases, a flag that captures the state of a particular record) is the right approach. The small side of a BigQuery "Small Join" can actually return at least 8MB (this value is compressed on our end, so is usually 2 to 10 times larger), so for "lookup" table type subqueries, this can actually provide a lot of records.
In your case, it's not clear to me what the exact query you are trying to run is.. it looks like you are trying to return the most recent sales times of every individual item - and then JOIN this information with the SUM of sales amt per month of each item? Can you provide more info about the query?
It might be possible to do this all in one query. For example, in our wikipedia dataset, an example might look something like...
SELECT contributor_username, UTC_USEC_TO_MONTH(timestamp * 1000000) as month,
SUM(num_characters) as total_characters_used FROM
[publicdata:samples.wikipedia] WHERE (contributor_username != '' or
contributor_username IS NOT NULL) AND timestamp > 1133395200
AND timestamp < 1157068800 GROUP BY contributor_username, month
ORDER BY contributor_username DESC, month DESC;
...to provide wikipedia contributions per user per month (like sales per month per item). This result is actually really large, so you would have to limit by date range.
UPDATE (based on comments below) a similar query that finds "num_characters" for the latest wikipedia revisions by contributors after a particular time...
SELECT current.contributor_username, current.num_characters
FROM
(SELECT contributor_username, num_characters, timestamp as time FROM [publicdata:samples.wikipedia] WHERE contributor_username != '' AND contributor_username IS NOT NULL)
AS current
JOIN
(SELECT contributor_username, MAX(timestamp) as time FROM [publicdata:samples.wikipedia] WHERE contributor_username != '' AND contributor_username IS NOT NULL AND timestamp > 1265073722 GROUP BY contributor_username) AS latest
ON
current.contributor_username = latest.contributor_username
AND
current.time = latest.time;
If your query requires you to use first build a large aggregate (for example, you need to run essentially an accurate COUNT DISTINCT) another option is to break this query up into two queries. The first query could provide the max effective date by month along with a count and save this result as a new table. Then, could run a sum query on the resulting table.
You could also store monthly sales records in separate tables, and only query the particular table for the months you are interested in, simplifying your monthly sales summaries (this could also be a more economical use of BigQuery). When you need to find aggregates across all tables, you could run your queries with multiple tables listed after the FROM clause.