SQL script to with the shown screenshot - sql

I want to write a sql script to as shown in the screenshot image. Thank you.
enter image description here
I've tried MAX() function to aggregate the ESSBASE_MONTH field to make it distinct and display a single month in the output instead of multiple months. I am yet to figure out how to put 0 in any month that EMPID did not perform any sale like in December under "Total GreaterThan 24 HE Account" and "Total_HE_Accounts"

The fields of the table are not very informative however based on screenshot, this is the best answer I could come up with.
Assuming the table name is SALES;
select
ADJ_EMPID,
ESSBASE_MONTH,
MAX(YTD_COUNT) AS YTD_COUNT,
SUM(TOTAL_24) AS TOTAL_24,
SUM(TOTAL_ACC) AS TOTAL_ACC
from SALES
group by
ADJ_EMPID,
ESSBASE_MONTH
The above will aggregate the monthly 'sales' data as expected.
To add the 'missing' rows such as the December, it is possible to do it by doing a union of the above query with a vitural table.
select
MAX(MONTH_NUMBER) AS MONTH_NUMBER,
ADJ_EMPID,
ESSBASE_MONTH,
MAX(YTD_COUNT) AS YTD_COUNT,
SUM(TOTAL_24) AS TOTAL_24,
SUM(TOTAL_ACC) AS TOTAL_ACC
from (
select
1 as MONTH_NUMBER,
*
from SALES
union all
select * from (values
(1, '300014366', 'January', 0, 0, 0),
(2, '300014366', 'Feburary', 0, 0, 0),
-- add the other missing months as required
(11, '300014366', 'November', 0, 0, 0),
(12, '300014366', 'December', 0, 0, 0)
) TEMP_TABLE (MONTH_NUMBER, ADJ_EMPID, ESSBASE_MONTH, YTD_COUNT, TOTAL_24, TOTAL_ACC)
) as AGGREGATED_DATA
group by
ADJ_EMPID,
ESSBASE_MONTH
order by MONTH_NUMBER;
TEMP_TABLE is a vitural tables which contains all the months and sales as zero. There is a special field MONTH_NUMBER added to sort the months in the proper order.
Not the easiest query to understand, the requirement is not exactly feasible either..
Link to fiddledb for a working solution with PostgreSQL 15.

Related

Parse JSON value having dynamic keys in column and convert JSON to record column structure in BigQuery

I have a problem where one column in the table contains a JSON string. JSON has some column fixed and some columns are dynamically added and their count is also not fixed. I need help for two problems in this:-
- Parse this JSON and extract the value of keys which are available and dynamic.
- Convert the JSON column to record structure, using the pattern and value of the key.
Eg. If below JSON is present in the column
{"y2019m08w35": 0, "total": 0, "y2019m08w33": 0, "y2019m08": 0, "y2019m08w34": 0}
then the key is y2019m08w35, y2019m08w33, etc. and it could be anything as it consists year month and week combination.
Now let's say I get the value of y2019m08w33, then it should go to the record column. Which should be created like below
Year -record column (2019)
Month - record column inside a year(m08)
Week - record column inside months(w33) which will hold the value of y2019m08w33 which is 0.
See attached screenshot for details.
Initial Value in Table
Expected Output
Below is for BigQuery Standard SQL
#standardSQL
SELECT id, name, product_id,
ARRAY(
SELECT AS STRUCT year, ARRAY_AGG(STRUCT(month, weeks)) months
FROM (
SELECT year, month, ARRAY_AGG(STRUCT(week, value)) weeks
FROM (
SELECT
REGEXP_EXTRACT(kv, r'y(\d{4})') year,
REGEXP_EXTRACT(kv, r'm(\d{2})') month,
IFNULL(REGEXP_EXTRACT(kv, r'w\d{2}'), 'w0') week,
REGEXP_EXTRACT(kv, r': (\d*)') value
FROM UNNEST(REGEXP_EXTRACT_ALL(json, r'"y\d{4}m\d{2}(?:w\d{2})?": \d*')) kv
)
GROUP BY year, month
)
GROUP BY year
) AS json
FROM `project.dataset.table`
You can tets, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 2 id, 'Test2' name, 1234 product_id, '{"y2019m08w35": 0, "total": 0, "y2019m08w33": 0, "y2019m08": 0, "y2019m08w34": 0}' json UNION ALL
SELECT 1 id, 'Test' name, 8338 product_id, '{"y2018m08w35": 10,"y2019m08w35": 10, "y2019m08w33": 20, "y2019m08": 0, "y2019m09w34": 30, "y2019m10w34": 30, "y2019m10w35": 40}' json
)
SELECT id, name, product_id,
ARRAY(
SELECT AS STRUCT year, ARRAY_AGG(STRUCT(month, weeks)) months
FROM (
SELECT year, month, ARRAY_AGG(STRUCT(week, value)) weeks
FROM (
SELECT
REGEXP_EXTRACT(kv, r'y(\d{4})') year,
REGEXP_EXTRACT(kv, r'm(\d{2})') month,
IFNULL(REGEXP_EXTRACT(kv, r'w\d{2}'), 'w0') week,
REGEXP_EXTRACT(kv, r': (\d*)') value
FROM UNNEST(REGEXP_EXTRACT_ALL(json, r'"y\d{4}m\d{2}(?:w\d{2})?": \d*')) kv
)
GROUP BY year, month
)
GROUP BY year
) AS json
FROM `project.dataset.table`
with result
Hope you can adjust above to whatever naming you really need - note: the naming of output columns in your example is not doable unless you did mockup in Excel or Sheets where you obviously free to name stuff as you wish :o)

Find two local averages within one SQL Server data set

In the plant at our company there is a physical process that has a two-stage start and a two-stage finish. As a widget starts to enter the process a new record is created containing the widget ID and a timestamp (DateTimeCreated) and once the widget fully enters the process another timestamp is logged in a different field for the same record (DateTimeUpdated). The interval is a matter of minutes.
Similarly, as a widget starts to exit the process another record is created containing the widget ID and the DateTimeCreated, with the DateTimeUpdated being populated when the widget has fully exited the process. In the current table design an "exiting" record is indistinguishable from an "entering" record (although a given widget ID occurs only either once or twice so a View could utilise this fact to make the distinction, but let's ignore that for now).
The overall time a widget is in the process is several days but that's not really of importance to the discussion. What is important is that the interval when exiting the process is always longer than when entering. So a very simplified, imaginary set of sorted interval values might look like this:
1, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 6, 7, 7, 7, 7, 8, 8, 8, 8, 10, 10, 10
You can see there is a peak in the occurrences of intervals around the 3-minute-mark (the "enters") and another peak around the 7/8-minute-mark (the "exits"). I've also excluded intervals of 5 minutes to demonstrate that enter-intervals and exit-intervals can be considered mutually exclusive.
We want to monitor the performance of each stage in the process daily by using a query to determine the local averages of the entry and exit data point clusters. So conceptually the two data sets could be split either side of an overall average (in this case 5.375) and then an average calculated for the values below the split (2.75) and another average above the split (8). Using the data above (in a random distribution) the averages are depicted as the dotted lines in the chart below.
My current approach is to use two Common Table Expressions followed by a final three-table-join query. It seems okay, but I can't help feeling it could be better. Would anybody like to offer an alternative approach or other observations?
WITH cte_Raw AS
(
SELECT
DATEDIFF(minute, DateTimeCreated, DateTimeUpdated) AS [Interval]
FROM
MyTable
WHERE
DateTimeCreated > CAST(CAST(GETDATE() AS date) AS datetime) -- Today
)
, cte_Midpoint AS
(
SELECT
AVG(Interval) AS Interval
FROM
cte_Raw
)
SELECT
AVG([Entry].Interval) AS AverageEntryInterval
, AVG([Exit].Interval) AS AverageExitInterval
FROM
cte_Raw AS [Entry]
INNER JOIN
cte_Midpoint
ON
[Entry].Interval < cte_Midpoint.Interval
INNER JOIN
cte_Raw AS [Exit]
ON
[Exit].Interval > cte_Midpoint.Interval
I don't think your query produces accurate results. Your two JOINs are producing a proliferation of rows, which throw the averages off. They might look correct (because one is less than the other), but it you did counts, you would see that the counts in your query have little to do with the sample data.
If you are just looking for the average of values that are less than the overall average and greater than the overall average, then you an use window functions:
WITH t AS (
SELECT t.*, v.[Interval],
AVG(v.[Interval]) OVER () as avg_interval
FROM MyTable t CROSS JOIN
(VALUES (DATEDIFF(minute, DateTimeCreated, DateTimeUpdated))
) v(Interval)
WHERE DateTimeCreated > CAST(CAST(GETDATE() AS date) AS datetime)
)
SELECT AVG(CASE WHEN t.[Interval] < t.avg_interval THEN t.[Interval] END) AS AverageEntryInterval,
AVG(CASE WHEN t.[Interval] > t.avg_interval THEN t.[Interval] END) AS AverageExitInterval
FROM t;
I decided to post my own answer as at the time of writing neither of the two proposed answers will run. I have however removed the JOIN statements and used the CASE statement approach proposed by Gordon.
I've also multiplied the DATEDIFF result by 1.0 to prevent rounding of results from the AVG function.
WITH cte_Raw AS
(
SELECT
1.0 * DATEDIFF(minute, DateTimeCreated, DateTimeUpdated) AS [Interval]
FROM
MyTable
WHERE
DateTimeCreated > CAST(CAST(GETDATE() AS date) AS datetime) -- Today
)
, cte_Midpoint AS
(
SELECT
AVG(Interval) AS Interval
FROM
cte_Raw
)
SELECT AVG(CASE WHEN cte_Raw.Interval < cte_Midpoint.Interval THEN cte_Raw.[Interval] END) AS AverageEntryInterval,
AVG(CASE WHEN cte_Raw.Interval > cte_Midpoint.Interval THEN cte_Raw.[Interval] END) AS AverageExitInterval
FROM cte_Raw CROSS JOIN cte_Midpoint
This solution does not cater for the theoretical pitfall indicated by Vladimir of uneven dispersions of Entry vs Exit intervals, as in practice we can be confident this does not occur.

How to get the column sums in row?

I have a table Expense where monthly expense is stored.
Now I want to get a result like "output". Here ID will be set according to the month sequence hence December will get 12.
How can I achieve that? I tried Unpivot but cant achieve it.
You can use apply :
select tt.id, sum(tt.monval) as TotalExpense
from Expense t cross apply
( values (1, January), (2, February), (3, March) ) tt(id, monval)
group by tt.id;

SQL query involving count, group by and substring

I would like to group rows of this table according to dates which form the start of SessionID and for each day, I would like to count how many rows there are for each set of ReqPhone values. Each set of ReqPhone values will be defined by the first four digits of ReqPhone. In other words, I would like to know how many rows there are for ReqPhone starting with 0925, 0927 and 0940, how many rows there are for ReqPhone starting with 0979, 0969 and 0955, etc etc.
I have been trying all kinds of group by and count but still haven't arrived at the right query.
Can anybody enlighten me?
Update:
In my country, the government assigns telecoms phone numbers starting with certain digits. Therefore, if you know the starting digits, you know which telecom someone is using. I am trying to count how many messages are sent each day using each telecoms.
SELECT SUBSTRING(ReqPhone, 1, 4),
DATEADD(DAY,0, DATEDIFF(DAY, 0, SessionID)) AS dayCreated,
COUNT(*) AS tally
FROM yourTable
GROUP BY SUBSTRING(ReqPhone, 1, 4),
DATEADD(DAY, 0, DATEDIFF(DAY, 0, SessionID))
SELECT LEFT(ReqPhone, 4),
DATEADD(DAY,0, DATEDIFF(DAY, 0, SessionID)) AS dayCreated,
COUNT(*) AS tally
FROM yourTable
GROUP BY LEFT(ReqPhone,4),
DATEADD(DAY, 0, DATEDIFF(DAY, 0, SessionID))
This will help you to calculate the count of rows group by the ReqPhone type. This query is working successfully in Oracle DB.
SELECT COUNT(SESSIONID), REQP
FROM (SELECT SESSIONID,SUBSTR(REQPHONE,1,4) AS REQP FROM SCHEMA_NAME.TABLE_NAME)
GROUP BY REQP
Note: Please use the column which is unique in the COUNT expression.

Count rows based on a pair of distinct values

We use hive to run queries on AB test data. The problem here is that we have some duplicate data we are trying to ignore. Luckily we have a means to ignore duplicate data. Our conversion_meta column contains an indicator for this duplicate data.
I'd like to find distinct (conversion_meta, conversion_type). I can't really figure out the correct syntax though. Here is what I have so far:
select conversion_type, day, sum(if(is_control='true', 1, 0)) as Control,
sum(if(is_control='false', 1, 0)) as Test from Actions
where day > "2013-12-20" and experiment_key='xyz' group by conversion_type, day
The columns in the end result should look like:
Conversion Type, Day, Control (count), Test (count)
I think you can solve this problem with union all.:
select conversion_type, day, sum(if(is_control='true', 1, 0)) as Control,
sum(if(is_control='false', 1, 0)) as Test from Actions
where day > "2013-12-20" and experiment_key='xyz' and conversion_meta = false
group by conversion_type, day
UNION ALL
select conversion_type, day, sum(if(is_control='true', 1, 0)) as Control,
sum(if(is_control='false', 1, 0)) as Test from Actions
where day > "2013-12-20" and experiment_key='xyz' and conversion_meta = true
group by conversion_type, day