Sum of Absolute Differences - Dynamic base and collection - sql

Long time reader, first time poster. tldr; at bottom
I'm in need of calculating the Sum of Absolute Differences between a Static set of data points and dynamic combination base set.
SQL 2005
Problem:
Think of 10 circular magnets with Serial Numbers : 1,3,6,7,8,11,12,13,18,20
Each Magnet has 7 data points representing gauss measurements taken around the magnet using a gauss meter.
Because these measurements do not vary much WITHIN a given magnet, but can vary greatly BETWEEN two different magnets, I am trying to calculate SUM of absolute differences between a grouping of magnets.
I started by SUMMING the 7 data points and then comparing those SUMS to each other. You can quickly identify which TWO magnets are the closest by sorting by the smallest SAD. SUM(ABS(Magnet1_DataPoints - Magnet2_DataPoints))
The tricky part began when trying to find the N grouping of magnets with the smallest SAD. This turned into a Combinations C(n,r) problem with added complexity. I could have 40 magnets and need to pull a group of 8 with the smallest SAD. I could have 30 and need to pull 4 magnets. So, the dynamic nature of this problem enters as well.
I'm looking for some guidance on how to write this in SQL. So far, I've hit several walls, especially when trying to do it recursively via an SP. The data points were giving me grief when trying to process this through a CTE.
I've found many examples of recursive combination algorithms, but I'm having trouble adapting them to my very specific use-case.
Any and all help is greatly appreciated!
More important details:
1.) Magnet ID's will be unique integers and once consumed, will be flagged as "checked out" with a boolean 0/1 field.
2.) The gauss numbers will be decimal (18,2) to begin with.
3.) It's possible I will need separate procedures for the "quantity/grouping" of magnets if that's easier, i.e., 4,8,12 (most likely those are the increments they will be pulled in).
4.) Magnet/ID order matters. 1,2,3,4 is the same as 2,3,1,4 (identical combinations)
5.) Ideally, the output would be a list representing the grouping and total SAD, i.e. with a request of 3 magnets.
ID1, ID2, ID3, SAD
36, 28, 29, 20.0
36, 28, 31, 22.00
34, 28, 29, 24.00
Table Example as requested:
CREATE TABLE #TmpMagTable ( IDOne int, TopMiddleOne decimal(18,2), TopLeftOne decimal(18,2), TopRightOne decimal(18,2), MiddleOne decimal(18,2), BotLeftOne decimal(18,2), BotRightOne decimal(18,2), BotMiddleOne decimal(18,2), checked int)
INSERT INTO #TmpMagTable
SELECT 1,1,2,3,4,5,6,7,0
UNION ALL SELECT 10,7,5,6,4,2,3,1,0
UNION ALL SELECT 19,4,2,5,7,3,6,1,0
UNION ALL SELECT 28,1,2,2,3,2,2,1,0
UNION ALL SELECT 2,2,3,4,5,6,7,8,0
UNION ALL SELECT 11,8,6,7,5,3,4,2,0
UNION ALL SELECT 20,8,3,6,5,4,7,2,0
UNION ALL SELECT 29,3,2,2,1,2,2,3,0
UNION ALL SELECT 3,3,4,5,6,7,8,9,0
UNION ALL SELECT 12,9,7,8,6,4,5,3,0
UNION ALL SELECT 21,3,4,7,9,5,8,6,0
UNION ALL SELECT 30,5,3,3,2,5,5,4,0
UNION ALL SELECT 4,4,5,6,7,8,9,10,0
UNION ALL SELECT 13,10,8,9,7,5,6,4,0
UNION ALL SELECT 22,7,5,8,4,6,9,10,0
UNION ALL SELECT 31,4,1,3,5,4,3,1,0
UNION ALL SELECT 5,5,6,7,8,9,10,11,0
UNION ALL SELECT 14,11,9,10,8,6,7,5,0
UNION ALL SELECT 23,5,6,9,11,7,10,8,0
UNION ALL SELECT 32,3,4,6,2,4,1,2,0
UNION ALL SELECT 6,6,7,8,9,10,11,12,0
UNION ALL SELECT 15,12,10,11,9,7,8,6,0
UNION ALL SELECT 24,9,7,10,6,8,11,12,0
UNION ALL SELECT 33,5,3,3,6,1,1,2,0
UNION ALL SELECT 7,7,8,9,10,11,12,13,0
UNION ALL SELECT 16,13,11,12,10,8,9,7,0
UNION ALL SELECT 25,7,8,11,13,9,12,10,0
UNION ALL SELECT 34,1,3,3,0,2,1,5,0
UNION ALL SELECT 8,8,9,10,11,12,13,14,0
UNION ALL SELECT 17,14,12,13,11,9,10,8,0
UNION ALL SELECT 26,8,9,12,14,10,13,11,0
UNION ALL SELECT 35,6,1,3,1,0,0,2,0
UNION ALL SELECT 9,9,10,11,12,13,14,15,0
UNION ALL SELECT 18,15,13,14,12,10,11,9,0
UNION ALL SELECT 27,12,10,13,9,11,14,15,0
UNION ALL SELECT 36,1,3,2,5,3,2,1,0
tldr; Need a recursive combination algorithm to sum values of each unique combination

Related

Calculate a field with date from previous rows

I have a table like the one shown in the image below. In this table, I need to calculate two new fields(The Red and Yellow fields), but this fields depends on the previous values. I have to calculate this values in BigQuery/SQL. In Excel, it is very easy, but I don't know how to do it in SQL.
I've tried doing a join with the same table, but previous week, and it works, but for only one "Future Week". (And there are about 100 Future Weeks)
How can I calculate this in BigQuery?
I was thinking in a Cursor.. but as far as I know, there are no cursors in BigQuery
Thanks
This is the example data:
WITH Data as ( Select '2021-01-03' as Week, 1000 as InboundReal, 10000 as StockReal, 1190 as SellReal, 1200 as InboundPpto UNION ALL
Select '2021-01-31',1000, 10000 , 1190 , 1200 UNION ALL Select '2021-02-07',1000, 10000 , 1190 , 1200 UNION ALL
Select '2021-02-14',1000, 10000 , 1200 , 1200 UNION ALL Select '2021-02-21',NULL,NULL,NULL,1200 UNION ALL
Select '2021-02-28',NULL,NULL,NULL,1200 UNION ALL Select '2021-03-07',NULL,NULL,NULL,1200 UNION ALL Select '2021-03-14',NULL,NULL,NULL,1200 )
Select *, NULL as ForecastSell,NULL as StockForecast FROM Data
I don't think this type of problem comes really into the SQL domain.
It is more of an iterative problem, where state of each iteration is maintained and available for the next one immediately (which is not the case in SQL). You can run multiple SQL queries in serial to easily achieve this though. Also, explore scripting options in BigQuery.

In SQL, Is there any way to construct a variable that tracks historical data within multiple groups?

I have inquiries about the "variable construction" in the SQL, more specifically Big Query in the GCP (Google Cloud Platform). I do not have a deep understanding of SQL, so I am having a hard time manipulating and constructing variables I intend to make. So, any comment would be very appreciated.
I’m thinking of constructing two variables, which seems quite tricky to me. I’d like to briefly introduce the structure of this dataset before I inquire about the way of constructing those variables. This dataset is the historical record of game matches played by around 25,000 users, totaling around 100 million matches. 10 players participate in a single match, and each player choose their own hero. Due to the technical issue, I can only manipulate and construct those two variables through Big Query in the GCP (Google Cloud Platform).
Constructing “Favorite Hero” Variable
First, I am planning to construct a “favorite hero” variable within match-user level. As shown in the tables below, the baseline variables are 1)match_id (that specifies each match) 2)user_id(that specifies each user) 3) day(that indicates the date of match played) 4)hero_type(that indicates which hero did each player(user) choose).
Let me make clear what I intend to construct. As shown below, the user “3258 (blue)” played four times within the observation period. So, for the fourth match of user 3258, his/her favorite hero_type is 36 because his/her cumulative favorite here_type is 36. Please note that the “cumulative” does not include the very that day. For example, the user “2067(red)” played three times: 20190208, 20190209, 20190212. Each time, the player chose heroes 81, 81, and 34, respectively. So, the “favorite_hero” for the third time is 81, not 34. Also, I’d like to set the number of favorite_hero is 2.
The important thing to note is that there are consecutive but split tables as shown below. Although those tables are split, the timeline should not be discontinued but be linked to each other.
Constructing “Familiarity” Variable
I think the second variable I intend to make is quite trickier than the previous one. I am planning to construct a “met_before” variable that counts the total number of cases where each user met another player (s). For example, in the match_id 2, the user 3258(blue) and 2067(red) met each other previously at the match_id 1. So, each user has a value of 1 for the variable “met_before” So, the concept of “match_id” particularly becomes more important when making this variable than the previous one, because this variable is primarily made based on the match_id. Another example is, for the match_id 5, the user 3258(blue) has the value of 4 for the variable “met_before” because the player met with user 2386(green) for two times (match_id 1 and 3) and with user 2067(red) for the two times(match_id 1 and 2), respectively.
Again, the important thing to note is that there are consecutive but split tables as shown below. Although those tables are split, the timeline should not be discontinued but be linked to each other.
As stated in the comments, it would be better if you could provide sample data.
Also, there are 2 separate problems in the question. It would be better to create 2 different threads for them.
I prepared sample data from your screenshots and the code you need.
So you can try the code and give feedback according to the output. So if there is anything wrong, we can iterate it.
CREATE TEMP FUNCTION find_fav_hero(heroes ARRAY<INT64>) AS
((
SELECT STRING_AGG(CAST(hero as string) ORDER BY hero)
FROM (
SELECT *, max(cnt) over () as max_cnt
FROM (
SELECT hero, count(*) as cnt
FROM UNNEST(heroes) as hero
GROUP BY 1
)
)
WHERE cnt = max_cnt
));
WITH
rawdata as (
SELECT 2386 AS user_id, 20190208 as day, 30 as hero_type UNION ALL
SELECT 3268 AS user_id, 20190208 as day, 30 as hero_type UNION ALL
SELECT 2067 AS user_id, 20190208 as day, 81 as hero_type UNION ALL
SELECT 3268 AS user_id, 20190209 as day, 36 as hero_type UNION ALL
SELECT 2067 AS user_id, 20190209 as day, 81 as hero_type UNION ALL
SELECT 2386 AS user_id, 20190210 as day, 3 as hero_type UNION ALL
SELECT 3268 AS user_id, 20190210 as day, 36 as hero_type UNION ALL
SELECT 2386 AS user_id, 20190212 as day, 203 as hero_type UNION ALL
SELECT 3268 AS user_id, 20190212 as day, 36 as hero_type UNION ALL
SELECT 2067 AS user_id, 20190212 as day, 34 as hero_type
)
SELECT *,
count(*) over (partition by user_id order by day) - 1 as met_before,
find_fav_hero(array_agg(hero_type) over (partition by user_id order by day rows between unbounded preceding and 1 preceding )) as favourite_hero
from rawdata
order by day, user_id

Find two local averages within one SQL Server data set

In the plant at our company there is a physical process that has a two-stage start and a two-stage finish. As a widget starts to enter the process a new record is created containing the widget ID and a timestamp (DateTimeCreated) and once the widget fully enters the process another timestamp is logged in a different field for the same record (DateTimeUpdated). The interval is a matter of minutes.
Similarly, as a widget starts to exit the process another record is created containing the widget ID and the DateTimeCreated, with the DateTimeUpdated being populated when the widget has fully exited the process. In the current table design an "exiting" record is indistinguishable from an "entering" record (although a given widget ID occurs only either once or twice so a View could utilise this fact to make the distinction, but let's ignore that for now).
The overall time a widget is in the process is several days but that's not really of importance to the discussion. What is important is that the interval when exiting the process is always longer than when entering. So a very simplified, imaginary set of sorted interval values might look like this:
1, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 6, 7, 7, 7, 7, 8, 8, 8, 8, 10, 10, 10
You can see there is a peak in the occurrences of intervals around the 3-minute-mark (the "enters") and another peak around the 7/8-minute-mark (the "exits"). I've also excluded intervals of 5 minutes to demonstrate that enter-intervals and exit-intervals can be considered mutually exclusive.
We want to monitor the performance of each stage in the process daily by using a query to determine the local averages of the entry and exit data point clusters. So conceptually the two data sets could be split either side of an overall average (in this case 5.375) and then an average calculated for the values below the split (2.75) and another average above the split (8). Using the data above (in a random distribution) the averages are depicted as the dotted lines in the chart below.
My current approach is to use two Common Table Expressions followed by a final three-table-join query. It seems okay, but I can't help feeling it could be better. Would anybody like to offer an alternative approach or other observations?
WITH cte_Raw AS
(
SELECT
DATEDIFF(minute, DateTimeCreated, DateTimeUpdated) AS [Interval]
FROM
MyTable
WHERE
DateTimeCreated > CAST(CAST(GETDATE() AS date) AS datetime) -- Today
)
, cte_Midpoint AS
(
SELECT
AVG(Interval) AS Interval
FROM
cte_Raw
)
SELECT
AVG([Entry].Interval) AS AverageEntryInterval
, AVG([Exit].Interval) AS AverageExitInterval
FROM
cte_Raw AS [Entry]
INNER JOIN
cte_Midpoint
ON
[Entry].Interval < cte_Midpoint.Interval
INNER JOIN
cte_Raw AS [Exit]
ON
[Exit].Interval > cte_Midpoint.Interval
I don't think your query produces accurate results. Your two JOINs are producing a proliferation of rows, which throw the averages off. They might look correct (because one is less than the other), but it you did counts, you would see that the counts in your query have little to do with the sample data.
If you are just looking for the average of values that are less than the overall average and greater than the overall average, then you an use window functions:
WITH t AS (
SELECT t.*, v.[Interval],
AVG(v.[Interval]) OVER () as avg_interval
FROM MyTable t CROSS JOIN
(VALUES (DATEDIFF(minute, DateTimeCreated, DateTimeUpdated))
) v(Interval)
WHERE DateTimeCreated > CAST(CAST(GETDATE() AS date) AS datetime)
)
SELECT AVG(CASE WHEN t.[Interval] < t.avg_interval THEN t.[Interval] END) AS AverageEntryInterval,
AVG(CASE WHEN t.[Interval] > t.avg_interval THEN t.[Interval] END) AS AverageExitInterval
FROM t;
I decided to post my own answer as at the time of writing neither of the two proposed answers will run. I have however removed the JOIN statements and used the CASE statement approach proposed by Gordon.
I've also multiplied the DATEDIFF result by 1.0 to prevent rounding of results from the AVG function.
WITH cte_Raw AS
(
SELECT
1.0 * DATEDIFF(minute, DateTimeCreated, DateTimeUpdated) AS [Interval]
FROM
MyTable
WHERE
DateTimeCreated > CAST(CAST(GETDATE() AS date) AS datetime) -- Today
)
, cte_Midpoint AS
(
SELECT
AVG(Interval) AS Interval
FROM
cte_Raw
)
SELECT AVG(CASE WHEN cte_Raw.Interval < cte_Midpoint.Interval THEN cte_Raw.[Interval] END) AS AverageEntryInterval,
AVG(CASE WHEN cte_Raw.Interval > cte_Midpoint.Interval THEN cte_Raw.[Interval] END) AS AverageExitInterval
FROM cte_Raw CROSS JOIN cte_Midpoint
This solution does not cater for the theoretical pitfall indicated by Vladimir of uneven dispersions of Entry vs Exit intervals, as in practice we can be confident this does not occur.

Setting a maximum value for a variable. Cognos

I started working in BI and I was given a brain teaser since I came from C# and not SQL/cognus.
I get a number. It can be between 0 and a very large number. When I get it and it's below 1,000 everything is dandy. But if it's bigger than or equal to 1,000 , I should use 1,000 instead.
I am not allowed to use conditions, I need it to be pure math, or if I can't then I should use efficient methods.
I thought it would be easy and just use Min() but that works differently in cognus and SQL apparently.
Use the LEAST() function:
Oracle Setup:
CREATE TABLE data ( value ) AS
SELECT 1 FROM DUAL UNION ALL
SELECT 999 FROM DUAL UNION ALL
SELECT 1000 FROM DUAL UNION ALL
SELECT 1001 FROM DUAL;
Query:
SELECT value, LEAST( value, 1000 ) AS output FROM data
Output:
VALUE OUTPUT
----- ------
1 1
999 999
1000 1000
1001 1000

SQL statement, selecting data from specific range of hours

Intro: I have an database with table which contains column with hours, for example
08:30:00
08:45:00
09:30:00
12:20:00
...
18:00:00
the datatype in this column is "time".
Question: is it possible with SQL to select all hours [ for eg. 10:00:00, 10:30:00, 11:00:00 etc. ] from range 08:00:00 to 18:00:00, which are not on the list?
edit:
Well, after thinking it out, the solution which You [ thanks! ] is not entirely perfect [ well it is perfect with details I gave u :) ]. My application is allowing users to make an appointments which takes half an hour [ in this version of app ]. The column with time I gave u in my first topic is quite optimistic, because I didn't put cover meetings for hours such as 11:45:00 or 13:05:00, which will end at 12:15:00 and 13:35:00, so I don't really know from the top which hours I would be able to put in this extra table.
First, create a table T_REF with a column COL_HR with dates that have
all the hours you want to report if not found in the original table.
In MYSQL doing something like this should work
SELECT DATE_FORMAT(COLUMN_NAME, '%H') FROM T_REF
MINUS
SELECT DATE_FORMAT(COLUMN_NAME, '%H') FROM ORIG_TABLE
This will get and compare only the hours and report the ones that are not
in your table (ORIG_TABLE).
Ugly, but possible:
CREATE TABLE thours (h time);
INSERT INTO thours VALUES
('08:30:00'), ('08:45:00'), ('09:30:00'),
('12:20:00'), ('18:00:00');
CREATE VIEW hrange AS
SELECT 8 as h UNION ALL SELECT 9 UNION ALL SELECT 10 UNION ALL
SELECT 11 UNION ALL SELECT 12 UNION ALL SELECT 13 UNION ALL
SELECT 14 UNION ALL SELECT 15 UNION ALL SELECT 16 UNION ALL
SELECT 17 UNION ALL SELECT 18;
CREATE VIEW hours AS
SELECT cast(concat(cast(h as char), ':00:00') as time) AS h FROM hrange
UNION ALL
SELECT cast(concat(cast(h as char), ':30:00') as time)
FROM hrange WHERE h < 18
ORDER BY 1;
SELECT h.h AS ranged, t.h AS scheduled
FROM hours h
LEFT join thours t ON h.h = t.h;
If you'll add WHERE t.h IS NULL to the query, you'll get a list of wanted hours.
I created views as MySQL cannot dynamically generate series.
Try out here.
Are you only expecting to find every hour and half hour? Without knowing which times you're expecting to find or not find, I'd suggest the following:
Create a separate table with all possible times in it and then run a query to find which times don't exist in the original table.