Comparing every row in table with the master row - sql

I have a Redshift table with single VARCHAR column named "Test" and several float columns. The "Test" column has unique values, one of them is "Control", others are not hardcoded.
Tables has ~10 rows (not static) and ~10 columns.
I need to generate the Looker report which will show the original data and the difference between the corresponding float columns in "Control" and other Tests.
Input Example:
Test | Metric_1 | Metric_2
----------------------------
Control| 10 | 100
A | 12 | 120
B | 8 | 80
The desirable report:
| Control | A | A-Control | B | B-Control
|---------|----|-----------|---|-----------
Metric_1 | 10 | 12 | 2 | 8 | -2
Metric_2 | 100 | 120| 20 | 80| -20
To calculate the difference for the each row with "Control"
I tried:
SELECT T.test,
T.metric_1 - Control.metric_1 AS DIFF1,
T.metric_2 - Control.metric_2 AS DIFF2,
...
FROM T, (SELECT * FROM T WHERE test='Control') AS Control
I can do part of work in Looker (it can transpose),
part in SQL, but still cannot figure out how to build this report.

You could transpose the test dimension, being able to build part of it:
| Control | A | B |
|---------|----|---|
Metric_1 | 10 | 12 | 8 |
Metric_2 | 100 | 120| 80|
Then operate on top of this results using table calculations.
You can use the functions pivot_where() or pivot_index().
For example, pivot_where(test = 'A', metric) - pivot_where(test = 'Control', metric)

Related

Select Value based on Multiple Value Range in SQL

I am having multiple criteria to give incentive to my employees. For example as shown in below image
Grid Table is dynamic in nature. It keeps on changing based on business conditions.
I have a table where I have emp Ids whose Resolution % I have calculated and also calculated their Normalization %. Now, I need to give them % Incentives based on the above Grid using SQL Query.
Output Table in which i need to update the incentives
I assume the grid table is also stored as a database table (so you can update it):
+-----------------+---------------+--------------------+------------------+-----------+
| INCENTIVES |
+-----------------+---------------+--------------------+------------------+-----------+
| from_resulution | to_resolution | from_normalization | to_normalization | incentive |
+-----------------+---------------+--------------------+------------------+-----------+
| 0 | 70 | 0 | 5 | 9 |
| 0 | 70 | 5 | 10 | 11 |
| 0 | 70 | 10 | 100 | 13 |
| 71 | 75 | 0 | 5 | 10 |
... I hope you get the idea
+-----------------+---------------+--------------------+------------------+-----------+
And the update query can be:
update employee E
set E.incentive = (select I.incentive
from incentives I
where e.resolution >= I.from_resolution
and e.resolution < I.to_resolution
and e.normalization >= I.from_normalization
and e.normalization < I.to_normalization)
UPDATE: the TO values are not in the scope of the range. By using the TO value equal to the FROM value of the next range we assure to cover all values (including floating point). Thanks to Gordon

How to find two consecutive rows sorted by date, containing a specific value?

I have a table with the following structure and data in it:
| ID | Date | Result |
|---- |------------ |-------- |
| 1 | 30/04/2020 | + |
| 1 | 01/05/2020 | - |
| 1 | 05/05/2020 | - |
| 2 | 03/05/2020 | - |
| 2 | 04/05/2020 | + |
| 2 | 05/05/2020 | - |
| 2 | 06/05/2020 | - |
| 3 | 01/05/2020 | - |
| 3 | 02/05/2020 | - |
| 3 | 03/05/2020 | - |
| 3 | 04/05/2020 | - |
I'm trying to write an SQL query (I'm using SQL Server) which returns the date of the first two consecutive negative results for a given ID.
For example, for ID no. 1, the first two consecutive negative results are on 01/05 and 05/05.
The first two consecutive results for ID No. 2 are on 05/05 and 06/05.
The first two consecutive negative results for ID No. 3 are on on 01/05 and 02/05 .
So the query should produce the following result:
| ID | FirstNegativeDate |
|---- |------------------- |
| 1 | 01/05 |
| 2 | 05/05 |
| 3 | 01/05 |
Please note that the dates aren't necessarily one day apart. Sometimes, two consecutive negative tests may be several days apart. But they should still be considered as "consecutive negative tests". In other words, two negative tests are not 'consecutive' only if there is a positive test result in between them.
How can this be done in SQL? I've done some reading and it looks like maybe the PARTITION BY statement is required but I'm not sure how it works.
This is a gaps-and-island problem, where you want the start of the first island of '-'s that contains at least two rows.
I would recommend lead() and aggregation:
select id, min(date) first_negative_date
from (
select t.*, lead(result) over(partition by id order by date) lead_result
from mytable t
) t
where result = '-' and lead_result = '-'
group by id
Use LEAD or LAG functions over ID partition ordered by your Date column.
Then simple check where LEAD/LAG column is equal to Result.
You'll need also to filter the top ones.
The image attached just shows what LEAD/LAG would return

Creating a view that joins multiple tables on an ID and a timestamp that needs to be rounded

I have a web application that sends data to my sqlite database into different tables depending on the information. I would like to make a view that merges multiple tables together based on cownumber and TS[timestamp] (There are no updates to my table, so a change to the same cownumber send the full record as a new entry with a new timestamp). The ajax calls are made table by table so the TS do not exactly sync up generally they can be 5-20 seconds off depending on the connection
Here is a sample of the three tables
+----master_animal-----+
+----------------------------------------------------+
| cownumber | height | weight | ts |
+-----------+----------+--------+--------------------+
| 1 | 150 | ... | 2017-12-01 12:28:00|
| 2 | 170 | ... | 2017-12-03 17:16:00|
| 3 | 60 | ... | 2017-12-03 08:09:00|
| 4 | 109 | ... | 2017-12-04 23:23:00|
+----animal_inventory-----+
+-------------------------------------------------------------+
| cownumber | brandlocation| dateacquired| ts |
+-----------+--------------+-------------+--------------------+
| 1 | ... | ... | 2017-12-01 12:28:50|
| 2 | ... | ... | 2017-12-03 17:16:30|
| 3 | ... | ... | 2017-12-03 08:09:12|
| 4 | ... | ... | 2017-12-04 23:23:23|
+----experiment-----+
+-------------------------------------------------------------+
| cownumber | ageatwean | birthweight | ts |
+-----------+--------------+-------------+--------------------+
| 1 | ... | ... | 2017-12-01 12:28:20|
| 2 | ... | ... | 2017-12-03 17:16:41|
| 3 | ... | ... | 2017-12-03 08:09:24|
| 4 | ... | ... | 2017-12-04 23:23:11|
The View I wrote
CREATE VIEW testing
AS SELECT a.height,a.weight,a.cownumber,
b.brandlocation,b.dateacquired,
c.ageatwean,c.birthweight
FROM master_animal a, animal_inventory b, experiment c
WHERE a.cownumber=b.cownumber
AND ROUND(a.ts/10000) = ROUND(b.ts/10000)
AND a.cownumber=c.cownumber
AND ROUND(a.ts/10000) = ROUND(c.ts/10000);
The query I wrote
Select * from testing where cownumber = 1;
What I was hoping to get back was
+----testing-----+
+----------------------------------------------------+
| cownumber | height | weight | brandlocation| dateacquired | ageatwean |birthweight |
+-----------+--------+--------+--------------+--------------+-----------+------------+
| 941 | 0 | ... | ... | ... | ... | .. |
Where there will be one row for cownumber 941 as long as all the correlated records were within a few seconds of each other. I am not exactly sure if I need to divide by 10000 or smaller. The same record should be no more than 50 seconds apart from each other. Anything more than 50 seconds apart should be considered a new record.
When I test this where there is only one record for that cownumber it works fine. But lets say I change some information from each table. I provide a new height, a new brandlocation.
Instead of getting two rows. The first row being the initial data entry and the second row showing the same cownumber with the changed values, I get back 8 rows with partial changes.
height|weight|cownumber|brandlocation|dateacquired|ageatwean|birthweight|
0.0|0.0|941|0|0|0.0|0
0.0|0.0|941|0|0|0.0|0
0.0|0.0|941|Left Hip|0|0.0|0
0.0|0.0|941|Left Hip|0|0.0|0
50.0|0.0|941|0|0|0.0|0
50.0|0.0|941|0|0|0.0|0
50.0|0.0|941|Left Hip|0|0.0|0
50.0|0.0|941|Left Hip|0|0.0|0
I assume the issue is in my where clause but I am not sure exactly how to fix it
The timestamps are stored as strings. When you try to divide it, the database tries to convert it to a number, which results in 2017. So all timestamps end up being the same.
Dividing cannot determine the distance; the values 9999 and 10000 will end up different although they are right near each other. (And an integer division results in an integer result, so the ROUND() has no effect.)
To compute the distance, convert the timestamp into a number of seconds first, and then use abs():
SELECT ...
FROM master_animal m
JOIN animal_inventory i ON m.cownumber = i.cownumber
AND abs(strftime('%s', m.ts) - strftime('%s', i.ts)) <= 50
JOIN experiment e ON m.cownumber = e.cownumber
AND abs(strftime('%s', m.ts) - strftime('%s', e.ts)) <= 50;

Find a subset of numbers that equals to the target weighted average and target sum

There is a SQL server table containing 1 million of rows. A sample data is shown below.
Percentage column is computed as = ((Y/X)* 100)
+----+--------+-------------+-----+-----+-------------+
| ID | Amount | Percentage | X | Y | Z |
+----+--------+-------------+-----+-----+-------------+
| 1 | 10 | 9.5 | 100 | 9.5 | 95 |
| 2 | 20 | 9.5 | 100 | 9.5 | 190 |
| 3 | 40 | 5 | 100 | 5 | 200 |
| 4 | 50 | 5.555555556 | 90 | 5 | 277.7777778 |
| 5 | 70 | 8.571428571 | 70 | 6 | 600 |
| 6 | 100 | 9.230769231 | 65 | 6 | 923.0769231 |
| 7 | 120 | 7.058823529 | 85 | 6 | 847.0588235 |
| 8 | 60 | 10.52631579 | 95 | 10 | 631.5789474 |
| 9 | 80 | 10 | 100 | 10 | 800 |
| 10 | 95 | 10 | 100 | 10 | 950 |
+----+--------+-------------+-----+-----+-------------+
Now I need to find the rows such that their amount value add up to a given Amount and weighted average matches to the given Percentage.
For example, if the target Amount =365 and target Percentage=9.84, then from the given dataset, we can say that rows with ID=1,2,6,8,9,10 form the subset which will match the given targets.
Amount = 10+20+100+60+80+95
= 365
Percentage = Sum of (product of Amount and Percentage)/Sum of (Amount)
(I am using Z column to store the products of Amount and Percentage to make the calculations easier)
= ((10*9.5)+(20*9.5)+(100*9.23077)+(60*10.5264)+(80*10)+(95*10))/ (10+20+100+60+80+95)
= 9.834673618
So the rows 1,2,6,8,9,10 matches the given target sum and target weighted average.
Proposed algorithm should work on the 1 million rows and main objective is to achieve the match on the weighted average (Percentage) with Amount as much close as possible to the target Amount.
I found few questions on the stackoverflow which are related to match the target sum. But my problem is to match two target attributes Sum and weighted average.
Which algorithm can be used to achieve this?
Since the target "Percentage" is only approximate (therefore not an actual constraint), let's try removing it and find a solution for Amount. This can only make the problem easier.
What's left is the Subset Sum Problem, which is NP-complete. There are simple exponential-time solutions, and sneaky pseudo-polynomial-time solutions, but I don't think any of them will be practical for a table with 106 rows.
If this is an academic exercise, I suggest you write up the cleverest pseudo-polynomial-time solution you can come up with. If it's a task in the real world, I suggest you go back to the person who gave it to you, explain that an exact solution is impractical, and negotiate for an approximate solution.

Merge computed data from two tables back into one of them

I have the following situation (as a reduced example). Two tables, Measures1 and Measures2, each of which store an ID, a Weight in grams, and optionally a Volume in fluid onces. (In reality, Measures1 has a good deal of other data that is irrelevant here)
Contents of Measures1:
+----+----------+--------+
| ID | Weight | Volume |
+----+----------+--------+
| 1 | 100.0000 | NULL |
| 2 | 200.0000 | NULL |
| 3 | 150.0000 | NULL |
| 4 | 325.0000 | NULL |
+----+----------+--------+
Contents of Measures2:
+----+----------+----------+
| ID | Weight | Volume |
+----+----------+----------+
| 1 | 75.0000 | 10.0000 |
| 2 | 400.0000 | 64.0000 |
| 3 | 100.0000 | 22.0000 |
| 4 | 500.0000 | 100.0000 |
+----+----------+----------+
These tables describe equivalent weights and volumes of a substance. E.g. 10 fluid ounces of substance 1 weighs 75 grams. The IDs are related: ID 1 in Measures1 is the same substance as ID 1 in Measures2.
What I want to do is fill in the NULL volumes in Measures1 using the information in Measures2, but keeping the weights from Measures1 (then, ultimately, I can drop the Measures2 table, as it will be redundant). For the sake of simplicity, assume that all volumes in Measures1 are NULL and all volumes in Measures2 are not.
I can compute the volumes I want to fill in with the following query:
SELECT Measures1.ID, Measures1.Weight,
(Measures2.Volume * (Measures1.Weight / Measures2.Weight))
AS DesiredVolume
FROM Measures1 JOIN Measures2 ON Measures1.ID = Measures2.ID;
Producing:
+----+----------+-----------------+
| ID | Weight | DesiredVolume |
+----+----------+-----------------+
| 4 | 325.0000 | 65.000000000000 |
| 3 | 150.0000 | 33.000000000000 |
| 2 | 200.0000 | 32.000000000000 |
| 1 | 100.0000 | 13.333333333333 |
+----+----------+-----------------+
But I am at a loss for how to actually insert these computed values into the Measures1 table.
Preferably, I would like to be able to do it with a single query, rather than writing a script or stored procedure that iterates through every ID in Measures1. But even then I am worried that this might not be possible because the MySQL documentation says that you can't use a table in an UPDATE query and a SELECT subquery at the same time, and I think any solution would need to do that.
I know that one workaround might be to create a new table with the results of the above query (also selecting all of the other non-Volume fields in Measures1) and then drop both tables and replace Measures1 with the newly-created table, but I was wondering if there was any better way to do it that I am missing.
UPDATE Measures1
SET Volume = (Measures2.Volume * (Measures1.Weight / Measures2.Weight))
FROM Measures1 JOIN Measures2
ON Measures1.ID = Measures2.ID;