I'm trying to write queries that can loop new columns to selected number, such as 100 or 200 new columns, which use data from the previously created columns.
I have data like below:
IF NOT EXISTS
(
SELECT * FROM sysobjects WHERE name = 'test' AND xtype = 'U'
)
CREATE TABLE test
(
[id] INT,
[a] NUMERIC(3, 2),
[b] NUMERIC(3, 2)
);
INSERT INTO test
VALUES (1, 0.1, 0.7),
(2, 0.5, 0.5),
(3, 0.5, 0.3),
(4, 0.6, 0.5),
(5, 0.5, 0.5),
(6, 0.5, 0.67),
(7, 0.5, 0.5),
(8, 0.46, 0.5),
(9, 0.5, 0.5),
(10, 0.37, 0.52),
(11, 0.37, 0.37),
(12, 0.28, 0.2);
I have id, a, and b as input, and I want to create new columns as c c = a+b, then d = a+b+c and so on, to even 100 or 200 new columns.
I could use queries like the one below, but if I need 100 columns, it will take forever to write.
select
t.*,
t.a + t.b + t.c d
from
(select
*,
a + b c
from test) t;
I know that SQL is not good at loops, but I still want to try if it even possible.Thank you.
The specifics of your problem indicate the database design is terrible, you'd never want to sum across columns like that. But your question about whether in SQL you can tell it to sort of write itself so that it'll save you from typing it all out, is generally no.
In some cases, database systems will support something like EXCEPT (or since Snowflake likes to be different, EXCLUDE)
That is helpful because you can SELECT * EXCEPT col1 which is sort of what you're looking for, though you're wanting to do that with a SUM.
In general, no the language does not support such things. You might get lucky in some cases, where an RDBMS adds a helpful function for something that lots of users want to do (like in the case of EXCEPT)
Therefore, if you really want to do it, you have to look outside of SQL and build it yourself. For me personally, I run my SQL through a jinja engine that will parse out the SQL for me.
I actually have a custom wrapper that does the exact thing you want:
SELECT *,
({{ variables | join(' + ') }}) AS SUM_OF_ALL
FROM {{ source_table }}
This is essentially a function that I have to pass a list of column names into it - so if I had 200 it would still be annoying. However using macros and other things it is definitely possible, to have it dynamically find all the column names so they don't have to be declared explicitly.
TL;DR - no, not possible without going outside of SQL.
Here you go:
IF NOT EXISTS
(
SELECT * FROM sysobjects WHERE name = 'test' AND xtype = 'U'
)
CREATE TABLE test
(
[id] INT,
[a] NUMERIC(3, 2),
[b] NUMERIC(3, 2)
);
INSERT INTO test
VALUES (1, 0.1, 0.7),
(2, 0.5, 0.5),
(3, 0.5, 0.3),
(4, 0.6, 0.5),
(5, 0.5, 0.5),
(6, 0.5, 0.67),
(7, 0.5, 0.5),
(8, 0.46, 0.5),
(9, 0.5, 0.5),
(10, 0.37, 0.52),
(11, 0.37, 0.37),
(12, 0.28, 0.2);
DECLARE #sql nvarchar(max) = 'select t.*'
;WITH cte AS (
SELECT TOP 100 row_number() OVER(ORDER BY (SELECT NULL)) AS cnt
FROM sys.objects so CROSS JOIN sys.columns sc
)
SELECT #sql = #sql + '
, 0' + '+ (t.a + t.b) * (POWER(2.0, ' + CAST(c.cnt as varchar(30)) + '-1)) as c' + cast(c.cnt AS varchar(300))
FROM cte c
SELECT #sql = #sql + ' from test t'
SELECT #sql
EXEC (#sql)
EDIT:
This code loops 100 columns or so and generate sums for previous column in a new column.
The sums get pretty big after a while. Maybe i'm doing something wrong but not sure :)
Related
Here's a situation that I can model easily in Excel, but I'm having trouble with in SQL Server. I experimented heavily with window functions (like lag() ) because I feel like they're part of the answer here, but there must be a component to them that I'm overlooking. I'm trying to get a running percentage, but the calculation which produces the new percentage has the previous percentage rolled into it.
Below is a screenshot from Excel with data and the desired result. I've added some explanation below it. I'm trying to produce column C in SQL Server (percent of additive).
Row 1: We start with a vat containing 200 gallons of some solution. We add 200 gallons of additive. So the percentage of additive to the original solution is 50%.
Row 2: After some consumption, we're left with 300 gallons of 50% solution. Now we add 200 gallons of additive. So our new additive percentage is 50% of the 300 (150), plus the 200 we just added (150+200=350). And we divide that by the total (300+200=500). So 350/500 = 0.7 or 70%.
And so on.
As you can see, as we keep adding additive and consuming the solution, the percentage approaches 100%.
Here's some code to produce a temp table with the data shown above. Appreciate any help.
create table #x (starting_vat_level int, additive_added int, pct_of_additive float null)
insert into #x values (200, 200, NULL)
insert into #x values (300, 150, NULL)
insert into #x values (100, 50, NULL)
insert into #x values (100, 100, NULL)
insert into #x values (150, 50, NULL)
insert into #x values (150, 100, NULL)
insert into #x values (200, 150, NULL)
insert into #x values (300, 50, NULL)
insert into #x values (300, 100, NULL)
insert into #x values (150, 50, NULL)
insert into #x values (100, 80, NULL)
insert into #x values (50, 10, NULL)
You need to calculate the pct_of_additive recursively.
with rcte as
(
select id, starting_vat_level, additive_added,
pct_of_additive = convert(float, additive_added * 1.0
/ (starting_vat_level + additive_added))
from #x
where id = 1
union all
select x.id, x.starting_vat_level, x.additive_added,
pct_of_additive = convert(float,
((r.pct_of_additive * x.starting_vat_level) + x.additive_added)
/ (x.starting_vat_level + x.additive_added))
from rcte r
inner join #x x on r.id = x.id - 1
)
select *
from rcte
order by id
db<>fiddle demo
I have a query where multiple factors determine the actual, active row. Can I do this in real-time and still be performant, or is an approach with a bit field the generally recommended approach, where the currently active field is indicated, index and queried?
My real-time solution involves an intermediate step in a view (temporary table in my example below). Therefore I am concerned about performance, because I will have to deal with hundreds of thousands to millions of records.
To illustrate:
DECLARE #grades TABLE (
person int,
grade int,
attempt int,
correction int)
INSERT #grades VALUES (1, 80, 1, 0)
INSERT #grades VALUES (1, 90, 2, 0)
INSERT #grades VALUES (1, 100, 3, 0)
INSERT #grades VALUES (2, 95, 1, 0)
INSERT #grades VALUES (2, 80, 1, 1)
INSERT #grades VALUES (2, 90, 1, 2)
INSERT #grades VALUES (2, 89, 1, 3)
SELECT b.*
INTO #grades_corrected
FROM #grades AS b
RIGHT JOIN (
SELECT person, attempt, MAX(correction) AS last_correction
FROM #grades as b
GROUP BY person, attempt
)
AS last_corrections
ON (b.attempt = last_corrections.attempt
AND b.correction = last_corrections.last_correction
AND b.person = last_corrections.person
)
SELECT g.*
FROM #grades_corrected g
LEFT OUTER JOIN #grades_corrected g2 ON (
g.person = g2.person
AND g.grade < g2.grade)
WHERE g2.grade is null
DROP TABLE #grades_corrected
Performance is going to be much more nuanced than just the sql you have. In any case the sql you have above will breakdown most quickly on the group by with the max and the temp table copy. Both of these will depend heavily on how many records are in the table and how much power (cpu and ram mostly) on your sql server. If you copy "millions" of records into the temp table, it will most likely be slow by most standards.
I want to find out whether it is possible to find cycles in Hierarchical or Chain data with SQL.
E.g. I have following schema:
http://sqlfiddle.com/#!3/27269
create table node (
id INTEGER
);
create table edges (
id INTEGER,
node_a INTEGER,
node_b INTEGER
);
create table graph (
id INTEGER,
edge_id INTEGER);
INSERT INTO node VALUES (1) , (2), (3), (4);
INSERT INTO edges VALUES (1, 1, 2), (2, 2, 3) , (3, 3, 4) , (4, 4, 1);
-- first graph [id = 1] with cycle (1 -> 2 -> 3 -> 4 -> 1)
INSERT INTO graph VALUES (1, 1), (1, 2), (1, 3), (1, 4);
-- second graph [id =2] without cycle (1 -> 2 -> 3)
INSERT INTO graph VALUES (2, 1), (2, 2), (2, 3);
In graph table records with same ID belong to one graph.
I need a query that will return IDs of all graphs that have at least one cycle.
So for example above query should return 1, which is the id of the first graph;
First, I assume this is a directed graph. An undirected graph has a trivial cycle if it contains a single edge.
The only tricky part to the recursive CTE is stopping when you've hit a cycle -- so you don't get infinite recursion.
Try this:
with cte as (
select e.object_a, e.object_b, iscycle = 0
from edges e
union all
select cte.object_a, e.object_b,
(case when cte.object_a = e.object_b then 1 else 0 end) as iscycle
from cte join
edges e
on cte.object_b = e.object_a
where iscycle = 0
)
select max(iscycle)
from cte;
I wrote SQL query based on #gordon-linoff answer. In some cases I had infinite loop, so I added column with node_path and then I was checking if the current connection had appeared in that column.
This is this script:
create table edges (
node_a varchar(20),
node_b varchar(20)
);
INSERT INTO edges VALUES ('A', 'B'), ('A', 'C'), ('A', 'D'), ('B', 'D'), ('D', 'K'), ('K', 'A')
GO
with cte as (
SELECT
e.node_a
, e.node_b
, 0 as depth
, iscycle = 0
, CAST(e.node_a +' -> '+ e.node_b AS varchar(MAX)) as nodes_path
FROM edges e
UNION ALL
SELECT
cte.node_a
, e.node_b
, depth + 1
, (case when cte.node_a = e.node_b then 1 else 0 end) as iscycle
, CAST(cte.nodes_path+' -> '+ e.node_b AS varchar(MAX)) as nodes_path
FROM cte
JOIN edges e ON cte.node_b = e.node_a AND cte.nodes_path NOT LIKE '%' + CAST(cte.node_a+' -> '+ e.node_b AS varchar(500)) + '%'
where iscycle = 0
)
SELECT * -- max(iscycle)
FROM cte
option (maxrecursion 300) --just for safety :)
I don't know if it is efficient where are millions of records, so if you can see that I could write this query more optimized, please share with your opinion.
I've been trying to work out how to do a particular query for a day or so now and it has gotten to the point where I need some outside help. Hence my question.
Given the following data;
DECLARE #Data AS TABLE
(
OrgId INT,
ThingId INT
)
DECLARE #ReplacementData AS TABLE
(
OldThingId INT,
NewThingId INT
)
INSERT INTO #Data (OrgId, ThingId)
VALUES (1, 2), (1, 3), (1, 4),
(2, 1), (2, 4),
(3, 3), (3, 4)
INSERT INTO #ReplacementData (OldThingId, NewThingId)
VALUES (3, 4), (2, 5)
I want to find any organisation that has a "thing" that has been replaced as denoted in the #ReplacementData table variable. I'd want to see the org id, the thing it is that they have that has been replaced and the id of the thing that should replace it. So for example given the data above, I should see;
Org id, Thing Id, Replacement Thing Id org doesn't have but should have
1, 2, 5 -- As Org 1 has 2, but not 5
I've had many attempts at trying to get this working, and I just can't seem to get my head around how to go about it. The following are a couple of my attempts, but I think I am just way off;
-- Attempt using correlated subqueries and EXISTS clauses
-- Show all orgs that have the old thing, but not the new thing
-- Ideally, limit results to OrgId, OldThingId and the NewThingId that they should now have too
SELECT *
FROM #Data d
WHERE EXISTS (SELECT *
FROM #Data oldstuff
WHERE oldstuff.OrgId = d.OrgId
AND oldstuff.ThingId IN
(SELECT OldThingID
FROM #ReplacementData))
AND NOT EXISTS (SELECT *
FROM #Data oldstuff
WHERE oldstuff.OrgId = d.OrgId
AND oldstuff.ThingId IN
(SELECT NewThingID
FROM #ReplacementData))
-- Attempt at using a JOIN to only include those old things that the org has (via the where clause)
-- Also try exists to show missing new things.
SELECT *
FROM #Data d
LEFT JOIN #ReplacementData rd ON rd.OldThingId = d.ThingId
WHERE NOT EXISTS (
SELECT *
FROM #Data dta
INNER JOIN #ReplacementData rep ON rep.NewThingId = dta.ThingId
WHERE dta.OrgId = d.OrgId
)
AND rd.OldThingId IS NOT NULL
Any help on this is much appreciated. I may well be going about it completely wrong, so please let me know if there is a better way of tackling this type of problem.
Try this out and let me know.
DECLARE #Data AS TABLE
(
OrgId INT,
ThingId INT
)
DECLARE #ReplacementData AS TABLE
(
OldThingId INT,
NewThingId INT
)
INSERT INTO #Data (OrgId, ThingId)
VALUES (1, 2), (1, 3), (1, 4),
(2, 1), (2, 4),
(3, 3), (3, 4)
INSERT INTO #ReplacementData (OldThingId, NewThingId)
VALUES (3, 4), (2, 5)
SELECT D.OrgId, RD.*
FROM #Data D
JOIN #ReplacementData RD
ON D.ThingId=RD.OldThingId
LEFT OUTER JOIN #Data EXCLUDE
ON D.OrgId = EXCLUDE.OrgId
AND RD.NewThingId = EXCLUDE.ThingId
WHERE EXCLUDE.OrgId IS NULL
I am trying to eliminate outliers in SQL Server 2008 by standard deviation. I would like only records that contain a value in a specific column within +/- 1 standard deviation of that column's mean.
How can I accomplish this?
If you are assuming a bell curve distribution of events, then only 68% of values will be within 1 standard deviation away from the mean (95% are covered by 2 standard deviations).
I would load a variable with the standard deviation of your range (derived using stdev / stdevp sql function) and then select the values that are within the appropriate number of standard deviations.
declare #stdtest table (colname varchar(20), colvalue int)
insert into #stdtest (colname, colvalue) values ('a', 2)
insert into #stdtest (colname, colvalue) values ('b', 4)
insert into #stdtest (colname, colvalue) values ('c', 4)
insert into #stdtest (colname, colvalue) values ('d', 4)
insert into #stdtest (colname, colvalue) values ('e', 5)
insert into #stdtest (colname, colvalue) values ('f', 5)
insert into #stdtest (colname, colvalue) values ('g', 7)
insert into #stdtest (colname, colvalue) values ('h', 9)
declare #std decimal
declare #mean decimal
declare #lower decimal
declare #higher decimal
declare #noofstds int
select #std = STDEV(colvalue), #mean = AVG(colvalue) from #stdtest
--68%
set #noofstds = 1
select #lower = #mean - (#noofstds * #std)
select #higher = #mean + (#noofstds * #std)
select #lower, #higher, * from #stdtest where colvalue between #lower and #higher
--returns rows with a colvalue between 3 and 7 inclusive
--95%
set #noofstds = 2
select #lower = #mean - (#noofstds * #std)
select #higher = #mean + (#noofstds * #std)
select #lower, #higher, * from #stdtest where colvalue between #lower and #higher
--returns rows with a colvalue between 1 and 9 inclusive
There is an aggregate function called STDEV in SQL that will give you the standard deviation. This is the hard part- then just find the range between the mean and +/- one STDEV value.
This is one way you could go about doing it -
create table #test
(
testNumber int
)
INSERT INTO #test (testNumber)
SELECT 2
UNION ALL
SELECT 4
UNION ALL
SELECT 4
UNION ALL
SELECT 4
UNION ALL
SELECT 5
UNION ALL
SELECT 5
UNION ALL
SELECT 7
UNION ALL
SELECT 9
SELECT testNumber FROM #test t
JOIN (
SELECT STDEV (testnumber) as [STDEV], AVG(testnumber) as mean
FROM #test
) X on t.testNumber >= X.mean - X.STDEV AND t.testNumber <= X.mean + X.STDEV
I'd be careful and think about what you're doing. Throwing away outliers might mean that you're discarding information that might not fit into a pre-conceived world view that could be quite wrong. Those outliers might be "black swans" that are rare, though not as rare as you'd think, and quite significant.
You give no context or explanation of what you're doing. It's easy to cite a function or technique that will fulfill the needs of your particular case, but I thought it appropriate to post the caution until additional information is supplied.