SQL - Distribute Same Values equally across X number of tables

SQL - Distribute Same Values equally across X number of tables - sql

I want to see if someone knows a way to evenly distribute multiple like values across "x" number of temp tables ensuring that the 'like' values (same team name in this example) never get lumped into one particular table. What I am trying to do is create heats for a race and evenly distribute teams across tables. Ex:
**Teams**
-----------
Los Angeles
New York
New York
Los Angeles
Florida
Florida
Arizona
Texas
Alabama
Alaska
New York
New York
New York
I would like the distribution to go end up something like this where all multiple teams are evenly distribute across 2 (or 3 or 4) heats:
**Heat One**
-------------
Los Angeles
New York
Florida
Arizona
Alabama
New York
New York
**Heat Two**
------------
Los Angeles
New York
Florida
Texas
Alaska
New York

Starting with SQL Server 2005, there's a native functionality for bucketing data. NTILE()
The NTILE function is the fourth of four windowing functions introduced in SQL Server 2005. NTILE takes a different approach to paritioning data. ROW_NUMBER, RANK and DENSE_RANK will generate variable sized buckets of data based on the partition key(s). NTILE attempts to split the data into equal, fixed size buckets. BOL has a comprehensive page comparing the ranking functions if
you want a quick visual reference on their effects.
Syntax
The syntax for NTILE differs slightly from the other window functions. It's NTILE(#BUCKET_COUNT) OVER ([PARTITION BY _] ORDER BY _) , where #BUCKET_COUNT is a positive integer or bigint value.
The challenge is ensuring we get a good distribution and that's the part that is subject to the vagueries of the random number generator (newid calls/(SELECT NULL)).
Leveraging Rhys's setup
CREATE table dbo.Teams (TeamId int, TeamName varchar(32));
insert dbo.Teams values
( 1, 'Los Angeles'),
( 2, 'New York'),
( 3, 'New York'),
( 4, 'Los Angeles'),
( 5, 'Florida'),
( 6, 'Florida'),
( 7, 'Arizona'),
( 8, 'Texas'),
( 9, 'Alabama'),
(10, 'Alaska'),
(11, 'New York'),
(12, 'New York'),
(13, 'New York');
SELECT
NTILE(2) OVER (ORDER BY NEWID()) AS Heat
, NTILE(2) OVER (ORDER BY (SELECT NULL)) AS HeatAlternate
, T.TeamName
, T.TeamId
FROM
dbo.Teams AS T
ORDER BY
1,3;
One of the nicer things about this approach is that it can be switched out to make whatever bucketing size you want by simply changing the value passed to ntile. It also ought to scale better as it would only take one pass through the source table.

This approach doesn't sound right (having separate tables called Heat1, Heat2 etc) so you might want to re-think what you're doing, but if your circumstances dictate this is a good approach then how about allocating a random unique (but sequential) number to each team then use MOD to split the teams across heats? In order to get the 'like' teams (same teamname) into different heats they just need to be randomised together and the MOD will separate them.
create table dbo.Teams (TeamId int, TeamName varchar(32))
go
insert dbo.Teams values
( 1, 'Los Angeles'),
( 2, 'New York'),
( 3, 'New York'),
( 4, 'Los Angeles'),
( 5, 'Florida'),
( 6, 'Florida'),
( 7, 'Arizona'),
( 8, 'Texas'),
( 9, 'Alabama'),
(10, 'Alaska'),
(11, 'New York'),
(12, 'New York'),
(13, 'New York')
go
-- First get a random number per unique team name
; with cte as (
select row_number() over (order by newid()) as lrn, t.TeamName
from dbo.Teams t
group by t.TeamName
)
-- Second get a unique random number per team with like teams ordered together
select row_number() over (order by lrn, newid()) - 1 as rn, t.*
into #teams
from dbo.Teams t
join cte c on c.TeamName = t.TeamName
select 'Heat1', *
from #teams
where rn % 4 = 0
select 'Heat2', *
from #teams
where rn % 4 = 1
select 'Heat3', *
from #teams
where rn % 4 = 2
select 'Heat4', *
from #teams
where rn % 4 = 3

Related

How to self join only a subset of rows in PostgreSQL?

Given the following table:
CREATE TABLE people (
name TEXT PRIMARY KEY,
age INT NOT NULL
);
INSERT INTO people VALUES
('Lisa', 30),
('Marta', 27),
('John', 32),
('Sam', 41),
('Alex', 12),
('Aristides',43),
('Cindi', 1)
;
I am using a self join to compare each value of a specific column with all the other values of the same column. My query looks something like this:
SELECT DISTINCT A.name as child
FROM people A, people B
WHERE A.age + 16 < B.age;
This query aims to spot potential sons/daughters based on age difference. More specifically, my goal is to identify the set of people that may have stayed in the same house as one of their parents (ordered by name), assuming that there must be an age difference of at least 16 years between a child and their parents.
Now I would like to combine this kind of logic with the information that is in another table.
The other table looks something like that:
CREATE TABLE houses (
house_name TEXT NOT NULL,
house_member TEXT NOT NULL REFERENCES people(name)
);
INSERT INTO houses VALUES
('house Smith', 'Lisa'),
('house Smith', 'Marta'),
('house Smith', 'John'),
('house Doe', 'Lisa'),
('house Doe', 'Marta'),
('house Doe', 'Alex'),
('house Doe', 'Sam'),
('house McKenny', 'Aristides'),
('house McKenny', 'John'),
('house McKenny', 'Cindi')
;
The two tables can be joined ON houses.house_member = people.name.
More specifically I would like to spot the children only within the same house. It does not make sense to compare the age of each person with the age of all the others, but instead it would be more efficient to compare the age of each person with all the other people in the same house.
My idea is to perform the self join from above but only within a PARTITION BY household_name. However, I don't think this is a good idea since I do not have an aggregate function. Same applies for GROUP BY statements as well. What could I do here?
The expected output should be the following, ordered by house_member:
house_member
Alex
Cindi
For simplicity I have created a fiddle.

At first join two tables to build one table that has all three bits of info: house_name, house_member, age.
And then join it with itself just as you did originally and add one extra filter to look only at the same households.
WITH
CTE_All
AS
(
SELECT
houses.house_name
,houses.house_member
,people.age
FROM
houses
INNER JOIN people ON people.name = houses.house_member
)
SELECT DISTINCT
Children.house_name
,Children.house_member AS child_name
FROM
CTE_All AS Children
INNER JOIN CTE_All AS Parents
ON Children.age + 16 < Parents.age
-- this is our age difference
AND Children.house_name = Parents.house_name
-- within the same house
;
All this is one single query. You don't have to use CTE, you can inline it as a subquery, but it is more readable with CTE.
Result
house_name | child_name
:------------ | :---------
house Doe | Alex
house McKenny | Cindi

Re-format table, placing multiple column headers as rows

I have a table of fishing catches, showing number of fish and total kg, for all the fishing days. Current format of the data is showing as below
In the other reference table is a list of all the official fish species with codes and names.
How can I re-format the first table so the rows are repeated for each day showing a certain species with the corresponding total catches and kgs in a row. So instead of the species kg and n having their different columns, I would have them in rows while there is only one n and kg column. I am thinking of looping through the list of all species and based on the numbers it will duplicate the rows in a way with the right values of n and kg of the species in the rows. This is the final format I need. My database is SQL Server.

You may use a union query here:
SELECT Day, 'Albacore' AS Species, ALB_n AS n, ALB_kg AS kg FROM yourTable
UNION ALL
SELECT Day, 'Big eye tuna', BET_n, BET_kg FROM yourTable
UNION ALL
SELECT Day, 'Sword fish', SWO_n, SWO_kg FROM yourTable
ORDER BY Day, Species;

You can also use a cross apply here, e.g.:
/*
* Data setup...
*/
create table dbo.Source (
Day int,
ALB_n int,
ALB_kg int,
BET_n int,
BET_kg int,
SWO_n int,
SWO_kg int
);
insert dbo.Source (Day, ALB_n, ALB_kg, BET_n, BET_kg, SWO_n, SWO_kg) values
(1, 10, 120, 4, 60, 2, 55),
(2, 15, 170, 8, 100, 1, 30);
create table dbo.Species (
Sp_id int,
Sp_name nvarchar(20)
);
insert dbo.Species (Sp_id, Sp_name) values
(1, N'Albacore'),
(2, N'Big eye tuna'),
(3, N'Sword fish');
/*
* Unpivot data using cross apply...
*/
select Day, Sp_name as Species, n, kg
from dbo.Source
cross apply dbo.Species
cross apply (
select
case
when Sp_name=N'Albacore' then ALB_n
when Sp_name=N'Big eye tuna' then BET_n
when Sp_name=N'Sword fish' then SWO_n
else null end as n,
case
when Sp_name=N'Albacore' then ALB_kg
when Sp_name=N'Big eye tuna' then BET_kg
when Sp_name=N'Sword fish' then SWO_kg
else null end as kg
) unpivotted (n, kg);

How to specify a linear programming-like constraint (i.e. max number of rows for a dimension's attributes) in SQL server?

I'm looking to assign unique person IDs to a marketing program, but need to optimize based on each person's Probability Score (some people can be sent to multiple programs, some only one) and have two constraints such as budgeted mail quantity for each program.
I'm using SQL Server and am able to put IDs into their highest scoring program using the row_number() over(partition by person_ID order by Prob_Score), but I need to return a table where each ID is assigned to a program, but I'm not sure how to add the max mail quantity constraint specific to each individual program. I've looked into the Check() constraint functionality, but I'm not sure if that's applicable.
create table test_marketing_table(
PersonID int,
MarketingProgram varchar(255),
ProbabilityScore real
);
insert into test_marketing_table (PersonID, MarketingProgram, ProbabilityScore)
values (1, 'A', 0.07)
,(1, 'B', 0.06)
,(1, 'C', 0.02)
,(2, 'A', 0.02)
,(3, 'B', 0.08)
,(3, 'C', 0.13)
,(4, 'C', 0.02)
,(5, 'A', 0.04)
,(6, 'B', 0.045)
,(6, 'C', 0.09);
--this section assigns everyone to their highest scoring program,
--but this isn't necessarily what I need
with x
as
(
select *, row_number()over(partition by PersonID order by ProbabilityScore desc) as PersonScoreRank
from test_marketing_table
)
select *
from x
where PersonScoreRank='1';
I also need to specify some constraints: two max C packages, one max A & one max B package can be sent. How can I reassign the IDs to a program while also using the highest probability score left available?
The final result should look like:
PersonID MarketingProgram ProbabilityScore PersonScoreRank
3 C 0.13 1
6 C 0.09 1
1 A 0.07 1
6 B 0.045 2

You need to rethink your ROW_NUMBER() formula based on your actual need, and you should also have a table of Marketing Programs to make this work efficiently. This covers the basic ideas you need to incorporate to efficiently perform the filtering you need.
MarketingPrograms Table
CREATE TABLE MarketingPrograms (
ProgramID varchar(10),
PeopleDesired int
)
Populate the MarketingPrograms Table
INSERT INTO MarketingPrograms (ProgramID, PeopleDesired) Values
('A', 1),
('B', 1),
('C', 2)
Use the MarketingPrograms Table
with x as (
select *,
row_number()over(partition by ProgramId order by ProbabilityScore desc) as ProgramScoreRank
from test_marketing_table
)
select *
from x
INNER JOIN MarketingPrograms m
ON x.MarketingProgram = m.ProgramID
WHERE x.ProgramScoreRank <= m.PeopleDesired

Postgresql check for double entries

I searched for nearly one hour to solve my problem but i cant find anything.
So:
I created a table named s (Suppliers) where some Suppliers for Parts are listed, it looks like this:
insert into S(sno, sname, status, city)
values ('S1', 'Smith', 20, 'London'),
('S2', 'Jones', 10, 'Paris'),
('S3', 'Blake', 30, 'Paris'),
('S4', 'Clark', 20, 'London'),
('S5', 'Adams', 30, 'Athens');
Now i want to check this table for double entries in the column "city", so this would be London and Paris and i want to sort it by the sno and print it out.
I know that it's a bit harder in Postgres than in mySQL and i tried it like this:
SELECT sno, COUNT(city) AS NumOccurencies FROM s GROUP BY sno HAVING ( COUNT (city) > 1 );
But all i get is an empty table :(. I tried different ways but it's always the same, i don't know what to do to be honest. I hope some of you could help me out here :).
Greetings Max

You're thinking about it a little backwards. By grouping by the sno you're finding all of those rows with the same sno, not the same city. Try this instead:
SELECT
city
FROM
S
GROUP BY
city
HAVING
COUNT(*) > 1
You can then use that as a subquery to find the rows that you want:
SELECT
sno, sname, status, city
FROM
S
WHERE
city IN
(
SELECT
city
FROM
S
GROUP BY
city
HAVING
COUNT(*) > 1
)

Algorithm to do a summation over a column with random selection of data from other column

I have a table like this:
CREATE TABLE Table1
([IdeaNr] int, [SubmitterName] varchar(4), [SubmitterDepartment] varchar(4))
;
INSERT INTO Table1
([IdeaNr], [SubmitterName], [SubmitterDepartment])
VALUES
(1, 'Joe', 'Org1'),
(1, 'Bill', 'Org2'),
(1, 'Kate', 'Org1'),
(1, 'Tom', 'Org3'),
(2, 'Sue', 'Org2'),
(3, 'Bill', 'Org2'),
(3, 'Fred', 'Org1'),
(4, 'Ted', 'Org3'),
(4, 'Kate', 'Org1'),
(4, 'Hank', 'Org3')
;
I want get the following result from a query:
IdeaNr SubmitterCount SubmitterRndName SubmitterRndDepartment
1 4 Joe or ... Org1 (if Joe)
2 1 Sue Org2
3 2 Bill or ... Org2 (if Bill)
4 3 Ted or ... Org3 (if Ted)
I have tried a lot of things with all kind of JOINs of Table1 with itself, derived tables and GROUP BY, e.g.:
SELECT COUNT(IdeaNr) AS SubmitterCount,IdeaNr,SubmitterName,SubmitterDepartment
FROM Table1
GROUP BY IdeaNr,SubmitterName,SubmitterDepartment
I think the problem is to create an algorithm that takes just the first (or a random) name and department appearing in a group of IdeaNr. It is absolutely clear that you can get to misleading interpretations of that kind of data, e. g.:
Org1 has 2 Ideas
Org2 has 1 Idea
Org3 has 1 Idea
But this kind of "wrong averaging" is OK for the task. Can you help?
EDIT: The expected result evolved during the discussion. The desired result changed to:
IdeaNr SubmitterCount SubmitterRndName SubmitterRndDepartment
1 4 Joe, Bill, ... GroupIdea
2 1 Sue Org2
3 2 Bill, Fred GroupIdea
4 3 Ted, ... GroupIdea

Try it like this:
DECLARE #Table1 TABLE ([IdeaNr] int, [SubmitterName] varchar(4), [SubmitterDepartment] varchar(4));
INSERT INTO #Table1([IdeaNr], [SubmitterName], [SubmitterDepartment])
VALUES
(1, 'Joe', 'Org1'),
(1, 'Bill', 'Org2'),
(1, 'Kate', 'Org1'),
(1, 'Tom', 'Org3'),
(2, 'Sue', 'Org2'),
(3, 'Bill', 'Org2'),
(3, 'Fred', 'Org1'),
(4, 'Ted', 'Org3'),
(4, 'Kate', 'Org1'),
(4, 'Hank', 'Org3');
SELECT x.IdeaNr
,Count(x.IdeaNr)
,MAX(Submitter.SubmitterName) AS SubmitterRndName
,MAX(Submitter.SubmitterDepartment) AS SubmitterRndDepartment
FROM #Table1 AS x
CROSS APPLY
(
SELECT TOP 1 SubmitterName, SubmitterDepartment
FROM #Table1 AS y
WHERE y.IdeaNr=x.IdeaNr
) AS Submitter
GROUP BY x.IdeaNr
There is one more idea, don't know if you could need this:
SELECT x.IdeaNr
,Count(x.IdeaNr)
,STUFF(
(
SELECT ', ' + y.SubmitterName --maybe with DISTINCT
FROM #Table1 AS y
WHERE y.IdeaNr=x.IdeaNr
FOR XML PATH('')
),1,2,'') AS AllSubmitters
,STUFF(
(
SELECT ', ' + z.SubmitterDepartment --maybe with DISTINCT
FROM #Table1 AS z
WHERE z.IdeaNr=x.IdeaNr
FOR XML PATH('')
),1,2,'') AS AllDepartments
FROM #Table1 AS x
GROUP BY x.IdeaNr
This comes back with
IdeaNr AllSubmitters AllDepartments
1 4 Joe, Bill, Kate, Tom Org1, Org2, Org1, Org3
2 1 Sue Org2
3 2 Bill, Fred Org2, Org1
4 3 Ted, Kate, Hank Org3, Org1, Org3
EDIT: Following your idea from the last comment:
SELECT x.IdeaNr
,COUNT(x.IdeaNr)
,STUFF(
(
SELECT DISTINCT ', ' + y.SubmitterName
FROM #Table1 AS y
WHERE y.IdeaNr=x.IdeaNr
FOR XML PATH('')
),1,2,'') AS AllSubmitters
,CASE WHEN COUNT(x.IdeaNr)=1 THEN (SELECT TOP 1 z.SubmitterDepartment FROM #Table1 AS z WHERE z.IdeaNr=x.IdeaNr)
ELSE 'GroupIdea' END AS Departments
FROM #Table1 AS x
GROUP BY x.IdeaNr

If you want to read more about this topic search for top-N-per-group. In SQL Server it is easy to do using CROSS APPLY.
SQL Fiddle
WITH
CTE
AS
(
SELECT
IdeaNr
,COUNT(*) AS SubmitterCount
FROM #Table1
GROUP BY IdeaNr
)
SELECT
CTE.IdeaNr
,CTE.SubmitterCount
,CA.SubmitterName
,CA.SubmitterDepartment
FROM
CTE
CROSS APPLY
(
SELECT TOP(1)
T.SubmitterName
,T.SubmitterDepartment
FROM #Table1 AS T
WHERE T.IdeaNr = CTE.IdeaNr
--ORDER BY T.SubmitterName
--ORDER BY T.SubmitterDepartment
--ORDER BY CRYPT_GEN_RANDOM(4)
) AS CA
ORDER BY CTE.IdeaNr;
If you don't put any ORDER BY in the CROSS APPLY part the server will pick one "random" row. It is not random as such, but results may be the same or may differ when you run this query several times. In practice, results will most likely differ if you create or drop indexes on the table, but if the table is large they may differ every time the query runs.
If you want to pick some specific row for each IdeaNr, then use ORDER BY Name or Department or some ID, etc.
If you want to pick a really random row, then ORDER BY CRYPT_GEN_RANDOM(4).
I get the following result without any ORDER BY when I use table variable for this test without any indexes:
IdeaNr SubmitterCount SubmitterName SubmitterDepartment
1 4 Joe Org1
2 1 Sue Org2
3 2 Bill Org2
4 3 Ted Org3
It looks as if it picked the "first" row for each IdeaNr in the order as they were added to the table. But, don't be fooled, the order is not guaranteed without explicit ORDER BY. If you want to get the first row for each IdeaNr in the order as they were added to the table, you need to store information about this order somehow. For example, add a column ID int IDENTITY to the table that would increment automatically as new rows are added and then you can use it like this ORDER BY ID DESC to get guaranteed results.
Play with SQL Fiddle to see how it works.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas