Pure PostgreSQL replacement for PL/R sample() function? - sql

Our new database does not (and will not) support PL/R usage, which we rely on extensively to implement a random weighted sample function:
CREATE OR REPLACE FUNCTION sample(
ids bigint[],
size integer,
seed integer DEFAULT 1,
with_replacement boolean DEFAULT false,
probabilities numeric[] DEFAULT NULL::numeric[])
RETURNS bigint[]
LANGUAGE 'plr'
COST 100
VOLATILE
AS $BODY$
set.seed(seed)
ids = as.integer(ids)
if (length(ids) == 1) {
s = rep(ids,size)
} else {
s = sample(ids,size, with_replacement,probabilities)
}
return(s)
$BODY$;
Is there a purely SQL approach to this same function? This post shows an approach that selects a single random row, but does not have the functionality of sampling multiple groups at once.
As far as I know, SQL Fiddle does not support PLR, so see below for a quick replication example:
CREATE TABLE test
(category text, uid integer, weight numeric)
;
INSERT INTO test
(category, uid, weight)
VALUES
('a', 1, 45),
('a', 2, 10),
('a', 3, 25),
('a', 4, 100),
('a', 5, 30),
('b', 6, 20),
('b', 7, 10),
('b', 8, 80),
('b', 9, 40),
('b', 10, 15),
('c', 11, 20),
('c', 12, 10),
('c', 13, 80),
('c', 14, 40),
('c', 15, 15)
;
SELECT category,
unnest(diffusion_shared.sample(array_agg(uid ORDER BY uid),
1,
1,
True,
array_agg(weight ORDER BY uid))
) as uid
FROM test
WHERE category IN ('a', 'b')
GROUP BY category;
Which outputs:
category uid
'a' 4
'b' 8
Any ideas?

Related

How does the reduce_agg function in presto work?

the docs explain reduce_agg here: https://prestodb.io/docs/current/functions/aggregate.html
reduce_agg(inputValue T, initialState S, inputFunction(S, T, S), combineFunction(S, S, S)) → S
Reduces all input values into a single value. inputFunction will be invoked for each input value. In addition to taking the input value, inputFunction takes the current state, initially initialState, and returns the new state. combineFunction will be invoked to combine two states into a new state. The final state is returned:
SELECT id, reduce_agg(value, 0, (a, b) -> a + b, (a, b) -> a + b)
FROM (
VALUES
(1, 2),
(1, 3),
(1, 4),
(2, 20),
(2, 30),
(2, 40)
) AS t(id, value)
GROUP BY id;
-- (1, 9)
-- (2, 90)
SELECT id, reduce_agg(value, 1, (a, b) -> a * b, (a, b) -> a * b)
FROM (
VALUES
(1, 2),
(1, 3),
(1, 4),
(2, 20),
(2, 30),
(2, 40)
) AS t(id, value)
GROUP BY id;
-- (1, 24)
-- (2, 24000)
What exactly does the combineFunction argument do?
Changing the combineFunction to the following all result in the same output.
SELECT id, reduce_agg(value, 0, (a, b) -> a + b, (a, b) -> 0)
FROM (
VALUES
(1, 2),
(1, 3),
(1, 4),
(2, 20),
(2, 30),
(2, 40)
) AS t(id, value)
GROUP BY id;
-- (1, 9)
-- (2, 90)
SELECT id, reduce_agg(value, 0, (a, b) -> a + b, (a, b) -> b)
FROM (
VALUES
(1, 2),
(1, 3),
(1, 4),
(2, 20),
(2, 30),
(2, 40)
) AS t(id, value)
GROUP BY id;
-- (1, 9)
-- (2, 90)

How much unique data is there, put it all in a table

I would like to query in SQL how many unique values ​​there are and how many rows are there. In Python, I could do it like this. But how do I do this in SQL so that I get the result like at the bottom?
In Python I could do the following
d = {'sellerid': [1, 1, 1, 2, 2, 3, 3, 3], 'modelnumber': [85, 45, 85, 12 ,85, 74, 85, 12]
, 'modelgroup': [2, 3, 2, 1, 2, 3, 2, 1 ]}
df = pd.DataFrame(data=d)
display(df.head(10))
df['Dataframe']='df'
unique_sellerid = df['sellerid'].nunique()
print("unique_sellerid", unique_sellerid)
unique_modelnumber = df['modelnumber'].nunique()
print("unique_modelnumber", unique_modelnumber)
unique_modelgroup = df['modelgroup'].nunique()
print("unique_modelgroup", unique_modelgroup)
total_rows = df.shape[0]
print("total_rows", total_rows)
[OUT]
unique_sellerid 3
unique_modelnumber 4
unique_modelgroup 3
total_rows 8
I want a query like
Here is the dummy table
CREATE TABLE cars (
sellerid INT NOT NULL,
modelnumber INT NOT NULL,
modelgroup INT,
);
INSERT INTO cars
(sellerid , modelnumber, modelgroup )
VALUES
(1, 85, 2),
(1, 45, 3),
(1, 85, 2),
(2, 12, 1),
(2, 85, 2),
(3, 74, 3),
(3, 85, 2),
(3, 12, 1);
You could use the count(distinct column) aggregation function like :
select
count(distinct col1) as nunique_col1,
count(distinct col2) as nunique_col2,
count(1) as nb_rows
from database
Also in pandas, you can also apply the nunique() function on the dataset, rather than doing it on each column: df.nunique()

Deriving the first instance of a specific node type in a tree structure query using SQL

I am designing an Electrical design software that will model an electrical utility system from the incoming Power Utility right down to the individual circuits such as computers and coffee machines.
I want to give each component of the system a dedicated table. eg. Transformers, Loads, cables, PowerPanels (called buses in this example).
Each component can be connected to one or many other components. I am using a parent/child table to manage the connections and plan to use a CTE to derive the hierarchical tree structure for a given component.
The voltage supplying any component in the system will be derived by finding the first instance of a transformer or a utility in the tree.
I have developed a query that can handle this as demonstrated below.
However, this only works for selecting one component in the CTE. I am looking for a way to select all buses and their connected voltage (nearest trafo or Utility). The only solution I can come up with is to use a table function on the above query. Is there a better way of doing this.
CREATE TABLE #componentConnection
(componentConnectionID int, parentComponentID varchar(4), childComponentID int)
;
INSERT INTO #componentConnection
(componentConnectionID, parentComponentID, childComponentID)
VALUES
(1, '13', 18),
(2, '13', 19),
(3, '13', 20),
(4, '13', 21),
(5, '13', 22),
(6, '13', 23),
(7, '14', 24),
(8, '14', 25),
(9, '14', 26),
(10, '14', 27),
(11, '14', 28),
(12, '14', 29),
(13, '15', 30),
(14, '15', 31),
(15, '15', 32),
(16, '15', 33),
(17, '15', 34),
(18, '15', 35),
(19, '16', 36),
(20, '16', 37),
(21, '16', 38),
(22, '16', 39),
(23, '16', 40),
(24, '16', 41),
(25, '1', 5),
(27, '5', 13),
(28, NULL, 1),
(29, '18', 6),
(30, '6', 11),
(31, '11', 7),
(32, '7', 14)
;
CREATE TABLE #component
(componentID int, componentName varchar(8), componentType varchar(7))
;
INSERT INTO #component
(componentID, componentName, componentType)
VALUES
(1, 'Utility1', 'utility'),
(2, 'Utility2', 'utility'),
(3, 'utility3', 'utility'),
(4, 'utility4', 'utility'),
(5, 'Cable1', 'cable'),
(6, 'Cable2', 'cable'),
(7, 'Cable3', 'cable'),
(8, 'Cable4', 'cable'),
(9, 'Cable5', 'cable'),
(10, 'Cable6', 'cable'),
(11, 'Trafo1', 'trafo'),
(12, 'Trafo2', 'trafo'),
(13, 'Bus1', 'bus'),
(14, 'Bus2', 'bus'),
(15, 'Bus3', 'bus'),
(16, 'Bus4', 'bus'),
(17, 'Bus5', 'bus'),
(18, 'cub1', 'cir'),
(19, 'cub2', 'cir'),
(20, 'cub3', 'cir'),
(21, 'cub4', 'cir'),
(22, 'cub5', 'cir'),
(23, 'cub6', 'cir'),
(24, 'cub1', 'cir'),
(25, 'cub2', 'cir'),
(26, 'cub3', 'cir'),
(27, 'cub4', 'cir'),
(28, 'cub5', 'cir'),
(29, 'cub6', 'cir'),
(30, 'cub1', 'cir'),
(31, 'cub2', 'cir'),
(32, 'cub3', 'cir'),
(33, 'cub4', 'cir'),
(34, 'cub5', 'cir'),
(35, 'cub6', 'cir'),
(36, 'cub1', 'cir'),
(37, 'cub2', 'cir'),
(38, 'cub3', 'cir'),
(39, 'cub4', 'cir'),
(40, 'cub5', 'cir'),
(41, 'cub6', 'cir')
;
CREATE TABLE #utility
([utilityID] int, [componentID] int, [utlityKV] float)
;
INSERT INTO #utility
([utilityID], [componentID], [utlityKV])
VALUES
(1, 1, 0.4),
(2, 2, 0.208),
(4, 3, 0.48),
(5, 4, 0.208)
;
CREATE TABLE #transformer
([transformerID] int, [componentID] int, [facilityID] int, [transformerName] varchar(4), [transformerPrimaryTapKv] float, [transformerSecondaryTapKv] float, [transformerPrimaryKv] float, [transformerSecondaryKv] float)
;
INSERT INTO #transformer
([transformerID], [componentID], [facilityID], [transformerName], [transformerPrimaryTapKv], [transformerSecondaryTapKv], [transformerPrimaryKv], [transformerSecondaryKv])
VALUES
(3, 11, 1, NULL, 0.48, 0.208, 0.48, 0.208),
(4, 12, 2, NULL, 0.48, 0.4, 0.48, 0.4)
;
CREATE TABLE #Bus
([busID] int, [busTypeID] int, [componentID] int, [bayID] int, [busName] varchar(4), [busConductorType] varchar(6), [busRatedCurrent] int)
;
INSERT INTO #Bus
([busID], [busTypeID], [componentID], [bayID], [busName], [busConductorType], [busRatedCurrent])
VALUES
(8, 1, 13, 1, 'bus1', 'Copper', 60),
(9, 1, 14, 1, 'bus2', 'copper', 50),
(10, 2, 15, 1, 'bus3', 'copper', 35),
(11, 2, 16, 1, 'bus4', 'copper', 35),
(13, 1, 17, 1, 'bus5', 'copper', 50)
;
WITH CTE AS (SELECT childComponentID AS SourceID, childComponentID, 0 AS depth
FROM #ComponentConnection
UNION ALL
SELECT C1.SourceID, C.childComponentID, c1.depth + 1 AS depth
FROM #ComponentConnection AS C INNER JOIN
CTE AS C1 ON C.parentComponentID = C1.childComponentID)
SELECT childComponentID,b.busName, min(depth)
--,c.componentType
,isnull(t.transformerSecondaryKv,u.utlityKV) kV
FROM CTE AS CTE1
join #Component c
on CTE1.SourceID = c.componentID
left join #Utility u
on CTE1.SourceID = u.componentID
left join #Transformer t
on CTE1.SourceID = t.componentID
LEFT JOIN #Bus b
on cte1.childComponentID = b.componentID
where busName is not null and c.componentType in ('Utility','trafo')
group by childComponentID,b.busName,isnull(t.transformerSecondaryKv,u.utlityKV)
order by depth
The desired result would be as follows for a Bus. I want to list all Buses and their associated Voltage. I would select all from the Bus table and derive the voltage from the heirarchical structure
Result
BusName | Voltage
Bus 1 | 0.4
Bus 2 | 0.208
Bus 3 | etc

Insert trigger SQL: missing FROM-clause entry for table

I'm using Postgres and I'm trying to create a trigger for insert or update new values into a table.
Here are the trigger and the function:
create or replace function trigf1() returns trigger as $$
begin
if (ballotbox.totvoters>votes.nofvotes) then
raise notice 'more voters than allowed';
return old;
else return new;
end if;
end;
$$ language plpgsql;
create trigger T1
Before insert or update on votes
for each row
execute procedure trigf1();
When I'm trying to update the tables "votes" and "ballotBox" I'm getting the error:
ERROR: missing FROM-clause entry for table "ballotbox"
LINE 1: SELECT (ballotbox.totvoters > votes.nofvotes)
^
QUERY: SELECT (ballotbox.totvoters > votes.nofvotes)
CONTEXT: PL/pgSQL function trigf1() line 3 at IF
I don't know if it's needed but here are the create tables and the insert values:
create table ballotBox
(bno integer,
cid numeric(4,0),
street varchar(20),
hno integer,
totvoters integer,
primary key (bno),
foreign key (cid) references city);
create table votes
(cid numeric(4,0),
bno integer,
pid numeric(3,0),
nofvotes integer,
foreign key (cid) references city,
foreign key (bno) references ballotBox,
foreign key (pid) references party,
check (nofvotes >= 0));
insert into ballotBox values
(1, 1, 'street1', 10, 1500),
(2, 1, 'street2', 15, 490),
(3, 1, 'street2', 15, 610),
(4, 1, 'street2', 15, 650),
(5, 2, 'street3', 10, 900),
(6, 2, 'street3', 55, 800),
(7, 2, 'street4', 67, 250),
(8, 2, 'street4', 67, 990),
(9, 2, 'street5', 5, 600),
(10, 3, 'street1', 72, 1000),
(11, 3, 'street6', 25, 610),
(12, 3, 'street6', 25, 600),
(13, 4, 'street2', 3, 550),
(14, 4, 'street7', 15, 500),
(15, 5, 'street8', 44, 1100),
(16, 5, 'street9', 7, 710),
(17, 5, 'street10', 13, 950);
insert into votes values
(1, 1, 200, 100),
(1, 1, 210, 220),
(1, 1, 220, 2),
(1, 1, 230, 400),
(1, 1, 240, 313),
(1, 1, 250, 99),
(2, 1, 200, 55),
(2, 1, 210, 150),
(2, 1, 220, 2),
(2, 1, 230, 16),
(2, 1, 240, 210);
try this
if ((select totvoters from ballotbox) > (selecct nofvotes from votes))

Convert pandas Series/DataFrame to numpy matrix, unpacking coordinates from index

I have a pandas series as so:
A 1
B 2
C 3
AB 4
AC 5
BA 4
BC 8
CA 5
CB 8
Simple code to convert to a matrix as such:
1 4 5
4 2 8
5 8 3
Something fairly dynamic and built in, rather than many loops to fix this 3x3 problem.
You can do it this way.
import pandas as pd
# your raw data
raw_index = 'A B C AB AC BA BC CA CB'.split()
values = [1, 2, 3, 4, 5, 4, 8, 5, 8]
# reformat index
index = [(a[0], a[-1]) for a in raw_index]
multi_index = pd.MultiIndex.from_tuples(index)
df = pd.DataFrame(values, columns=['values'], index=multi_index)
df.unstack()
df.unstack()
Out[47]:
values
A B C
A 1 4 5
B 4 2 8
C 5 8 3
For pd.DataFrame uses .values member or else .to_records(...) method
For pd.Series use .unstack() method as Jianxun Li said
import numpy as np
import pandas as pd
d = pd.DataFrame(data = {
'var':['A','B','C','AB','AC','BA','BC','CA','CB'],
'val':[1,2,3,4,5,4,8,5,8] })
# Here are some options for converting to np.matrix ...
np.matrix( d.to_records(index=False) )
# matrix([[(1, 'A'), (2, 'B'), (3, 'C'), (4, 'AB'), (5, 'AC'), (4, 'BA'),
# (8, 'BC'), (5, 'CA'), (8, 'CB')]],
# dtype=[('val', '<i8'), ('var', 'O')])
# Here you can add code to rearrange it, e.g.
[(val, idx[0], idx[-1]) for val,idx in d.to_records(index=False) ]
# [(1, 'A', 'A'), (2, 'B', 'B'), (3, 'C', 'C'), (4, 'A', 'B'), (5, 'A', 'C'), (4, 'B', 'A'), (8, 'B', 'C'), (5, 'C', 'A'), (8, 'C', 'B')]
# and if you need numeric row- and col-indices:
[ (val, 'ABCDEF...'.index(idx[0]), 'ABCDEF...'.index(idx[-1]) ) for val,idx in d.to_records(index=False) ]
# [(1, 0, 0), (2, 1, 1), (3, 2, 2), (4, 0, 1), (5, 0, 2), (4, 1, 0), (8, 1, 2), (5, 2, 0), (8, 2, 1)]
# you can sort by them:
sorted([ (val, 'ABCDEF...'.index(idx[0]), 'ABCDEF...'.index(idx[-1]) ) for val,idx in d.to_records(index=False) ], key=lambda x: x[1:2] )