How much unique data is there, put it all in a table - sql

I would like to query in SQL how many unique values ​​there are and how many rows are there. In Python, I could do it like this. But how do I do this in SQL so that I get the result like at the bottom?
In Python I could do the following
d = {'sellerid': [1, 1, 1, 2, 2, 3, 3, 3], 'modelnumber': [85, 45, 85, 12 ,85, 74, 85, 12]
, 'modelgroup': [2, 3, 2, 1, 2, 3, 2, 1 ]}
df = pd.DataFrame(data=d)
display(df.head(10))
df['Dataframe']='df'
unique_sellerid = df['sellerid'].nunique()
print("unique_sellerid", unique_sellerid)
unique_modelnumber = df['modelnumber'].nunique()
print("unique_modelnumber", unique_modelnumber)
unique_modelgroup = df['modelgroup'].nunique()
print("unique_modelgroup", unique_modelgroup)
total_rows = df.shape[0]
print("total_rows", total_rows)
[OUT]
unique_sellerid 3
unique_modelnumber 4
unique_modelgroup 3
total_rows 8
I want a query like
Here is the dummy table
CREATE TABLE cars (
sellerid INT NOT NULL,
modelnumber INT NOT NULL,
modelgroup INT,
);
INSERT INTO cars
(sellerid , modelnumber, modelgroup )
VALUES
(1, 85, 2),
(1, 45, 3),
(1, 85, 2),
(2, 12, 1),
(2, 85, 2),
(3, 74, 3),
(3, 85, 2),
(3, 12, 1);

You could use the count(distinct column) aggregation function like :
select
count(distinct col1) as nunique_col1,
count(distinct col2) as nunique_col2,
count(1) as nb_rows
from database
Also in pandas, you can also apply the nunique() function on the dataset, rather than doing it on each column: df.nunique()

Related

Pure PostgreSQL replacement for PL/R sample() function?

Our new database does not (and will not) support PL/R usage, which we rely on extensively to implement a random weighted sample function:
CREATE OR REPLACE FUNCTION sample(
ids bigint[],
size integer,
seed integer DEFAULT 1,
with_replacement boolean DEFAULT false,
probabilities numeric[] DEFAULT NULL::numeric[])
RETURNS bigint[]
LANGUAGE 'plr'
COST 100
VOLATILE
AS $BODY$
set.seed(seed)
ids = as.integer(ids)
if (length(ids) == 1) {
s = rep(ids,size)
} else {
s = sample(ids,size, with_replacement,probabilities)
}
return(s)
$BODY$;
Is there a purely SQL approach to this same function? This post shows an approach that selects a single random row, but does not have the functionality of sampling multiple groups at once.
As far as I know, SQL Fiddle does not support PLR, so see below for a quick replication example:
CREATE TABLE test
(category text, uid integer, weight numeric)
;
INSERT INTO test
(category, uid, weight)
VALUES
('a', 1, 45),
('a', 2, 10),
('a', 3, 25),
('a', 4, 100),
('a', 5, 30),
('b', 6, 20),
('b', 7, 10),
('b', 8, 80),
('b', 9, 40),
('b', 10, 15),
('c', 11, 20),
('c', 12, 10),
('c', 13, 80),
('c', 14, 40),
('c', 15, 15)
;
SELECT category,
unnest(diffusion_shared.sample(array_agg(uid ORDER BY uid),
1,
1,
True,
array_agg(weight ORDER BY uid))
) as uid
FROM test
WHERE category IN ('a', 'b')
GROUP BY category;
Which outputs:
category uid
'a' 4
'b' 8
Any ideas?

pandas Multiindex - set_index with list of tuples

I experienced following issue. I have an existing MultiIndex and want to replace the single level with a list of tuples. But I got some strange value error
Code to reproduce:
idx = pd.MultiIndex.from_tuples([(1, u'one'), (1, u'two'),
(2, u'one'), (2, u'two')],
names=['foo', 'bar'])
idx.set_levels([3, 5], level=0) # works fine
idx.set_levels([(1,2),(3,4)], level=0) #TypeError: Levels must be list-like
Can anyone comment:
1) What's the issue?
2) What's the best method to replace index (int values -> tuple values)
Thanks!
For me working new contructor:
idx = pd.MultiIndex.from_product([[(1,2),(3,4)], idx.levels[1]], names=idx.names)
print (idx)
MultiIndex(levels=[[(1, 2), (3, 4)], ['one', 'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
names=['foo', 'bar'])
EIT1:
df = pd.DataFrame({'A':list('abcdef'),
'B':[1,2,1,2,2,1],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')}).set_index(['B','C'])
#dynamic generate dictioanry with list of tuples
new = [(1, 2), (3, 4)]
d = dict(zip(df.index.levels[0], new))
print (d)
{1: (1, 2), 2: (3, 4)}
#explicit define dictionary
d = {1:(1,2), 2:(3,4)}
#rename first level of MultiInex
df = df.rename(index=d, level=0)
print (df)
A D E F
B C
(1, 2) 7 a 1 5 a
(3, 4) 8 b 3 3 a
(1, 2) 9 c 5 6 a
(3, 4) 4 d 7 9 b
2 e 1 2 b
(1, 2) 3 f 0 4 b
EDIT:
new = [(1, 2), (3, 4)]
lvl0 = list(map(tuple, np.array(new)[pd.factorize(idx.get_level_values(0))[0]].tolist()))
print (lvl0)
[(1, 2), (1, 2), (3, 4), (3, 4)]
idx = pd.MultiIndex.from_arrays([lvl0, idx.get_level_values(1)], names=idx.names)
print (idx)
MultiIndex(levels=[[(1, 2), (3, 4)], ['one', 'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
names=['foo', 'bar'])

Oracle SQL error: maximum number of expressions

Could you please help me regarding that issue getting error in Oracle SQL
ORA-01795 maximum number of expressions in a list is 1000
I'm passing value like
and test in (1, 2, 3.....1000)
Try to split your query with multiple in clauses like below
SELECT *
FROM table_name
WHERE test IN (1,2,3,....500)
OR test IN (501, 502, ......1000);
You can try workarounds:
Split single in into several ones:
select ...
from ...
where test in (1, ..., 999) or
test in (1000, ..., 1999) or
...
test in (9000, ..., 9999)
Put values into a (temporary?) table, say TestTable:
select ...
from ...
where test in (select TestField
from TestTable)
Edit: As I can see, the main difficulty is to build such a query. Let's implement it in C#. We are given a collection of ids:
// Test case ids are in [1..43] range
IEnumerable<int> Ids = Enumerable.Range(1, 43);
// Test case: 7, in actual Oracle query you, probably set it to 100 or 1000
int chunkSize = 7;
string fieldName = "test";
string filterText = string.Join(" or " + Environment.NewLine, Ids
.Select((value, index) => new {
value = value,
index = index
})
.GroupBy(item => item.index / chunkSize)
.Select(chunk =>
$"{fieldName} in ({string.Join(", ", chunk.Select(item => item.value))})"));
if (!string.IsNullOrEmpty(filterText))
filterText = $"and \r\n({filterText})";
string sql =
$#"select MyField
from MyTable
where (1 = 1) {filterText}";
Test:
Console.Write(sql);
Outcome:
select MyField
from MyTable
where (1 = 1) and
(test in (1, 2, 3, 4, 5, 6, 7) or
test in (8, 9, 10, 11, 12, 13, 14) or
test in (15, 16, 17, 18, 19, 20, 21) or
test in (22, 23, 24, 25, 26, 27, 28) or
test in (29, 30, 31, 32, 33, 34, 35) or
test in (36, 37, 38, 39, 40, 41, 42) or
test in (43))

Insert trigger SQL: missing FROM-clause entry for table

I'm using Postgres and I'm trying to create a trigger for insert or update new values into a table.
Here are the trigger and the function:
create or replace function trigf1() returns trigger as $$
begin
if (ballotbox.totvoters>votes.nofvotes) then
raise notice 'more voters than allowed';
return old;
else return new;
end if;
end;
$$ language plpgsql;
create trigger T1
Before insert or update on votes
for each row
execute procedure trigf1();
When I'm trying to update the tables "votes" and "ballotBox" I'm getting the error:
ERROR: missing FROM-clause entry for table "ballotbox"
LINE 1: SELECT (ballotbox.totvoters > votes.nofvotes)
^
QUERY: SELECT (ballotbox.totvoters > votes.nofvotes)
CONTEXT: PL/pgSQL function trigf1() line 3 at IF
I don't know if it's needed but here are the create tables and the insert values:
create table ballotBox
(bno integer,
cid numeric(4,0),
street varchar(20),
hno integer,
totvoters integer,
primary key (bno),
foreign key (cid) references city);
create table votes
(cid numeric(4,0),
bno integer,
pid numeric(3,0),
nofvotes integer,
foreign key (cid) references city,
foreign key (bno) references ballotBox,
foreign key (pid) references party,
check (nofvotes >= 0));
insert into ballotBox values
(1, 1, 'street1', 10, 1500),
(2, 1, 'street2', 15, 490),
(3, 1, 'street2', 15, 610),
(4, 1, 'street2', 15, 650),
(5, 2, 'street3', 10, 900),
(6, 2, 'street3', 55, 800),
(7, 2, 'street4', 67, 250),
(8, 2, 'street4', 67, 990),
(9, 2, 'street5', 5, 600),
(10, 3, 'street1', 72, 1000),
(11, 3, 'street6', 25, 610),
(12, 3, 'street6', 25, 600),
(13, 4, 'street2', 3, 550),
(14, 4, 'street7', 15, 500),
(15, 5, 'street8', 44, 1100),
(16, 5, 'street9', 7, 710),
(17, 5, 'street10', 13, 950);
insert into votes values
(1, 1, 200, 100),
(1, 1, 210, 220),
(1, 1, 220, 2),
(1, 1, 230, 400),
(1, 1, 240, 313),
(1, 1, 250, 99),
(2, 1, 200, 55),
(2, 1, 210, 150),
(2, 1, 220, 2),
(2, 1, 230, 16),
(2, 1, 240, 210);
try this
if ((select totvoters from ballotbox) > (selecct nofvotes from votes))

How to select unique subsequences in SQL?

In generic terms I have a sequence of events, from which i'd like to select unique non-repeatable sequences using MS SQL Server 2008 R2.
Specifically in this case, each test has a series of recordings, each of which have a specific sequence of stimuli. I'd like to select the unique sequences of stimuli from inside the recordings of one test, insert them into another table and assign the sequence group id to the original table.
DECLARE #Sequence TABLE
([ID] INT
,[TestID] INT
,[StimulusID] INT
,[RecordingID] INT
,[PositionInRecording] INT
,[SequenceGroupID] INT
)
INSERT #Sequence
VALUES
(1, 1, 101, 1000, 1, NULL),
(2, 1, 102, 1000, 2, NULL),
(3, 1, 103, 1000, 3, NULL),
(4, 1, 103, 1001, 1, NULL),
(5, 1, 103, 1001, 2, NULL),
(6, 1, 101, 1001, 3, NULL),
(7, 1, 102, 1002, 1, NULL),
(8, 1, 103, 1002, 2, NULL),
(9, 1, 101, 1002, 3, NULL),
(10, 1, 102, 1003, 1, NULL),
(11, 1, 103, 1003, 2, NULL),
(12, 1, 101, 1003, 3, NULL),
(13, 2, 106, 1004, 1, NULL),
(14, 2, 107, 1004, 2, NULL),
(15, 2, 107, 1005, 1, NULL),
(16, 2, 106, 1005, 2, NULL)
After correctly identifying the unique sequences, the results should look like this
DECLARE #SequenceGroup TABLE
([ID] INT
,[TestID] INT
,[SequenceGroupName] NVARCHAR(50)
)
INSERT #SequenceGroup VALUES
(1, 1, '101-102-103'),
(2, 1, '103-103-101'),
(3, 1, '102-103-101'),
(4, 2, '106-107'),
(5, 2, '107-106')
DECLARE #OutcomeSequence TABLE
([ID] INT
,[TestID] INT
,[StimulusID] INT
,[RecordingID] INT
,[PositionInRecording] INT
,[SequenceGroupID] INT
)
INSERT #OutcomeSequence
VALUES
(1, 1, 101, 1000, 1, 1),
(2, 1, 102, 1000, 2, 1),
(3, 1, 103, 1000, 3, 1),
(4, 1, 103, 1001, 1, 2),
(5, 1, 103, 1001, 2, 2),
(6, 1, 101, 1001, 3, 2),
(7, 1, 102, 1002, 1, 3),
(8, 1, 103, 1002, 2, 3),
(9, 1, 101, 1002, 3, 3),
(10, 1, 102, 1003, 1, 3),
(11, 1, 103, 1003, 2, 3),
(12, 1, 101, 1003, 3, 3),
(13, 2, 106, 1004, 1, 4),
(14, 2, 107, 1004, 2, 4),
(15, 2, 107, 1005, 1, 5),
(16, 2, 106, 1005, 2, 5)
This is fairly easy to do in MySQL and other databases that support some version of GROUP_CONCAT functionality. It's apparently a good deal harder in SQL Server. Here's a stackoverflow question that discusses one technique. Here's another with some information about SQL Server 2008 specific solutions that might also get you started.
This will do it. Had to add an column to #SequenceGroup.
DECLARE #Sequence TABLE
([ID] INT
,[TestID] INT
,[StimulusID] INT
,[RecordingID] INT
,[PositionInRecording] INT
,[SequenceGroupID] INT
)
INSERT #Sequence
VALUES
(1, 1, 101, 1000, 1, NULL),
(2, 1, 102, 1000, 2, NULL),
(3, 1, 103, 1000, 3, NULL),
(4, 1, 103, 1001, 1, NULL),
(5, 1, 103, 1001, 2, NULL),
(6, 1, 101, 1001, 3, NULL),
(7, 1, 102, 1002, 1, NULL),
(8, 1, 103, 1002, 2, NULL),
(9, 1, 101, 1002, 3, NULL),
(10, 1, 102, 1003, 1, NULL),
(11, 1, 103, 1003, 2, NULL),
(12, 1, 101, 1003, 3, NULL),
(13, 2, 106, 1004, 1, NULL),
(14, 2, 107, 1004, 2, NULL),
(15, 2, 107, 1005, 1, NULL),
(16, 2, 106, 1005, 2, NULL)
DECLARE #SequenceGroup TABLE
([ID] INT IDENTITY(1, 1)
,[TestID] INT
,[SequenceGroupName] NVARCHAR(50)
,[RecordingID] INT
)
insert into #SequenceGroup
select TestID, (stuff((select '-' + cast([StimulusID] as nvarchar(100))
from #Sequence t1
where t2.RecordingID = t1.RecordingID
for xml path('')), 1, 1, '')), RecordingID
from #Sequence t2
group by RecordingID, TestID
order by RecordingID
select * from #SequenceGroup
update #Sequence
set SequenceGroupID = sg.ID
from #Sequence s
join #SequenceGroup sg on s.RecordingID=sg.RecordingID and s.TestID=sg.testid
select * from #Sequence