with pandas, find a max(col) row in each group - pandas

with pandas,
CREATE TABLE ForgeRock
(`id` int, `productName` varchar(7), `score` int)
;
INSERT INTO ForgeRock
(`id`, `productName`, `score`)
VALUES
(1, 'OpenIDM', '8'),
(2, 'OpenAM', '3'),
(3, 'OpenDJ', '7'),
(4, 'OpenDJ', '4'),
(5, 'OpenAM', '9')
;
wanted result is
1 OpenIDM 8
3 OpenDJ 7
5 OpenAM 9
To get the max score on each group,
df.groupby('productName')['score'].max()
Result is:
OpenAM 9
OpenDJ 7
OpenIDM 8
The result is right but, I need full colmuns -> id also.
How could I get score(max) with id and productName?

you want to use idxmax instead of max. This way, you get the index values at which the maximums occurred. You can then use these to access the entire row of the dataframe.
max_idx = df.groupby('productName', as_index=False)['score'].idxmax()
print df.loc[max_idx]
id productName score
4 5 OpenAM 9
2 3 OpenDJ 7
0 1 OpenIDM 8

Related

One number in each column in PostgreSQL [duplicate]

This question already has answers here:
Split column into multiple rows in Postgres
(3 answers)
Closed 8 days ago.
I have a question how to put number in each column.
Right now, the data is looking like this.
Column A
Column B
1
1, 2, 3, 4, 5, 6, 8, 10
2
4, 6, 7, 9, 11, 12
My goal is making this table look like this below.
Column A
Column B
1
1
1
2
1
3
1
4
etc.
create table pp
(
id int,
toppings int);
insert into pp
(id, toppings)
values
(1,1),
(1,2),
(1,3),
(1,4),
(1,5),
(1,6),
(1,8),
(1,10),
(2,4),
(2,6),
(2,7),
(2,9),
(2,11),
(2,12);
I know this work but I'm looking for an easier way.
select a, unnest(b)
from pp;
unnest() transforms an array into a set of rows.
Here is the documentation.
And here is the demo.

INSERT rows into SQL Server by looping through a column with numbers

Let's say I have a very basic table:
DAY_ID
Value
Inserts
5
8
2
4
3
0
3
3
0
2
4
1
1
8
0
I want to be able to "loop" through the Inserts column, and add that many # of rows.
For each added row, I want DAY_ID to be decreased by 1 and Value to remain the same, Inserts column is irrelevant we can set to 0.
So 2 new rows should be added from DAY_ID = 5 and Value = 8, and 1 new row with DAY_ID = 2 and Value = 4. The final output of the new rows would be:
DAY_ID
Value
Inserts
(5-1)
8
0
(5-2)
8
0
(2-1)
4
0
I haven't tried much in SQL Server, I was able to create a solution in R and Python using arrays, but I'm really hoping I can make something work in SQL Server for this project.
I think this can be done using a loop in SQL.
Looping is generally not the way you solve any problems in SQL - SQL is designed and optimized to work with sets, not one row at a time.
Consider this source table:
CREATE TABLE dbo.src(DAY_ID int, Value int, Inserts int);
INSERT dbo.src VALUES
(5, 8, 2),
(4, 3, 0),
(3, 3, 0),
(2, 4, 1),
(1, 8, 0);
There are many ways to "explode" a set based on a single value. One is to split a set of commas (replicated to the length of the value, less 1).
-- INSERT dbo.src(DAY_ID, Value, Inserts)
SELECT
DAY_ID = DAY_ID - ROW_NUMBER() OVER (PARTITION BY DAY_ID ORDER BY ##SPID),
src.Value,
Inserts = 0
FROM dbo.src
CROSS APPLY STRING_SPLIT(REPLICATE(',', src.Inserts-1), ',') AS v
WHERE src.Inserts > 0;
Output:
DAY_ID
Value
Inserts
1
4
0
4
8
0
3
8
0
Working example in this fiddle.

Bigquery query to get sum of values of one column based on another column

I want to write a bigquery query to get values of a column which is the sum of values of another column, based on a "like" condition.
In the below table column starts_with_count is what I want to fill. I have added the expected values for this column manually to show my expectation. Other column values are already present.
The starts_with_count value is sum (full_count) where its link appears in other rows.
company
link
full_count
starts_with_count (expected)
abc
http://www.abc.net1
1
15 (= sum (full_count) where link like 'http://www.abc.net1%')
abc
http://www.abc.net1/page1
2
9 (= sum (full_count) where link like 'http://www.abc.net1/page1%')
abc
http://www.abc.net1/page1/folder1
3
3 (= sum (full_count) where link like 'http://www.abc.net1/page1/folder1%')
abc
http://www.abc.net1/page1/folder2
4
4
abc
http://www.abc.net1/page2
5
5
xyz
http://www.xyz.net1/
6
21
xyz
http://www.xyz.net1/page1/
7
15
xyz
http://www.xyz.net1/page1/file1
8
8
Try this:
WITH sample AS (
SELECT * FROM UNNEST([
STRUCT('abc' AS company, 'http://www.abc.net1' AS link, 1 AS full_count),
('abc', 'http://www.abc.net1/page1', 2),
('abc', 'http://www.abc.net1/page1/folder1', 3),
('abc', 'http://www.abc.net1/page1/folder2', 4),
('abc', 'http://www.abc.net1/page2', 5),
('xyz', 'http://www.xyz.net1/', 6),
('xyz', 'http://www.xyz.net1/page1/', 7),
('xyz', 'http://www.xyz.net1/page1/file1', 8)
])
)
SELECT first.company, first.link, SUM(second.full_count) AS starts_with_count
FROM sample first, sample second
WHERE STARTS_WITH(second.link, first.link)
GROUP BY 1, 2
;
output:
Another option
select * except(links),
( select sum(full_count)
from t.links
where starts_with(link, t.link)
) starts_with_count
from (
select *,
array_agg(struct(link, full_count)) over(partition by company) links
from your_table
) t
if applied to sample data in your question - output is
For the simple/dummy example you provided - performance improvement is significant!
Which one to use really depends on your real data!
To analyze - use EXECUTION DETAILS tab

Dynamic pivoting with Informix SQL

This is my data:
date id value
1/1/2021 a 5
1/1/2021 b 10
1/1/2021 c 7
1/1/2021 d 5
1/1/2021 e 6
1/2/2021 a 4
1/2/2021 b 8
1/2/2021 c 12
1/2/2021 d 3
1/2/2021 e 5
What I want to get is this:
> 1/1/2021 1/2/2021
> a 5 4
> b 10 8
> c 7 12
> d 5 3
> e 6 5
I found soultion how to do this if date column is fixed, but it isn't. It can have other values next time. Also, I found some solutions with dynamic sql, but none of these works with Informix (at least I wasn't able to replicate those result).
How can this be done in Informix?
You can use dynamic SQL — or text manipulation of SQL results — to build a moderately complex SQL statement that returns the data you are after.
The answer below assumes that the table name is Data and that there is a primary key (unique) constraint on the combination of the date and id columns — assumptions that address the questions in my comment:
How many dates might you be working with? You show 2, but is it just 2 or could it be 7, 31, 365, …? Do you always have all 5 of the ID entries a .. e for each date? Is there ever any repetition of the ID values on a given date?
and answers in your response:
I don't know how many dates I might be working with, but probably from 2 to 12, shouldn't be more than 12 dates. ID's will vary too, and some dates might have them all, others don't.
Note: Informix allows you to create this table:
CREATE TABLE data
(
date DATE NOT NULL,
id CHAR(1) NOT NULL,
value INTEGER NOT NULL,
PRIMARY KEY(DATE, id)
);
Many DBMS would require the date column name to be presented as a delimited identifier enclosed in double quotes (and case-sensitive — "date"), or use a proprietary extension such as enclosing the identifier in square brackets ([date]), both in the CREATE TABLE statement and in the subsequent SQL. Informix does not — and manages to distinguish between the letters DATE as column name, data type and function name correctly.
This answer uses what I call TDQD — Test-Driven Query Design.
Relevant dates
SELECT UNIQUE date FROM data
This gives you the dates that will appear as columns. It is probable that you'll filter the data more — such as:
SELECT UNIQUE date
FROM data
WHERE date BETWEEN (TODAY - 7) AND (TODAY - 1)
ORDER BY date
You might format the results to give string usable as a column name (and using a different date range):
SELECT UNIQUE
date AS column_date,
TO_CHAR(date, 'd%Y_%m_%d') AS column_name
FROM data
WHERE date BETWEEN DATE('2021-01-01') AND DATE('2021-01-31')
ORDER BY column_date
This assumes you have set the Informix-specific environment variable DBDATE="Y4MD-" so that DATE values are presented and interpreted like DATETIME YEAR TO DAY values are.
Relevant ID values
SELECT UNIQUE id
FROM data
WHERE date BETWEEN DATE('2021-01-01') AND DATE('2021-01-31')
ORDER BY id
This will give you the list of ID values in column 1 of the final result. However, it isn't crucial to the generated SQL.
Generate SQL for Result Table
SELECT id,
MAX(CASE WHEN date = DATE('2021-01-01') THEN value ELSE NULL END) AS d2021_01_01,
MAX(CASE WHEN date = DATE('2021-01-02') THEN value ELSE NULL END) AS d2021_01_02,
MAX(CASE WHEN date = DATE('2021-01-03') THEN value ELSE NULL END) AS d2021_01_03,
MAX(CASE WHEN date = DATE('2021-01-04') THEN value ELSE NULL END) AS d2021_01_04,
MAX(CASE WHEN date = DATE('2021-01-05') THEN value ELSE NULL END) AS d2021_01_05,
MAX(CASE WHEN date = DATE('2021-01-06') THEN value ELSE NULL END) AS d2021_01_06,
MAX(CASE WHEN date = DATE('2021-01-07') THEN value ELSE NULL END) AS d2021_01_07,
MAX(CASE WHEN date = DATE('2021-01-08') THEN value ELSE NULL END) AS d2021_01_08
FROM data
GROUP BY id
ORDER BY id;
This SQL is built using the column date and column name values from the 'relevant dates' query to generate the MAX(CASE … END) AS dYYYY_MM_DD clauses in the select-list. That has to be done outside SQL — using some program to read the relevant date information and produce the corresponding SQL.
For example, if the output of the last 'relevant dates' query is in the file date.columns, this shell script would generate the requisite SQL:
printf "SELECT id"
while read column_date column_name
do
printf ",\n MAX(CASE WHEN date = DATE('%s') THEN value ELSE NULL END) AS %s" $column_date $column_name
done < date.columns
printf "\n FROM data\n GROUP BY id\n ORDER BY id;\n"
The only difference here is that the column for the date 2021-01-08 is omitted because the value is not selected by the SQL (not present in the date.columns file).
You can use any appropriate tools to run some SQL to generate the required list of dates and give the appropriate values for column_date and column_name and then format the data into an SQL statement as shown.
Sample Data
INSERT INTO data VALUES('2021-01-01', 'a', 5);
INSERT INTO data VALUES('2021-01-01', 'b', 10);
INSERT INTO data VALUES('2021-01-01', 'c', 7);
INSERT INTO data VALUES('2021-01-01', 'd', 5);
INSERT INTO data VALUES('2021-01-01', 'e', 6);
INSERT INTO data VALUES('2021-01-02', 'a', 4);
INSERT INTO data VALUES('2021-01-02', 'b', 8);
INSERT INTO data VALUES('2021-01-02', 'c', 12);
INSERT INTO data VALUES('2021-01-02', 'd', 3);
INSERT INTO data VALUES('2021-01-02', 'e', 5);
INSERT INTO data VALUES('2021-01-03', 'b', 18);
INSERT INTO data VALUES('2021-01-03', 'c', 112);
INSERT INTO data VALUES('2021-01-03', 'd', 13);
INSERT INTO data VALUES('2021-01-03', 'e', 15);
INSERT INTO data VALUES('2021-01-04', 'a', 24);
INSERT INTO data VALUES('2021-01-04', 'c', 212);
INSERT INTO data VALUES('2021-01-04', 'd', 23);
INSERT INTO data VALUES('2021-01-04', 'e', 25);
INSERT INTO data VALUES('2021-01-05', 'a', 34);
INSERT INTO data VALUES('2021-01-05', 'b', 38);
INSERT INTO data VALUES('2021-01-05', 'd', 33);
INSERT INTO data VALUES('2021-01-05', 'e', 35);
INSERT INTO data VALUES('2021-01-06', 'a', 44);
INSERT INTO data VALUES('2021-01-06', 'b', 48);
INSERT INTO data VALUES('2021-01-06', 'c', 412);
INSERT INTO data VALUES('2021-01-06', 'e', 45);
INSERT INTO data VALUES('2021-01-07', 'a', 54);
INSERT INTO data VALUES('2021-01-07', 'c', 512);
INSERT INTO data VALUES('2021-01-07', 'd', 53);
Sample output
Using a Stack Overflow Markdown table:
id
d2021_01_01
d2021_01_02
d2021_01_03
d2021_01_04
d2021_01_05
d2021_01_06
d2021_01_07
d2021_01_08
CHAR(1)
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
a
5
4
24
34
44
54
b
10
8
18
38
48
c
7
12
112
212
412
512
d
5
3
13
23
33
53
e
6
5
15
25
35
45
Tested on a MacBook Pro running macOS 10.14.6 Mojave (yes, antique), using IBM Informix Dynamic Server Version 12.10.FC6 (yes, also antique).

SQL Server : duplicate key from creating unique index

I have a table of stock quotes, and dates:
StockID QuoteID QuoteDay QuoteClose
---------------------------------------
5 95 2018-01-03 1.080
5 96 2018-01-04 1.110
5 97 2018-01-05 1.000
5 98 2018-01-06 1.030
5 99 2018-01-07 1.010
5 100 2018-01-08 0.899
5 101 2018-01-09 0.815
I create a clustered index to manipulate the data but am running into duplicate key errors with the index
CREATE UNIQUE CLUSTERED INDEX MACD_IDX ON #TBL_MACD_LOOP (StockId, QuoteId)
Different combinations of StockID and QuoteID will result in the same output:
For example (StockID, QuoteID) of (5, 11) and (51, 1) both produce an index of 511.
My solution is to add "-" between StockId and QuoteId.
Now (5, 11) produces 5-11 and (51, 1) produces 51-1.
How do I combine strings with values?
No, you are definitely mistaken.
The combinations for (StockId, QuoteId) of (5, 11) and (51, 1) are two DISTINCTLY different pairs of values.
They are NOT combined into a single value (of 511 as you assume) when creating the index entry. Those are two different values and therefore can co-exist in that table - no problem.
To prove this - just run this INSERT statement:
INSERT INTO #TBL_MACD_LOOP(StockId, QuoteId, QuoteDay, QuoteClose)
VALUES (5, 11, '20180505', 42.76), (51, 1, '20180505', 128.07)
Even with your unique index in place, this INSERT works without any trouble at all (assuming you don't already have one of these two pairs of values in your table, of course)