Displaying one value for multiple rows in a MultiIndexed dataframe - pandas

I'm interested in presenting the following data in pandas:
metric1 | metric 2 || % occurence | total
-----------------------------------------
A | 1 || 20 |
| 2 || 10 | 35
| 3 || 5 |
-----------------------------------------
B | 1 || 40 |
| 2 || 10 | 65
| 3 || 15 |
(For text search, I'd describe this as presenting a breakdown of a groupby together with the aggregate values of the outer level of a MultiIndex)
I can create all the columns except for the total column: assuming df is a flat table like
metric1 | metric 2 | percentage
--------------------------------
A | 1 | 20
A | 2 | 10
A | 3 | 5
B | 1 | 40
B | 2 | 10
B | 3 | 15
I can get most of what I want using
aggregate_df = df.groupby(['metric1', 'metric2']).sum()
And I can get the total values using
aggregate_df.sum(level=0)
My question is, is there any way to display them together in a single DataFrame?

With multiple index you can make it and crosstab+stack
pd.crosstab(index=df.metric1,columns=df.metric2,values=df.percentage,aggfunc='sum',margins=True).set_index('All',append=True).iloc[:-1].stack()
Out[59]:
metric1 All metric2
A 35 1 20
2 10
3 5
B 65 1 40
2 10
3 15
dtype: int64

Related

How to get columns when using buckets (width_bucket)

I would like to know which row were moved to a bucket.
SELECT
width_bucket(s.score, sl.mins, sl.maxs, 9) as buckets,
COUNT(*)
FROM scores s
CROSS JOIN scores_limits sl
GROUP BY 1
ORDER BY 1;
My actual return:
buckets | count
---------+-------
1 | 182
2 | 37
3 | 46
4 | 15
5 | 29
7 | 18
8 | 22
10 | 11
| 20
What I expect to return:
SELECT buckets FROM buckets_table [...] WHERE scores.id = 1;
How can I get, for example, the column 'id' of table scores?
I believe you can include the id in an array with array_agg. If I recreate your case with
create table test (id serial, score int);
insert into test(score) values (10),(9),(5),(4),(10),(2),(5),(7),(8),(10);
The data is
id | score
----+-------
1 | 10
2 | 9
3 | 5
4 | 4
5 | 10
6 | 2
7 | 5
8 | 7
9 | 8
10 | 10
(10 rows)
Using the following and aggregating the id with array_agg
SELECT
width_bucket(score, 0, 10, 11) as buckets,
COUNT(*) nr_ids,
array_agg(id) agg_ids
FROM test s
GROUP BY 1
ORDER BY 1;
You get
buckets | nr_ids | agg_ids
---------+--------+----------
3 | 1 | {6}
5 | 1 | {4}
6 | 2 | {3,7}
8 | 1 | {8}
9 | 1 | {9}
10 | 1 | {2}
12 | 3 | {1,5,10}

full outer join in redshift

I have 2 tables A and B with columns, containing some details of students (all columns are integer):
A:
st_id,
st_subject_id,
B:
st_id,
st_subject_id,
st_count1,
st_count2
st_id means student id, st_subject_id is subject id.
For student id 15, there are following entries:
A:
15 | 1
15 | 2
15 | 3
B:
15 | 1 | 31 | 11
15 | 2 | 30 | 14
15 | 4 | 21 | 6
15 | 5 | 26 | 9
3 subjects in table A and 4 subjects(2 matching with table A and 2 extra) in table B.
I want to display the final result as:
15 | 1 | 31 | 11
15 | 2 | 30 | 14
15 | 3 | null | null
15 | 4 | 21 | 6
15 | 5 | 26 | 9
Can this be done using full outer join in SQL, or by another method?
I think something like this would suffice, but I can't test right now.
Coalesce means that the first non-null value will be selected from both tables.
select
coalesce(A.st_id, B.st_id) st_id,
coalesce(A.st_subject_id, B.st_subject_id) st_subject_id,
B.st_count1,
B.st_count2
from A
full outer join B
on A.st_id = B.st_id and A.st_subject_id = B.st_subject_id

Postgres Query Based on Previous and Next Rows

I'm trying to solve the bus routing problem in postgresql which requires visibility of previous and next rows. Here is my solution.
Step 1) Have one edges table which represents all the edges (the source and target represent vertices (bus stops):
postgres=# select id, source, target, cost from busedges;
id | source | target | cost
----+--------+--------+------
1 | 1 | 2 | 1
2 | 2 | 3 | 1
3 | 3 | 4 | 1
4 | 4 | 5 | 1
5 | 1 | 7 | 1
6 | 7 | 8 | 1
7 | 1 | 6 | 1
8 | 6 | 8 | 1
9 | 9 | 10 | 1
10 | 10 | 11 | 1
11 | 11 | 12 | 1
12 | 12 | 13 | 1
13 | 9 | 15 | 1
14 | 15 | 16 | 1
15 | 9 | 14 | 1
16 | 14 | 16 | 1
Step 2) Have a table which represents bus details like from time, to time, edge etc.
NOTE: I have used integer format for "from" and "to" column for faster results as I can do an integer query, but I can replace it with any better format if available.
postgres=# select id, "busedgeId", "busId", "from", "to" from busedgetimes;
id | busedgeId | busId | from | to
----+-----------+-------+-------+-------
18 | 1 | 1 | 33000 | 33300
19 | 2 | 1 | 33300 | 33600
20 | 3 | 2 | 33900 | 34200
21 | 4 | 2 | 34200 | 34800
22 | 1 | 3 | 36000 | 36300
23 | 2 | 3 | 36600 | 37200
24 | 3 | 4 | 38400 | 38700
25 | 4 | 4 | 38700 | 39540
Step 3) Use dijkstra algorithm to find the nearest path.
Step 4) Get the upcoming buses from the busedgetimes table in the earliest first order for the nearest path detected by dijkstra algorithm.
Problem: I am finding it difficult to make the query for the Step 4.
For example: If I get the path as edges 2, 3, 4, to travel from source vertex 2 to target vertex 5 in the above records. To get the first bus for the first edge, it's not so hard as I can simply query with from < 'expected departure' order by from desc but for the second edge, the from condition requires to time of first result row. Also, query requires edge ids filter.
How can I achieve this in a single query?
I am not sure if I understood your problem correctly. But getting values from other rows this can be done by window functions (https://www.postgresql.org/docs/current/static/tutorial-window.html):
demo: db<>fiddle
SELECT
id,
lag("to") OVER (ORDER BY id) as prev_to,
"from",
"to",
lead("from") OVER (ORDER BY id) as next_from
FROM bustimes;
The lag function moves the value of the previous row into the current one. The lead function does the same with the next row. So you are able to calculate a difference between last arrival and current departure or something like that.
Result:
id prev_to from to next_from
18 33000 33300 33300
19 33300 33300 33600 33900
20 33600 33900 34200 34200
21 34200 34200 34800 36000
22 34800 36000 36300
Please notice that "from" and "to" are reserved words in PostgreSQL. It would be better to chose other names.

SQL: Add values according to index columns only for lines sharing an id

Yesterday I asked this question: SQL: How to add values according to index columns but I found out that my problem is a bit more complicated:
I have an array like this
id | value| position | relates_to_position |type
19 | 100 | 2 | NULL | 1
19 | 50 | 6 | NULL | 2
19 | 20 | 7 | 6 | 3
20 | 30 | 3 | NULL | 2
20 | 10 | 4 | 3 | 3
From this I need to create the resulting table, which adds all the lines where the relates_to_position value matches the position value, but only for lines sharing the same id!
The resulting table should be
id | value| position |type
19 | 100 | 2 | 1
19 | 70 | 6 | 2
20 | 40 | 3 | 2
I am using Oracle 11. There is only one level of recursion, meaning a line would not refer to a line which has the relates_to_pos field set.
I think the following query will do this:
select id, coalesce(relates_to_position, position) as position,
sum(value) as value, min(type) as type
from t
group by id, coalesce(relates_to_position, position);

Can I order by multiple columns and somehow keep the ordering related between columns in MySQL?

I know the title doesn't explain my question very well (if someone can come up with a better title then please edit it). Here's what I want to do, say I have the following table:
id | a | b | c
------------------
1 | 3 | 3 | 3
2 | 20 | 40 | 30
3 | 40 | 30 | 10
4 | 30 | 10 | 15
5 | 10 | 15 | 6
6 | 15 | 6 | 20
This is slightly truncated version, I have a few more columns to sort by, but the principle behind the data & my question is the same.
What I would like is to get the data ordered in the following way:
The row with the highest value in col a
The row with the highest value in col b
The row with the highest value in col c
Followed by all remaining rows ordered by their value in col c
So, the result set would look like:
id | a | b | c
------------------
3 | 40 | 30 | 10
2 | 20 | 40 | 30
6 | 15 | 6 | 20
4 | 30 | 10 | 15
5 | 10 | 15 | 6
1 | 3 | 3 | 3
Doing a
SELECT id, a, b, c
FROM table
ORDER BY a DESC, b DESC, c DESC
Obviously gives me a ordered first, then b and finally c, so the following (which is not what I need):
id | a | b | c
------------------
3 | 40 | 30 | 10
4 | 30 | 10 | 15
2 | 20 | 40 | 30
6 | 15 | 6 | 20
5 | 10 | 15 | 6
1 | 3 | 3 | 3
I'm not familiar with the MySQL TSQL dialect but you would have to first SELECT the row with the highest 'A' value, perform a UNION ALL (i.e. no distinct via sorting) with the row with the highest 'B' value, perform a UNION ALL with the row with the highest 'C' value and then a UNION ALL with the remaining rows ordered by 'C' and excluding the 3 rows (by id) already selected.
I've just tested the following which appears to work (does involve 3 subqueries however):
SELECT id, a, b, c
FROM test
ORDER BY FIELD(a,(SELECT MAX(a) FROM test)) DESC,
FIELD(b,(SELECT MAX(b) FROM test)) DESC,
FIELD(c,(SELECT MAX(c) FROM test)) DESC,
c DESC