Finding columns with highest value among other columns in SQL - sql

I have a table that looks like this:
table,
th,
td {
border: 1px solid black;
<table>
<tr>
<th>Customer</th>
<th>Category 1</th>
<th>Category 2</th>
<th>Category 3</th>
<th>Category 4</th>
</tr>
<tr>
<td>aaaaa#aaa.com</td>
<td>0</td>
<td>563</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>bbbb#bbb.com</td>
<td>33</td>
<td>31</td>
<td>38</td>
<td>13</td>
</tr>
<tr>
<td>cccc#ccc.com</td>
<td>108</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>dddd#ddd.com</td>
<td>0</td>
<td>7</td>
<td>0</td>
<td>11</td>
</tr>
</table>
I am trying to insert a new column named "BestCategory" that will show the name of the category that has the highest value in between them.
I have tried to use GREATEST but it's not accepted in my system.
Can you guys help me?

First you have to use UNPIVOT to calculate the maxValue for each row
Then use a CASE to select what is the BestCategory.
Sql Fiddle Demo
WITH maxValues as
(
select
[Customer], Max(Amount) as TheMax
from
Customer
UNPIVOT (Amount for AmountCol in
([Category 1], [Category 2], [Category 3], [Category 4])) as unpvt
group by [Customer]
)
select
Customer.[Customer], [Category 1], [Category 2], [Category 3], [Category 4],
TheMax,
Case
WHEN [Category 1] = TheMax THEN '[Category 1]'
WHEN [Category 2] = TheMax THEN '[Category 2]'
WHEN [Category 3] = TheMax THEN '[Category 3]'
ELSE '[Category 4]'
END as BestCategory
from Customer
inner join maxValues
on Customer.[Customer] = maxValues.[Customer]
OUTPUT
| Customer | Category 1 | Category 2 | Category 3 | Category 4 | TheMax | BestCategory |
|---------------|------------|------------|------------|------------|--------|--------------|
| aaaaa#aaa.com | 0 | 563 | 0 | 0 | 563 | [Category 2] |
| bbbb#bbb.com | 33 | 31 | 38 | 13 | 38 | [Category 3] |
| cccc#ccc.com | 108 | 0 | 0 | 0 | 108 | [Category 1] |
| dddd#ddd.com | 0 | 7 | 0 | 11 | 11 | [Category 4] |

If you want the columnname in the "greatest field" you could use the idea in this sqlfiddle example
It selects the columnname with the greatest value per row
SELECT CASE
WHEN [category 1] > [category 2]
AND [category 1] > [category 3]
AND [category 1] > [category 4]
THEN '[Category 1]'
ELSE CASE
WHEN [category 2] > [category 1]
AND [category 2] > [category 3]
AND [category 2] > [category 4]
THEN '[Category 2]'
ELSE CASE
WHEN [category 3] > [category 1]
AND [category 3] > [category 2]
AND [category 3] > [category 4]
THEN '[Category 3]'
ELSE '[category 4]'
END
END
END
FROM customer

Related

Swap values between columns based on third column

I have a table like this:
src_id | src_source | dst_id | dst_source | metadata
--------------------------------------------------------
123 | A | 345 | B | some_string
234 | B | 567 | A | some_other_string
498 | A | 432 | A | another_one # this line should be ignored
765 | B | 890 | B | another_one # this line should be ignored
What I would like is:
A_id | B_id | metadata
-----------------------
123 | 345 | some string
567 | 234 | some_other_string
Here's the data to replicate:
data = [
("123", "A", "345", "B", "some_string"),
("234", "B", "567", "A", "some_other_string"),
("498", "A", "432", "A", "another_one"),
("765", "B", "890", "B", "another_two"),
]
cols = ["src_id", "src_source", "dst_id", "dst_source", "metadata"]
df = spark.createDataFrame(data).toDF(*cols)
I am a bit confused as to how to do this - I got to here:
output = (
df
.filter(F.col("src_source") != F.col("dst_source"))
.withColumn("A_id",
F.when(F.col("src_source") == "A", F.col("src_id")))
.withColumn("B_id",
F.when(F.col("src_source") == "B", F.col("src_id")))
)
I think i figured it out - I need to split the df and union again!
ab_df = (
df
.filter(F.col("src_source") != F.col("dst_source"))
.filter((F.col("src_source") == "A") & (F.col("dst_source") == "B"))
.select(F.col("src_id").alias("A_id"),
F.col("dst_id").alias("B_id"),
"metadata")
)
ba_df = (
df
.filter(F.col("src_source") != F.col("dst_source"))
.filter((F.col("src_source") == "B") & (F.col("dst_source") == "A"))
.select(F.col("src_id").alias("B_id"),
F.col("dst_id").alias("A_id"),
"metadata")
)
all = ab_df.unionByName(ba_df)
You can do it without union, just in one select, without the need to write the same filter twice.
output = (
df
.filter(F.col("src_source") != F.col("dst_source"))
.select(
F.when(F.col("src_source") == "A", F.col("src_id")).otherwise(F.col("dst_id")).alias("A_id"),
F.when(F.col("src_source") == "A", F.col("dst_id")).otherwise(F.col("src_id")).alias("B_id"),
"metadata"
)
)
output.show()
# +----+----+-----------------+
# |A_id|B_id| metadata|
# +----+----+-----------------+
# | 123| 345| some_string|
# | 567| 234|some_other_string|
# +----+----+-----------------+

Preserve table cell line breaks in Sphinx processing of reStructuredText

I have a reStructuredText table with a row like this:
+------+-----------------------------+
| Mask | The bit mask: |
| | [bit 0] Description of bit0 |
| | [bit 1] And bit1 |
+------+-----------------------------+
The cell when produced by Sphinx (HTML as example) is this:
<td><p>The bit mask:
[bit 0] Description of bit0
[bit 1] And bit1</p></td>
What I would like to be produced is this (or similar), where a line break is forced at least before every new line:
<td><p>The bit mask:
<br>[bit 0] Description of bit0
<br>[bit 1] And bit1</p></td>
Is there a way I can configure Sphinx to respect the lines in a reStructuredText table cell?
(For reference, here is the whole table as currently produced:)
<table class="docutils align-default">
<colgroup>
<col style="width: 17%" />
<col style="width: 83%" />
</colgroup>
<tbody>
<tr class="row-odd">
<td>
<p>Mask</p>
</td>
<td>
<p>The bit mask:
[bit 0] Description of bit0
[bit 1] And bit1
</p>
</td>
</tr>
</tbody>
</table>
Generally there are two easy ways to guarantee a line break or alignment in reST.
1. Using Paragraphs, the following:
+------+-----------------------------+
| Mask | The bit mask: |
| | |
| | [bit 0] Description of bit0 |
| | |
| | [bit 1] And bit1 |
| | |
+------+-----------------------------+
Will give:
<table class="docutils align-default">
<tbody>
<tr class="row-odd">
<td>
<p>Mask</p>
</td>
<td>
<p>The bit mask:</p>
<p>[bit 0] Description of bit0</p>
<p>[bit 1] And bit1</p>
</td>
</tr>
</tbody>
</table>
2. Using Line Blocks, the following:
+------+-------------------------------+
| Mask | | The bit mask: |
| | | [bit 0] Description of bit0 |
| | | [bit 1] And bit1 |
+------+-------------------------------+
Will give:
</table>
<tbody>
<tr class="row-odd">
<td>
<p>Mask</p>
</td>
<td>
<div class="line-block">
<div class="line">The bit mask:</div>
<div class="line">[bit 0] Description of bit0</div>
<div class="line">[bit 1] And bit1</div>
</div>
</td>
</tr>
</tbody>
</table>
The resulting <div class="line"></div> will work like a paragraph and also keep alignment. This is guaranteed by the reST specification, so even if your output is not HTML there should be mechanisms in place to guarantee the result will be consistent.

How to convert dictionary with list to dataframe with default index and column names

How to convert dictionary to dataframe with default index and column names
dictionary d = {0: [1, 'Sports', 222], 1: [2, 'Tools', 11], 2: [3, 'Clothing', 23]}
df
id type value
0 1 Sports 222
1 2 Tools 11
2 3 Clothing 23
Use DataFrame.from_dict with orient='index' parameter:
d = {0: [1, 'Sports', 222], 1: [2, 'Tools', 11], 2: [3, 'Clothing', 23]}
df = pd.DataFrame.from_dict(d, orient='index', columns=['id','type','value'])
print (df)
id type value
0 1 Sports 222
1 2 Tools 11
2 3 Clothing 23

How can I select the columns names where a condition is met

I need to select column names where the count is greater than 2. I have this dataset:
Index | col_1 | col_2 | col_3 | col_4
-------------------------------------
0 | 5 | NaN | 4 | 2
1 | 2 | 2 | NaN | 2
2 | NaN | 3 | NaN | 1
3 | 3 | NaN | NaN | 1
The expected result is a list: ['col_1', 'col_4']
When I use
df.count() > 2
I get
col_1 True
col_2 False
col_3 False
col_4 True
Length: 4, dtype: bool
This is the code for testing
import pandas as pd
import numpy as np
data = {'col_1': [5, 2, np.NaN, 3],
'col_2': [np.NaN, 2, 3, np.NaN],
'col_3': [4, np.NaN, np.NaN, np.NaN],
'col_4': [2, 2, 1,1]}
frame = pd.DataFrame(data)
frame.count() > 2
You can do it this way.
import pandas as pd
import numpy as np
data = {'col_1': [5, 2, np.NaN, 3],
'col_2': [np.NaN, 2, 3, np.NaN],
'col_3': [4, np.NaN, np.NaN, np.NaN],
'col_4': [2, 2, 1,1]}
frame = pd.DataFrame(data)
expected_list = []
for col in list(frame.columns):
if frame[col].count() > 2:
expected_list.append(col)
Use dict can easily solve this:
frame[[key for key, value in dict(frame.count() > 2).items() if value]]
Try:
(df.columns)[(df.count() > 2).values].to_list()

append columns from rows in pandas

convert rows into new columns, like:
original dataframe:
attr_0 attr_1 attr_2 attr_3
0 day_0 -0.032546 0.161111 -0.488420 -0.811738
1 day_1 -0.341992 0.779818 -2.937992 -0.236757
2 day_2 0.592365 0.729467 0.421381 0.571941
3 day_3 -0.418947 2.022934 -1.349382 1.411210
4 day_4 -0.726380 0.287871 -1.153566 -2.275976
...
after convertion:
day_0_attr_0 day_0_attr_1 day_0_attr_2 day_0_attr_3 day_1_attr_0 \
0 -0.032546 0.144388 -0.992263 0.734864 -0.936625
day_1_attr_1 day_1_attr_2 day_1_attr_3 day_2_attr_0 day_2_attr_1 \
0 -1.717135 -0.228005 -0.330573 -0.28034 0.834345
day_2_attr_2 day_2_attr_3 day_3_attr_0 day_3_attr_1 day_3_attr_2 \
0 1.161089 0.385277 -0.014138 -1.05523 -0.618873
day_3_attr_3 day_4_attr_0 day_4_attr_1 day_4_attr_2 day_4_attr_3
0 0.724463 0.137691 -1.188638 -2.457449 -0.171268
If MultiIndex use:
print (df.index)
MultiIndex(levels=[[0, 1, 2, 3, 4], ['day_0', 'day_1', 'day_2', 'day_3', 'day_4']],
labels=[[0, 1, 2, 3, 4], [0, 1, 2, 3, 4]])
df = df.reset_index(level=0, drop=True).stack().reset_index()
level_0 level_1 0
0 day_0 attr_0 -0.032546
1 day_0 attr_1 0.161111
2 day_0 attr_2 -0.488420
3 day_0 attr_3 -0.811738
4 day_1 attr_0 -0.341992
5 day_1 attr_1 0.779818
6 day_1 attr_2 -2.937992
7 day_1 attr_3 -0.236757
8 day_2 attr_0 0.592365
9 day_2 attr_1 0.729467
10 day_2 attr_2 0.421381
11 day_2 attr_3 0.571941
12 day_3 attr_0 -0.418947
13 day_3 attr_1 2.022934
14 day_3 attr_2 -1.349382
15 day_3 attr_3 1.411210
16 day_4 attr_0 -0.726380
17 day_4 attr_1 0.287871
18 day_4 attr_2 -1.153566
19 day_4 attr_3 -2.275976
df = pd.DataFrame([df[0].values], columns = df['level_0'] + '_' + df['level_1'])
print (df)
day_0_attr_0 day_0_attr_1 ... day_4_attr_2 day_4_attr_3
0 -0.032546 0.161111 ... -1.153566 -2.275976
[1 rows x 20 columns
Another solution with product:
from itertools import product
cols = ['{}_{}'.format(a,b) for a, b in product(df.index.get_level_values(1), df.columns)]
print (cols)
['day_0_attr_0', 'day_0_attr_1', 'day_0_attr_2', 'day_0_attr_3',
'day_1_attr_0', 'day_1_attr_1', 'day_1_attr_2', 'day_1_attr_3',
'day_2_attr_0', 'day_2_attr_1', 'day_2_attr_2', 'day_2_attr_3',
'day_3_attr_0', 'day_3_attr_1', 'day_3_attr_2', 'day_3_attr_3',
'day_4_attr_0', 'day_4_attr_1', 'day_4_attr_2', 'day_4_attr_3']
df = pd.DataFrame([df.values.ravel()], columns=cols)
print (df)
day_0_attr_0 day_0_attr_1 ... day_4_attr_2 day_4_attr_3
0 -0.032546 0.161111 ... -1.153566 -2.275976
[1 rows x 20 columns]
If no MultiIndex solutions are a bit changed:
print (df.index)
Index(['day_0', 'day_1', 'day_2', 'day_3', 'day_4'], dtype='object')
df = df.stack().reset_index()
df = pd.DataFrame([df[0].values], columns = df['level_0'] + '_' + df['level_1'])
from itertools import product
cols = ['{}_{}'.format(a,b) for a, b in product(df.index, df.columns)]
df = pd.DataFrame([df.values.ravel()], columns=cols)
print (df)
You can use melt and string concatenation approach i.e
idx = df.index
temp = df.melt()
# Repeat the index
temp['variable'] = pd.Series(np.concatenate([idx]*len(df.columns))) + '_' + temp['variable']
# Set index and transpose
temp.set_index('variable').T
variable day_0_attr_0 day_1_attr_0 day_2_attr_0 day_3_attr_0 day_4_attr_0 . . . .
value -0.032546 -0.341992 0.592365 -0.418947 -0.72638 . . . .