I want to get the values of multiple rows of a single column to different column in a single row based on the condition of another column.
I want to get the values of the field_value_name in same row but different columns based on values present in the field_name column as they belong to the same id.
How to get this through sql or pyspark?
I tried using CASE WHEN but it will scan for every row and return the output for every row. [
I'd rather want those values to be in a single row for every id.
You need pivoting logic here, e.g.
SELECT
parent_id,
MAX(CASE WHEN properties_field_name = 'Status'
THEN properties_field_value_name END) AS REQ_STATUS,
MAX(CASE WHEN properties_field_name = 'Type'
THEN properties_field_value_name END) AS REQ_TYPE,
MAX(CASE WHEN properties_field_name = 'Description'
THEN properties_field_value_name END) AS REQ_DESC
FROM yourTable
GROUP BY parent_id
ORDER BY parent_id;
You can do it with MAX like Tim shows or you can do it with joins like this:
SELECT
parent_id,
status.properties_field_value_name as status,
type.properties_field_value_name as type,
desc.properties_field_value_name as desc
FROM (
SELECT distinct partent_id
FROM thetableyoudidnotname
) as base
LEFT JOIN thetableyoudidnotname as status on base.parent_id = status.parent_id and status.properties_field_name = 'Status'
LEFT JOIN thetableyoudidnotname as type on base.parent_id = type.parent_id and type.properties_field_name = 'Type'
LEFT JOIN thetableyoudidnotname as desc on base.parent_id = desc.parent_id and desc.properties_field_name = 'Description'
Simple pivot problem I think.
data = [
[7024549, 'Status', 'Approved'],
[7024549, 'Type', 'Jama Design'],
[7024549, 'Description', 'null']
]
df = spark.createDataFrame(data, ['parent_id', 'properties_field_name', 'properties_field_value_name'])
df.withColumn('id', f.expr('uuid()')) \
.withColumn('properties_field_name', f.concat(f.lit('REQ_'), f.upper(f.col('properties_field_name')))) \
.groupBy('id', 'parent_id') \
.pivot('properties_field_name') \
.agg(f.first('properties_field_value_name')) \
.drop('id') \
.show()
+---------+---------------+----------+-----------+
|parent_id|REQ_DESCRIPTION|REQ_STATUS| REQ_TYPE|
+---------+---------------+----------+-----------+
| 7024549| null| Approved| null|
| 7024549| null| null|Jama Design|
| 7024549| null| null| null|
+---------+---------------+----------+-----------+
Related
I've created a query in Apache Spark in hopes of taking multiple rows of customer data and rolls it up into one row, showing what types of products they have open. So data that looks like this:
Customer Product
1 Savings
1 Checking
1 Auto
Ends up looking like this:
Customer Product
1 Savings/Checking/Auto
The query currently still has multiple rows. I tried group by, but that doesn't show the multiple products that a customer has, instead, it'll just show one product.
Is there a way to do this is Apache Spark or SQL (which is really similar to apache)? Unfortunately, I don't have MYSQL nor do I think IT will install it for me.
SELECT
"ACCOUNT"."account_customerkey" AS "account_customerkey",
max(
concat(case when Savings=1 then ' Savings'end,
case when Checking=1 then ' Checking 'end,
case when CD=1 then ' CD /'end,
case when IRA=1 then ' IRA /'end,
case when StandardLoan=1 then ' SL /'end,
case when Auto=1 then ' Auto /'end,
case when Mortgage=1 then ' Mortgage /'end,
case when CreditCard=1 then ' CreditCard 'end)) AS Description
FROM "ACCOUNT" "ACCOUNT"
inner join (
SELECT
"ACCOUNT"."account_customerkey" AS "customerkey",
CASE WHEN "ACCOUNT"."account_producttype" = 'Savings' THEN 1 ELSE NULL END AS Savings,
CASE WHEN "ACCOUNT"."account_producttype" = 'Checking' THEN 1 ELSE NULL END AS Checking,
CASE WHEN "ACCOUNT"."account_producttype" = 'CD' THEN 1 ELSE NULL END AS CD,
CASE WHEN "ACCOUNT"."account_producttype" = 'IRA' THEN 1 ELSE NULL END AS IRA,
CASE WHEN "ACCOUNT"."account_producttype" = 'Standard Loan' THEN 1 ELSE NULL END AS StandardLoan,
CASE WHEN "ACCOUNT"."account_producttype" = 'Auto' THEN 1 ELSE NULL END AS Auto,
CASE WHEN "ACCOUNT"."account_producttype" = 'Mortgage' THEN 1 ELSE NULL END AS Mortgage,
CASE WHEN "ACCOUNT"."account_producttype" = 'Credit Card' THEN 1 ELSE NULL END AS CreditCard
FROM "ACCOUNT" "ACCOUNT"
)a on "account_customerkey" =a."customerkey"
GROUP BY
"ACCOUNT"."account_customerkey"
Please try this.
scala> df.show()
+--------+--------+
|Customer| Product|
+--------+--------+
| 1| Savings|
| 1|Checking|
| 1| Auto|
| 2| Savings|
| 2| Auto|
| 3|Checking|
+--------+--------+
scala> df.groupBy($"Customer").agg(collect_list($"Product").as("Product")).select($"Customer",concat_ws(",",$"Product").as("Product")).show(false)
+--------+---------------------+
|Customer|Product |
+--------+---------------------+
|1 |Savings,Checking,Auto|
|3 |Checking |
|2 |Savings,Auto |
+--------+---------------------+
scala>
See https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/collect_list and related functions
You need to use collect_list which is available with SQL or %sql.
%sql
select id, collect_list(num)
from t1
group by id
I used my own data, you need to tailor. Just demonstrating in more native SQL form.
I am looking to run a sql expression that checks for the next event that is either 'DELIVERED' or 'ORDER-CANCELED' and return a different result depending on which is first.
df = spark.createDataFrame([["ORDER", "2009-11-23", "1"], ["DELIVERED", "2009-12-17", "1"], ["ORDER-CANCELED", "2009-11-25", "1"], ["ORDER", "2009-12-03", "1"]]).toDF("EVENT", "DATE", "ID")
+--------------+----------+---+
| EVENT| DATE| ID|
+--------------+----------+---+
| ORDER|2009-11-23| 1|
|ORDER-CANCELED|2009-11-25| 1|
| ORDER|2009-12-03| 1|
| DELIVERED|2009-12-17| 1|
+--------------+----------+---+
I have written a statement that works for just an DELIVERED event using this code:
df = df.withColumn("NEXT", f.expr("""
case when EVENT = 'ORDER' then
first(if(EVENT in ('DELIVERED'), 'SUCCESS', null), True)
over (Partition By ID ORDER BY ID, DATE ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)
else null end
"""))
This works but I don't know how to add in a second condition for the else statement of 'ORDER-CANCELED'.
df = df.withColumn("NEXT", f.expr("""
case when EVENT = 'ORDER' then
first(if(EVENT in ('DELIVERED'), 'SUCCESS', null)
**elseif(EVENT in ('ORDER-CANCELED'), 'CANCELED'), True)**
over (Partition By ID ORDER BY ID, DATE ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)
else null end
"""))
something like this maybe ?
df = df.withColumn(
"NEXT",
f.expr("""
case when EVENT = 'ORDER' then
first(
case when EVENT in ('DELIVERED') then
'SUCCESS'
when EVENT in ('ORDER-CANCELED') then
'CANCELED'
else
NULL
end
) over (Partition By ID ORDER BY ID, DATE ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)
else NULL
end
"""))
I need to do something like 'majority voting' of columns in SQL database. That means, that having columns: c0, c1, ..., cn, I would like to have in some other column for each row the most frequent value among mentioned columns (and null or random otherwise - it doesn't really matter). For example, if we have the following table:
+--+--+--+------+
|c0|c1|c2|result|
+--+--+--+------+
| 0| 1| 0| 0|
| 0| 1| 1| 1|
| 2| 2| 0| 2|
| 0| 3| 1| null|
That is what I mean by majority voting of columns c0, c1, c2: in the first row we have 2 rows with value 0 and 1 with 1, so result = 0. In the second we have one 0's vs two 1's, ergo result = 1 and so on. We assume, that all the columns have the same type.
It would be great, if the query were concise (it can be build dynamically). Native SQL is preferred, but PL/SQL, psql will also do.
Thank you in advance.
This can easily be done by creating a table out of the three columns and using an aggregate function on that:
The following works in Postgres:
select c0,c1,c2,
(select c
from unnest(array[c0,c1,c2]) as t(c)
group by c
having count(*) > 1
order by count(*) desc
limit 1)
from the_table;
If you don't want to hard-code the column names, you can use Postgres' JSON function as well:
select t.*,
(select t.v
from jsonb_each_text(to_jsonb(t)) as t(c,v)
group by t.v
having count(*) > 1
order by count(*) desc
limit 1) as result
from the_table t;
Note that the above takes all columns into account. If you want to remove specific columns (e.g. an id column) you need to use to_jsonb(t) - 'id' to remove that key from the JSON value.
Neither of those solutions deals with ties (two different values appearing the same number of times).
Online example: https://rextester.com/PJR58760
The first solution can be "adapted" somewhat to Oracle, especially if you can build the SQL on the fly:
select t.*,
(select c
from (
-- this part would need to be done dynamically
-- if you don't know the columns
select t.c0 as c from dual union all
select t.c1 from dual union all
select t.c2 from dual
) x
group by c
having count(*) > 1
order by count(*) desc
fetch first 1 rows only) as result
from the_table t;
In Postgres use jsonb functions. You need primary key or unique column(s), id is unique in the example:
with my_table(id, c0, c1, c2) as (
values
(1, 0, 1, 0),
(2, 0, 1, 1),
(3, 2, 2, 0),
(4, 0, 3, 1)
)
select distinct on (id) id, value
from (
select id, value, count(*)
from my_table t
cross join jsonb_each_text(to_jsonb(t)- 'id')
group by id, value
) s
order by id, count desc
id | value
----+-------
1 | 0
2 | 1
3 | 2
4 | 1
(4 rows)
The query works well regardless of the number of columns.
Here's a solution for Postgres.
SELECT t1.c0,
t1.c1,
t1.c2,
(SELECT y.c
FROM (SELECT x.c,
count(*) OVER (PARTITION BY x.rn) ct
FROM (SELECT v.c,
rank() OVER (ORDER BY count(v.c) DESC) rn
FROM (VALUES (t1.c0),
(t1.c1),
(t1.c2)) v(c)
GROUP BY v.c) x
WHERE x.rn = 1) y
WHERE y.ct = 1) result
FROM elbat t1;
db<>fiddle
In the subquery first all the values with maximum count are taken using rank(). The windowed version of count() is then used to filter if there is only one value with maximum count.
If you need to do this over more columns, just add them to the SELECT and the VALUES.
THIS ANSWERS THE ORIGINAL VERSION OF THE QUESTION.
You can just compare the values. For your example with two values neither of which is NULL:
select t.*
(case when ((case when c0 = 0 then 1 else -1 end) +
(case when c1 = 0 then 1 else -1 end) +
(case when c2 = 0 then 1 else -1 end)
) > 0
then 0 else 1
end)
from t;
There is a Postgres database and the table has three columns. The data structure is in external system so I can not modify it.
Every object is represented by three rows (identified by column element_id - rows with the same value in this column represents the same object), for example:
key value element_id
-----------------------------------
status active 1
name exampleNameAAA 1
city exampleCityAAA 1
status inactive 2
name exampleNameBBB 2
city exampleCityBBB 2
status inactive 3
name exampleNameCCC 3
city exampleCityCCC 3
In the query, I want to put list of some names, check if the value of row with key column status in the same object has status 'active' and return the name of this objects only if the status is 'active'.
So for this example, there are three objects in the database table. I want to put in query two 'names':
a)exampleNameAAA
b)exampleNameCCC
and the result should be:
exampleNameAAA (because I asked for two objects and only one of them has active value in status row.
You can use an EXISTS query:
select e1.*
from element e1
where (e1.key, e1.value) in ( ('name', 'exampleNameAAA'), ('name', 'exampleNameCCC'))
and exists (select *
from element e2
where e2.element_id = e1.element_Id
and (e2.key, e2.value) = ('status', 'active'));
Online example: https://rextester.com/JOWED21150
One option uses aggregation:
SELECT
MAX(CASE WHEN "key" = 'name' THEN "value" END) AS name
FROM yourTable
GROUP BY element_id
HAVING
MAX(CASE WHEN "key" = 'name' THEN "value" END) IN
('exampleNameAAA', 'exampleNameCCC') AND
SUM(CASE WHEN "key" = 'status' AND "value" = 'active' THEN 1 ELSE 0 END) > 0;
name
exampleNameAAA
Demo
This is the pivot approach, where we isolate individual keys and values for each element_id group.
I like expressing this as:
SELECT MAX(t.value) FILTER (WHERE t.key = 'name') AS name
FROM t
GROUP BY t.element_id
HAVING MAX(t.value) FILTER (WHERE t.key = 'name') IN ('exampleNameAAA', 'exampleNameCCC') AND
MAX(t.value) FILTER (WHERE t.key = 'status') = 'active';
All that said, the exists solution is probably more performant in this case. The advantage of the aggregation approach is that you can easily bring additional columns into the select, such as the city:
SELECT MAX(t.value) FILTER (WHERE t.key = 'name') AS name,
MAX(t.value) FILTER (WHERE t.key = 'city') as city
(Note: key is a bad name for a column because it is a SQL keyword.)
I have two queries:
Select count(*) as countOne where field = '1'
Select count(*) as countTwo where field = '2'
What I want to see after executing these queries in my results viewer:
countOne | countTwo
23 | 123
How can I get the results from both queries by only running one query?
SELECT COUNT(CASE WHEN field = '1' THEN 1 END) AS countOne,
COUNT(CASE WHEN field = '2' THEN 1 END) AS countTwo
FROM YourTable
WHERE field IN ( '1', '2' )
The simplest way is to run each as a subselect eg.
SELECT
(
Select count(*) where field = '1' as countOne,
Select count(*) where field = '2' as countTwo
)
BUt this is not necesarily the best way
Another wayto do it would be to Group by field and then do PIVOT to select out each group as a separate column.