Avoid Cartesian Join for the spark SQL query - sql

I am trying to calculate the processRate from the total count of two temp tables but I'm getting the error "Detected implicit cartesian product for INNER join between logical plans" where I am not even performing joins. I am sure this error can be resolved by restructuring the query in correct format and I need your help on it. Below is the query,
spark.sql("""
CREATE OR REPLACE TEMPORARY VIEW final_processRate AS
SELECT
((a.total - b.total)/a.total))* 100 AS processRate
FROM
(select count (*) as total from sales) a,
(select count (*) as total from sales where status = 'PENDING') b
""")
I'm getting this error while trying to view the data using,
spark.sql("select * from processRate limit 10").show(false)
Can you please help on formatting the above query to resolve this issue and view the data of final_processRate?

You don't need subquery for this. Just use a conditional aggregation:
spark.sql("""
CREATE OR REPLACE TEMPORARY VIEW final_processRate AS
SELECT
((count(*) - count(case when status='PENDING' then 1 end)) / count(*)) * 100 AS processRate
FROM sales
""")
Then you can query the temp view using:
spark.sql("select * from final_processRate")
which should give you a single number/percentage calculated above.

I would write this as:
select avg(case when status = 'PENDING' then 0.0 else 1 end)
from sales;
This returns the proportion of rows that are not pending.

Related

How to calculate metrics between two tables

How to calculate metrics between two tables? In addition, I noticed that when using FROM tbl1, tbl2, there are noises, the WHERE filters did not work, a total count(*) was returned
Query:
select
count(*) filter(WHERE tb_get_gap.system in ('LINUX','UNIX')) as gaps,
SUM(CAST(srvs AS INT)) filter(WHERE tb_getcountsrvs.type = 'LZ') as total,
100 - (( gaps / total ) * 100)
FROM tb_get_gap, tb_getcountsrvs
Error:
SQL Error [42703]: ERROR: column "gaps" does not exist
I need to count in the tb_get_gap table by fields = ('LINUX', 'UNIX'), then a SUM ()in thesrvs field in the
tb_getcountsrvs table by fields = 'LZ' in type, right after
making this formula 100 - ((gaps / total) * 100)
It would seem that you cannot define gaps and also use it in the same query. In SQL Server you would have to use the logic twice. Maybe a subquery would work better.
select 100 - (t.gaps / t.total) * 100)
from
(
select
count(*) filter(WHERE tb_get_gap.system in ('LINUX','UNIX')) as gaps,
SUM(CAST(srvs AS INT)) filter(WHERE tb_getcountsrvs.type = 'LZ') as total
FROM tb_get_gap, tb_getcountsrvs
) t

How do I count the rows with a where clause in SQL Server?

I am pretty much stuck with a problem I am facing with SQL Server. I want to show in a query the amount of times that specific value occurs. This is pretty easy to do, but I want to take it a step further and I think the best way to explain on what I am trying to achieve is to explain it using images.
I have two tables:
Plant and
Chest
As you can see with the chest the column 'hoeveelheid' tells how full the chest is, 'vol' == 1 and 3/4 is == 0,75. In the plant table there is a column 'Hoeveelheidperkist' which tells how much plants there can be in 1 chest.
select DISTINCT kist.Plantnaam, kist.Plantmaat, count(*) AS 'Amount'
from kist
group by kist.plantnaam, kist.Plantmaat
This query counts all the chests, but it does not seperate the count of 'Vol' chests and '3/4' chests. It only does This. What I want to achieve is this. But I have no idea how. Any help would be much appreciated.
If you use group by you don't need distinct
and if you want the seprated count for hoeveelheid you ust add to the group by clause
select DISTINCT kist.Plantnaam, kist.Plantmaat, kist.hoeveelheid, count(*) AS 'Amount'
from kist
group by kist.plantnaam, kist.Plantmaat, hoeveelheid
or if you want all the 3 count ond the samw rowx you could use a condition aggreagtion eg:
select DISTINCT kist.Plantnaam, kist.Plantmaat
, sum(case when kist.hoeveelheid ='Vol' then 1 else 0 end) vol
, sum(case when kist.hoeveelheid ='3/3' then 1 else 0 end) 3_4
, count(*) AS 'Amount'
from kist
group by kist.plantnaam, kist.Plantmaat
When you want to filter the data on the counts you have to use having clause. When ever you are using aggregate functions(sum, count, min, max) and you want to filter them on aggregation basis, use having clause
select DISTINCT kist.Plantnaam, kist.Plantmaat, count(*) AS 'Amount'
from kist
group by kist.plantnaam, kist.Plantmaat having count(*) = 1 -- or provide necessary conditions

using correlated subquery in the case statement

I’m trying to use a correlated subquery in my sql code and I can't wrap my head around what I'm doing wrong. A brief description about the code and what I'm trying to do:
The code consists of a big query (ALIASED AS A) which result set looks like a list of customer IDs, offer IDs and response status name ("SOLD","SELLING","IRRELEVANT","NO ANSWER" etc.) of each customer to each offer. The customers IDs and the responses in the result set are non-unique, since more than one offer can be made to each customer, and a customer can have different response for different offers.
The goal is to generate a list of distinct customer IDs and to mark each ID with 0 or 1 flag :
if the ID has AT LEAST ONE offer with status name is "SOLD" or "SELLING" the flag should be 1 otherwise 0. Since each customer has an array of different responses, what I'm trying to do is to check if "SOLD" or "SELLING" appears in this array for each customer ID, using correlated subquery in the case statement and aliasing the big underlying query named A with A1 this time:
select distinct
A.customer_ID,
case when 'SOLD' in (select distinct A1.response from A as A1
where A.customer_ID = A1.customer_ID) OR
'SELLING' in (select distinct A1.response from A as A1
where A.customer_ID = A1.customer_ID)
then 1 else 0 end as FLAG
FROM
(select …) A
What I get is a mistake alert saying there is no such object as A or A1.
Thanks in advance for the help!
You can use exists with cte :
with cte as (
<query here>
)
select c.*,
(case when exists (select 1
from cte c1
where c1.customer_ID = c.customer_ID and
c1.response in ('sold', 'selling')
)
then 1 else 0
end) as flag
from cte c;
You can also do aggregation :
select customer_id,
max(case when a.response in ('sold', 'selling') then 1 else 0 end) as flag
from < query here > a;
group by customer_id;
With statement as suggested by Yogesh is a good option. If you have any performance issues with "WITH" statement. you can create a volatile table and use columns from volatile table in your select statement .
create voltaile table as (select response from where response in ('SOLD','SELLING').
SELECT from customer table < and join voltaile table>.
The only disadvantge here is volatile tables cannot be accessed after you disconnect from session.

Count query giving wrong column name error

select COUNT(analysed) from Results where analysed="True"
I want to display count of rows in which analysed value is true.
However, my query gives the error: "The multi-part identifier "Results.runId" could not be bound.".
This is the actual query:
select ((SELECT COUNT(*) AS 'Count'
FROM Results
WHERE Analysed = 'True')/failCount) as PercentAnalysed
from Runs
where Runs.runId=Analysed.runId
My table schema is:
The value I want for a particular runId is: (the number of entries where analysed=true)/failCount
EDIT : How to merge these two queries?
i) select runId,Runs.prodId,prodDate,prodName,buildNumber,totalCount as TotalTestCases,(passCount*100)/(passCount+failCount) as PassPercent,
passCount,failCount,runOwner from Runs,Product where Runs.prodId=Product.prodId
ii) select (cast(counts.Count as decimal(10,4)) / cast(failCount as decimal(10,4))) as PercentAnalysed
from Runs
inner join
(
SELECT COUNT(*) AS 'Count', runId
FROM Results
WHERE Analysed = 'True'
GROUP BY runId
) counts
on counts.runId = Runs.runId
I tried this :
select runId,Runs.prodId,prodDate,prodName,buildNumber,totalCount as TotalTestCases,(passCount*100)/(passCount+failCount) as PassPercent,
passCount,failCount,runOwner,counts.runId,(cast(counts.Count as decimal(10,4)) / cast(failCount as decimal(10,4))) as PercentAnalysed
from Runs,Product
inner join
(
SELECT COUNT(*) AS 'Count', runId
FROM Results
WHERE Analysed = 'True'
GROUP BY runId
) counts
on counts.runId = Runs.runId
where Runs.prodId=Product.prodId
but it gives error.
Your problems are arising from improper joining of tables. You need information from both Runs and Results, but they aren't combined properly in your query. You have the right idea with a nested subquery, but it's in the wrong spot. You're also referencing the Analysed table in the outer where clause, but it hasn't been included in the from clause.
Try this instead:
select (cast(counts.Count as decimal(10,4)) / cast(failCount as decimal(10,4))) as PercentAnalysed
from Runs
inner join
(
SELECT COUNT(*) AS 'Count', runId
FROM Results
WHERE Analysed = 'True'
GROUP BY runId
) counts
on counts.runId = Runs.runId
I've set this up as an inner join to eliminate any runs which don't have analysed results; you can change it to a left join if you want those rows, but will need to add code to handle the null case. I've also added casts to the two numbers, because otherwise the query will perform integer division and truncate any fractional amounts.
I'd try the following query:
SELECT COUNT(*) AS 'Count'
FROM Results
WHERE Analysed = 'True'
This will count all of your rows where Analysed is 'True'. This should work if the datatype of your Analysed column is either BIT (Boolean) or STRING(VARCHAR, NVARCHAR).
Use CASE in Count
SELECT COUNT(CASE WHEN analysed='True' THEN analysed END) [COUNT]
FROM Results
Click here to view result
select COUNT(*) from Results where analysed="True"

How to perform running sum (balance) in SQL

I have 2 SQL Tables
unit_transaction
unit_detail_transactions
(tables schema here: http://sqlfiddle.com/#!3/e3204/2 )
What I need is to perform an SQL Query in order to generate a table with balances. Right now I have this SQL Query but it's not working fine because when I have 2 transactions with the same date then the balance is not calculated correctly.
SELECT
ft.transactionid,
ft.date,
ft.reference,
ft.transactiontype,
CASE ftd.isdebit WHEN 1 THEN MAX(ftd.debitaccountid) ELSE MAX(ftd.creditaccountid) END as financialaccountname,
CAST(COUNT(0) as tinyint) as totaldetailrecords,
ftd.isdebit,
SUM(ftd.amount) as amount,
balance.amount as balance
FROM unit_transaction_details ftd
JOIN unit_transactions ft ON ft.transactionid = ftd.transactionid
JOIN
(
SELECT DISTINCT
a.transactionid,
SUM(CASE b.isdebit WHEN 1 THEN b.amount ELSE -ABS(b.amount) END) as amount
--SUM(b.debit-b.credit) as amount
FROM unit_transaction_details a
JOIN unit_transactions ft ON ft.transactionid = a.transactionid
CROSS JOIN unit_transaction_details b
JOIN unit_transactions ft2 ON ft2.transactionid = b.transactionid
WHERE (ft2.date <= ft.date)
AND ft.unitid = 1
AND ft2.unitid = 1
AND a.masterentity = 'CONDO-A'
GROUP BY a.transactionid,a.amount
) balance ON balance.transactionid = ft.transactionid
WHERE
ft.unitid = 1
AND ftd.isactive = 1
GROUP BY
ft.transactionid,
ft.date,
ft.reference,
ft.transactiontype,
ftd.isdebit,
balance.amount
ORDER BY ft.date DESC
The result of the query is this:
Any clue on how to perform a correct SQL that will show me the right balances ordered by transaction date in descendant mode?
Thanks a lot.
EDIT: THINK OF 2 POSSIBLE SOLUTIONS
The problem is generated when you have the same date in 2 transactions, so here is what Im going to do:
Save Date and Time into "date" column. That way there won't be 2 exact dates.
OR
Create a "priority" column and set the priority for each record. So if I found that the date already exists and it has priority = 1 then the current priority will be 2.
What do you think?
There are two ways to do a running sum. I am going to show the syntax on a simpler table, to give you an idea.
Some databases (Oracle, PostgreSQL, SQL Server 2012, Teradata, DB2 for instance) support cumulative sums directly. For this you use the following function:
select sum(<val>) over (partition by <column> order by <ordering column>)
from t
This is a windows function that will calculate the running sum of for each group of records identified by . The order of the sum is .
Alas, many databases don't support this functionality, so you would need to do a self join to do this in a single SELECT query in the database:
select t.column, sum(tprev.<val>) as cumsum
from t left join
t tprev
where t.<column> = tprev.<column> and
t.<ordering column> >= tprev.<ordering column>
group by t.column
There is also the possibility of creating another table and using a cursor to assign the cumulative sum, or of doing the sum at the application level.