How Do I Programmatically Use "Count" In Pyspark? - sql

Trying to do a simple count in Pyspark programmatically but coming up with errors. .count() works at the end of the statement if I drop AS (count(city)) but I need the count to appear inside not on the outside.
result = spark.sql("SELECT city AS (count(city)) AND business_id FROM business WHERE city = 'Reading'")
One of many errors
Py4JJavaError: An error occurred while calling o24.sql.
: org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '(' expecting ')'(line 1, pos 21)
== SQL ==
SELECT city AS (count(city)) AND business_id FROM business WHERE city = 'Reading'
---------------------^^^

Your syntax is incorrect. Maybe you want to do this instead:
result = spark.sql("""
SELECT
count(city) over(partition by city),
business_id
FROM business
WHERE city = 'Reading'
""")
You need to provide a window if you use count without group by. In this case, you probably want a count for each city.

Just my solution to the problem I'm trying to solve. The solution above is where I would like to be at.
result = spark.sql("SELECT count(*) FROM business WHERE city='Reading'")

Related

Any and All SQL operator in bigquery

Does BigQuery support any and all operators in Standard SQL?
I am trying to find all users who are getting higher than minimum salary in a department and below query doesn't work. I keep getting "Unexpected keyword ANY" Message
select ENAME_, JOB_ from `tescomobile---bigquery.internal.testing1`
where SAL_ = ANY(
select min(sal_) from `tescomobile---bigquery.internal.testing1`
group by DEPTNO_)
group by JOB_,ENAME_
You could use IN. Something like this:
SELECT
ENAME_,
JOB_
FROM `tescomobile---bigquery.internal.testing1`
WHERE
sal_ IN (
SELECT
MIN(sal_)
FROM `tescomobile---bigquery.internal.testing1`
GROUP BY DEPTNO_
)
GROUP BY JOB_,ENAME_

BigQuery Query working with multiple "likes" but not working with "in"

I would like to isolate some emails with specific titles. I can use multiple "like"s connected with an ORs in the where clause. This gives me a number of results. However, if I try to do a ____ in ('____', '____', etc), the code suddenly returns nothing.
This does not work.
select DATE_TRUNC(DATE(send_time,"America/Los_Angeles"), week(monday)) as week,
status,
settings_title,
sum(emails_sent) as emails_sent,
sum(report_summary_opens) as report_summary_opens,
sum(report_summary_unique_opens) as report_summary_unique_opens,
sum(report_summary_subscriber_clicks) as report_summary_subscriber_clicks
from mailchimp.campaigns_view
where status = 'sent'
and settings_title in ('%_LL_%', '%_IC_%', '%_AC_%', '%_CC_%', '%_PC_%')
group by 1,2,3
order by 1 desc
However, this works.
select DATE_TRUNC(DATE(send_time,"America/Los_Angeles"), week(monday)) as week,
status,
settings_title,
sum(emails_sent) as emails_sent,
sum(report_summary_opens) as report_summary_opens,
sum(report_summary_unique_opens) as report_summary_unique_opens,
sum(report_summary_subscriber_clicks) as report_summary_subscriber_clicks
from mailchimp.campaigns_view
where status = 'sent'
and (settings_title like '%_LL_%'
or settings_title like '%_IC_%'
or settings_title like '%_AC_%'
or settings_title like '%_CC_%'
or settings_title like '%_PC_%')
group by 1,2,3
order by 1 desc
I have already tried to include a subquery in my "from" that eliminates all null settings_title. Any ideas why this is not working? Am I missing some small syntax error?
Thanks for the help!
The % symbol will only work with LIKE. For IN it's only equality. Try REGEXP_CONTAINS too.
As in:
SELECT REGEXP_CONTAINS("abcdefg", '(xxx|zzz|yyy|cd)')
Thanks Felipe, very usefull!!!. In my case I used REGEXP_CONTAINS for matching with a multiple patterns added to a table. The select with the column "pattern_str" located in the second position is able to search and find correctly for every portion of the parttern:
WITH CTE_PatternCovid as (
Select STRING_AGG(Pattern,'|') as strPattern from xxxxxxxxx.TEMP.Temp_patternsearch_covid 
)
--this convert the multiple patterns into a single line:
--.*MASK.FFP.|.*MASK.KN.|.*TEST.ANTIG.|.*MASK.QUI.|.*SP.*H.DRO.AL.
--then use in this way:
Select ProductName FROM xxxxxxxxx.TEMP.Table_ProductsName_covid 
where
regexp_contains (upper(ProductName),(SELECT strPattern FROM CTE_PatternCovid ))

Code works fine till COALESCE

HI this code is working fine until the last statement there is more to it but was wondering if we can learn what is incorrect on this.
this is on the ibm i (as400)
'SQL0199 Keyword Select Not Selected. Valid Tokens: For Use Skip Wait With Fetch Order Union Except Optimize' can you explain this issue to me?
SELECT COUNT(*)
FROM DLIB.ORDHEADR,DLIB.TRANCODE,DLIB.TRA11
WHERE OHCOM# = TSCOM# AND OHORD# = TSORD#
AND (otCOM# = OHCOM# AND OTORD#= OHORD# AND ottrnc = 'AQC')
AND OHORDT IN('RTR','INT','SAM')
AND OHREQD = replace(char(current date, iso), '-', '')
AND OHHLDC = ' '
AND ( ( TSTATS IN('AEP','SPJ')
AND OHORD# NOT in (SELECT a.TSORD#
FROM DLIB.TRANCODE a
WHERE a.TSTATS IN('EEP','SPC')
)
)
OR TSTATS IN('EEP','SPC')
AND OHORD# IN (SELECT DISTINCT(C.TSORD#)
FROM DLIB.TRANCODE C
JOIN (SELECT DISTINCT (B.TSORD#), MAX(B.TSUTIM) AS C_TSUTIM,
MAX(B.TSUDAT) AS C_TSUDAT
FROM DLIB.TRANCODE B
WHERE B.TSTATS IN ('EEP','SPC','ECM','ECT',
'ECA','CEL','BOC','COM',
'COO','REV','MCO','CPA',
'ECV','ECC','EPT','EPM',
'CAT','CAC','CAM','CAS',
'MAC','004','006','600',
'MEP','EPC','CPK')
GROUP BY B.TSORD#
) q1
ON C.TSORD# = q1.TSORD#
AND C.TSUDAT = q1.C_TSUDAT
AND C.TSUTIM = q1.C_TSUTIM
WHERE C.TSORD# NOT IN (SELECT F.TSORD#
FROM DLIB.TRANCODE F
WHERE F.TSTATS IN ('SPJ','REL','EAS','REV',
'STP','SPT','PPC','SPM',
'BPA','BPB','BPC','BPD','BPE',
'BPF','BPG','BPH','BPI','BPJ',
'BPK','BPL','BPM','BPN','CBM',
'BPO','BPP','BAT','BCM',
'BAM','WAT','WAM','LBL','012',
'006','600','004','SCP','CBA',
'CBB','CBC','CBD','CBE',
'CBF','CBG','CBH','CBI','CBJ',
'CBK','CBL','CBM','CBN','CBO',
'CBP','CBQ','CBR','CBS',
'CBT','CBU','CBV','CBW',
'CBX','CBY','CBZ','CB1',
'CB2','CB3','CB4','CB5')
)
AND C.TSTATS IN('EEP','SPC')
)
)
-- till here it's fine.
SELECT COALESCE(SUM(OdQty#),0)
You need to use GROUP BY to SUM.
SELECT COALESCE(SUM(Goals),0) AS TeamGoals
FROM Players
GROUP BY TeamId
After formatting your code so that we can see better where the various parts of the statement begins and ends, we can see what matches up with what.
Everything up to "till here it's fine" is one SQL SELECT statement. You need a semicolon to begin your next query, which starts with SELECT COALESCE(), but is incomplete since there is no FROM clause. Once you've put the terminator on the first statement it should run.
The second query is another question. You didn't show us the rest of the code. As TeKapa says, you need a GROUP BY clause anytime you use an aggregate function. But this is only required, if you are also including a non-aggregate column in the results.
SELECT TeamID, COALESCE(SUM(Goals),0) AS TeamGoals
FROM Players
GROUP BY TeamId
That will give you each TeamID in Players, and the total Goals for each team. You would probably also include ORDER BY TeamID
But if you simply want the combined total of all Players, it is completely valid to say
SELECT SUM(Goals) AS TotalGoals
FROM Players
Taking a step back, it seems like your query has gotten so complex, that even you may be having difficulty managing it. Hopefully others wont be asked to maintain something like this.
If such code is going into production, I recommend finding ways to modularize portions of the complexity, such as with views, or common table expressions. It may also be a good idea to store those lists of values in a table, rather than hardcoding them.

Only a single result allowed for a SELECT that is part of an expression

I have the following SQL statement. It's throws the following error: "Only a single result allowed for a SELECT that is part of an expression". The goal of my sql statement is to get the name of the employee who made the 'cheapest' bribe.
The part between the brackets return the employee_id and the money it costs a day (of the relative cheapest bribe). These are two results while I only want the employee_id. So I just want to use the MIN part to get the right employee_id. How can I do this?
SELECT Voornaam, Achternaam
FROM Medewerker m JOIN
(
SELECT Medewerker_id
FROM Steekpenning
ORDER BY -1*Bedrag/(julianday(Begindatum) - julianday(Einddatum))
limit 1
) s
on m.Medewerker_id = s.Medewerker_id;
EDITED the answer. How can I expand this query to only show the bribes started this month? I think I need to use something like this? (julianday(Begindatum) - julianday('now')) > 31 but where?
Regards.
Cas
I think the following will work in SQLite:
select Firstname, Surname
from Employee e join
(select employee_id
from bribe
order by -1*Amount/(julianday(Startdate) - julianday(Enddate))
limit 1
) b
on e.employee_id = b.employee_id;

multiple count(distinct)

I get an error unless I remove one of the count(distinct ...). Can someone tell me why and how to fix it?
I'm in vfp. iif([condition],[if true],[else]) is equivalent to case when
SELECT * FROM dpgift where !nocalc AND rectype = "G" AND sol = "EM112" INTO CURSOR cGift
SELECT
list_code,
count(distinct iif(language != 'F' AND renew = '0' AND type = 'IN',donor,0)) as d_Count_E_New_Indiv,
count(distinct iif(language = 'F' AND renew = '0' AND type = 'IN',donor,0)) as d_Count_F_New_Indiv /*it works if i remove this*/
FROM cGift gift
LEFT JOIN
(select didnumb, language, type from dp) d
on cast(gift.donor as i) = cast(d.didnumb as i)
GROUP BY list_code
ORDER by list_code
edit:
apparently, you can't use multiple distinct commands on the same level. Any way around this?
VFP does NOT support two "DISTINCT" clauses in the same query... PERIOD... I've even tested on a simple table of my own, DIRECTLY from within VFP such as
select count( distinct Col1 ) as Cnt1, count( distinct col2 ) as Cnt2 from MyTable
causes a crash. I don't know why you are trying to do DISTINCT as you are just testing a condition... I more accurately appears you just want a COUNT of entries per each category of criteria instead of actually DISTINCT
Because you are not "alias.field" referencing your columns in your query, I don't know which column is the basis of what. However, to help handle your DISTINCT, and it appears you are running from WITHIN a VFP app as you are using the "INTO CURSOR" clause (which would not be associated with any OleDB .net development), I would pre-query and group those criteria, something like...
select list_code,
donor,
max( iif( language != 'F' and renew = '0' and type = 'IN', 1, 0 )) as EQualified,
max( iif( language = 'F' and renew = '0' and type = 'IN', 1, 0 )) as FQualified
from
list_code
group by
list_code,
donor
into
cursor cGroupedByDonor
so the above will ONLY get a count of 1 per donor per list code, no matter how many records that qualify. In addition, if one record as an "F" and another does NOT, then you'll have a value of 1 in EACH of the columns... Then you can do something like...
select
list_code,
sum( EQualified ) as DistEQualified,
sum( FQualified ) as DistFQualified
from
cGroupedByDonor
group by
list_code
into
cursor cDistinctByListCode
then run from that...
You can try using either another derived table or two to do the calculations you need, or using projections (queries in the field list). Without seeing the schema, it's hard to know which one will work for you.