Compute postgresql query using Spark SQL

Compute postgresql query using Spark SQL - apache-spark-sql

So basically i need to run the two following queries using Spark SQL but i can not figure out how to to id without my compiler throwing some random errors.
Query 1:
SELECT DISTINCT c.name, count(p.pid)FROM clubs c
JOIN teams t on c.cid = t.cid
JOIN tournaments d on t.tid = t.tid
JOIN players p on p.ncid = c.ncid
WHERE c.cid = 45 AND d.tyear = 2014
GROUP BY c.name
ORDER BY count DESC
Query 2:
SELECT DISTINCT t.tyear, c.name, (SELECT max(m.matchdate) - min(m.matchdate) FROM matches m WHERE t.tyear = date_part('year', m.matchdate)) AS days FROM tournaments t
JOIN hosts h ON t.tyear = h.tyear
JOIN countries c on c.cid = h.cid
JOIN stadiums s on s.cid = c.cid
JOIN matches m on m.sid = s.sid
GROUP BY t.tyear, c.name, s.sid
ORDER BY days DESC
ive already tried to run query 1 like this but it didnt work:
spark1 = SparkSession.builder.appName('spark').getOrCreate()
teams = spark1.read.csv("teams.csv", header = True, mode="DROPMALFORMED").cache()
clubs = spark1.read.csv("clubs.csv", header=True, mode="DROPMALFORMED").cache()
tournaments = spark1.read.csv("tournaments.csv", header = True, mode="DROPMALFORMED")
players = spark1.read.csv("players.csv", header = True, mode="DROPMALFORMED")
spark1.sql('SELECT DISTINCT c.name, count(p.pid)FROM clubs JOIN teams t on c.cid = t.cid JOIN tournaments d on t.tid = t.tid JOIN players p on p.ncid = c.ncid WHERE c.cid = 45 AND d.tyear = 2014 GROUP BY c.name ORDER BY count DESC').show()
any help would be appreciated!

Related

How to get rid of duplicating data in every row?

SELECT distinct AD.ReferenceNumber, AD.ProjectTitle, Z.ZoneCode, C.CompanyName,SS.AssignedTo, ZG.ZoneGroupName,au.Amount
FROM ApplicationDetails AD
LEFT JOIN ApplicationFormsDetails AS b ON (AD.referencenumber = b.referencenumber)
LEFT JOIN ScheduleSummaries AS SS ON (AD.ReferenceNumber = SS.ReferenceNo)
INNER JOIN AppTypes as at on ss.ItemCode = at.Category
INNER JOIN Companies AS C ON (AD.CompanyId = C.CompanyID)
INNER JOIN Zones Z ON (C.ZoneCode = Z.ZoneCode)
INNER JOIN ZoneGroups ZG ON (Z.ZoneGroup = ZG.ZoneGroupId)
LEFT JOIN AssessmentUsedItems au on ah.AssessmentHeaderId = au.HeaderId
WHERE AD.ApplicationDate BETWEEN '2017-10-01' AND '2017-10-31' AND ZG.ZoneGroupCode = 'HO' and ah.referencenumber = 'N-101317-A1-02'
GROUP BY AD.ReferenceNumber, AD.ProjectTitle, Z.ZoneCode, C.CompanyName,SS.AssignedTo, ZG.ZoneGroupName,au.Amount--, ah.ApplicationForm,au.Amount
The output of this query is its duplicating the amount for every AssignTO.
Output :

Maybe you want to try using SUM(ISNULL(au.amount, 0)) AS amount instead of au.amount and remove au.amount from the GROUP BY as well...

Try this query:
SELECT AD.ReferenceNumber,
AD.ProjectTitle,
Z.ZoneCode,
C.CompanyName,
SS.AssignedTo,
ZG.ZoneGroupName,
SUM(COALESCE(au.Amount,0)) AS Amount
FROM ApplicationDetails AD
LEFT JOIN ApplicationFormsDetails AS b
ON (AD.referencenumber = b.referencenumber)
LEFT JOIN ScheduleSummaries AS SS
ON (AD.ReferenceNumber = SS.ReferenceNo)
INNER JOIN AppTypes AS at
ON ss.ItemCode = at.Category
INNER JOIN Companies AS C
ON (AD.CompanyId = C.CompanyID)
INNER JOIN Zones Z
ON (C.ZoneCode = Z.ZoneCode)
INNER JOIN ZoneGroups ZG
ON (Z.ZoneGroup = ZG.ZoneGroupId)
LEFT JOIN AssessmentUsedItems au
ON ah.AssessmentHeaderId = au.HeaderId
WHERE AD.ApplicationDate BETWEEN '2017-10-01' AND '2017-10-31'
AND ZG.ZoneGroupCode = 'HO'
AND ah.referencenumber = 'N-101317-A1-02'
GROUP BY
AD.ReferenceNumber,
AD.ProjectTitle,
Z.ZoneCode,
C.CompanyName,
SS.AssignedTo,
ZG.ZoneGroupName

Finding the count

I have the following SQL query and need to know the count of companyid as I can see repeating data. How do I find the count of it. Following is the query
SELECT a.companyId 'companyId'
, i.orgDebtType 'orgDebtType'
, d.ratingTypeName 'ratingTypeName'
, c.currentRatingSymbol 'currentRatingSymbol'
, c.ratingStatusIndicator 'ratingStatusIndicator'
, g.qualifierValue 'qualifierValue'
, c.ratingdate 'ratingDate'
, h.value 'outlook'
FROM ciqRatingEntity a
JOIN ciqcompany com
on com.companyId = a.companyId
JOIN ciqratingobjectdetail b ON a.entitySymbolValue = b.objectSymbolValue
JOIN ciqRatingData c ON b.ratingObjectKey = c.ratingObjectKey
JOIN ciqRatingType d ON b.ratingTypeId = d.ratingTypeId
JOIN ciqRatingOrgDebtType i ON i.orgDebtTypeId=b.orgDebtTypeId
JOIN ciqRatingEntityData red ON red.entitySymbolValue=a.entitySymbolValue
AND red.ratingDataItemId='1' ---CoName
LEFT JOIN ciqRatingDataToQualifier f ON f.ratingDataId = c.ratingDataId
LEFT JOIN ciqRatingQualifiervalueType g ON g.qualifiervalueid = f.qualifierValueId
LEFT JOIN ciqRatingValueType h ON h.ratingValueId = c.outlookValueId
WHERE 1=1
AND b.ratingTypeId IN ( '130', '131', '126', '254' )
-- and a.companyId = #companyId
AND a.companyId IN
(SELECT distinct TOP 2000000
c.companyId
FROM ciqCompany c
inner join ciqCompanyStatusType cst on cst.companystatustypeid = c.companystatustypeid
inner join ciqCompanyType ct on ct.companyTypeId = c.companyTypeId
inner join refReportingTemplateType rep on rep.templateTypeId = c.reportingtemplateTypeId
inner join refCountryGeo rcg on c.countryId = rcg.countryId
inner join refState rs on rs.stateId = c.stateId
inner join ciqSimpleIndustry sc on sc.simpleIndustryId = c.simpleIndustryId
ORDER BY companyid desc)
ORDER BY companyId DESC, c.ratingdate, b.ratingTypeId, c.ratingStatusIndicator

This will list where there are duplicate companyID's
SELECT companyId, count(*) as Recs
FROM ciqCompany
GROUP BY ciqCompany
HAVING count(*) > 1

I understand that you wish to add a column to the query with the count of each companyId, you can use COUNT() OVER():
select count(a.companyId) over (partition by a.companyId) as companyCount,
<rest of the columns>
from ciqRatingEntity a
join <rest of the query>
This would return in each row the count of the companyId of that row without grouping the results.

How to retrieve count of records in SELECT statement

I am trying to retrieve the right count of records to mitigate an issue I am having. The below query returns 327 records from my database:
SELECT DISTINCT COUNT(at.someid) AS CountOfStudentsInTable FROM tblJobSkillAssessment AS at
INNER JOIN tblJobSkills j ON j.jobskillid = at.skillid
LEFT JOIN tblStudentPersonal sp ON sp.someid2 = at.someid
INNER JOIN tblStudentSchool ss ON ss.monsterid = at.someid
INNER JOIN tblSchools s ON s.schoolid = ss.schoolid
INNER JOIN tblSchoolDistricts sd ON sd.schoolid = s.schoolid
INNER JOIN tblDistricts d ON d.districtid = sd.districtid
INNER JOIN tblCountySchools cs ON cs.schoolid = s.schoolid
INNER JOIN tblCounties cty ON cty.countyid = cs.countyid
INNER JOIN tblRegionUserRegionGroups rurg ON rurg.districtid = d.districtid
INNER JOIN tblGroups g ON g.groupid = rurg.groupid
WHERE ss.graduationyear IN (SELECT Items FROM FN_Split(#gradyears, ',')) AND sp.optin = 'Yes' AND g.groupname = #groupname
Where I run into trouble is trying to reconcile that with the below query. One is for showing just a count of all the particular students the other is showing pertinent information for a set of students as needed but the total needs to be the same and it is not. The below query return 333 students - the reason is because the school the student goes to is in two separate counties and it counts that student twice. I can't figure out how to fix this.
SELECT DISTINCT #TableName AS TableName, d.district AS LocationName, cty.county AS County, COUNT(DISTINCT cc.monsterid) AS CountOfStudents, d.IRN AS IRN FROM tblJobSkillAssessment AS cc
INNER JOIN tblJobSkills AS c ON c.jobskillid = cc.skillid
INNER JOIN tblStudentPersonal sp ON sp.monsterid = cc.monsterid
INNER JOIN tblStudentSchool ss ON ss.monsterid = cc.monsterid
INNER JOIN tblSchools s ON s.schoolid = ss.schoolid
INNER JOIN tblSchoolDistricts sd ON sd.schoolid = s.schoolid
INNER JOIN tblDistricts d ON d.districtid = sd.districtid
INNER JOIN tblCountySchools cs ON cs.schoolid = s.schoolid
INNER JOIN tblCounties cty ON cty.countyid = cs.countyid
INNER JOIN tblRegionUserRegionGroups rurg ON rurg.districtid = d.districtid
INNER JOIN tblGroups g ON g.groupid = rurg.groupid
WHERE ss.graduationyear IN (SELECT Items FROM FN_Split(#gradyears, ',')) AND sp.optin = 'Yes' AND g.groupname = #groupname
GROUP BY cty.county, d.IRN, d.district
ORDER BY LocationName ASC

If you just want the count, then perhaps count(distinct) will solve the problem:
select count(distinct at.someid)
I don't see what at.someid refers to, so perhaps:
select count(distinct cc.monsterid)

Using a sum with a distinct in SQL

I have a query that returns the data i'm looking for using a distinct, but when I SUM that data I get a wrong amount for a my hierachy point '4-2-0-0-5-2'. 4-2-0-0-5-2 has multiple rows so when I sum it, it doesn't add up correctly. What would be the best way to incorporate a distinct into a SUM statement. Any help would be appreicated. Thanks.
First query :
Select distinct B.Proj_Nbr,c.proj_cc,h.proj_cc, h.Proj_Hier, B.Proj_Nm, D.Fscl_Per, A.Amount
from acct_bal a
inner join dim_proj b on a.dim_proj_id = b.dim_proj_id
inner join essbase_fcs.projects_hier_map c on c.proj_nbr = b.proj_nbr
inner join dim_per_mo d on d.dim_per_mo_id = a.dim_per_mo_id
Inner Join Dim_Acct F On A.Dim_Acct_Id = F.Dim_Acct_Id
Inner Join Dim_Org G On A.Dim_Org_Id = G.Dim_Org_Id
inner join essbase_fcs.projects_hier_map h on h.proj_cc = g.cost_ctr
inner join dim_org g1 on c.proj_cc = g1.cost_ctr
Where F.Fin_Lee_Nbr = 500
and c.proj_hier like '4-2-0-0-5-2%'
And A.Dim_Scnro_Id = '45'
And D.Fscl_Yr = '2014'
And b.Proj_Nbr = '9005459'
and fscl_per ='1'
RESULT of 2 rows:
9005459 0358080 0358080 4-2-0-0-5-2 Global Sales.com (iSell) 179777.09
9005459 0358080 0358057 4-2-0-0-5-5 Global Sales.com (iSell) 2257.3**
When I want to sum the data I use this query below. This gives me the two rows i'm looking for, but proj_hier 4-2-0-0-5-2 has the wrong amount because it has multiple rows.
Select B.Proj_Nbr,c.proj_cc, h.Proj_Hier, B.Proj_Nm, D.Fscl_Per, sum(A.Amount)
from acct_bal a
inner join dim_proj b on a.dim_proj_id = b.dim_proj_id
inner join essbase_fcs.projects_hier_map c on c.proj_nbr = b.proj_nbr
inner join dim_per_mo d on d.dim_per_mo_id = a.dim_per_mo_id
Inner Join Dim_Acct F On A.Dim_Acct_Id = F.Dim_Acct_Id
Inner Join Dim_Org G On A.Dim_Org_Id = G.Dim_Org_Id
inner join essbase_fcs.projects_hier_map h on h.proj_cc = g.cost_ctr
inner join dim_org g1 on c.proj_cc = g1.cost_ctr
Where F.Fin_Lee_Nbr = 500
and c.proj_hier like '4-2-0-0-5-2%'
And A.Dim_Scnro_Id = '45'
And D.Fscl_Yr = '2014'
And b.Proj_Nbr = '9005459'
and fscl_per ='1'
group by B.Proj_Nbr,c.proj_cc,f.dim_acct_id, h.Proj_Hier, B.Proj_Nm, D.Fscl_Per
having Sum(A.Amount) <> 0
Order By H.Proj_Hier, B.Proj_Nbr, D.Fscl_Per

Please Generalize the Question and then ask, If i understood your problem Here is solution:
General Query :
select sum(a.amountColumn) from your_table
group by agrrColumnName;
If i change your query :
Select distinct B.Proj_Nbr,c.proj_cc,h.proj_cc, h.Proj_Hier, B.Proj_Nm, D.Fscl_Per, sum(A.Amount)
from acct_bal a
inner join dim_proj b on a.dim_proj_id = b.dim_proj_id
inner join essbase_fcs.projects_hier_map c on c.proj_nbr = b.proj_nbr
inner join dim_per_mo d on d.dim_per_mo_id = a.dim_per_mo_id
Inner Join Dim_Acct F On A.Dim_Acct_Id = F.Dim_Acct_Id
Inner Join Dim_Org G On A.Dim_Org_Id = G.Dim_Org_Id
inner join essbase_fcs.projects_hier_map h on h.proj_cc = g.cost_ctr
inner join dim_org g1 on c.proj_cc = g1.cost_ctr
Where F.Fin_Lee_Nbr = 500
and c.proj_hier like '4-2-0-0-5-2%'
And A.Dim_Scnro_Id = '45'
And D.Fscl_Yr = '2014'
And b.Proj_Nbr = '9005459'
and fscl_per ='1' group by B.Proj_Nbr;

sql query sum bringing back different results

I have the following two queries below, the Total is coming back different, but I am adding the sums in each of the query the same way. Why is the total coming back different?
select [Total Children] = (SUM(demo.NumberOfPreschoolers) + SUM(demo.NumberOfToddlers) + SUM(demo.NumberOfInfants)),
County = co.Description
from ClassroomDemographics as demo
inner join Classrooms as c on demo.Classroom_Id = c.Id
inner join Sites as s on c.Site_Id = s.Id
inner join Profiles as p on s.Profile_Id = p.Id
inner join Dictionary.Counties as co on p.County_Id = co.Id
where co.Description = 'MyCounty'
Group By co.Description
select [Number Of DLL Children] = SUM(cd.NumberOfLanguageSpeakers),
[Total Children] = (SUM(demo.NumberOfPreschoolers) + SUM(demo.NumberOfToddlers) + SUM(demo.NumberOfInfants)),
County = co.Description
from ClassroomDLL as cd
inner join Classrooms as c on cd.Classroom_Id = c.Id
inner join Sites as s on c.Site_Id = s.Id
inner join Profiles as p on s.Profile_Id = p.Id
inner join Dictionary.Counties as co on p.County_Id = co.Id
inner join ClassroomDemographics as demo on c.Id = demo.Classroom_Id
where co.Description = 'MyCounty'
Group by co.Description

Just a quick glance over the two querties, I would presume that:
inner join ClassroomDemographics as demo on c.Id = demo.Classroom_Id
in the second query is excluding results that are in the first query, therefor the aggregated values will be different.

Your join to the Classrooms table is joining with an extra table in the 2nd query.
Query 1:
from ClassroomDemographics as demo
inner join Classrooms as c on demo.Classroom_Id = c.Id
Query 2:
from ClassroomDLL as cd
inner join Classrooms as c on cd.Classroom_Id = c.Id
...
inner join ClassroomDemographics as demo on c.Id = demo.Classroom_Id
My bet is that the ClassroomDLL table has less data in it, or has rows with a null for one of the join criteria columns, either of which could exclude rows from the results and throw your aggregate totals off.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Compute postgresql query using Spark SQL - apache-spark-sql

Related

How to get rid of duplicating data in every row?

Finding the count

How to retrieve count of records in SELECT statement

Using a sum with a distinct in SQL

sql query sum bringing back different results

Categories

Resources