Flattening Join in PostgreSQL

Flattening Join in PostgreSQL - sql

Is it possible to join a table so that only a specific row at a specific ordered offset is joined instead of every matching record in table?
I have two tables, Customer and MonthlyRecommendation. MonthlyRecommendation points to Customer and tracks one product recommendation made by the customer at some day in each month.
I'm trying to write a query that retrieves each customer, along with the last 12-months of recommendations. Simply doing:
SELECT c.id, m.date, m.product
FROM Customer AS c
INNER JOIN MonthlyRecommendation AS m ON m.customer_id = c.id
will get me the data I want, but I need it flattened so that each customer's data is in one row, and the result signature looks like:
id, date_01, product_01, date_02, product_02, ..., date_12, product_12
Is there any way to do this in PostgreSQL? For similar problems, I would normally just make 12 separate JOINs, joining on specific sub-condition for each one, but in this case, the condition is relative to the order of the date values in the table. I'd like to be able to specify and ORDER BY, with maybe a LIMIT and OFFSET, but I don't believe and SQL dialect supports that.

Some databases support the pivot operation directly. In Postgres, you could use a user-defined function, such as cross tab. But the aggregation method is simple enough:
SELECT c.id,
'2013-01' as date_01, max(case when m.date = '2013-01' then m.product end) as product_01,
'2013-02' as date_02, max(case when m.date = '2013-02' then m.product end) as product_02,
. . .
'2013-12' as date_12, max(case when m.date = '2013-12' then m.product end) as product_12
FROM Customer c INNER JOIN
MonthlyRecommendation m
ON m.customer_id = c.id
GROUP BY c.id;
Of course, the above query is just guessing at a format for date. You'll need to put the right comparison in for your data.

Related

Sum fields of an Inner join

How I can add two fields that belong to an inner join?
I have this code:
select
SUM(ACT.NumberOfPlants ) AS NumberOfPlants,
SUM(ACT.NumOfJornales) AS NumberOfJornals
FROM dbo.AGRMastPlanPerformance MPR (NOLOCK)
INNER JOIN GENRegion GR ON (GR.intGENRegionKey = MPR.intGENRegionLink )
INNER JOIN AGRDetPlanPerformance DPR (NOLOCK) ON
(DPR.intAGRMastPlanPerformanceLink =
MPR.intAGRMastPlanPerformanceKey)
INNER JOIN vwGENPredios P (NOLOCK) ON ( DPR.intGENPredioLink =
P.intGENPredioKey )
INNER JOIN AGRSubActivity SA (NOLOCK) ON (SA.intAGRSubActivityKey =
DPR.intAGRSubActivityLink)
LEFT JOIN (SELECT RA.intGENPredioLink, AR.intAGRActividadLink,
AR.intAGRSubActividadLink, SUM(AR.decNoPlantas) AS
intPlantasTrabajads, SUM(AR.decNoPersonas) AS NumOfJornales,
SUM(AR.decNoPlants) AS NumberOfPlants
FROM AGRRecordActivity RA WITH (NOLOCK)
INNER JOIN AGRActividadRealizada AR WITH (NOLOCK) ON
(AR.intAGRRegistroActividadLink = RA.intAGRRegistroActividadKey AND
AR.bitActivo = 1)
INNER JOIN AGRSubActividad SA (NOLOCK) ON (SA.intAGRSubActividadKey
= AR.intAGRSubActividadLink AND SA.bitEnabled = 1)
WHERE RA.bitActive = 1 AND
AR.bitActive = 1 AND
RA.intAGRTractorsCrewsLink IN(2)
GROUP BY RA.intGENPredioLink,
AR.decNoPersons,
AR.decNoPlants,
AR.intAGRAActivityLink,
AR.intAGRSubActividadLink) ACT ON (ACT.intGENPredioLink IN(
DPR.intGENPredioLink) AND
ACT.intAGRAActivityLink IN( DPR.intAGRAActivityLink) AND
ACT.intAGRSubActivityLink IN( DPR.intAGRSubActivityLink))
WHERE
MPR.intAGRMastPlanPerformanceKey IN(4) AND
DPR.intAGRSubActivityLink IN( 1153)
GROUP BY
P.vchRegion,
ACT.NumberOfFloors,
ACT.NumOfJournals
ORDER BY ACT.NumberOfFloors DESC
However, it does not perform the complete sum. It only retrieves all the values of the columns and adds them 1 by 1, instead of doing the complete sum of the whole column.
For example, the query returns these results:
What I expect is the final sums. In NumberOfPlants the result of the sum would be 163,237 and of NumberJornales would be 61.
How can I do this?

First of all the (nolock) hints are probably not accomplishing the benefit you hope for. It's not an automatic "go faster" option, and if such an option existed you can be sure it would be already enabled. It can help in some situations, but the way it works allows the possibility of reading stale data, and the situations where it's likely to make any improvement are the same situations where risk for stale data is the highest.
That out of the way, with that much code in the question we're better served with a general explanation and solution for you to adapt.
The issue here is GROUP BY. When you use a GROUP BY in SQL, you're telling the database you want to see separate results per group for any aggregate functions like SUM() (and COUNT(), AVG(), MAX(), etc).
So if you have this:
SELECT Sum(ColumnB) As SumB
FROM [Table]
GROUP BY ColumnA
You will get a separate row per ColumnA group, even though it's not in the SELECT list.
If you don't really care about that, you can do one of two things:
Remove the GROUP BY If there are no grouped columns in the SELECT list, the GROUP BY clause is probably not accomplishing anything important.
Nest the query
If option 1 is somehow not possible (say, the original is actually a view) you could do this:
SELECT SUM(SumB)
FROM (
SELECT Sum(ColumnB) As SumB
FROM [Table]
GROUP BY ColumnA
) t
Note in both cases any JOIN is irrelevant to the issue.

Aggregation of a single data type on a joined table

I have a manual report causing 2-3 hours of manual labor weekly for aggregation.
the "a" table gives me the length I want to sum and the "b" table join brings in the name of the activity I need to aggregate. The issue is that the query does not completely aggregate the value I wish. The output I am looking for is the a.dtlExcode_ and b.print_name_ with the total sum for the dates selected by a.dtlExcode_.
Can anyone provide some pointers? I am fairly new to writing queries but am determined to eliminate manual work from canned reports by using the database.
select sum(a.dtlLength_) as Total_Minutes,
a.dtlExcode_,
b.print_name_
from V_SCHEDULE a
left join V_EXCEPT b on a.dtlExcode_ = b.code_
where sched_date_ is between '2021-07-12' and '2021-07-18'
Group by
a.sched_date_,
a.dtlLength_,
a.dtlExcode_,
b.print_name_

Just a few small errors in the logic. Here are the adjustments:
select sum(a.dtlLength_) as Total_Minutes
, a.dtlExcode_
, b.print_name_
from V_SCHEDULE a
left join V_EXCEPT b
on a.dtlExcode_ = b.code_
where sched_date_ between '2021-07-12' and '2021-07-18'
Group by a.sched_date_, a.dtlExcode_, b.print_name_
;
You had included a.dtlLength_ in the GROUP BY terms, leading to groups for each separate a.dtlLength_ value. That was the main problem.
The other GROUP BY terms could be fine, depending on your requirement.
If you want each date separately in the result, that's fine. If not, remove the date term from the GROUP BY clause, like so:
select sum(a.dtlLength_) as Total_Minutes
, a.dtlExcode_
, b.print_name_
from V_SCHEDULE a
left join V_EXCEPT b
on a.dtlExcode_ = b.code_
where sched_date_ between '2021-07-12' and '2021-07-18'
Group by a.dtlExcode_, b.print_name_
;

How do I get all values for Store_ID pulled into my Snowflake Query?

I have a query below and am trying to get all the week_id's, upc_id's and upc_dsc's pulled in even if there is no net_amt or item_qty for them. I'm successfully pulling in all stores, but I also want to show a upc and week id for these stores so that they can see if they have 0 sales for a upc. I tried doing a right join with my date table under the right join of the s table as well as a right join for the upc table, but it messes up my data and doesn't pull in the columns I need. Does anyone know how to fix this?
Thank you
select
a.week_id
, s.district_cd
, s.store_id
, a.upc_id
, a.upc_dsc
, sum(a.net_amt) as current_week_sales
, sum(t.net_amt) as last_week_sales
, sum(a.item_qty) as current_week_units
, sum(t.item_qty) as last_week_units
from (
select
week_id
, district_cd
, str.store_id
, txn.upc_id
, upc_dsc
, dense_rank() over (order by week_id) as rank
, sum(txn.net_amt) as net_amt
, sum(txn.item_qty) as item_qty
from dw_dss.txn_facts txn
left join dw_dss.lu_store_finance_om STR
on str.store_id = txn.store_id
join dw_dss.lu_upc upc
on upc.upc_id = txn.upc_id
join lu_day_merge day
on day.d_date = txn.txn_dte
where district_cd in (72,73)
and txn.upc_id in (27610100000
,27610200000
,27610300000
,27610400000
)
and division_id = 19
and txn_dte between '2021-07-25' and current_date - 1
group by 1,2,3,4,5
) a
left join temp_tables.ab_week_ago t
on t.rank = a.rank and a.store_id = t.store_id and a.upc_id = t.upc_id
right join dw_dss.lu_store_finance_om s
on s.store_id = a.store_id
where s.division_id = 19
and s.district_cd in (72,73)
group by 1,2,3,4,5

As stated in a previous comment, the example is too long to debug, especially since the source tables are not provided.
However, as a general rule, when adding zeroes for missing dimensions, I follow these steps:
Construct the main query, this is the query with all the complexity
that pulls the data you need - just the available data, without
missing dimensions; test this query to make sure it gives correct
results, aggregated correctly by each dimension
Then use this query as a CTE in a WITH statement and to this query, you can right join all dimensions for which you want to add zero values for missing data
Be sure to double check filtering on the dimensions to ensure that you don't filter out too much in your WHERE conditions, for example, instead of filtering with WHERE on the final query, like in your example:
right join dw_dss.lu_store_finance_om s
on s.store_id = a.store_id
where s.division_id = 19
and s.district_cd in (72,73)
I might rather filter the dimension itself in a subquery:
right join (select store_id from dw_dss.lu_store_finance_om
where s.division_id = 19 and s.district_cd in (72,73)) s
on s.store_id = a.store_id

I have a query below and am trying to get all the week_id's, upc_id's and upc_dsc's pulled in even if there is no net_amt or item_qty for them.
You want to generate the rows that you want using cross join and then use left join to bring in the the data you want. In your case, you also want aggregation.
You have not explained the tables, and I find your query quite hard to follow. But the idea is:
select c.weekid, c.store_id, c.upc_id,
count(f.dayid) as num_sales,
sum(f.net_amt) as total_amt
from calendar c cross join
stores s cross join
upcs u left join
facts f
using (dayid, store_id, upc_id) -- or whatever the right conditions are
group by c.weekid, c.store_id, c.upc_id;
Obviously, you have additional filters. You would apply these filters in the where clause to the dimension tables (or use a subquery if the logic is more complicated).

Bigquery SQL code to pull earliest contact

I have a copy of our salesforce data in bigquery, I'm trying to join the contact table together with the account table.
I want to return every account in the dataset but I only want the contact that was created first for each account.
I've gone around and around in circles today googling and trying to cobble a query together but all roads either lead to no accounts, a single account or loads of contacts per account (ignoring the earliest requirement).
Here's the latest query. that produces no results. I think I'm nearly there but still struggling. any help would be most appreciated.
SELECT distinct
c.accountid as Acct_id
,a.id as a_Acct_ID
,c.id as Cont_ID
,a.id AS a_CONT_ID
,c.email
,c.createddate
FROM `sfdcaccounttable` a
INNER JOIN `sfdccontacttable` c
ON c.accountid = a.id
INNER JOIN
(SELECT a2.id, c2.accountid, c2.createddate AS MINCREATEDDATE
FROM `sfdccontacttable` c2
INNER JOIN `sfdcaccounttable` a2 ON a2.id = c2.accountid
GROUP BY 1,2,3
ORDER BY c2.createddate asc LIMIT 1) c3
ON c.id = c3.id
ORDER BY a.id asc
LIMIT 10

The solution shared above is very BigQuery specific: it does have some quirks you need to work around like the memory error you got.
I once answered a similar question here that is more portable and easier to maintain.
Essentially you need to create a smaller table(even better to make it a view) with the ID and it's first transaction. It's similar to what you shared by slightly different as you need to group ONLY in the topmost query.
It looks something like this
select
# contact ids that are first time contacts
b.id as cont_id,
b.accountid
from `sfdccontacttable` as b inner join
( select accountid,
min(createddate) as first_tx_time
FROM `sfdccontacttable`
group by 1) as a on (a.accountid = b.accountid and b.createddate = a.first_tx_time)
group by 1, 2
You need to do it this way because otherwise you can end up with multiple IDs per account (if there are any other dimensions associated with it). This way also it is kinda future proof as you can have multiple dimensions added to the underlying tables without affecting the result and also you can use a where clause in the inner query to define a "valid" contact and so on. You can then save that as a view and simply reference it in any subquery or join operation

Setup a view/subquery for client_first or client_last
as:
SELECT * except(_rank) from (
select rank() over (partition by accountid order by createddate ASC) as _rank,
*
FROM `prj.dataset.sfdccontacttable`
) where _rank=1
basically it uses a Window function to number the rows, and return the first row, using ASC that's first client, using DESC that's last client entry.
You can do that same for accounts as well, then you can join two simple, as exactly 1 record will be for each entity.
UPDATE
You could also try using ARRAY_AGG which has less memory footprint.
#standardSQL
SELECT e.* FROM (
SELECT ARRAY_AGG(
t ORDER BY t.createddate ASC LIMIT 1
)[OFFSET(0)] e
FROM `dataset.sfdccontacttable` t
GROUP BY t.accountid
)

How to join multiple tables, without omitting values without a match

This small example is based on a question that I have run into countless times, however I failed finding the best answer.
I would like to create a report on Incidents logged per type, per month. Written the following query.
SELECT
d.MonthPeriod
,i.[Type]
,COUNT(*) AS [Count of Calls]
FROM
[dbo].[FactIncident] as i
LEFT JOIN
[dbo].[DimDate] as d on i.DateLoggedKey = d.DateKey
GROUP BY
d.[MonthPeriod],
i.[Type]
This results in the following:
Although correct, I would like to visualize earlier months with 0 logged calls. DimDate contains the following.
What is the best and/or most efficient way of showing the count of calls per month, per type, for all months. Even if the count is 0?
Thought of using Cross Apply, however the resultant query gets huge quickly. Only think of a dataset requiring the count of calls per customer, per category, per month over the last 3 years..
Any ideas?

Do the left join starting with the calendar table, so you keep all the months:
SELECT d.MonthPeriod, i.[Type], COUNT(i.type) AS [Count of Calls]
FROM [dbo].[DimDate] d LEFT JOIN
[dbo].[FactIncident] i
ON i.DateLoggedKey = d.DateKey
GROUP BY d.[MonthPeriod], i.[Type];
This will, of course, return the type as NULL for the months with no data.
If you want all types present, then use CROSS JOIN on the types. This example gets the data from the fact table, but you might have another reference table containing each type:
SELECT d.MonthPeriod, t.[Type], COUNT(i.type) AS [Count of Calls]
FROM [dbo].[DimDate] d CROSS JOIN
(select distinct type from factincident) t LEFT JOIN
[dbo].[FactIncident] i
ON i.DateLoggedKey = d.DateKey and i.type = t.type
GROUP BY d.[MonthPeriod], t.[Type];

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Flattening Join in PostgreSQL - sql

Related

Sum fields of an Inner join

Aggregation of a single data type on a joined table

How do I get all values for Store_ID pulled into my Snowflake Query?

Bigquery SQL code to pull earliest contact

How to join multiple tables, without omitting values without a match

Categories

Resources