SQL - How to eliminate duplicates from the below query in POSTGRES - sql

I have been working on the below query. Basically there are two tables. Realtime_Input and Realtime_Output. When I join the two tables and take the necessary columns, I made this a view and when i query against the view I get duplicates.
What am I doing wrong? When I tested using distinct keyword, I get 60 unique rows but intermittently i get duplicates. My db is on cloud foundry cloud (postgres). Is is because of that? Please help !
select i2.key_ts_long,
case
when i2.revenue_activepower = 'NA'
then (-1 * CAST(io.min5_forecast as real))
else (CAST(i2.revenue_activepower AS real) - CAST(io.min5_forecast as real))
end as diff
from realtime_analytic_input i2,
(select i.farm_id,
i.key_ts_long,
o.min5_forecast,
o.min5_timestamp_seconds
from realtime_analytic_input i,
realtime_analytic_output o
where i.farm_id = o.farm_id
and i.key_ts_long = o.key_ts_long
and o.farm_id = 'MW1'
) io
where i2.key_ts_long = CAST(io.min5_timestamp_seconds AS bigint)
and i2.farm_id = io.farm_id
and i2.farm_id = 'MW1'
and io.key_ts_long between 1464738953169 and 1466457841
order by io.key_ts_long desc

Related

Can't find a way to improve my PostgreSQL query

In my PostgreSQL database I have 6 tables named storeAPrices, storeBprices etc., holding the same columns and indexes as follows:
item_code (string, primary_key)
item_name (string, btree index)
is_whigthed (number : 0|1, betree index)
item_price (number )
My desire is to join each storePrices table to other by item_code or item_name similarity but "OR" should act as in programming language (check right side only if left is false).
Currently, my query has low performance.
select
*
FROM "storeAprices" sap
left JOIN LATERAL (
SELECT * FROM "storeBPrices" sbp
WHERE
similarity(sap.item_name,sbp.item_name) >= 0.45
ORDER BY similarity(sap.item_name,sbp.item_name) DESC
limit 1
) bp ON case when sap.item_code = bp.item_code then true else sap.item_name % bp.item_name end
left JOIN LATERAL (
select * FROM "storeCPrices" scp
WHERE similarity(sap.item_name,scp.item_name) >= 0.45
ORDER BY similarity(sap.item_name,scp.item_name) desc
limit 1
) rp ON case when sap.item_code = rp.item_code then true else sap.item_name % rp.item_name end
This is part of my query and it took too much time to response. My data is not so large (15k items per table)
Also I have another index "is_whigthed" that I'm not sure how to use it. (I don't want set it as variable because I want to get all "is_whigthed" results)
Any suggestions?
OR should be faster than using case
bp ON sap.item_code = bp.item_code OR sap.item_name % bp.item_name
also you can create Trigram index on item_name columns as mentioned in pg_trgm module docs, since you are using it's % operator for similarity

TPC-DS Query 6: Why do we need 'where j.i_category = i.i_category' condition?

I'm going through TPC-DS for Amazon Athena.
It was fine until query 5.
I got some problem on query 6. (which is below)
select a.ca_state state, count(*) cnt
from customer_address a
,customer c
,store_sales s
,date_dim d
,item i
where a.ca_address_sk = c.c_current_addr_sk
and c.c_customer_sk = s.ss_customer_sk
and s.ss_sold_date_sk = d.d_date_sk
and s.ss_item_sk = i.i_item_sk
and d.d_month_seq =
(select distinct (d_month_seq)
from date_dim
where d_year = 2002
and d_moy = 3 )
and i.i_current_price > 1.2 *
(select avg(j.i_current_price)
from item j
where j.i_category = i.i_category)
group by a.ca_state
having count(*) >= 10
order by cnt, a.ca_state
limit 100;
It took more than 30 minutes so it failed with timeout.
I tried to find which part cause problem, so I checked the where conditions and I found where j.i_category = i.i_category for the last part of where condition.
I don't know why this condition is needed so I deleted this part and the query ran Ok.
can you guys tell me why this part is needed?
The j.i_category = i.i_category is subquery correlation condition.
If you remove it from the subquery
select avg(j.i_current_price)
from item j
where j.i_category = i.i_category)
the subquery becomes uncorrelated, and becomes a global aggregation on the item table, which is easy to calculate and the query engine needs to do it once.
If you want a fast, performant query engine on AWS, i can recommend Starburst Presto (disclaimer: i am from Starburst). See https://www.concurrencylabs.com/blog/starburst-presto-vs-aws-redshift/ for a related comparison (note: this is not a comparison with Athena).
If it doesn't have to be that fast, you can use PrestoSQL on EMR (note that "PrestoSQL" and "Presto" components on EMR are not the same thing).

SQL Server - Need to SUM values in across multiple returned records

In the following query I am trying to get TotalQty to SUM across both the locations for item 6112040, but so far I have been unable to make this happen. I do need to keep both lines for 6112040 separate in order to capture the different location.
This query feeds into a Jasper ireport using something called Java.Groovy. Despite this, none of the PDFs printed yet have been either stylish or stained brown. Perhaps someone could address that issue as well, but this SUM issue takes priority
I know Gordon Linoff will get on in about an hour so maybe he can help.
DECLARE #receipt INT
SET #receipt = 20
SELECT
ent.WarehouseSku AS WarehouseSku,
ent.PalletId AS [ReceivedPallet],
ISNULL(inv.LocationName,'') AS [ActualLoc],
SUM(ISNULL(inv.Qty,0)) AS [LocationQty],
SUM(ISNULL(inv.Qty,0)) AS [TotalQty],
MAX(CAST(ent.ReceiptLineNumber AS INT)) AS [LineNumber],
MAX(ent.WarehouseLotReference) AS [WarehouseLot],
LEFT(SUM(ent.WeightExpected),7) AS [GrossWeight],
LEFT(SUM(inv.[Weight]),7) AS [NetWeight]
FROM WarehouseReceiptDetail AS det
INNER JOIN WarehouseReceiptDetailEntry AS ent
ON det.ReceiptNumber = ent.ReceiptNumber
AND det.FacilityName = ent.FacilityName
AND det.WarehouseName = ent.WarehouseName
AND det.ReceiptLineNumber = ent.ReceiptLineNumber
LEFT OUTER JOIN Inventory AS inv
ON inv.WarehouseName = det.WarehouseName
AND inv.FacilityName = det.FacilityName
AND inv.WarehouseSku = det.WarehouseSku
AND inv.CustomerLotReference = ent.CustomerLotReference
AND inv.LotReferenceOne = det.ReceiptNumber
AND ISNULL(ent.CaseId,'') = ISNULL(inv.CaseId,'')
WHERE
det.WarehouseName = $Warehouse
AND det.FacilityName = $Facility
AND det.ReceiptNumber = #receipt
GROUP BY
ent.PalletId
, ent.WarehouseSku
, inv.LocationName
, inv.Qty
, inv.LotReferenceOne
ORDER BY ent.WarehouseSku
The lines I need partially coalesced are 4 and 5 in the above return.
Create a second dataset with a subquery and join to that subquery - you can extrapolate from the following to apply to your situation:
First the Subquery:
SELECT
WarehouseSku,
SUM(Qty)
FROM
Inventory
GROUP BY
WarehouseSku
Now apply to your query - insert into the FROM clause:
...
LEFT JOIN (
SELECT
WarehouseSKU,
SUM(Qty)
FROM
Inventory
GROUP BY
WarehouseSKU
) AS TotalQty
ON Warehouse.WarehouseSku = TotalQty.WarehouseSku
Without seeing the actual schema DDL it is hard to know the exact cardinality, but I think this will point you in the right direction.

How to improve query performance in Oracle

Below sql query is taking too much time for execution. It might be due to repetitive use of same table in from clause. I am not able to find out how to fix this query so that performance would be improve.
Can anyone help me out with this?
Thanks in advance !!
select --
from t_carrier_location act_end,
t_location end_loc,
t_carrier_location act_start,
t_location start_loc,
t_vm_voyage_activity va,
t_vm_voyage v,
t_location_position lp_start,
t_location_position lp_end
where act_start.carrier_location_id = va.carrier_location_id
and act_start.carrier_id = v.carrier_id
and act_end.carrier_location_id =
decode((select cl.carrier_location_id
from t_carrier_location cl
where cl.carrier_id = act_start.carrier_id
and cl.carrier_location_no =
act_start.carrier_location_no + 1),
null,
(select cl2.carrier_location_id
from t_carrier_location cl2, t_vm_voyage v2
where v2.hire_period_id = v.hire_period_id
and v2.voyage_id =
(select min(v3.voyage_id)
from t_vm_voyage v3
where v3.voyage_id > v.voyage_id
and v3.hire_period_id = v.hire_period_id)
and v2.carrier_id = cl2.carrier_id
and cl2.carrier_location_no = 1),
(select cl.carrier_location_id
from t_carrier_location cl
where cl.carrier_id = act_start.carrier_id
and cl.carrier_location_no =
act_start.carrier_location_no + 1))
and lp_start.location_id = act_start.location_id
and lp_start.from_date <=
nvl(act_start.actual_dep_time, act_start.actual_arr_time)
and (lp_start.to_date is null or
lp_start.to_date >
nvl(act_start.actual_dep_time, act_start.actual_arr_time))
and lp_end.location_position_id = act_end.location_id
and lp_end.from_date <=
nvl(act_end.actual_dep_time, act_end.actual_arr_time)
and (lp_end.to_date is null or
lp_end.to_date >
nvl(act_end.actual_dep_time, act_end.actual_arr_time))
and act_end.location_id = end_loc.location_id
and act_start.location_id = start_loc.location_id;
There is no Stright forward one answer for your question and the query you've mentioned.
In order to get a better response time of any query, you need to keep few things in mind while writing your queries. I will mention few here which appeared to be important for your query
Use joins instead of subqueries.
Use EXPLAIN to determine queries are functioning appropriately.
Use the columns which are having indexes with your where clause else create an index on those columns. here use your common sense which are the columns to be indexed ex: foreign key columns, deleted, orderCreatedAt, startDate etc.
Keep the order of the select columns as they appear at the table instead of arbitrarily selecting columns.
The above four points are enough for the query you've provided.
To dig deep about SQL optimization and tuning refer this https://docs.oracle.com/database/121/TGSQL/tgsql_intro.htm#TGSQL130

Database (Oracle 11g) query optimization for joins

So I am trying to optimize a bunch of queries which are taking a lot of time. What I am trying to figure out is how to create an index on columns from different tables.
Here is a simple version of my problem.
What I did
After Googling I looked into bitmap index but I am not sure if this is the right way to solve the issue
Issue
There is a many to many relationship b/w Student(sid,...) and Report(rid, year, isdeleted)
StudentReport(id, sid, rid) is the join table
Query
Select *
from Report
inner join StudentReport on Report.rid = StudentReport.rid
where Report.isdeleted = 0 and StudentReport.sid = x and Report.year = y
What is the best way to create an index?
Please try this:
with TMP_REP AS (
Select * from Report where Report.isdeleted = 0 AND Report.year = y
)
,TMP_ST_REP AS(
Select *
from StudentReport where StudentReport.sid = x
)
SELECT * FROM TMP_REP R, TMP_ST_REP S WHERE S.rid = R.rid