Adjust R Data Frame with Element Wise Function - sql

I am using RODBC to pull down server data into a data frame using this statement:
df <- data.frame(
sqlQuery(
channel = ODBC_channel_string,
query = SQLquery_string
)
)
The resultant data set has the following grouping attributes of interest:
Scenario
Other Group By 1
Other Group By 2
...
Other Group By K
With key variables:
[Time] = Future Year(i)
[Spot] = Projected Effective Discount Rate For Year(i-1) to Year(i)
Abbreviated Table Snip
What I would like to do is transform the [Spot] column into a discount factor that is dependent on consistent preceding values:
Scenario
Other Group By 1
Other Group By 2
...
Other Group By K
With key variables:
[Time] = Future Year(i)
[Spot] = Projected Effective Discount Rate For Year(i-1) to Year(i)
[Disc_0] = prod([Value]), for [All Grouping] = [This Grouping] and [Time] <= [This Time]
Excel Version of Abbreviated Goal Snip
I could code the solution using a for loop, but I suspect that will be very inefficient in R if there are significant row counts in the original data frame.
What I am hoping is to use some creative implementation of dplyr's mutate:
df %>% mutate(Disc_0 = objective_function{?})
I think that R should be able to do this kind of data wrangle quickly, but I am not sure if that is the case. I am more familiar with SQL and may attempt to produce the necessary variable there.

Related

BigQuery - Adwords Data Transfer - AccountStats vs AccountBasicStats

For many tables, there's always a AccountStats vs AccountBasicStats.
The same SQL query might have different values from Stats vs BasicStats, for example:
SELECT
cs.Date,
SUM(cs.Impressions) AS Sum_Impressions,
SUM(cs.Clicks) AS Sum_Clicks,
SUM(cs.Interactions) AS Sum_Interactions,
(SUM(cs.Cost) / 1000000) AS Sum_Cost,
SUM(cs.Conversions) AS Sum_Conversions
FROM
`{dataset_id}.Customer_{customer_id}` c
LEFT JOIN
`{dataset_id}.AccountBasicStats_{customer_id}` cs
<-----OR USING----->
`{dataset_id}.AccountStats_{customer_id}` cs
ON
c.ExternalCustomerId = cs.ExternalCustomerId
WHERE
c._DATA_DATE = c._LATEST_DATE
AND c.ExternalCustomerId = {customer_id}
GROUP BY
1
ORDER BY
1
It seems the main difference is ClickType column, which might double count based on the documentation: ClickType.
The BasicStats seems the most accurate, and match up exactly from adwords. While the Stats give around 2x-3x increase in impressions.
Is there a way to transform the data so that both queries would get the same results?
Since there's no basic stats for Hourly data, which I'm interested.
According to:
https://groups.google.com/forum/#!topic/adwords-api/QiY_RT9aNlM
Seems that there is no way to de-segment the data after ClickType is brought in.

TSQL Show greatest Parcel# from row

I have data that looks like this.
Select distinct
x.[PropertyBasics.PK Resnet Property ID], x.filname,
c.mv_30day_value '30 Day Value As Is',
c.mv_30day_repvalue '30 Day Value Repaired',
t.parcel 'Parcel#',x. *
from
Resnet_Reporting_ops.dbo.Ops_FullExportFLATV3 as X (NOLOCK)
left join
resnet_mysql.dbo.form_tax t (nolock) on x.[PropertyBasics.PK Resnet Property ID] = t.property_id
left join
resnet_mysql.dbo.form_fm c (nolock) on t.task_id = c.task_id
where
X.[PropertyBasics.Property Basics - ResID] = 217
and x.[PropertyBasics.PK Resnet Property ID] = 1153829
How do you get this data to only show 1 record for Parcel #?
select distinct tests the ENTIRE row to see if it is unique. Just the smallest little difference in any column will make a row "distinct". The following 2 values are different, so that will cause 2 rows
741722064100000
741-722-06-41-00000
and the following 2 value pairs are different, which cause 2 more rows:
500000 800000
435000 850000
in all that combines to make 4 rows.
So, it looks like we could strip the dash from data item 2 above, and that would make it equal data item 1.
replace(t.parcel,'-','') AS [Parcel#]
But is that always true? Could there be other differences in other rows not shown here?
How do we decide between the value pairs 3. or 4. ? MAX()won't work e.g.
MAX(c.mv_30day_value), MAX(c.mv_30day_repvalue)
would produce
500000 8500000
and that combination doesn't exist in the source data
The logic required to meet the expected result isn't well defined.
Try the following:
SELECT
x.[PropertyBasics.PK Resnet Property ID]
, x.filname
, MAX(c.mv_30day_value) "30 Day Value As Is"
, MAX(c.mv_30day_repvalue) "30 Day Value Repaired"
, replace(t.parcel,'-','') "Parcel#"
-- , x.* NO WAY !!
FROM Resnet_Reporting_ops.dbo.Ops_FullExportFLATV3 AS X
LEFT JOIN resnet_mysql.dbo.form_tax t ON x.[PropertyBasics.PK Resnet Property ID] = t.property_id
LEFT JOIN resnet_mysql.dbo.form_fm c ON t.task_id = c.task_id
WHERE X.[PropertyBasics.Property Basics - ResID] = 217
AND x.[PropertyBasics.PK Resnet Property ID] = 1153829
GROUP BY
x.[PropertyBasics.PK Resnet Property ID]
, x.filname
, replace(t.parcel,'-','')
;
Note. x.* isn't feasible with GROUP BY as you need to specify the columns will define each unique row. x.* is also counter productive with "select distinct" as for each additional column of output you increase the possibility of more rows. (i.e. more columns generally = more differences = more rows). Also as mentioned doubt MAX() produces a good result here.

Query for getting value from another record in same table and filter by difference greater than a gap threshold

I have data imported into a temporary table in MSAccess which looks like this:
to which I have added the "Gap" and "Previous/Current" columns that I need to calculate using an SQL Query. The "Gap Threshold" is User input or PARAMETER supplied to Query and for e.g. is 300. The GlobalID groups ItemID's whereas each ItemID is unique number.
What i want to do is calculate the GAP
(GAP = TEMPORARY_1![VERSION DATE] - TEMPORARY![VERSION DATE])
between ItemID's of similar GlobalID's and identify the items having GAP > GAP THRESHOLD value. Based on this GAP, for each GlobalID-grouped ItemID's, I want to determine which is the "Previous" ItemID and which is the "Current" ItemID.
i.e. determine which is Previous Item and which is Current Item, having a GAP of more than 300 days between them.
Finally, CREATE ANOTHER TABLE that will only import these Current/Previous Pairs for each GlobalID, but display them as one record each like this:
OR Is it a better design to Create 2 separate Tables AFTER CALCULATING GAP > GAP THRESHOLD, called tblPrevious & tblCurrent from the Temporary table like this?:
I need someone to point me in the right direction to have a better normalized design and achieve this using SQL query. Note: all the tables need to be generated dynamically everytime based on new data extract that is imported.
The below query gives error on Gap column and doesn't calculate Previous/Current:
PARAMETERS Threshold Long;
SELECT TEMPORARY.GlobalID, TEMPORARY.ItemID, TEMPORARY.[Version Date], IIf([TEMPORARY]![GlobalID]=
[TEMPORARY_1]![GlobalID],Max([TEMPORARY]![Version Date])-Min([TEMPORARY_1]![Version Date])=0,"Previous") AS Previous, TEMPORARY_1.ItemID, TEMPORARY_1.[Version Date], IIf([TEMPORARY]![GlobalID]=[TEMPORARY_1]![GlobalID],Max([TEMPORARY]![Version Date])-Min([TEMPORARY_1]![Version Date])>[Threshold],"Current") AS [Current], IIf(([TEMPORARY]![Version Date]-[TEMPORARY_1]![Version Date])>[Threshold],[TEMPORARY]![Version Date]-[TEMPORARY_1]![Version Date],"") AS GAP
FROM TEMPORARY, TEMPORARY AS TEMPORARY_1
GROUP BY TEMPORARY.GlobalID, TEMPORARY.ItemID, TEMPORARY.[Version Date], TEMPORARY_1.GlobalID, TEMPORARY_1.ItemID, TEMPORARY_1.[Version Date];
Any help would be most appreciated.
Will offer one more contribution - option with VBA code to get the Current/Previous pairs. This does require saving records to table. Tested and runs in a snap.
Sub GetGap()
Dim intTH As Integer, x As Integer, strGID As String
Dim rsT1 As DAO.Recordset, rsT2 As DAO.Recordset
CurrentDb.Execute "DELETE FROM Temp"
Set rsT1 = CurrentDb.OpenRecordset("SELECT * FROM Temporary ORDER BY GlobalID, ItemID DESC;")
Set rsT2 = CurrentDb.OpenRecordset("SELECT * FROM Temp;")
strGID = rsT1!GlobalID
x = 1
While Not rsT1.EOF
If strGID = rsT1!GlobalID Then
If x = 1 Then
rsT2.AddNew
rsT2!GlobalID = strGID
rsT2!CurItemID = rsT1!ItemID
rsT2!CurDate = rsT1![Version Date]
x = 2
ElseIf x = 2 Then
rsT2!GlobalID = strGID
rsT2!PreItemID = rsT1!ItemID
rsT2!PreDate = rsT1![Version Date]
x = 3
End If
If Not rsT1.EOF Then rsT1.MoveNext
Else
If x = 3 Then rsT2.Update
strGID = rsT1!GlobalID
x = 1
End If
If rsT1.EOF Then rsT2.Update
Wend
End Sub
Then a query can easily calculate the Gap and filter records.
SELECT Temp.GlobalID, Temp.CurItemID, Temp.CurDate, Temp.PreDate, Temp.PreItemID, [CurDate]-[PreDate] AS Gap
FROM Temp
WHERE ((([CurDate]-[PreDate])>Int([Enter Threshold])));
Or the code can be expanded to also calc the Gap and save only records that meet the threshold requirement, just a bit more complicated.
Review Allen Browne Subquery.
Requirements described in narrative differ from the title. Here are suggestions for both.
Queries pulling Current/Previous pairs.
Query 1:
SELECT [GlobalID], [ItemID] AS CurItemID, [Version Date] AS CurDate, (SELECT TOP 1 [Version Date] FROM Temporary AS Dupe WHERE Dupe.GlobalID=Temporary.GlobalID AND Dupe.ItemID < Temporary.ItemID ORDER BY Dupe.GlobalID, Dupe.ItemID DESC) AS PreDate, (SELECT TOP 1 [ItemID] FROM Temporary AS Dupe WHERE Dupe.GlobalID=Temporary.GlobalID AND Dupe.ItemID < Temporary.ItemID ORDER BY Dupe.GlobalID, Dupe.ItemID DESC) AS PreItemID
FROM [Temporary];
Query 2:
SELECT Query1.GlobalID, Query1.CurItemID, Query1.CurDate, Query1.PreDate, Query1.PreItemID, DateDiff("d",[PreDate],[CurDate]) AS Gap FROM Query1 WHERE ((([GlobalID] & [CurItemID]) In (SELECT TOP 1 GlobalID & CurItemID FROM Query1 AS Dupe WHERE Dupe.GlobalID = Query1.GlobalID ORDER BY GlobalID, CurItemID DESC))) AND DateDiff("d",[PreDate],[CurDate]) > Int([Enter Threshold]);
Final output:
GlobalID CurItemID CurDate PreDate PreItemID Gap
00109086 2755630 2/26/2015 3/11/2014 2130881 352
00114899 2785590 3/13/2015 3/25/2014 2093191 353
00154635 2755623 2/26/2015 4/4/2014 2176453 328
Here is query that addresses the requirement for Minimum/Maximum as stated in your title. Not as slow as the Current/Previous queries but if dataset gets significantly larger I expect it will get very slow.
SELECT Maximum.GlobalID, Maximum.ItemID AS MaxItem, Maximum.[Version Date] AS MaxItemDate, Minimum.ItemID AS MinItem, Minimum.[Version Date] AS MinItemDate, Maximum.[Version Date]-Minimum.[Version Date] AS Gap
FROM
(SELECT T1.GlobalID, T1.ItemID, T1.[Version Date] FROM [Temporary] AS T1 WHERE (((T1.ItemID) In (SELECT Min([ItemID]) AS MinItem FROM Temporary GROUP BY GlobalID)))) AS Minimum
INNER JOIN
(SELECT T1.GlobalID, T1.ItemID, T1.[Version Date] FROM [Temporary] AS T1 WHERE (((T1.ItemID) In (SELECT Max([ItemID]) AS MaxItem FROM Temporary GROUP BY GlobalID)))) AS Maximum
ON Minimum.GlobalID = Maximum.GlobalID
WHERE Maximum.[Version Date]-Minimum.[Version Date]>Int([Enter Threshold]);
Also, your dates are in international format. If you encounter issues with that, review Allen Browne International Dates

SQL Filtering duplicate rows due to bad ETL

The database is Postgres but any SQL logic should help.
I am retrieving the set of sales quotations that contain a given product within the bill of materials. I'm doing that in two steps: step 1, retrieve all DISTINCT quote numbers which contain a given product (by product number).
The second step, retrieve the full quote, with all products listed for each unique quote number.
So far, so good. Now the tough bit. Some rows are duplicates, some are not. Those that are duplicates (quote number & quote version & line number) might or might not have maintenance on them. I want to pick the row that has maintenance greater than 0. The duplicate rows I want to exclude are those that have a 0 maintenance. The problem is that some rows, which have no duplicates, have 0 maintenance, so I can't just filter on maintenance.
To make this exciting, the database holds quotes over 20+ years. And the data scientists guys have just admitted that maybe the ETL process has some bugs...
--- step 0
--- cleanup the workspace
SET CLIENT_ENCODING TO 'UTF8';
DROP TABLE IF EXISTS product_quotes;
--- step 1
--- get list of Product Quotes
CREATE TEMPORARY TABLE product_quotes AS (
SELECT DISTINCT master_quote_number
FROM w_quote_line_d
WHERE item_number IN ( << model numbers >> )
);
--- step 2
--- Now join on that list
SELECT
d.quote_line_number,
d.item_number,
d.item_description,
d.item_quantity,
d.unit_of_measure,
f.ref_list_price_amount,
f.quote_amount_entered,
f.negtd_discount,
--- need to calculate discount rate based on list price and negtd discount (%)
CASE
WHEN ref_list_price_amount > 0
THEN 100 - (ref_list_price_amount + negtd_discount) / ref_list_price_amount *100
ELSE 0
END AS discount_percent,
f.warranty_months,
f.master_quote_number,
f.quote_version_number,
f.maintenance_months,
f.territory_wid,
f.district_wid,
f.sales_rep_wid,
f.sales_organization_wid,
f.install_at_customer_wid,
f.ship_to_customer_wid,
f.bill_to_customer_wid,
f.sold_to_customer_wid,
d.net_value,
d.deal_score,
f.transaction_date,
f.reporting_date
FROM w_quote_line_d d
INNER JOIN product_quotes pq ON (pq.master_quote_number = d.master_quote_number)
INNER JOIN w_quote_f f ON
(f.quote_line_number = d.quote_line_number
AND f.master_quote_number = d.master_quote_number
AND f.quote_version_number = d.quote_version_number)
WHERE d.net_value >= 0 AND item_quantity > 0
ORDER BY f.master_quote_number, f.quote_version_number, d.quote_line_number
The logic to filter the duplicate rows is like this:
For each master_quote_number / version_number pair, check to see if there are duplicate line numbers. If so, pick the one with maintenance > 0.
Even in a CASE statement, I'm not sure how to write that.
Thoughts? The database is Postgres but any SQL logic should help.
I think you will want to use Window Functions. They are, in a word, awesome.
Here is a query that would "dedupe" based on your criteria:
select *
from (
select
* -- simplifying here to show the important parts
,row_number() over (
partition by master_quote_number, version_number
order by maintenance desc) as seqnum
from w_quote_line_d d
inner join product_quotes pq
on (pq.master_quote_number = d.master_quote_number)
inner join w_quote_f f
on (f.quote_line_number = d.quote_line_number
and f.master_quote_number = d.master_quote_number
and f.quote_version_number = d.quote_version_number)
) x
where seqnum = 1
The use of row_number() and the chosen partition by and order by criteria guarantee that only ONE row for each combination of quote_number/version_number will get the value of 1, and it will be the one with the highest value in maintenance (if your colleagues are right, there would only be one with a value > 0 anyway).
Can you do something like...
select
*
from
w_quote_line_d d
inner join
(
select
...
,max(maintenance)
from
w_quote_line_d
group by
...
) d1
on
d1.id = d.id
and d1.maintenance = d.maintenance;
Am I understanding your problem correctly?
Edit: Forgot the group by!
I'm not sure, but maybe you could Group By all other columns and use MAX(Maintenance) to get only the greatest.
What do you think?

Handling negative values with sql

I have a data set that lists the date and quantity of future stock of products. Occasionally our demand outstrips our future supply and we wind up with a negative future quantity. I need to factor that future negative quantity into previous supply so we don't compound the problem by overselling our supply.
In the following data set, I need to prepare for demand on 10-19 by applying the negative quantity up the chain until i'm left with a positive quantity:
"ID","SKU","DATE","SEASON","QUANTITY"
"1","001","2012-06-22","S12","1656"
"2","001","2012-07-13","F12","1986"
"3","001","2012-07-27","F12","-283"
"4","001","2012-08-17","F12","2718"
"5","001","2012-08-31","F12","-4019"
"6","001","2012-09-14","F12","7212"
"7","001","2012-09-21","F12","782"
"8","001","2012-09-28","F12","2073"
"9","001","2012-10-12","F12","1842"
"10","001","2012-10-19","F12","-12159"
I need to get it to this:
"ID","SKU","DATE","SEASON","QUANTITY"
"1","001","2012-06-22","S12","1656"
"2","001","2012-07-13","F12","152"
I have looked at using a while loop as well as an outer apply but cannot seem to find a way to do this yet. Any help would be much appreciated. This would need to work for sql server 2008 R2.
Here's another example:
"1","002","2012-07-13","S12","1980"
"2","002","2012-08-10","F12","-306"
"3","002","2012-09-07","F12","826"
Would become:
"1","002","2012-07-13","S12","1674"
"3","002","2012-09-07","F12","826"
You don't seem to get a lot of answers - so here's something if you won't get the right 'how-to do it in pure SQL'. Ignore this solution if there's anything SQLish - it's just a defensive coding, not elegant.
If you want to get a sum of all data with same season why deleting duplicate records - just get it outside, run a foreach loop, sum all data with same season value, update table with the right values and delete unnecessary entries. Here's one of the ways to do it (pseudocode):
productsArray = SELECT * FROM products
processed = array (associative)
foreach product in productsArray:
if product[season] not in processed:
processed[season] = product[quantity]
UPDATE products SET quantity = processed[season] WHERE id = product[id]
else:
processed[season] = processed[season] + product[quantity]
DELETE FROM products WHERE id = product[id]
Here is a CROSS APPLY - tested
SELECT b.ID,SKU,b.DATE,SEASON,QUANTITY
FROM (
SELECT SKU,SEASON, SUM(QUANTITY) AS QUANTITY
FROM T1
GROUP BY SKU,SEASON
) a
CROSS APPLY (
SELECT TOP 1 b.ID,b.Date FROM T1 b
WHERE a.SKU = b.SKU AND a.SEASON = b.SEASON
ORDER BY b.ID ASC
) b
ORDER BY ID ASC