Simplified version of the dataset I have is:
DATA HAVE;
INPUT ID match1 $ match2 $ not_relevant;
DATALINES;
1 "ABC" "ABC" 4
1 "XYZ" "XYZ" 29
2 "QQQ" "AAA" 5
2 "ABC" "ABC" 9
3 "EFG" "EFG" 7
3 "DEF" "DEF" 12
3 "LMK" LMK" 16
3 "LMK" . 29
;RUN;
I am looking to compare match1 and match2, and if anywhere in the ID column match1 does not equal match2, I would like to remove all of the rows with that ID. So for this example dataset I want to remove all of ID 2 (rows 3 and 4) since row 3 does not have a match between match1 and match2. All I can figure out how to do so far is to delete the rows where they dont match, which isnt terribly helpful for this application. I assume it would be easier to make it a new data set with some wheres but I am unsure how to begin there. Any ideas / advice?
EDIT:
Apologies, I dumbed down my dataset too much and forgot about an important exception. Note in my new dataset (I only added one row to the end). I do NOT want to delete group 3, since match2 is blank. I only want to delete a group where match2 is not blank and match1 does not equal match2.
Thanks
There's a few ways to do this. One would be to just construct a dataset of IDs that have non-matching rows, then do a merge or a SQL join and remove anything that matched this list.
However, my preferred option (partly because of speed, but also it's more straightforward once you understand how it works) is the DoW loop.
data want;
id_nonmatch = 0;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
if match1 ne match2 then id_nonmatch = 1; *set the flag to 1 if we find a nonmatch;
end;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
if id_nonmatch = 0 then output;
end;
run;
There are two set statements on the data step, each of which runs through the same dataset separately. If it doesn't make sense, throw a put _all_; inside each of the do loops - that will show you what it's doing. The first loop goes over all of the rows for one ID, checks if any violate the constraint, and if none do, the flag variable (id_nonmatch) stays 0. If one does, it becomes a 1 (and stays that way). Then, when it hits an ID boundary, it stops pulling records from the first set statement, and goes onto the second - re-pulling those same rows. Now, it outputs only when the flag is a zero.
This is very efficient because of buffering - unless your id groups are very large, the data step may be able to use buffers to keep the same rows in memory and not have to reread them from disk. (This will depend on your disk and buffers - and seems to help much less on flash than on physical disks [since there is not the additional benefit of the disk head not having to move] - so your mileage may vary here.)
Just to show this difference, here is a log showing that there isn't much additional time needed for the second read - when the record is reasonably sized. This benefit is less when the record is very small - I imagine there is more overhead involved. Note that the second read adds only 1/7 of the time of the first read to the total processing time!
69 data have;
70 call streaminit(7);
71 length strvar $1000;
72 do id = 1 to 100000;
73 do iter = 1 to 50;
74 x = rand('Uniform');
75 output;
76 end;
77 end;
78 run;
NOTE: Variable strvar is uninitialized.
NOTE: The data set WORK.HAVE has 5000000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 5.20 seconds
cpu time 5.20 seconds
79
80
81 data _null_;
82 do _n_ = 1 by 1 until (last.id);
83 set have;
84 by id;
85 end;
86 run;
NOTE: There were 5000000 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
real time 2.37 seconds
cpu time 2.37 seconds
87
88
89 data _null_;
90 do _n_ = 1 by 1 until (last.id);
91 set have;
92 by id;
93 end;
94 do _n_ = 1 by 1 until (last.id);
95 set have;
96 by id;
97 end;
98 run;
NOTE: There were 5000000 observations read from the data set WORK.HAVE.
NOTE: There were 5000000 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
real time 2.74 seconds
cpu time 2.73 seconds
It is easy to do this with an SQL query with a GROUP BY and HAVING clause.
proc sql;
create table want as
select *
from have
group by id
having max( (match1 ne match2) and not missing(match2))
;
quit;
SAS evaluates boolean expressions as 1/0 for TRUE/FALSE so the MAX() of a series of TRUE/FALSE values will be TRUE if ANY of them are TRUE.
I'm no expert in MSAS Cube so may be this is obvious, but this is blocking an important feature in our team.
We have a fact table of "Indicators" (basicaly values from a calculator), that are computed for a specific date. indicators have a versionId, to group them following a functional rule.
It goes like :
From Date, Value, NodeId, VersionId
D0 - 1.45 - N2 - V0
We have a fact table of "VersionsAssociation" that lists all the versions (the very same versions as the ones in the "Indicator" fact table) that are valid and visible and for what date.
To fit with a customer need, some versions are visible at multiple dates.
For instance, a version computed for date D0, may be visible/recopied for date D1, D2, ...; so for a specific version V0, we would have in "VersionAssociation" :
VersionId , Date From (computed), Date To (Visible at what date)
V0 - D0 - D0
V0 - D0 - D1
V0 - D0 - D2
V0 - D0 - D3
...
In our cube model, "Indicators" facts have a "From Date", the date they are compute for, but no "To Date", because when they are visible is not up to the indicator, but rather decided by the "VersionAssociation".
The means that in our "Dimension Usage" panel, we have a many-to-many relation from "Indicator" pointing to "VersionAssociation" on the dimension "To Date".
So far, this part works as expected. When we select "To Date" = D1 in Excel, we see indicators recopied from D0, with right values (no duplicate).
Then we have a thing called projection, where we split an indicator value alongside a specific dimension. For that we have a third measure group called "Projection", with values called "Weight".
Weights have a "To Date", because the weight are computed for a specific date, and even if an indicator is copied from D0 into D1, when projected, it is projected using D1 Weights.
Also we duplicate the weight regarding all the available from date, that's strange, but without it, the result are pure chaos.
Meaning we would have in the weights:
NodeId,From Date, To Date, Projection Axis, Weight
N2 , D0 , D0 , P1 , 0.75
N2 , D0 , D0 , P2 , 0.25 (a value on node N2 would be split into 2 different values, where the sum is still the same)
N2 , D0 , D1 , P1 , 0.70
N2 , D0 , D1 , P2 , 0.30
Here goes the issue:
The Measure Group "Projection" and "Indicator" are directly linked to the dimension "Projection".
"Projection" has a direct link to the "From Date" and the "To Date" dimension.
"Indicator" has a direct link to the "From Date" dimension, but only a m2m reference to the "To Date" dimension, through the "Version Association" measure group.
To apply the Projection weights, we use a measure expression on the mesures from the "Indicator" Measure group, having something like "[Value Unit] * [Weight]".
Because of reasons, this causes MSAS to not properly disciminate the weight that are eligible to apply to a certain value in the "Indicator" measure group.
For instance, if we look into excel and ask for the D1 date (same behavior for all date), on the Projection Axsi P1 we got :
Value Weight
1.45 * 0.75 (Weight: From Date D0, To Date D0, P1)
+ 1.45 * 0.70 (Weight: From Date D0, To Date D1, P1)
for D1 and P2 we have :
Value Weight
1.45 * 0.25 (Weight: From Date D0, To Date D0, P2)
+ 1.45 * 0.30 (Weight: From Date D0, To Date D1, P2)
This cause the values to mean nothing and be non readable.
So what all of this is for, is to ask for a way to limit the weights that can be applied in the measure expression. We tried to use scope on "From Date" , "To Date" with the "Weight" measure or the "Value" measure, but the cube never step in our SCOPE instructions.
This is very long, and complicated, but we're stuck.
I am not sure that I understoond your problem completely, but what I understood is that since there is no projection axis in the fact Indicator, hence for a similar FromDate and ToDate, when Projection is selected they repeat values.
example from your data
D0 , D0 , P1 , 0.75
D0 , D0 , P2 , 0.25
for this the indicator value is repeated 1.45 for both rows where as it should be 1.45*0.75 for the first row and 1.45*0.25 for the second.
If this is the issue try the below query
with member Measures.IndicatorTest
as
([DimFromDate].[FromDate].CurrentMember,
[DimToDate].[ToDate].CurrentMember,
[Value Unit])
member Measures.ProjectionTest
as
([DimFromDate].[FromDate].CurrentMember,
[DimToDate].[ToDate].CurrentMember,
[DimProjection].[Projection].CurrentMember
[Weight])
member Measures.WeightedIndicator
as
Measures.IndicatorTest*Measures.ProjectionTest
select Measures.WeightedIndicator
on columns,
nonempty
(
[DimFromDate].[FromDate].[FromDate],
[DimToDate].[ToDate].[ToDate],
[DimProjection].[Projection].[Projection]
)
on rows
from yourCube
For closure, as it turns out the behavior expected is not possible (as far as out team tried). so we reverted to merging two of the 3 tables together, and ahving only one many-to-many join in the measure groups.
If I have a data set like the following :
type| min | max
-----------------
a | 25 | 30
b | 20 | 30
c | 15 | 20
My goal is to match an input with a type, and to do that while taking into account that my types have overlapping values.
So let's say I have an input in my system that is 25, and I want to match my input to a type (either a, b, or c). My input is most likely b, since the average of the min and max of b is 25, and could possibly be a, but that is less likely. I've tried implementing this and have had no luck, and have also thought of using p-values, but am not sure how I can do it.
What would be the best way to implement this?
Something like this fits your description:
select t.*
from t
where ? >= min and ? <= max
order by abs( ? - (max - min) / 2 )
fetch first 1 row only;
This identifies the ranges where the value matches. It then chooses the range where the value is closest to the middle of the range.
I'm trying to automate billing for my boss. I have to choose the highest quantity for an invoice date and client, then print that quantity in a separate column and a 0 (or blank) for the second row associated with that client. I'm trying to recreate this example:
Billing Snippet
I'm having trouble using Pandas to do this. I used a pivot table to get the max quantity for each client, then merged that data with the original to get a "max" column. That looks like this:
Dataframe snippet
My plan is to use indexes to essentially say "if the Qty is not equal to Max, then change the value to 0"
Here's my code, but I get the error "A value is trying to be set on a copy of a slice from a DataFrame" :
ad2[ad2['Qty'] != ad2['max']]['Qtrly Billing Count']=0
Any advice on how to tackle this?
Update: Tried turning off the setting that gives me the index error, but the column I want to update isn't changing. Help!
Recreating you df:
ad2 = pd.DataFrame({'Qty':[33, 47],'max':[47,47], 'Qtrly':[47,47] })
Qtrly Qty max
0 47 33 47
1 47 47 47
using loc:
ad2.loc[ad2['Qty'] != ad2['max'], 'Qtrly']=0
result:
Qtrly Qty max
0 0 33 47
1 47 47 47
I am trying to fill column D and column E.
Column A: varchar(64) - unique for each trip
Column B: smallint
Column C: timestamp without time zone (excel messed it up in the
image below but you can assume this as timestamp column)
Column D: numeric - need to find out time from origin in minutes
column E: numeric - time to destination in minutes.
Each trip has got different intermediate stations and I am trying to figure out the time it has been since origin and time to destination
Cell D2 = C2 - C2 = 0
cell D3 = C3 - C2
Cell D4 = C4 - C2
Cell E2 = E6 - E2
Cell E3 = E6 - E3
Cell E6 = E6 - E6 = 0
The main issue is that each trip contains differnt number of stations for each trip_id. I can think about using partition by column but cant figure out how to implement it.
Another sub question: I am dealing with very large table (100 million rows). What is the best way Postgresql experts implement data modification. Do you create like a sample table from the original data and implement everything on the sample before implementing the modifications on the original table or do you use something like "Begin trasaction" on the original data so that you can rollback in case of any error.
PS: Help with question title appreciated.
you don't need to know the number of stops
with a as (select *,extract(minutes from c - min(c) over (partition by a)) dd,extract(minutes from max(c) over (partition by a) - c) ee from td)
update td set d=dd, e=ee
from a
where a.a = td.a and a.b=td.b
;
http://sqlfiddle.com/#!17/c9112/1