Using SPSS Reference Variables and an Index to Create a New Variable - indexing

Essentially, I have a log which contains a Unique identifier for a subject which is tracked through multiple cases. I then used the following code, suggested previously through the great community here, to create an Index. Unfortunately, I've run into a new challenge that I can't seem to figure out. Here is a sample of the current data set to provide perspective.
Indexing function
sort cases by Unique_Modifier.
if $casenum=1 or Unique_Modifier<>lag(Unique_Modifier) Index=1.
if Unique_Modifier=lag(Unique_Modifier) Index=lag(Index)+1.
format Index(f2).
execute.
Unique Identifier Index Variable of interest
A 1 101
A 2 101
A 3 607
A 4 607
A 5 101
A 6 101
B 1 108
B 2 210
C 1 610
C 2 987
C 3 1100
C 4 610
What I'd like to do is create a new variable which contains the number of discrete, different entries in the variable of interest column. The expected output would be as the following:
Unique Identifier Index Variable of interest Intended Output
A 1 101 1
A 2 101 1
A 3 607 2
A 4 607 2
A 5 101 2
A 6 101 2
B 1 108 1
B 2 210 2
C 1 610 1
C 2 987 2
C 3 1100 3
C 4 610 3
I've tried a few different ways to do it, one was to use a similar index function, but it fails as if the variable of interest is different in subsequent lines, it works but, sometimes, we have a recurrence of a variable like 5 lines later. My next idea was to use the AGGREGATE function, but I looked through the IBM manual and it doesn't seem like there is a function within aggregate that would produce the intended output here. Anyone have any ideas? I think a loop is the best bet, but loops within SPSS are a bit funky and hard to get working.

Try this:
data list list/Unique_Identifier Index VOI (3f) .
begin data.
1 1 101
1 2 101
1 3 607
1 4 607
1 5 101
1 6 101
2 1 108
2 2 210
3 1 610
3 2 987
3 3 1100
3 4 610
end data.
string voiT (a1000).
compute voiT=concat(ltrim(string(VOI,f10)),",").
compute Intended_Output=1.
do if index>1.
do if index(lag(voiT), rtrim(voiT))>0.
compute Intended_Output=lag(Intended_Output).
compute voiT=lag(voiT).
else.
compute Intended_Output=lag(Intended_Output)+1.
compute voiT=concat(rtrim(lag(voiT)), rtrim(voiT)).
end if.
end if .
exe.

Related

How to Count unique values across two colomns with certain types in postgresql

I've got data feeding into a database like below.
Now I can use a single query many times to achieve this but i'm looking for a way to query once and return all results
Time, ID ( The totally unique field), ReqType (can be 1 value out of 30), ReqName (can be 1 out of 20).
time
ID
ReQtype
ReqName
1345
12
1
test
1346
13
2
test
1352
14
1
hello world
1352
15
3
fith
1354
16
1
hello world
1357
17
4
apple
Without constructing a query for every variable of ReqType and ReqName i would like to return in X time(last 30min for example above) ((this will be two queries actually i think)
count
ReqName
2
test
2
hello world
1
fith
1
apple
Count
ReqType
3
1
1
2
1
3
1
4

Select maximum value where another column is used for for the Grouping

I'm trying to join several tables, where one of the tables is acting as a
key-value store, and then after the joins find the maximum value in a
column less than another column. As a simplified example, I have the following three tables:
Documents:
DocumentID
Filename
LatestRevision
1
D1001.SLDDRW
18
2
P5002.SLDPRT
10
Variables:
VariableID
VariableName
1
DateReleased
2
Change
3
Description
VariableValues:
DocumentID
VariableID
Revision
Value
1
2
1
Created
1
3
1
Drawing
1
2
3
Changed Dimension
1
1
4
2021-02-01
1
2
11
Corrected typos
1
1
16
2021-02-25
2
3
1
Generic part
2
3
5
Screw
2
2
4
2021-02-24
I can use the LEFT JOIN/IS NULL thing to get the latest version of
variables relatively easily (see http://sqlfiddle.com/#!7/5982d/3/0).
What I want is the latest version of variables that are less than or equal
to a revision which has a DateReleased, for example:
DocumentID
Filename
Variable
Value
VariableRev
DateReleased
ReleasedRev
1
D1001.SLDDRW
Change
Changed Dimension
3
2021-02-01
4
1
D1001.SLDDRW
Description
Drawing
1
2021-02-01
4
1
D1001.SLDDRW
Description
Drawing
1
2021-02-25
16
1
D1001.SLDDRW
Change
Corrected Typos
11
2021-02-25
16
2
P5002.SLDPRT
Description
Generic Part
1
2021-02-24
4
How do I do this?
I figured this out. Add another JOIN at the start to add in another version of the VariableValues table selecting only the DateReleased variables, then make sure that all the VariableValues Revisions selected are less than this date released. I think the LEFT JOIN has to be added after this table.
The example at http://sqlfiddle.com/#!9/bd6068/3/0 shows this better.

Labeling rows in pandas using multiple boolean conditions without chaining

I'm trying to label data in the original dataframe, based on multiple boolean conditions. This is easy enough when labeling based on one or two conditions, but as I begin requiring multiple conditions the code becomes difficult to manage. The solution seems to break the code down into copies, but that causes chain errors. Here is one example of the issue...
This is a simplified version of what my data looks like:
df=pd.DataFrame(np.array([['ABC',1,3,3,4], ['std',0,0,2,4],['std',2,1,2,4],['std',4,4,2,4],['std',2,6,2,6]]), columns=['Note', 'Na','Mg','Si','S'])
df
Note Na Mg Si S
0 ABC 1 3 3 4
1 std 0 0 2 4
2 std 2 1 2 4
3 std 4 4 2 4
4 std 2 6 2 6
A standard (std) is located throughout the dataframe. I would like to create a label when the instrument fails. This occurs in the data when:
String condition met (Note = standard/std)
Na>0 & Mg>0
Doesn't fall outside of a calculated range for more than 2 elements.
For requirement 3 - Here is an example of a range:
maxMin=pd.DataFrame(np.array([['Max',3,3,3,7], ['Min',1,1,2,2]]), columns=['Note', 'Na','Mg','Si','S'])
maxMin
Note Na Mg Si S
0 Max 3 3 3 7
1 Min 1 1 2 2
Calculating out of bound standard:
elements=['Na','Mg','Si','S']
std=df[(df['Note'].str.contains('std|standard'))&(df['Na']>0)&(df['Mg'])
df.loc[(std[elements].lt(maxMin.loc[1, :])|std[elements].gt(maxMin.loc[0, :]).select_dtypes(include=['bool'])).sum(axis=1)>2]
Note Na Mg Si S
3 std 4 4 2 4
Now, I would like to label this datapoint within the original dataframe. Desired result:
Note Na Mg Si S Error
0 ABC 1 3 3 4 False
1 std 0 0 2 4 False
2 std 2 1 2 4 False
3 std 4 4 2 4 True
4 std 2 6 2 6 False
I've tried things like:
df['Error'].loc[std.loc[(std[elements].lt(maxMin.loc[1, :])|std[elements].gt(mMmaxMinloc[0, :]).select_dtypes(include=['bool'])).sum(axis=1)>5].index.values.copy()]=True
That unfortunately causes a chain error.
How would you accomplish this without creating a chain error? Most books/tutorial revolve around creating one long expression, but as I dive deeper, I feel there might be a simpler solution. Any input would be appreciated
I figured it out a solution that works for me.
The solution was to use .index.value to create an array of the index that passed the bool conditions. That array can be used to pass edit the original dataframe.
##These two conditions can probably be combined
condition1=df[(df['Note'].str.contains('std|standard'))&(df['Na']>.01)&(df['Mg']>.01)]
##where condition1 is greater/less than the bounds of the known value.
##provides array where condition is true
OutofBounds=condition1.loc[(condition1[elements].lt(maxMin.loc[1, :])|condition1[elements].gt(maxMin.loc[0, :]).select_dtypes(include=['bool'])).sum(axis=1)>5].index.values
OutofBounds
out:array([ 3], dtype=int64)
Now I can pass the array into the original dataframe:
df.loc[OutofBounds, 'Error']=True

iterrows() of 2 columns and save results in one column

in my data frame I want to iterrows() of two columns but want to save result in 1 column.for example df is
x y
5 10
30 445
70 32
expected output is
points sequence
5 1
10 2
30 1
445 2
I know about iterrows() but it saved out put in two different columns.How can I get expected output and is there any way to generate sequence number according to condition? any help will be appreciated.
First never use iterrows, because really slow.
If want 1, 2 sequence by number of columns convert values to numy array by DataFrame.to_numpy and add numpy.ravel, then for sequence use numpy.tile:
df = pd.DataFrame({'points': df.to_numpy().ravel(),
'sequence': np.tile([1,2], len(df))})
print (df)
points sequence
0 5 1
1 10 2
2 30 1
3 445 2
4 70 1
5 32 2
Do this way:
>>> pd.DataFrame([i[1] for i in df.iterrows()])
points sequence
0 5 1
1 10 2
2 30 1
3 445 2

Removing duplicates from many excel sheets

I got a question if there is any fast way to remove duplicate rows across two excel spreadsheets. After searching I can do it by comparing the same rows in the spreadsheets (VBA). But I want to check whether the row from one is included anywhere in two. If exactly the same row exists in two it should be removed. So far I can do it if they are the same rows (e.g. 1 and 1).
Thanks in advance for any kind of help.
I can think of a workaround for this:
Create a column at the end of each row which is concatenation of all the columns of that particular row: Lets sat below are the two tables on the two excel sheets:
sheet1
A B C D(Concat)
1 2 3 123
4 5 6 456
7 8 9 789
1 3 5 135
4 3 2 432
sheet2
A B C D(Concat)
2 3 4 234
1 1 1 111
1 2 3 123
2 2 2 222
4 5 6 456
We will now identify the duplicate rows based on the last concatenated column. Using the formula =IF(ISNUMBER(MATCH(D4,Sheet1!D:D,0)),"DUP","NONDUP") in the second sheet, we can identify the rows which are already present in sheet1 irrespective of the sequence of the row in sheet1 wrt sheet2.
Result on Sheet2 shows up as below:
A B C D E(Result)
2 3 4 234 NONDUP
1 1 1 111 NONDUP
1 2 3 123 DUP
2 2 2 222 NONDUP
4 5 6 456 DUP