Looping through variables in spss - variables

Im looking for a way to loop through variables (eg week01 to week52) and count the number of times the value changes across the them. For example
week01 to week18 may be coded as 1
week19 to week40 may be coded as 4
and week 41 to 52 may be coded as 3
That would be 2 transistions within the data.
How could i go about writing a code that can find me this information? I'm rather new to this and some help to get me in the right direction would be very appreciated.

You can use the DO REPEAT command to loop through variable lists. Below is an example of using this command to create a before date and after date to compare, and increment a count variable whenever these two variables are different.
data list fixed / observation (A1).
begin data
1
2
3
4
5
end data.
*making random data.
vector week(52).
do repeat week = week1 to week52.
compute week = RND(RV.UNIFORM(0.5,4.4)).
end repeat.
execute.
*initialize count to zero.
compute count = 0.
do repeat week_after = week2 to week52 / week_before = week1 to week51.
if week_after <> week_before count = count + 1.
end repeat.
execute.

Related

Find max and last value from a googlesheet query skipping x rows

I have a data set in google sheets, for each week of data I have 3 rows. I wish to query the data in every second row to calculate the max value and the last value.
For instance:
ROW
DATA
1
800
2
Text
3
500
4
More text
5
600
6
Blah
7
700
8
Blah
For Max value I have the following which will return 800
MAX(FILTER(QUERY(A1:A,"Select * skipping 2"), QUERY(A1:A,"Select * skipping 2") <> 0))
How do I change it up to return the last value? Which should return 700
try:
=LOOKUP(2^99,FILTER(A:A,A:A<>0))
#rockinfreakshow answer will successfully find the last number.
To filter a range by n amount of rows, you can use:
=FILTER(A:A,MOD(ROW(A:A),n)=1)
Change n with your desired value, and 1 with the number of row you want to get. 1 for the first, 2 for the second, but 0 if you want the nth one. To find MAX, just wrap it in MAX()
To find the last one, even if it's a text or number, you can use SORTN and SEQUENCE:
=SORTN(FILTER(A:A,MOD(ROW(A:A),n)=1,A:A<>""),1,1, SEQUENCE(COUNTA(FILTER(A:A,MOD(ROW(A:A),n)=1,A:A<>""))),0)
It orders the elements in reverse order and only chooses the first one
Remember to change n with the number of rows and =1 with the number of row you want to choose

updating the next several row values based on the value of a row in another column

I'm trying to figure out how to add the values of one column (the amount column) to the next few rows based on the condition of another column (the days column). If the condition of the days column is greater than 1, for each day greater than 1 I add the amount column to that many following rows. So if days is three, I add the amount to the next two rows (the first day is just the current row). I actually think this is easier if I make a copy of the amount column, so I made a copy called backlog.
So let's say I have an amount column that represents the amount of support tickets that need to be resolved each day. Each amount has a number of days it takes for the amount to be resolved. I need the total amount to be a sum of the value today and the sum of the outstanding tickets. So if I have an amount of 1 for 2 days, I have 1 ticket amount today and I add that same 1 tomorrow to the ticket amount of tomorrow. If this doesn't make sense, the below examples will. I have a solution as well, but my main issue is doing this efficiently.
Here is a sample dataframe to use:
amount = list(np.zeros(10)) + [random.randint(1,3) for val in range(15)]
random.shuffle(amount)
ex = pd.DataFrame({
'Amount': amount
})
ex.loc[ex['Amount']>0, 'Days'] = [random.randint(0,4) for val in range(15)]
ex.loc[ex['Amount']==0, 'Days'] = 0
ex['Days'] = ex['Days'].astype(int)
ex['Backlog'] = ex['Amount']
ex.head(10)
Input Dataframe:
Amount
Days
Backlog
2
0
2
1
3
1
2
2
2
3
0
3
Desired Output Dataframe:
Amount
Days
Backlog
2
0
2
1
3
1
2
2
3
3
0
6
In the last two values of the backlog column, I have a value of 3 (2 from the current day amount plus 1 from the prior day amount) and a value of 6 (3 for the current day + 2 from the previous day amount + 1 from two days ago).
I have made code for this below, which I think achieves the outcome:
for i in range(0, len(ex['Amount'])):
Days = ex['Days'].iloc[i]
if Days >= 2:
for j in range (1,Days):
if (i+j)>= len(ex['Amount']):
break
ex['Backlog'].iloc[i+j] += ex['Amount'].iloc[i]
The problem is that I'm already using two for loops to slice the data frame for two features first, so when this code is used as a function for a very large data frame it runs far too slowly, and my main goal has been to implement a faster way to do this. Is there a more efficient pandas method to achieve the same outcome? Possibly without having to use slow iteration or a nested for loop? I'm at a loss.

Pandas Cumulative sum over 1 indice but not the other 3

I have a dataframe with 4 variables DIVISION, QTR, MODEL_SCORE, MONTH with the sum of variable X aggregated by those 4.
I would like to effective partition the data by DIVISION,QTR, and MODEL SCORE and keep a running total order the MONTH FIELD order smallest to largest. The idea being it would reset if it got to a new permutation of the other 3 columns
df = df.groupby(['DIVISION','MODEL','QTR','MONTHS'])['X'].sum()
I'm trying
df['cumsum'] = df.groupby(level=3)['X'].cumsum()
having tried all numbers I can think in the level argument. It seems be able to work any way other than what I want.
EDIT: I know the below isn't formatted ideally, but basically as long as the only variable changing was MONTH the cumulative sum would continue but any other variable would cause it to reset.
DIVSION QTR MODEL MONTHS X CUMSUM
A 1 1 1 10 10
A 1 1 2 20 30
A 1 2 1 5 5
I'm sorry for all the trouble I believe the answer was way simpler than I was making it to be.
After
df = df.groupby(['DIVISION','MODEL','QTR','MONTHS'])['X'].sum()
I was supposed to reset the index I did not want a multi-index and this appears to have worked.
df = df.reset_index()
df['cumsum'] = df.groupby(['DIVISION','MODEL','QTR'])['X'].cumsum()

Count past relative number of rows that meet criteria similar to excel's COUNTIF

I have a stock data csv with the following info
Open High Low Close
0 154.55 155.54 152.90 153.41
1 156.82 158.75 155.42 156.76
2 150.21 157.44 150.15 156.33
3 147.78 149.38 146.88 149.11
4 144.25 147.28 143.90 146.27
5 142.90 144.05 140.79 143.73
>>>
I want to count how many of the previous two highs were above the current open.
In excel I can do this
I would like to calculate this using pandas, but so far I have not been able to get anything working.
The closest I have gotten is with the following function and then using apply
def high_counter(day_open):
count = 0
for i in data['High'][:+2]:
if i > day_open:
count += 1
return count
data['Number of previous highs above'] = data['Open'].apply(high_counter)
However, this leads to the comparison always starting from the top down instead of down from the relative cell like in excel.
To summarize, I need to compare the Open with the previous N highs and get a count of how many are above the relative open whether that Open is the first row or the 50th row. The Highs to be compared would begin at the row prior the Open and would be of N size.
IIUC:
df["above"] = pd.DataFrame([df["High"].shift(-1)>df["Open"],
df["High"].shift(-2)>df["Open"]]).T.sum(axis=1)
# or [df["High"].shift(n)>df["Open"] for n in range(-1,-5,-1)] if you want to generalize the number n
print (df)
Open High Low Close above
0 154.55 155.54 152.90 153.41 2
1 156.82 158.75 155.42 156.76 1
2 150.21 157.44 150.15 156.33 0
3 147.78 149.38 146.88 149.11 0
4 144.25 147.28 143.90 146.27 0
5 142.90 144.05 140.79 143.73 0

DAX - Reference measure in calculated column?

I have data like this
EmployeeID Value
1 7
2 6
3 5
4 3
I would like to create a DAX calculated column (or do I need a measure?) that gives me for each row, Value - AVG() of selected rows.
So if the AVG() of the above 4 rows is 5.25, I would get results like this
EmployeeID Value Diff
1 7 1.75
2 6 0.75
3 5 -0.25
4 3 -1.75
Still learning DAX, I cannot figure out how to implement this?
Thanks
I figured this out with the help of some folks on MSDN forums.
This will only work as a measure because measures are selection aware while calculated columns are not.
The Average stored in a variable is critical. ALLSELECTED() gives you the current selection in a pivot table.
AVERAGEX does the row value - avg of selection.
Diff:=
Var ptAVG = CALCULATE(AVERAGE[Value],ALLSELECTED())
RETURN AVERAGEX(Employee, Value - ptAVG)
You can certainly do this with a calculated column. It's simply
Diff = TableName[Value] - AVERAGE(TableName[Value])
Note that this averages over all employees. If you want to average over only specific groups, then more work needs to be done.