How to create new columns using groupby based on logical expressions - pandas

I have this CSV file
http://www.sharecsv.com/s/2503dd7fb735a773b8edfc968c6ae906/whatt2.csv
I want to create three columns, 'MT_Value','M_Value', and 'T_Data', one who has the mean of the data grouped by year and month, which I accomplished by doing this.
data.groupby(['Year','Month']).mean()
But for M_value I need to do the mean of only the values different from zero, and for T_Data I need the count of the values that are zero divided by the total of values, I guess that for the last one I need to divide the amount of values that are zero by the amount of total data grouped, but honestly I am a bit lost. I looked on google and they say something about transform but I didn't understood very well
Thank you.

You could do something like this:
(data.assign(M_Value=data.Valor.where(data.Valor!=0),
T_Data=data.Valor.eq(0))
.groupby(['Year','Month'])
[['Valor','M_Value','T_Data']]
.mean()
)
Explanation: assign will create new columns with respective names. Now
data.Valor.where(data.Valor!=0) will replace 0 values with nan, which will be ignored when we call mean().
data.Valor.eq(0) will replace 0 with 1 and other values with 0. So when you do mean(), you compute count(Valor==0)/total_count().
Output:
Valor M_Value T_Data
Year Month
1970 1 2.306452 6.500000 0.645161
2 1.507143 4.688889 0.678571
3 2.064516 7.111111 0.709677
4 11.816667 13.634615 0.133333
5 7.974194 11.236364 0.290323
... ... ... ...
1997 10 3.745161 7.740000 0.516129
11 11.626667 21.800000 0.466667
12 0.564516 4.375000 0.870968
1998 1 2.000000 15.500000 0.870968
2 1.545455 5.666667 0.727273
[331 rows x 3 columns]

Related

How do you iterate through a data frame based on the value in a row

I have a data frame which I am trying to iterate through, however not based on time, but on an increase of 10 for example
Column A
Column B
12:05
1
13:05
6
14:05
11
15:05
16
so in this case it would return a new data frame with the rows with 1 and 11. How am I able to do this? The different methods that I have tried such as asfreq resample etc. don't seem to work. They say invalid frequency. The reason I think about this is that it is not time based. What is the function that allows me to do this that isn't time based but based on a numerical value such as 10 or 7. I don't want the every nth number, but every time the column value changes by 10 from the last selected value. ex 1 to 11 then if the next values were 12 15 17 21, it would be 21.
here is one way to do it
# do a remainder division, and choose rows where remainder is zero
# offset by the first value, to make calculation simpler
first_val = df.loc[0]['Column B']
df.loc[((df['Column B'] - first_val) % 10).eq(0)]
Column A Column B
0 12:05 1
2 14:05 11

Pandas Cumulative sum over 1 indice but not the other 3

I have a dataframe with 4 variables DIVISION, QTR, MODEL_SCORE, MONTH with the sum of variable X aggregated by those 4.
I would like to effective partition the data by DIVISION,QTR, and MODEL SCORE and keep a running total order the MONTH FIELD order smallest to largest. The idea being it would reset if it got to a new permutation of the other 3 columns
df = df.groupby(['DIVISION','MODEL','QTR','MONTHS'])['X'].sum()
I'm trying
df['cumsum'] = df.groupby(level=3)['X'].cumsum()
having tried all numbers I can think in the level argument. It seems be able to work any way other than what I want.
EDIT: I know the below isn't formatted ideally, but basically as long as the only variable changing was MONTH the cumulative sum would continue but any other variable would cause it to reset.
DIVSION QTR MODEL MONTHS X CUMSUM
A 1 1 1 10 10
A 1 1 2 20 30
A 1 2 1 5 5
I'm sorry for all the trouble I believe the answer was way simpler than I was making it to be.
After
df = df.groupby(['DIVISION','MODEL','QTR','MONTHS'])['X'].sum()
I was supposed to reset the index I did not want a multi-index and this appears to have worked.
df = df.reset_index()
df['cumsum'] = df.groupby(['DIVISION','MODEL','QTR'])['X'].cumsum()

How to use double groupby in Pandas and filter based on if condition?

I have a data frame called df that looks like this in Pandas:
**id amt date seq**
SB 450,000,000 2020-05-11 1
OM 430,000,000 2020-05-11 1
SB 450,000,000 2020-05-12 1
OM 450,000,000 2020-05-12 1
OM 130,000,000 2020-05-12 2
I need to find the value in amt for each ID for each day. The issue is that one some days there are multiple cycles as indicated by "seq".
If there are 2 cycles (aka seq=2) for any one day, I need to take the value when seq=2 for that id on that day, and drop any values for seq=1 with the same day and id. Some days there are only 1 cycle for any one id, and on those days I can just stick with the value where seq=1.
My goal is to Pandas groupby day and then again groupby id, then apply an if statement for if the seq column contains a 2 for that id and that day, then filter that groupby object to only include the row where seq=2 for that day and id. The end result would be a data frame with only the rows where seq=2 for any day when there are multiple cycles and seq=1 or 2, and the rows where seq=1 for days where there is only one cycle and seq=1 for all ids.
So far I have tried:
`for day in df.groupby(df['date']):
for id in day[1].groupby(['id']):
if 2 in id[1]['seq']:
id[1]=id[1].apply(lambda g: g[g['seq']==2])`
Which gives me:
KeyError: 'seq'
and I have also tried:
`for day in df.groupby(df['date']):
for id in day[1].groupby(['id']):
id=list(id)
if 2 in id[1]['seq']:
id[1]=id[1][id[1]['seq']==2]`
Which runs fine but then doesn't actually change or doing anything to df (same number of rows remain).
Can anyone help me with how I can accomplish this?
Thank you in advance!
You can do this if you groupby date + id, then get the indices of the rows where seq is at it's maximum for those groupings. Once you get the those indices, you can slice back into the original dataframe to get your desired subset:
max_seq_indices = df.groupby(["date", "**id"])["seq**"].idxmax()
print(max_seq_indices)
date **id
2020-05-11 OM 1
SB 0
2020-05-12 OM 4
SB 2
Name: seq**, dtype: int64
Looking at the values of this Series, you can see that we have a maximum seq for ["2020-05-11", "OM"] at row 1. Likewise, there is a maximum seq for ["2020-05-11", "SB"] at row 0. And so on. If we use this to slice back into our original dataframe, we end up with a subset that you described in your question:
new_df = df.loc[max_seq_indices]
print(new_df)
**id amt date seq**
1 OM 430,000,000 2020-05-11 1
0 SB 450,000,000 2020-05-11 1
4 OM 130,000,000 2020-05-12 2
2 SB 450,000,000 2020-05-12 1
This approach will encounter issues if your have a seq greater than 2, but only want the rows where seq is 2. However if that is the case, leave a comment and I can update my answer with a more robust (but probably more complex) solution
You can also work with a sorted dataframe like:
df.sort_values(['date', '**id', 'seq**'], inplace=True)
Then you can use groupby to take just the last of each group
df.reset_index(drop=True).groupby(['date', '**id'])['amt'].agg('last')

DAX - Reference measure in calculated column?

I have data like this
EmployeeID Value
1 7
2 6
3 5
4 3
I would like to create a DAX calculated column (or do I need a measure?) that gives me for each row, Value - AVG() of selected rows.
So if the AVG() of the above 4 rows is 5.25, I would get results like this
EmployeeID Value Diff
1 7 1.75
2 6 0.75
3 5 -0.25
4 3 -1.75
Still learning DAX, I cannot figure out how to implement this?
Thanks
I figured this out with the help of some folks on MSDN forums.
This will only work as a measure because measures are selection aware while calculated columns are not.
The Average stored in a variable is critical. ALLSELECTED() gives you the current selection in a pivot table.
AVERAGEX does the row value - avg of selection.
Diff:=
Var ptAVG = CALCULATE(AVERAGE[Value],ALLSELECTED())
RETURN AVERAGEX(Employee, Value - ptAVG)
You can certainly do this with a calculated column. It's simply
Diff = TableName[Value] - AVERAGE(TableName[Value])
Note that this averages over all employees. If you want to average over only specific groups, then more work needs to be done.

Business Objects CountIf by cell reference

So I have a column with this data
1
1
1
2
3
4
5
5
5
how can I do a count if where the value at any given location in the above table is equal to a cell i select? i.e. doing Count([NUMBER]) Where([NUMBER] = Coordinates(0,0)) would return 3, because there are 3 rows where the value is one in the 0 position.
it's basically like in excel where you can do COUNTIF(A:A, 1) and it would give you the total number of rows where the value in A:A is 1. is this possible to do in business objects web intelligence?
Functions in WebI operate on rows, so you have to think about it a little differently.
If your intent is to create a cell outside of the report block and display the count of specific values, you can use Count() with Where():
=Count([NUMBER];All) Where ([NUMBER] = "1")
In a freestanding cell, the above will produce a value of "3" for your sample data.
If you want to put the result in the same block and have it count up the occurrences of values on that row, for example:
NUMBER NUMBER Total
1 3
1 3
1 3
2 1
3 1
4 1
5 3
5 3
5 3
it gets a little more complicated. You have to have at least one other dimension in the query to reference. It can be anything, but you have to be counting something in conjunction with the NUMBER dimension. So, the following would work, assuming there's another dimension in the query named [Duh]:
=Count([NUMBER];All) ForAll([Duh])