Expanding a bit on this question, I want to capture changes in values specifically when the previous column value is 0 or when the next column value is 0.
Given the following dataframe, tracking value changes from one column to the next using diff and aggregating these fluctuations in a new set of values is possible.
Item Jan_20 Apr_20 Aug_20 Oct_20
Apple 3 4 4 4
Orange 5 5 1 2
Grapes 0 0 4 4
Berry 5 3 0 0
Banana 0 2 0 0
However, if I were to only capture such differences when the values being changed from one column to the next is either specifically from 0 or to 0 and tracking that as either new fruit or lost fruit, respectively, how would I do that?
Desired outcome:
Type Jan_20 Apr_20 Aug_20 Oct_20
New Fruits 0 2 4 0
Lost Fruits 0 0 5 0
Put another way, in the example, since Grapes go from a value of 0 in Apr_20 to 4 in Aug_20, I want 4 to be captured and stored in New Fruits. Similarly, since Banana and Berry both go from a value higher than zero in Apr_20 to 0 in Aug_20, I want to aggregate those values in Lost Fruits.
How could this be achieved?
This can be achieved using masks to hide the non relevant data, combined with diff and sum:
d = df.set_index('Item')
# mask to select values equal to zero
m = d.eq(0)
# difference from previous date
d = d.diff(axis=1)
out = pd.DataFrame({'New' : d.where(m.shift(axis=1)).sum(),
'Lost': -d.where(m).sum()}
).T
Output:
Jan_20 Apr_20 Aug_20 Oct_20
New 0 2 4 0
Lost 0 0 5 0
Related
I have a csv with around 8 million of rows, something like that:
a b c
0 2 3
and I wanted to generate from it new rows based on the second and the third value so I will get:
a b c
0 2 3
0 3 3
0 4 3
0 5 3
which is basically just itereating through every row(in this example one row), and then creating a new row with a value of b+i, where i is between 0 to the value of c including c itself.
c column is irelevant after the rows have been generated, problem is that it has million of rows, and doing that might generate many rows, so how can I do it efficenly? (loops are too slow for that amount of data).
thanks
You can reindex on the repeated index:
out = df.loc[df.index.repeat(df['c']+1)]
out['b'] += out.groupby(level=0).cumcount()
print(out)
Output (reset index if you want):
a b c
0 0 2 3
0 0 3 3
0 0 4 3
0 0 5 3
Note since you blow your data up by the c column and you already have 8 million rows, your new dataframe can be too big on its own.
I have the following dataframe:
d = {'value': [1,1,1,1,1,1,1,1,1,1], 'flag_1': [0,1,0,1,1,1,0,1,1,1],'flag_2':[1,0,1,1,1,1,1,0,1,1],'index':[1,2,3,4,5,6,7,8,9,10]}
df = pd.DataFrame(data=d)
I need to perform the following filter on it:
If flag 1 and flag 2 are equal keep the row with the maximum index from the consecutive indices. Below for rows 4,5,6 and rows 9,10 flag 1 and flag 2 are equal. From the group of consecutive indices 4,5,6 therefore I wish to keep only row 6 and drop rows 4 and 5. For the next group of rows 9 and 10 I wish to keep only row 10. The rows where flag 1 and 2 are not equal should all be retained. I want my final output to look as shown below:
I am really not sure how to achieve what is required so I would be grateful for any advice on how to do it.
IIUC, you can compare consecutive rows with shift. This solution requires a sorted index.
In [5]: df[~df[['flag_1', 'flag_2']].eq(df[['flag_1', 'flag_2']].shift(-1)).all(axis=1)]
Out[5]:
value flag_1 flag_2 index
0 1 0 1 1
1 1 1 0 2
2 1 0 1 3
5 1 1 1 6
6 1 0 1 7
7 1 1 0 8
9 1 1 1 10
I am trying to create a chart from a large data set the structure is as below :
sample data frame :
df = pd.DataFrame({'climate':['hot','hot','hot','cold','cold'],0:['none','apple','apple','orange','grape'],1:['orange','none','grape','apple','banana'],2:['grape','kiwi','tomato','none','tomato']})
need to plot how many from each fruit exist in different climate ,I need two chart both hot and cold separately.
Pivot table and aggregation are not possible because no numerical values .
what method do you recommend?
First do melt then pd.crosstab
s=df.melt('climate')
s=pd.crosstab(s.variable,s.value)
value apple banana grape kiwi none orange tomato
variable
0 2 0 1 0 1 1 0
1 1 1 1 0 1 1 0
2 0 0 1 1 1 0 2
I have a list of index numbers that represent index locations for a DF. list_index = [2,7,12]
I want to sum from a single column in the DF by rolling through each number in list_index and totaling the counts between the index points (and restart count at 0 at each index point). Here is a mini example.
The desired output is in OUTPUT column, which increments every time there is another 1 from COL 1 and RESTARTS the count at 0 on the location after the number in the list_index.
I was able to get it to work with a loop but there are millions of rows in the DF and it takes a while for the loop to run. It seems like I need a lambda function with a sum but I need to input start and end point in index.
Something like lambda x:x.rolling(start_index, end_index).sum()? Can anyone help me out on this.
You can try of cummulative sum and retrieving only 1 values related information , rolling sum with diffferent intervals is not possible
a = df['col'].eq(1).cumsum()
df['output'] = a - a.mask(df['col'].eq(1)).ffill().fillna(0).astype(int)
Out:
col output
0 0 0
1 1 1
2 1 2
3 0 0
4 1 1
5 1 2
6 1 3
7 0 0
8 0 0
9 0 0
10 0 0
11 1 1
12 1 2
13 0 0
14 0 0
15 1 1
I have a super strange problem which I spent the last hour trying to solve, but with no success. It is even more strange since I can't replicate it on a small scale.
I have a large DataFrame (150,000 entries). I took out a subset of it and did some manipulation. the subset was saved as a different variable, x.
x is smaller than the df, but its index is in the same range as the df. I'm now trying to assign x back to the DataFrame replacing values in the same column:
rep_Callers['true_vpID'] = x.true_vpID
This inserts all the different values in x to the right place in df, but instead of keeping the df.true_vpID values that are not in x, it is filling them with NaNs. So I tried a different approach:
df.ix[x.index,'true_vpID'] = x.true_vpID
But instead of filling x values in the right place in df, the df.true_vpID gets filled with the first value of x and only it! I changed the first value of x several times to make sure this is indeed what is happening, and it is. I tried to replicate it on a small scale but it didn't work:
df = DataFrame({'a':ones(5),'b':range(5)})
a b
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
z =Series([random() for i in range(5)],index = range(5))
0 0.812561
1 0.862109
2 0.031268
3 0.575634
4 0.760752
df.ix[z.index[[1,3]],'b'] = z[[1,3]]
a b
0 1 0.000000
1 1 0.812561
2 1 2.000000
3 1 0.575634
4 1 4.000000
5 1 5.000000
I really tried it all, need some new suggestions...
Try using df.update(updated_df_or_series)
Also using a simple example, you can modify a DataFrame by doing an index query and modifying the resulting object.
df_1
a b
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
df_2 = df_1.ix[3:5]
df_2.b = df_2.b + 2
df_2
a b
3 1 5
4 1 6
df_1
a b
0 1 0
1 1 1
2 1 2
3 1 5
4 1 6