Iterating over a dataframe twice: which is the ideal way?

Iterating over a dataframe twice: which is the ideal way? - pandas

I am trying to create a dataframe for Sankey chart in Power BI which needs source and destination like this.
id
Source
Destination
1
Starting a
next point b
1
next point b
final point c
1
final point c
end
2
Starting a
next point b
2
next point b
3
Starting a
next point b
3
next point b
final point c
3
final point c
end
I have a dataframe like this:
ID
flow
1
Starting a
1
next point b
1
final point c
2
Starting a
2
next point b
3
Starting a
3
next point b
3
final point c
I tried doing by iterating over the dataframe twice like below:
for index, row in df.iterrows():
for j, r in df.iterrows():
if row['ID'] == r['ID']:
if (index + 1 == j) & ("final point c" not in row['flow']):
df['Destination'][index] = df['flow'][j]
elif "final point c" in row['flow']:
df['Destination'][index] = 'End of flow'
Since it is iterating over the same dataframe twice, when the records are huge, it is taking a lot of time to process.
Is there any better way to do this? I tried looking at the all similar questions, but couldn't find anything that relates to my question.

You could use groupby+shift and a bit of masking:
end = df['flow'].str.startswith('final point')
df2 = (df.assign(destination=df.groupby('ID')['flow'].shift(-1)
.mask(end, end.map({True: 'end'}))
)
.rename(columns={'flow': 'source'})
)
output:
ID source destination
0 1 Starting a next point b
1 1 next point b final point c
2 1 final point c end
3 2 Starting a next point b
4 2 next point b NaN
5 3 Starting a next point b
6 3 next point b final point c
7 3 final point c end
Alternative with combine_first to fill the NaNs:
end = df['flow'].str.startswith('final point').map({True: 'end', False: ''})
df2 = (df.assign(destination=df.groupby('ID')['flow'].shift(-1).combine_first(end))
.rename(columns={'flow': 'source'})
)
output:
ID source destination
0 1 Starting a next point b
1 1 next point b final point c
2 1 final point c end
3 2 Starting a next point b
4 2 next point b
5 3 Starting a next point b
6 3 next point b final point c
7 3 final point c end

Related

Pandas Dataframe Checking Consecutive Values in a colum

Have a Pandas Dataframe like below.
EventOccurrence Month
1 4
1 5
1 6
1 9
1 10
1 12
Need to add a identifier column to above panda's dataframe such that whenever Month is consecutive thrice a value of True is filled, else false. Explored few options like shift and window without luck. Any pointer is appreciated.
EventOccurrence Month Flag
1 4 F
1 5 F
1 6 T
1 9 F
1 10 F
1 12 F
Thank You.

You can check whether the diff between rows is one, and the diff shifted by 1 is one as well:
df['Flag'] = (df.Month.diff() == 1) & (df.Month.diff().shift() == 1)
EventOccurrence Month Flag
0 1 4 False
1 1 5 False
2 1 6 True
3 1 9 False
4 1 10 False
5 1 12 False
Note that this will also return True if it is consecutive > 3 times, but that behaviour wasn't specified in the question so I'll assume it's OK
If it needs to only flag the third one, and not for example the fourth consecutive instance, you could add a condition:
df['Flag'] = (df.Month.diff() == 1) & (df.Month.diff().shift() == 1) & (df.Month.diff().shift(2) !=1)

Replace value in column based on value in another column

I have a dataframe with 3240 rows and 3 columns. Column Block represents the block in which values in column A and B appeared. Unique number of blocks is 6 but they are repeating in sequence throughout whole dataframe from 1-6. Values in column A are repeating themselves in the sequences of exact order from 1-10 throughout the whole dataframe (blocks). Values in column B exist from a-j (n = 10), but they repeating themselves in random order in sequences from a-j, so they are never duplicated within the Block.
So in each of 6 Blocks, values in column A (1-10) repeat themselves in exact order from 1-10, while In column B, values (a-j) repeat themselves in random order.
Df looks like this:
Block A B ID
1 1 a XY
1 2 b XY
1 3 c XY
1 4 d XY
1 5 e XY
1 6 f XY
1 7 g XY
1 8 h XY
1 9 i XY
1 10 j XY
....
6 1 d XY
...
6 6 j XY
....
1 1 g XX
1 2 a XX
Throughout dataframe i would like to replace all values in column B based on corresponding value in column A for each separate Block. Logic would be to replace values in column B based on values in column A by this pattern 1=6, 2=7, 3=8, 4=9, 5=10.
Result would look like this:
Block A B ID
1 1 f XY
1 2 g XY
1 3 h XY
1 4 i XY
1 5 j XY
1 6 a XY
1 7 b XY
1 8 c XY
1 9 d XY
1 10 e XY
....
6 1 j XY
...
6 6 d XY
....
1 1 g XX
1 2 a XX
What would be an efficient to do this?

You want to identify the block of 5 within each block of 10 and swap them. This is my solution:
df['B'] = (df.assign(blk_5 = (np.arange(len(df))//5+1) % 2,
blk_10 = np.arange(len(df)) // 10
)
.sort_values(['Block','blk_10','blk_5'])
['B'].values
)

Python: Add column to panda data frame with different column length

I have a panda dataframe and would like to add data columns using one common column as index. In case the new data does not have the index value it should enter a 0. The new column will have a different length. Is there a better way than using a loop? Example below
main Dataframe:
index_column date value
1 1 A
2 2 B
3 3 C
4 4 D
add new column:
date value
2 G
3 J
Result:
index_column date value new value
1 1 A 0
2 2 B G
3 3 C J
4 4 D 0
Many thanks!
Rolf

Find the cell with more common subcells

I haven't seen any similar question to this one. Thank you in advance for your help!
I have these two columns:
Final Product - Subcomponent
A - 1
B - 1
C - 1
D - 1
A - 2
C - 2
B - 3
C - 3
A - 4
C - 4
D - 4
A - 5
B - 5
Final product A is made with the subcomponents 1, 2,4 and 5.
B is made with the subcomponents 1,3 and 5.
C is made with the subcomponents 1,2 and 4.
D is made with the subcomponents 1 and 4.
What I am looking for is an algorithm in vba or pivot tables that optimizes the final production in this way:
1 repeats 4 times.
2 repeats 2 times.
3 repeats 2 times.
4 repeats 3 times.
5 repeats 2 times.
First A should be made because it has more common components. Then B should be made because there is just 1 component missing compared with A. Then C because there is just one component to be replaced and last D because there is has the same two components as C.
I know this is not easy at all... Thank you!

Try this code
Sub Test()
Dim d As Object
Dim v As Variant
Dim m As Long
Dim r As Long
Dim i As Long
m = Range("A" & Rows.Count).End(xlUp).Row
v = Range("A1:B" & m).Value
Set d = CreateObject("Scripting.Dictionary")
For r = 1 To m
If d.Exists(v(r, 1)) Then
d(v(r, 1)) = d(v(r, 1)) & ", " & v(r, 2)
Else
d(v(r, 1)) = v(r, 2)
End If
Next r
Range("E1").Resize(d.Count).Value = Application.Transpose(d.Keys)
Range("F1").Resize(d.Count).Value = Application.Transpose(d.Items)
End Sub

Assigning one column to another column between pandas DataFrames (like vector to vector assignment)

I have a super strange problem which I spent the last hour trying to solve, but with no success. It is even more strange since I can't replicate it on a small scale.
I have a large DataFrame (150,000 entries). I took out a subset of it and did some manipulation. the subset was saved as a different variable, x.
x is smaller than the df, but its index is in the same range as the df. I'm now trying to assign x back to the DataFrame replacing values in the same column:
rep_Callers['true_vpID'] = x.true_vpID
This inserts all the different values in x to the right place in df, but instead of keeping the df.true_vpID values that are not in x, it is filling them with NaNs. So I tried a different approach:
df.ix[x.index,'true_vpID'] = x.true_vpID
But instead of filling x values in the right place in df, the df.true_vpID gets filled with the first value of x and only it! I changed the first value of x several times to make sure this is indeed what is happening, and it is. I tried to replicate it on a small scale but it didn't work:
df = DataFrame({'a':ones(5),'b':range(5)})
a b
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
z =Series([random() for i in range(5)],index = range(5))
0 0.812561
1 0.862109
2 0.031268
3 0.575634
4 0.760752
df.ix[z.index[[1,3]],'b'] = z[[1,3]]
a b
0 1 0.000000
1 1 0.812561
2 1 2.000000
3 1 0.575634
4 1 4.000000
5 1 5.000000
I really tried it all, need some new suggestions...

Try using df.update(updated_df_or_series)
Also using a simple example, you can modify a DataFrame by doing an index query and modifying the resulting object.
df_1
a b
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
df_2 = df_1.ix[3:5]
df_2.b = df_2.b + 2
df_2
a b
3 1 5
4 1 6
df_1
a b
0 1 0
1 1 1
2 1 2
3 1 5
4 1 6

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Iterating over a dataframe twice: which is the ideal way? - pandas

Related

Pandas Dataframe Checking Consecutive Values in a colum

Replace value in column based on value in another column

Python: Add column to panda data frame with different column length

Find the cell with more common subcells

Assigning one column to another column between pandas DataFrames (like vector to vector assignment)

Categories

Resources