For every row in pandas, do until sample ID change - pandas

How can I iterarate over rows in a dataframe until the sample ID change?
my_df:
ID loc_start
sample1 10
sample1 15
sample2 10
sample2 20
sample3 5
Something like:
samples = ["sample1", "sample2" ,"sample3"]
out = pd.DataFrame()
for sample in samples:
if my_df["ID"] == sample:
my_list = []
for index, row in my_df.iterrows():
other_list = [row.loc_start]
my_list.append(other_list)
my_list = pd.DataFrame(my_list)
out = pd.merge(out, my_list)
Expected output:
sample1 sample2 sample3
10 10 5
15 20
I realize of course that this could be done easier if my_df really would look like this. However, what I'm after is the principle to iterate over rows until a certain column value change.

Based on the input & output provided, this should work.
You need to provide more info if you are looking for something else.
df.pivot(columns='ID', values = 'loc_start').rename_axis(None, axis=1).apply(lambda x: pd.Series(x.dropna().values))
output
sample1 sample2 sample3
0 10.0 10.0 5.0
1 15.0 20.0 NaN

Ben.T is correct that a pivot works here. Here is an example:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(0, 5, (10, 2)), columns=list("AB"))
# what does the df look like? Here, I consider column A to be analogous to your "ID" column
In [5]: df
Out[5]:
A B
0 3 1
1 2 1
2 4 2
3 4 1
4 0 4
5 4 2
6 4 1
7 3 1
8 1 1
9 4 0
# now do a pivot and see what it looks like
df2 = df.pivot(columns="A", values="B")
In [8]: df2
Out[8]:
A 0 1 2 3 4
0 NaN NaN NaN 1.0 NaN
1 NaN NaN 1.0 NaN NaN
2 NaN NaN NaN NaN 2.0
3 NaN NaN NaN NaN 1.0
4 4.0 NaN NaN NaN NaN
5 NaN NaN NaN NaN 2.0
6 NaN NaN NaN NaN 1.0
7 NaN NaN NaN 1.0 NaN
8 NaN 1.0 NaN NaN NaN
9 NaN NaN NaN NaN 0.0
Not quite what you wanted. With a little help from Jezreal's answer
df2 = df2.apply(lambda x: pd.Series(x.dropna().values))
In [20]: df3
Out[20]:
A 0 1 2 3 4
0 4.0 1.0 1.0 1.0 2.0
1 NaN NaN NaN 1.0 1.0
2 NaN NaN NaN NaN 2.0
3 NaN NaN NaN NaN 1.0
4 NaN NaN NaN NaN 0.0
The empty spots in the dataframe have to be filled with something, and NaN is used by default. Is this what you wanted?
If, on the other hand, you wanted to perform an operation on your data you would use the groupby instead.
df2 = df.groupby(by="A", as_index=False).mean()

Related

Pandas Long to Wide conversion

I am new to Pandas. I have a data set with in this format.
UserID ISBN BookRatings
0 276725.0 034545104U 0.0
1 276726.0 155061224 5.0
2 276727.0 446520802 0.0
3 276729.0 052165615W 3.0
4 276729.0 521795028 6.0
I would like to create this
ISBN 276725 276726 276727 276729
UserID
0 034545104U
1 0 155061224 0 0 0
2 0 0 446520802 0 0
3 0 0 0 052165615W 0
4 0 0 0 521795028 0
I tried pivot but was not successful. Any kind advice please?
I think that pivot() is the right approach here. The most difficult part is to get the arguments correctly. I think we need to keep the original index and the new columns should be the values in column UserID. Also, we want to fill the new dataframe with the values from column ISBN.
For this, I firstly extract the original index as column and then apply the pivot() function:
df = df.reset_index()
result = df.pivot(index='index', columns='UserID', values='ISBN')
# Make your float columns to integers (only works if all user ids are numbers, drop nan values first)
result.columns = map(int,result.columns)
Output:
276725 276726 276727 276729
index
0 034545104U NaN NaN NaN
1 NaN 155061224 NaN NaN
2 NaN NaN 446520802 NaN
3 NaN NaN NaN 052165615W
4 NaN NaN NaN 521795028
Edit: If you want the same appearance as in the original dataframe you have to apply the following line as well:
result = result.rename_axis(None, axis=0)
Output:
276725 276726 276727 276729
0 034545104U NaN NaN NaN
1 NaN 155061224 NaN NaN
2 NaN NaN 446520802 NaN
3 NaN NaN NaN 052165615W
4 NaN NaN NaN 521795028

In pandas replace consecutive 0s with NaN

I want to clean some data by replacing only CONSECUTIVE 0s in a data frame
Given:
import pandas as pd
import numpy as np
d = [[1,np.NaN,3,4],[2,0,0,np.NaN],[3,np.NaN,0,0],[4,np.NaN,0,0]]
df = pd.DataFrame(d, columns=['a', 'b', 'c', 'd'])
df
a b c d
0 1 NaN 3 4.0
1 2 0.0 0 NaN
2 3 NaN 0 0.0
3 4 NaN 0 0.0
The desired result should be:
a b c d
0 1 NaN 3 4.0
1 2 0.0 NaN NaN
2 3 NaN NaN NaN
3 4 NaN NaN NaN
where column c & d are affected but column b is NOT affected as it only has 1 zero (and not consecutive 0s).
I have experimented with this answer:
Replacing more than n consecutive values in Pandas DataFrame column
which is along the right lines but the solution keeps the first 0 in a given column which is not desired in my case.
Let us do shift with mask
df=df.mask((df.shift().eq(df)|df.eq(df.shift(-1)))&(df==0))
Out[469]:
a b c d
0 1 NaN 3.0 4.0
1 2 0.0 NaN NaN
2 3 NaN NaN NaN
3 4 NaN NaN NaN

Removing nan from pandas dataframe and reshaping dataframe

I have a pandas dataframe df which looks as following:
0 1 2 3 4 5 6
0 3 74
1 4 2
2 -9
3 -1 2 -16 -21
4 1
5 28
I want to remove all the nan from the above and realign the data in each row to get the following:
0 1 2 3
0 3 74
1 4 2
2 -9
3 -1 2 -16 -21
4 1
5 28
Basically I am trying to left align all the data in each row after removing nan. I am not sure how to proceed with this.
First shift all non missing values by justify and then use DataFrame.dropna for remove only NaNs columns:
arr = justify(df.to_numpy(), invalid_val=np.nan)
df = pd.DataFrame(arr).dropna(axis=1, how='all')
print (df)
0 1 2 3
0 3.0 74.0 NaN NaN
1 4.0 2.0 NaN NaN
2 -9.0 NaN NaN NaN
3 -1.0 2.0 -16.0 -21.0
4 1.0 NaN NaN NaN
5 28.0 NaN NaN NaN
#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = ~np.isnan(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out
This solution takes the data into numpy territory, run some computations with numpy isnan and numpy compress, create individual dataframes, and lump back into one dataframe with pandas concat :
data = """ 0 1 2 3 4 5 6
3 74 None None None None None
4 2 None None None None None
None None -9 None None None None
None None None -1 2 -16 -21
None None 1 None None None None
None None 28 None None None None """
df = pd.read_csv(StringIO(data), sep='\s{2,}',engine='python', na_values=["None"])
df
0 1 2 3 4 5 6
0 3.0 74.0 NaN NaN NaN NaN NaN
1 4.0 2.0 NaN NaN NaN NaN NaN
2 NaN NaN -9.0 NaN NaN NaN NaN
3 NaN NaN NaN -1.0 2.0 -16.0 -21.0
4 NaN NaN 1.0 NaN NaN NaN NaN
5 NaN NaN 28.0NaN NaN NaN NaN
#convert to numpy array
M = df.to_numpy()
#get True or False depending on the null status of each entry
condition = ~np.isnan(M)
#for each array, get entries that are not null
step1 = [np.compress(ent,arr) for ent,arr in zip(condition,M)]
step1
#concatenate each dataframe
step2 = pd.concat([pd.DataFrame(ent).T for ent in step1],ignore_index=True)
print(step2)
0 1 2 3
0 3.0 74.0 NaN NaN
1 4.0 2.0 NaN NaN
2 -9.0 NaN NaN NaN
3 -1.0 2.0 -16.0 -21.0
4 1.0 NaN NaN NaN
5 28.0 NaN NaN NaN
#alternatively, from step1 we could find the longest array and use that value to resize all the other arrays :
reshape = max(len(arr) for arr in step1)
#this happens in place
[arr.resize(reshape,refcheck=False) for arr in step1]
outcome = pd.DataFrame(step1).where(lambda x: x.ne(0),np.nan)

Compare 2 columns and replace to None if found equal

The following command will replace all values for matching row to None.
ndf.iloc[np.where(ndf.path3=='sys_bck_20190101.tar.gz')] = np.nan
What I really need to do is to replace the value of a single column called path4 if it matches with column path3. This does not work:
ndf.iloc[np.where(ndf.path3==ndf.path4), ndf.path3] = np.nan
Update:
There is a pandas method "fillna" that can be used with axis = 'columns'.
Is there a similar method to write "NA" values to the duplcate columns?
I can do this, but it does not look like pythonic.
ndf.loc[ndf.path1==ndf.path2, 'path1'] = np.nan
ndf.loc[ndf.path2==ndf.path3, 'path2'] = np.nan
ndf.loc[ndf.path3==ndf.path4, 'path3'] = np.nan
ndf.loc[ndf.path4==ndf.filename, 'path4'] = np.nan
Update 2
Let me explain the issue:
Assuming this dataframe:
ndf = pd.DataFrame({
'path1':[4,5,4,5,5,4],
'path2':[4,5,4,5,5,4],
'path3':list('abcdef'),
'path4':list('aaabef'),
'col':list('aaabef')
})
The expected results :
0 NaN 4.0 NaN NaN a
1 NaN 5.0 b NaN a
2 NaN 4.0 c NaN a
3 NaN 5.0 d NaN b
4 NaN 5.0 NaN NaN e
5 NaN 4.0 NaN NaN f
As you can see this is reverse of fillna. And I guess there is no easy way to do this in pandas. I have already mentioned the commands I can use. I will like to know if there is a better way to achieve this.
Use:
for c1, c2 in zip(ndf.columns, ndf.columns[1:]):
ndf.loc[ndf[c1]==ndf[c2], c1] = np.nan
print (ndf)
path1 path2 path3 path4 col
0 NaN 4.0 NaN NaN a
1 NaN 5.0 b NaN a
2 NaN 4.0 c NaN a
3 NaN 5.0 d NaN b
4 NaN 5.0 NaN NaN e
5 NaN 4.0 NaN NaN f

Boxplot with pandas and groupby

I have the following dataset sample:
0 1
0 0 0.040158
1 2 0.500642
2 0 0.005694
3 1 0.065052
4 0 0.034789
5 2 0.128495
6 1 0.088816
7 1 0.056725
8 0 -0.000193
9 2 -0.070252
10 2 0.138282
11 2 0.054638
12 2 0.039994
13 2 0.060659
14 0 0.038562
And need a box and whisker plot, grouped by column 0. I have the following:
plt.figure()
grouped = df.groupby(0)
grouped.boxplot(column=1)
plt.savefig('plot.png')
But I end up with three subplots. How can place all three on one plot?
Thanks.
In 0.16.0 version of pandas, you could simply do this:
df.boxplot(by='0')
Result:
I don't believe you need to use groupby.
df2 = df.pivot(columns=df.columns[0], index=df.index)
df2.columns = df2.columns.droplevel()
>>> df2
0 0 1 2
0 0.040158 NaN NaN
1 NaN NaN 0.500642
2 0.005694 NaN NaN
3 NaN 0.065052 NaN
4 0.034789 NaN NaN
5 NaN NaN 0.128495
6 NaN 0.088816 NaN
7 NaN 0.056725 NaN
8 -0.000193 NaN NaN
9 NaN NaN -0.070252
10 NaN NaN 0.138282
11 NaN NaN 0.054638
12 NaN NaN 0.039994
13 NaN NaN 0.060659
14 0.038562 NaN NaN
df2.boxplot()