data manipulation with na values - pandas

My code are like that df2 DataFrame (toy data) :
import pandas as pd
import numpy as np
end = '20201117'
np.random.seed(107)
df = pd.DataFrame()
for i in range(10):
start = np.random.choice(['20000101', '20100101', '20160101', '20121010'])
df_tempo = pd.DataFrame({'product': 'p'+str(i),
'category': 'cat'+str(np.random.choice(4))
},
index=pd.bdate_range(start=pd.to_datetime(start),
end=end,
freq=np.random.choice(['D', 'W'])))
df_tempo['valeur1'] = np.random.randint(50, 100, df_tempo.shape[0])
df_tempo['valeur2'] = np.random.randint(100, 200, df_tempo.shape[0])
df = pd.concat([df, df_tempo])
df = df.reset_index().rename(columns={'index': 'date'})
df2 = df.pivot(index='date', columns=['category', 'product'], values=['valeur1', 'valeur2'])
from this I would like to calculate the growth rate for 2020, 2019,..years
I tried :
df2.apply([
lambda x: 100*x.valeur1[x.valeur1['2020'].last_valid_index()]/x.valeur1[x.valeur1['2019'].last_valid_index()]-100,
])
but get an error message with is (I would like to calculate different stat for valeur1 and valeur2) :
AttributeError: 'Series' object has no attribute 'valeur1'
so I did (it's not what I would like to do because as I said I would like to apply different stat on valeur1 and valeur2) :
df2['valeur1'].apply([
lambda x: 100*x[x['2020'].last_valid_index()]/x[x['2019'].last_valid_index()]-100,
lambda x: 100*x[x['2010'].last_valid_index()]/x[x['2009'].last_valid_index()]-100,
])
It works except that
lambda x: 100*x[x['2010'].last_valid_index()]/x[x['2009'].last_valid_index()]-100
because I have data series that have no data for 2010 and so return an error.
I tried then a dropna()but not help
Any ideas to help on these 2 problems ?

Since date is already the index of df2, pd.Grouper(freq="Y").last() can be used to retrieve the value on the last date. Is this what you are expecting?
Note: df2 = df2.groupby(pd.Grouper(freq="Y")).ffill() will fill the last date of each year with previous valid value(s). For this sample dataset, however, no difference was introduced.
Code
df_last = df2.groupby(pd.Grouper(freq="Y")).last()
out = df_last.diff() / df_last.shift() * 100
Result
print(out)
valeur1 ... valeur2
category cat0 cat2 ... cat0 cat2
product p0 p1 p2 ... p7 p8 p9
date ...
2000-12-31 NaN NaN NaN ... NaN NaN NaN
2001-12-31 NaN 62.000000 -35.869565 ... NaN -37.113402 NaN
2002-12-31 NaN 8.641975 28.813559 ... NaN 13.934426 NaN
2003-12-31 NaN -40.909091 19.736842 ... NaN 30.935252 NaN
2004-12-31 NaN 25.000000 -30.769231 ... NaN -18.131868 NaN
2005-12-31 NaN -20.000000 57.142857 ... NaN -18.791946 NaN
2006-12-31 NaN 19.230769 -15.151515 ... NaN 14.049587 NaN
2007-12-31 NaN 25.806452 -40.476190 ... NaN -5.072464 NaN
2008-12-31 NaN -1.282051 4.000000 ... NaN 32.824427 NaN
2009-12-31 NaN 7.792208 38.461538 ... NaN -9.770115 NaN
2010-12-31 NaN -33.734940 -26.388889 ... NaN 12.738854 NaN
2011-12-31 NaN 29.090909 83.018868 ... NaN -12.994350 43.442623
2012-12-31 NaN -19.718310 -40.206186 ... NaN -5.194805 13.142857
2013-12-31 NaN 17.543860 34.482759 ... NaN 12.328767 -19.696970
2014-12-31 NaN 8.955224 23.076923 ... NaN -22.560976 -18.867925
2015-12-31 NaN -6.849315 -14.583333 ... NaN 43.307087 23.255814
2016-12-31 NaN 33.823529 -7.317073 ... NaN -22.527473 -6.289308
2017-12-31 -22.580645 -35.164835 -2.631579 ... -8.176101 10.638298 -22.818792
2018-12-31 -6.944444 55.932203 5.405405 ... -30.821918 1.923077 35.652174
2019-12-31 14.925373 -8.695652 -1.282051 ... 70.297030 -6.918239 19.230769
2020-12-31 2.597403 8.333333 -20.779221 ... -28.488372 -6.081081 -17.204301
[21 rows x 20 columns]

Related

Adding NA data for future dates in pandas dataframe

I have a pandas dataframe with monthly date index till the current month. I would like to impute NA values n periods into the future (in my case 1 year). I tried adding future dates into the existing index in the following manner:
recentDate = inputFileDf.index[-1]
outputFileDf.index = outputFileDf.index.append(pd.date_range(recentDate , periods=12, freq="M"))
This throws ValueError: Length mismatch: Expected axis has 396 elements, new values have 408 elements.
Would appreciate any help to "extend" the dataframe by adding the dates and NA values.
You can use df.reindex here.
Example data:
df = pd.DataFrame(
{'num': [*range(5)]},
index=pd.date_range('2022-10-10', periods=5, freq='D'))
print(df)
num
2022-10-10 0
2022-10-11 1
2022-10-12 2
2022-10-13 3
2022-10-14 4
recentDate = df.index[-1]
new_data = pd.date_range(recentDate , periods=4, freq="M")
new_idx = df.index.append(new_data)
new_df = df.reindex(new_idx)
print(new_df)
num
2022-10-10 0.0
2022-10-11 1.0
2022-10-12 2.0
2022-10-13 3.0
2022-10-14 4.0
2022-10-31 NaN
2022-11-30 NaN
2022-12-31 NaN
2023-01-31 NaN
Use concat - if indices are unique or not:
recentDate = inputFileDf.index[-1]
df = pd.DataFrame(index=pd.date_range(recentDate, periods=12, freq="M"))
outputFileDf = pd.concat([inputFileDf, df])
If indices are unique in idx use DataFrame.reindex:
recentDate = inputFileDf.index[-1]
idx = inputFileDf.index.append(pd.date_range(recentDate , periods=12, freq="M"))
outputFileDf = outputFileDf.reindex(idx)
EDIT: If original DataFrame has months and need append new one need add 1 month to last index for avoid double last indices from original DataFrame:
inputFileDf = pd.DataFrame(columns=['col'],
index=pd.date_range('2022-10-31', periods=4, freq='M'))
print(inputFileDf)
col
2022-10-31 NaN
2022-11-30 NaN
2022-12-31 NaN
2023-01-31 NaN
recentDate = inputFileDf.index[-1]
idx = inputFileDf.index.append(pd.date_range(recentDate + pd.DateOffset(months=1) , periods=12, freq="M"))
outputFileDf = inputFileDf.reindex(idx)
print (outputFileDf)
col
2022-10-31 NaN
2022-11-30 NaN
2022-12-31 NaN
2023-01-31 NaN
2023-02-28 NaN
2023-03-31 NaN
2023-04-30 NaN
2023-05-31 NaN
2023-06-30 NaN
2023-07-31 NaN
2023-08-31 NaN
2023-09-30 NaN
2023-10-31 NaN
2023-11-30 NaN
2023-12-31 NaN
2024-01-31 NaN
Or use first index value:
firstDate = inputFileDf.index[0]
idx = pd.date_range(firstDate, periods=12 + len(inputFileDf), freq="M")
outputFileDf = inputFileDf.reindex(idx)
print (outputFileDf)
col
2022-10-31 NaN
2022-11-30 NaN
2022-12-31 NaN
2023-01-31 NaN
2023-02-28 NaN
2023-03-31 NaN
2023-04-30 NaN
2023-05-31 NaN
2023-06-30 NaN
2023-07-31 NaN
2023-08-31 NaN
2023-09-30 NaN
2023-10-31 NaN
2023-11-30 NaN
2023-12-31 NaN
2024-01-31 NaN

Plot stacked (100%) bar chart for multiple categories on multiple dates

I have following initial dataframe:
Post ID
Submission_Date
Flair
0
row1
01.12.2020
NaN
1
row2
03.12.2020
Discussion
2
row3
03.12.2020
News
3
row4
03.12.2020
Discussion
4
row5
06.12.2020
Due Diligence
5
row6
07.12.2020
Discussion
6
row7
31.12.2020
Discussion
1
row8
01.01.2021
Hedge Fund Tears
Multiple Dates with missing dates in between
Multiple categories on dates
I grouped the dataframe with:
import pandas as pd
import numpy as np # for test data
data = {'Post ID': ['row1', 'row2', 'row3', 'row4', 'row5', 'row6', 'row7', 'row8'], 'Submission_Date': ['01.12.2020', '03.12.2020', '03.12.2020', '03.12.2020', '06.12.2020', '07.12.2020', '31.12.2020', '01.01.2021'], 'Flair': [np.nan, 'Discussion', 'News', 'Discussion', 'Due Diligence', 'Discussion', 'Discussion', 'Hedge Fund Tears']}
df = pd.DataFrame(data)
df['Submission_Date'] = pd.to_datetime(df['Submission_Date'], format = "%Y-%m-%d %H:%M:%S").dt.strftime('%Y-%m-%d')
df = df.groupby('Submission_Date')['Flair'].value_counts(normalize=True).unstack()
The result is this:
I want to fill the dates with "empty" bars and plot like this
something like this:
I tried already this:
fig, ax = plt.subplots(figsize=(20,10))
df.plot(kind='bar',ax=ax, stacked=True, width=1)
plt.xlabel('Submission_Date', fontsize=16)
plt.ylabel('Ratio of Flairs used', fontsize=16)
But the dates are incorrect since the empty days are not displayed
Assuming this input as df2 (the output of your groupby operation):
Flair Discussion Due Diligence Hedge Fund Tears News
Submission_Date
01.01.2021 NaN NaN 1.0 NaN
03.12.2020 0.666667 NaN NaN 0.333333
06.12.2020 NaN 1.0 NaN NaN
07.12.2020 1.000000 NaN NaN NaN
31.12.2020 1.000000 NaN NaN NaN
You can reindex from pd.date_range:
df2.index = pd.to_datetime(df2.index, format='%d.%m.%Y')
df2 = df2.reindex(pd.date_range(df2.index.min(), df2.index.max()))
df2.index = df2.index.strftime('%Y-%m-%d')
Flair Discussion Due Diligence Hedge Fund Tears News
2020-12-03 0.666667 NaN NaN 0.333333
2020-12-04 NaN NaN NaN NaN
2020-12-05 NaN NaN NaN NaN
2020-12-06 NaN 1.0 NaN NaN
2020-12-07 1.000000 NaN NaN NaN
...
2020-12-30 NaN NaN NaN NaN
2020-12-31 1.000000 NaN NaN NaN
2021-01-01 NaN NaN 1.0 NaN
graphical outcome (small size):

Is it possible to turn quartely data to monthly?

I'm struggling with this problem and I'm not sure if I'm approaching it correctly.
I have this dataset:
ticker date filing_date_x currency_symbol_x researchdevelopment effectofaccountingcharges incomebeforetax minorityinterest netincome sellinggeneraladministrative grossprofit ebit nonoperatingincomenetother operatingincome otheroperatingexpenses interestexpense taxprovision interestincome netinterestincome extraordinaryitems nonrecurring otheritems incometaxexpense totalrevenue totaloperatingexpenses costofrevenue totalotherincomeexpensenet discontinuedoperations netincomefromcontinuingops netincomeapplicabletocommonshares preferredstockandotheradjustments filing_date_y currency_symbol_y totalassets intangibleassets earningassets othercurrentassets totalliab totalstockholderequity deferredlongtermliab ... totalcurrentliabilities shorttermdebt shortlongtermdebt shortlongtermdebttotal otherstockholderequity propertyplantequipment totalcurrentassets longterminvestments nettangibleassets shortterminvestments netreceivables longtermdebt inventory accountspayable totalpermanentequity noncontrollinginterestinconsolidatedentity temporaryequityredeemablenoncontrollinginterests accumulatedothercomprehensiveincome additionalpaidincapital commonstocktotalequity preferredstocktotalequity retainedearningstotalequity treasurystock accumulatedamortization noncurrrentassetsother deferredlongtermassetcharges noncurrentassetstotal capitalleaseobligations longtermdebttotal noncurrentliabilitiesother noncurrentliabilitiestotal negativegoodwill warrants preferredstockredeemable capitalsurpluse liabilitiesandstockholdersequity cashandshortterminvestments propertyplantandequipmentgross accumulateddepreciation commonstocksharesoutstanding
116638 JNJ.US 2019-12-31 2020-02-18 USD 3.232000e+09 NaN 4.218000e+09 NaN 4.010000e+09 6.039000e+09 1.363200e+10 6.119000e+09 6.500000e+07 4.238000e+09 NaN 85000000.0 208000000.0 81000000.0 -4000000.0 NaN 104000000.0 NaN 208000000.0 2.074700e+10 9.414000e+09 7.115000e+09 -1.200000e+08 NaN 4.010000e+09 4.010000e+09 NaN 2020-02-18 USD 1.577280e+11 4.764300e+10 NaN 2.486000e+09 9.825700e+10 5.947100e+10 5.958000e+09 ... 3.596400e+10 1.202000e+09 1.202000e+09 NaN -1.589100e+10 1.765800e+10 4.527400e+10 1.149000e+09 -2.181100e+10 1.982000e+09 1.448100e+10 2.649400e+10 9.020000e+09 3.476200e+10 NaN NaN NaN NaN NaN 3.120000e+09 NaN 1.106590e+11 -3.841700e+10 NaN 5.695000e+09 7.819000e+09 1.124540e+11 NaN 2.649400e+10 2.984100e+10 6.229300e+10 NaN NaN NaN NaN 1.577280e+11 1.928700e+10 NaN NaN 2.632507e+09
116569 JNJ.US 2020-03-31 2020-04-29 USD 2.580000e+09 NaN 6.509000e+09 NaN 5.796000e+09 5.203000e+09 1.364400e+10 8.581000e+09 7.460000e+08 5.788000e+09 NaN 25000000.0 713000000.0 67000000.0 42000000.0 300000000.0 58000000.0 NaN 713000000.0 2.069100e+10 7.135000e+09 7.047000e+09 6.210000e+08 NaN 5.796000e+09 5.796000e+09 NaN 2020-04-29 USD 1.550170e+11 4.733800e+10 NaN 2.460000e+09 9.372300e+10 6.129400e+10 5.766000e+09 ... 3.368900e+10 2.190000e+09 2.190000e+09 NaN -1.624300e+10 1.740100e+10 4.422600e+10 NaN -1.951500e+10 2.494000e+09 1.487400e+10 2.539300e+10 8.868000e+09 3.149900e+10 NaN NaN NaN NaN NaN 3.120000e+09 NaN 1.129010e+11 -3.848400e+10 NaN 5.042000e+09 NaN 7.539000e+09 NaN 2.539300e+10 2.887500e+10 6.003400e+10 NaN NaN NaN NaN 1.550170e+11 1.802400e+10 4.324700e+10 -2.584600e+10 2.632392e+09
116420 JNJ.US 2020-06-30 2020-07-24 USD 2.707000e+09 NaN 3.940000e+09 NaN 3.626000e+09 4.993000e+09 1.177900e+10 5.711000e+09 -5.000000e+06 3.990000e+09 NaN 45000000.0 314000000.0 19000000.0 -26000000.0 NaN 67000000.0 NaN 314000000.0 1.833600e+10 7.839000e+09 6.557000e+09 -8.500000e+07 NaN 3.626000e+09 3.626000e+09 NaN 2020-07-24 USD 1.583800e+11 4.741300e+10 NaN 2.688000e+09 9.540200e+10 6.297800e+10 5.532000e+09 ... 3.677200e+10 5.332000e+09 5.332000e+09 NaN -1.553300e+10 1.759800e+10 4.589200e+10 NaN -1.832500e+10 7.961000e+09 1.464500e+10 2.506200e+10 9.424000e+09 3.144000e+10 NaN NaN NaN NaN NaN 3.120000e+09 NaN 1.138980e+11 -3.850700e+10 NaN 5.782000e+09 NaN 7.805000e+09 NaN 2.506200e+10 2.803600e+10 5.863000e+10 NaN NaN NaN NaN 1.583800e+11 1.913500e+10 4.405600e+10 -2.645800e+10 2.632377e+09
116235 JNJ.US 2020-09-30 2020-10-23 USD 2.840000e+09 NaN 4.401000e+09 NaN 3.554000e+09 5.431000e+09 1.411000e+10 4.445000e+09 -1.188000e+09 5.633000e+09 NaN 44000000.0 847000000.0 12000000.0 -32000000.0 NaN 206000000.0 NaN 847000000.0 2.108200e+10 8.477000e+09 6.972000e+09 -1.268000e+09 NaN 3.554000e+09 3.554000e+09 NaN 2020-10-23 USD 1.706930e+11 4.700600e+10 NaN 2.619000e+09 1.062200e+11 6.447300e+10 5.615000e+09 ... 3.884700e+10 5.078000e+09 5.078000e+09 NaN -1.493800e+10 1.785500e+10 5.757800e+10 NaN -1.684000e+10 1.181600e+10 1.457900e+10 3.268000e+10 9.599000e+09 3.376900e+10 NaN NaN NaN NaN NaN 3.120000e+09 NaN 1.148310e+11 -3.854000e+10 NaN 6.131000e+09 NaN 7.816000e+09 NaN 3.268000e+10 2.907800e+10 6.737300e+10 NaN NaN NaN NaN 1.706930e+11 3.078100e+10 4.516200e+10 -2.730700e+10 2.632167e+09
116135 JNJ.US 2020-12-31 2021-02-22 USD 4.032000e+09 NaN 1.647000e+09 NaN 1.738000e+09 6.457000e+09 1.466100e+10 1.734000e+09 -2.341000e+09 4.075000e+09 NaN 87000000.0 -91000000.0 13000000.0 -74000000.0 NaN 97000000.0 NaN -91000000.0 2.247500e+10 1.058600e+10 7.814000e+09 -2.414000e+09 NaN 1.738000e+09 1.738000e+09 NaN 2021-02-22 USD 1.748940e+11 5.340200e+10 NaN 3.132000e+09 1.116160e+11 6.327800e+10 7.214000e+09 ... 4.249300e+10 2.631000e+09 2.631000e+09 NaN -1.524200e+10 1.876600e+10 5.123700e+10 NaN -2.651700e+10 1.120000e+10 1.357600e+10 3.263500e+10 9.344000e+09 3.986200e+10 NaN NaN NaN NaN NaN 3.120000e+09 NaN 1.138900e+11 -3.849000e+10 NaN 6.562000e+09 NaN 8.534000e+09 NaN 3.263500e+10 2.927400e+10 6.912300e+10 NaN NaN NaN NaN 1.748940e+11 2.518500e+10 NaN NaN 2.632512e+09
then I have this dataframe(daily) prices:
ticker date open high low close adjusted_close volume
0 JNJ.US 2021-08-02 172.470 172.840 171.300 172.270 172.2700 3620659
1 JNJ.US 2021-07-30 172.540 172.980 171.840 172.200 172.2000 5346400
2 JNJ.US 2021-07-29 172.740 173.340 171.090 172.180 172.1800 4214100
3 JNJ.US 2021-07-28 172.730 173.380 172.080 172.180 172.1800 5750700
4 JNJ.US 2021-07-27 171.800 172.720 170.670 172.660 172.6600 7089300
I have daily data in the price data but I have quarterly data in the first data frame. I want to merge the dataframe in a way that all the prices between Jan-01-2020 and Mar-01-2020 are being merged with the correct row.
I'm not sure exactly how to do this. I thought of extracting the date to month-year but I still don't know how to merge based on the range of values?
Any suggestions would be welcomed, if I'm not clear please let me know and I can clarify.
If I understand correctly you could create common year and quarter columns for each DataFrame and do a merge on those columns. I did a left merge if you only want to match columns in the left dataset (daily data).
If this is not what you are looking for, could you please clarify with a sample input/output?
# importing pandas as pd
import pandas as pd
# Creating dummy data of daily values
dt = pd.Series(['2020-08-02', '2020-07-30', '2020-07-29',
'2020-07-28', '2020-07-27'])
# Convert the underlying data to datetime
dt = pd.to_datetime(dt)
dt_df = pd.DataFrame(dt, columns=['date'])
dt_df['quarter_1'] = dt_df['date'].dt.quarter
dt_df['year_1'] = dt_df['date'].dt.year
print(dt_df)
date quarter_1 year_1
0 2020-08-02 3 2020
1 2020-07-30 3 2020
2 2020-07-29 3 2020
3 2020-07-28 3 2020
4 2020-07-27 3 2020
# Creating dummy data of quarterly values
dt2 = pd.Series(['2019-12-31', '2020-03-31', '2020-06-30',
'2020-09-30', '2020-12-31'])
# Convert the underlying data to datetime
dt2 = pd.to_datetime(dt2)
dt2_df = pd.DataFrame(sr2, columns=['date2'])
dt2_df['quarter_2'] = dt2_df['date2'].dt.quarter
dt2_df['year_2'] = dt2_df['date2'].dt.year
print(dt2_df)
date_quarter quarter_2 year_2
0 2019-12-31 4 2019
1 2020-03-31 1 2020
2 2020-06-30 2 2020
3 2020-09-30 3 2020
4 2020-12-31 4 2020
Then you can just merge on how ever you want.
dt_df.merge(dt2_df, how='left', left_on=['quarter_1', 'year_1'], right_on=['quarter_2', 'year_2'] , validate="many_to_many")
OUTPUT:
date quarter_1 year_1 date_quarter quarter_2 year_2
0 2020-08-02 3 2020 2020-09-30 3 2020
1 2020-07-30 3 2020 2020-09-30 3 2020
2 2020-07-29 3 2020 2020-09-30 3 2020
3 2020-07-28 3 2020 2020-09-30 3 2020
4 2020-07-27 3 2020 2020-09-30 3 2020

From 15 object variables to final target variable (0 or 1)

Can i go from 15 object variables to one final binary target variable?
Those 15 variables has ~10.000 different codes, my dataset is about 21.000.000 records. What im trying to do is at first replace the codes i want with 1 and the other with 0, then if one of fifteen variables is 1 the target variable will be 1, if all fifteen variables are 0 the target variable will be 0.
i have tried to work with to_replace, as_type, to_numeric, infer_objects with not good results,for example my dataset look like this head(5):
D P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15
41234 1234 4367 874 NAN NAN NAN 789 NAN NAN NAN NAN NAN NAN NAN NAN
42345 7657 4367 874 NAN NAN NAN 789 NAN NAN NAN NAN NAN NAN NAN NAN
34212 7654 4347 474 NAN NAN NAN 789 NAN NAN NAN NAN NAN NAN NAN NAN
34212 8902 4317 374 NAN 452 NAN 719 NAN NAN NAN NAN NAN NAN NAN NAN
19374 2564 4387 274 NAN 452 NAN 799 NAN NAN NAN NAN NAN NAN NAN NAN
I want to transform all nan as 0, and selected codes with 1, so all the P1-P15 will be binary and the i will create a final P variable with them.
For example if P1-P15 have '3578','9732','4734'...(im using about 200 codes) i want to become 1.
All the other values i want to become 0.
D variable should stay as it is.
The final dataset will be (D,P), then i will add the train variables
Any ideas? The following code gives me wrong results.
selCodes=['3722','66']
dfnew['P']=(dfnew.loc[:,'PR1':].astype(str).isin(selCodes).any(axis=1).astype(int))
Take a look at a test dataset(left), and new P(right).With the example code 3722 P should be 1.
IIUC, Use, DataFrame.isin:
# example select codes
selCodes = ['1234', '9732', '719']
df['P'] = (
df.loc[:, 'P1':].astype(str)
.isin(selCodes).any(axis=1).astype(int)
)
df = df[['D', 'P']]
Result:
D P
0 41234 1
1 42345 0
2 34212 0
3 34212 1
4 19374 0

For every row in pandas, do until sample ID change

How can I iterarate over rows in a dataframe until the sample ID change?
my_df:
ID loc_start
sample1 10
sample1 15
sample2 10
sample2 20
sample3 5
Something like:
samples = ["sample1", "sample2" ,"sample3"]
out = pd.DataFrame()
for sample in samples:
if my_df["ID"] == sample:
my_list = []
for index, row in my_df.iterrows():
other_list = [row.loc_start]
my_list.append(other_list)
my_list = pd.DataFrame(my_list)
out = pd.merge(out, my_list)
Expected output:
sample1 sample2 sample3
10 10 5
15 20
I realize of course that this could be done easier if my_df really would look like this. However, what I'm after is the principle to iterate over rows until a certain column value change.
Based on the input & output provided, this should work.
You need to provide more info if you are looking for something else.
df.pivot(columns='ID', values = 'loc_start').rename_axis(None, axis=1).apply(lambda x: pd.Series(x.dropna().values))
output
sample1 sample2 sample3
0 10.0 10.0 5.0
1 15.0 20.0 NaN
Ben.T is correct that a pivot works here. Here is an example:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(0, 5, (10, 2)), columns=list("AB"))
# what does the df look like? Here, I consider column A to be analogous to your "ID" column
In [5]: df
Out[5]:
A B
0 3 1
1 2 1
2 4 2
3 4 1
4 0 4
5 4 2
6 4 1
7 3 1
8 1 1
9 4 0
# now do a pivot and see what it looks like
df2 = df.pivot(columns="A", values="B")
In [8]: df2
Out[8]:
A 0 1 2 3 4
0 NaN NaN NaN 1.0 NaN
1 NaN NaN 1.0 NaN NaN
2 NaN NaN NaN NaN 2.0
3 NaN NaN NaN NaN 1.0
4 4.0 NaN NaN NaN NaN
5 NaN NaN NaN NaN 2.0
6 NaN NaN NaN NaN 1.0
7 NaN NaN NaN 1.0 NaN
8 NaN 1.0 NaN NaN NaN
9 NaN NaN NaN NaN 0.0
Not quite what you wanted. With a little help from Jezreal's answer
df2 = df2.apply(lambda x: pd.Series(x.dropna().values))
In [20]: df3
Out[20]:
A 0 1 2 3 4
0 4.0 1.0 1.0 1.0 2.0
1 NaN NaN NaN 1.0 1.0
2 NaN NaN NaN NaN 2.0
3 NaN NaN NaN NaN 1.0
4 NaN NaN NaN NaN 0.0
The empty spots in the dataframe have to be filled with something, and NaN is used by default. Is this what you wanted?
If, on the other hand, you wanted to perform an operation on your data you would use the groupby instead.
df2 = df.groupby(by="A", as_index=False).mean()