I have my pandas df like below. It has list in one of the column can it be unwind as below:?
import pandas as pd
L1 = [['ID1', 0, [0, 1, 1] , [0, 1]],
['ID2', 2, [1, 2, 3], [0, 1]]
]
df1 = pd.DataFrame(L1,columns=['ID', 't', 'Key','Value'])
can this be unwinded like below?
import pandas as pd
L1 = [['ID1', 0, 0, 1, 1 , 0, 1],
['ID2', 2, 1, 2, 3, 0, 1]
]
df1 = pd.DataFrame(L1,columns=['ID', 't', 'Key_0','Key_1','Key_2','Value_0', 'Value_1'])
Turning the Series into lists you can call the DataFrame constructor to explode them into multiple columns. Using pop in a list comprehension within concat will remove the original columns from your DataFrame so that you just join back the exploded versions.
This will work regardless of the number of elements in each list, and even if the lists have varying numbers of elements across rows.
df2 = pd.concat(([df1] + [pd.DataFrame(df1.pop(col).tolist(), index=df1.index).add_prefix(f'{col}_')
for col in ['Key', 'Value']]),
axis=1)
print(df2)
ID t Key_0 Key_1 Key_2 Value_0 Value_1
0 ID1 0 0 1 1 0 1
1 ID2 2 1 2 3 0 1
You can flatten L1 before constructing the data frame:
L2 = [ row[0:2] + row[2] + row[3] for row in L1 ]
df2 = pd.DataFrame(L2,columns=['ID', 't', 'Key_0','Key_1','Key_2','Value_0', 'Value_1'])
You can use explode the dataframe columns wise-
df3 = df1.apply(pd.Series.explode, axis=1)
df3.columns = ['ID', 't', 'Key_0','Key_1','Key_2','Value_0', 'Value_1']
Related
I've a Pandas dataframe with continuous sequence of ones and zeroes, as follows:
import numpy as np
import pandas as pd
m = np.array([[1, 1, 1, 1], [1, 1, 1, 0], [1, 0, 1, 0], [1, 0, 0, 0]])
df = pd.DataFrame(m, columns=["C1", "C2", "C3", "C4"])
df.insert(0, "Sy", ["r1", "r2", "r3", "r4"])
Which gives me the following df:
Sy C1 C2 C3 C4
0 r1 1 1 1 1
1 r2 1 1 1 0
2 r3 1 0 1 0
3 r4 1 0 0 0
I'm trying to color only the series of ones in each column with different column specific colors. The series starts at row=0 and continues till the first zero appears. I took the help of this Stack Overflow post to color the columns.
However, this code colors the whole column and not just the cells containing consecutive sequence of 1's:
def f(dat, c="red"):
return [f"background-color: {c}" for i in dat]
columns_with_color_dictionary = {
"C1": "red",
"C2": "blue",
"C3": "orange",
"C4": "yellow",
}
style = df.style
for column, color in columns_with_color_dictionary.items():
style = style.apply(f, axis=0, subset=column, c=color)
with open("dd.html", "w") as fh:
fh.write(style.render())
The Html output:
Can anyone help me in this matter? Any alternative ideas are welcome too. The actual matrix is around 200X200 and I don't want color printing to console.
Thanks
Here is one way to do it.
Replace:
style = df.style
for column, color in columns_with_color_dictionary.items():
style = style.apply(f, axis=0, subset=column, c=color)
with:
style = df.style
for column, color in columns_with_color_dictionary.items():
style = style.apply(f, axis=0, subset=column, c=color).applymap(
lambda x: f"background-color: white",
subset=(
df[df[column] != 1].index,
[column],
),
)
And here is the Html output:
I have a table like this:
Scen F1 F2 F3 F4
0 S1 1 0 1 0
1 S2 0 1 0 1
and want to search by Scen and return the column names that == 1 for that row, e.g. for S1 I require F1, F3 as the result.
I've tried the following, and can get the result by hard coding df_col[0] , but need to be able to do this dynamically.
What's the best way to do this?
import pandas as pd
d = {'Scen': ["S1", "S2"],
'F1': [1, 0],
'F2': [0, 1],
'F3': [1, 0],
'F4': [0, 1]
}
df = pd.DataFrame(data=d)
def get_features(df, col_name):
df_col = df[(df.Scen == col_name)].T
feats = (df_col[(df_col[0] == 1)]).index.to_list()
print(feats)
return feats
get_features(df, "S1")
get_features(df, "S2")
EDIT:
Based on RichieV 's answer, this works:
def get_features(df, col_name):
df = df.replace(0, np.nan)
df = df.melt('Scen')
df_scen = (df['variable'].loc[(df['Scen']==col_name) & (df['value']==1)])
return (list(df_scen))
This is a one-hot decoding operation. When you encode to ones you pivot a column, so now we need to melt it back.
df = df.replace(0, np.nan) # get rid of zeros, they only fill spaces
df = df.melt('Scen').drop('value', axis=1)
Now df has four rows and two columns (scen and variable) with repeated scen rows for each corresponding feature. You can use df as it is or group by scenario and gather features in a list.
df = df.groupby('Scen').apply(list)
This works, not sure it's the most efficient
import pandas as pd
d = {'Scen': ["S1", "S2"],
'F1': [1, 0],
'F2': [0, 1],
'F3': [1, 0],
'F4': [0, 1]
}
df = pd.DataFrame(data=d)
def get_features(df, col_name):
df_col = df[(df.Scen == col_name)].T
df_feats = df_col.loc[df_col[df_col.columns.values[0]] == 1]
return (list(df_feats.index))
s1_list = get_features(df, "S1")
s2_list = get_features(df, "S2")
print(s1_list)
print(s2_list)
['F1', 'F3']
['F2', 'F4']
How should I utilize the RangeIndex provided by pandas.DataFrame.rolling in custom_function?
Current implementation gives a ValueError.
At first x.index = RangeIndex(start=0, stop=2, step=1), and tmp_df correctly selects the first and second row in df (index 0 and 1). For the last x.index = RangeIndex(start=6, stop=8, step=1) it seems like iloc tries to select index 8 in df which is out of range (df has index 0 to 7).
Basically, what I want to do is to have the custom function to count consecutive numbers in the window. Given a positive values 1,0,1,1,1,0 in the window, the custom function should return 3 as there is a maximum of 3 consecutive 1s.
import numpy as np
import pandas as pd
df = pd.DataFrame({'open': [7, 5, 10, 11,6,13,17,12],
'close': [6, 6, 11, 10,7,15,18,10],
'positive': [0, 1, 1, 0,1,1,1,0]},
)
def custom_function(x,df):
print("index:",x.index)
tmp_df = df.iloc[x.index] # raises "ValueError: cannot set using a slice indexer with a different length than the value" when x.index = RangeIndex(start=6, stop=8, step=1) as df index goes from 0 to 7 only
# do calulations on any column in tmp_df, get result
result = 1 #dummyresult
return result
intervals = range(2, 10)
for i in intervals:
df['result_' + str(i)] = np.nan
res = df.rolling(i).apply(custom_function, args=(df,), raw=False)
df['result_' + str(i)][1:] = res
print(df)
Consider the following MWE.
from pandas import DataFrame
from bokeh.plotting import figure
data = dict(x = [0,1,2,0,1,2],
y = [0,1,2,4,5,6],
g = [1,1,1,2,2,2])
df = DataFrame(data)
p = figure()
p.line( 'x', 'y', source=df[ df.g == 1 ] )
p.line( 'x', 'y', source=df[ df.g == 2 ] )
Ideally, I would like to compress the last to lines in one:
p.line( 'x', 'y', source=df.groupby('g') )
(Real life examples have a large and variable number of groups.) Is there any concise way to do this?
I just found out that the following works
gby = df.groupby('g')
gby.apply( lambda d: p.line( 'x', 'y', source=d ) )
(it has some drawbacks, though).
Any better idea?
I didn't come out with df.groupby so I used df.loc but maybe multi_line is what you are after:
from pandas import DataFrame
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource
data = dict(x = [0, 1, 2, 0, 1, 2],
y = [0, 1, 2, 4, 5, 6],
g = [1, 1, 1, 2, 2, 2])
df = DataFrame(data, index = data['g'])
dfs = [DataFrame(df.loc[i].values, columns = df.columns) for i in df['g'].unique()]
source = ColumnDataSource(dict(x = [df['x'].values for df in dfs], y = [df['y'].values for df in dfs]))
p = figure()
p.multi_line('x', 'y', source = source)
show(p)
Result:
This is Tony's solution slightly simplified.
import pandas as pd
from bokeh.plotting import figure
data = dict(x = [0, 1, 2, 0, 1, 2],
y = [0, 1, 2, 4, 5, 6],
g = [1, 1, 1, 2, 2, 2])
df = pd.DataFrame(data)
####################### So far as in the OP
gby = df.groupby('g')
p = figure()
x = [list( sdf['x'] ) for i,sdf in gby]
y = [list( sdf['y'] ) for i,sdf in gby]
p.multi_line( x, y )
from pandas import DataFrame
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource
data = dict(x = [0, 1, 2, 0, 1, 2],
y = [0, 1, 2, 4, 5, 6],
g = [1, 1, 1, 2, 2, 2])
df = DataFrame(data)
plt = figure()
for i, group in df.groupby(['g']):
source = ColumnDataSource(group)
plt.line('x','y', source=source, legend_group='g')
show(plt)
I have a dataframe and would like to have the values in one column being set through an iterative function as below.
import pandas as pd
import numpy as np
d = {'col1': [0.4444, 25.4615],
'col2': [0.5, 0.7],
'col3': [7, 7]}
df = pd.DataFrame(data=d)
df['col4'] = df['col1'] * df['col3']/4
def func(df):
a = np.exp(-df['col4'])
n = 1
while df['col2'] < a:
a = a + df['col4'] * 4 / n
n += 1
return n
df['col5'] = func(df)
I get an error message "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()." How can I run the function per row to solve the series/ambiguity problem?
EDIT: Added expected output.
out = {'col1': [0.4444, 25.4615],
'col2': [0.5, 0.7],
'col3': [7, 7],
'col4': [0.7777, 44.557625],
'col5': [0, 49]}
dfout = pd.DataFrame(out)
I am not sure what the values in col4 and col5 will be but according to the calculation I am trying to replicate those will be the values.
EDIT2: I had missed n+=1 in the while loop. added it now.
EDIT3: I am trying to apply
f(0) = e^-col4
f(n) = col4 * f(n-1) / n for n > 0
until f > col2 and then return the value of n per row.
Using the information you provided, this seems to be the solution:
import pandas as pd
import numpy as np
d = {'col1': [0.4444, 25.4615],
'col2': [0.5, 0.7],
'col3': [7, 7]}
df = pd.DataFrame(data=d)
df['col4'] = df['col1'] * df['col3']/4
def func(df):
n = 1
return n
df['col5'] = func(df)
For what it is worth, here is an inefficient solution: after each iteration, keep track of which coefficient starts satisfying the condition.
import pandas as pd
import numpy as np
d = {'col1': [0.4444, 25.4615],
'col2': [0.5, 0.7],
'col3': [7, 7]}
df = pd.DataFrame(data=d)
df['col4'] = df['col1'] * df['col3']/4
def func(df):
a = np.exp(-df['col4'])
n = 1
ns = [None] * len(df['col2'])
status = a > df['col2']
for i in range(len(status)):
if ns[i] is None and status[i]:
ns[i] = n
# stops when all coefficients satisfy the condition
while not status.all():
a = a * df['col4'] * n
status = a > df['col2']
n += 1
for i in range(len(status)):
if ns[i] is None and status[i]:
ns[i] = n
return ns
df['col5'] = func(df)
print(df['col5'])