reindex (1,N) dimension dataframe - pandas

A = pandas.DataFrame({"A" : [1, 4], "Output1" : [6, 8]}).set_index(["A"]).fillna(0)
new_A = A.reindex(pandas.MultiIndex.from_tuples([['Output1', "-"]]) , axis="columns")
I'm expecting to get
Output1
-
A
1 6
4 8
But instead I get
Output1
-
A
1 NaN
4 NaN
Anything wrong in my code ?

Don't use reindex, which aligns the columns by names. Just reassign the columns:
A.columns = pd.MultiIndex.from_tuples([['Output1', "-"]])
Output:
Output1
-
A
1 6
4 8

Related

How to do a conditional rolling mean in Pandas?

I have this data frame available. It has a timestamp for start, a timestamp for end and a duration column.
start
end
duration
1
5
4
2
5
3
3
4
1
4
6
2
5
9
4
6
7
1
7
10
3
I'd like to add a column 'rolling_mean' to the dataframe that calculates a rolling average on all previous rows (ordered by start) with this condition: only previous rows can be used for mean calculation where the event has already ended (so end date should be equal to or lower than the start date of the row for which the rolling mean is being calculated). So for row number 4, the rolling_mean is 1 because we look at all previous rows and only the previous one fulfills the condition of the event having ended.
This is the dataframe I'd like to get with a Pandas rolling mean:
start
end
duration
rolling_mean
1
5
4
Nan
2
5
3
Nan
3
4
1
Nan
4
6
2
1
5
9
4
2.666667
6
7
1
2.500000
7
10
3
2.200000
Here is the code to reproduce my example:
d = [[1, 5],
[2, 5],
[3, 4],
[4, 6],
[5, 9],
[6, 7],
[7, 10]]
df = pd.DataFrame(d, columns=['start_time', 'end_time'])
df['duration'] = df.end_time - df.start_time
I've tried to merge the dataframe with itself to then filter out the irrelevant rows, but the data frame is too big to take this approach.
So I'm looking for a rolling mean but where I can specify the extra condition.
Does anyone have any ideas for this one?
A for loop will do the job:
rolling_mean = np.repeat(np.nan, len(df))
start, end, duration = df[["start_time", "end_time", "duration"]].to_numpy().T
for i in range(len(df)):
matches = duration[:i][end[:i] <= start[i]]
if matches.any():
rolling_mean[i] = matches.mean()

Find common values within groupby in pandas Dataframe based on two columns

I have following dataframe:
period symptoms recovery
1 4 2
1 5 2
1 6 2
2 3 1
2 5 2
2 8 4
2 12 6
3 4 2
3 5 2
3 6 3
3 8 5
4 5 2
4 8 4
4 12 6
I'm trying to find the common values of df['period'] groups (1, 2, 3, 4) based on value
of two columns 'symptoms' and 'recovery'
Result should be :
symptoms recovery period
5 2 [1, 2, 3, 4]
8 4 [2, 4]
where each same two columns values has the periods occurrence in a list or column.
I'm I approaching the problem in the wrong way ? Appreciate your help.
I tried to turn each period into dict and loop through to find values but didn't work for me. Also tried to use grouby().apply() but I'm not getting a meaningful data frame.
Tried sorting values based on 3 columns but couldn't get the common ones between each period section.
Last attempt :
df2 = df[['period', 'how_long', 'days_to_ex']].copy()
#s = df.groupby(["period", "symptoms", "recovery"]).size()
s = df.groupby(["symptoms", "recovery"]).size()
You were almost there:
from io import StringIO
import pandas as pd
# setup sample data
data = StringIO("""
period;symptoms;recovery
1;4;2
1;5;2
1;6;2
2;3;1
2;5;2
2;8;4
2;12;6
3;4;2
3;5;2
3;6;3
3;8;5
4;5;2
4;8;4
4;12;6
""")
df = pd.read_csv(data, sep=";")
# collect unique periods
df.groupby(['symptoms','recovery'])[['period']].agg(list).reset_index()
This gives
symptoms recovery period
0 3 1 [2]
1 4 2 [1, 3]
2 5 2 [1, 2, 3, 4]
3 6 2 [1]
4 6 3 [3]
5 8 4 [2, 4]
6 8 5 [3]
7 12 6 [2, 4]

rolling windows defined by backward cumulative sums

I have got a pandas DataFrame like this:
A B
0 3 ...
1 2
2 4
3 4
4 1
5 7
6 5
7 3
I would like to compute a rolling along column A summing its elements backwards until I reach at least 10. The resulting windows should be:
A B window_indices
0 3 ... NA
1 2 NA
2 4 NA
3 4 --> [3,2,1]
4 1 [4,3,2,1]
5 7 [5,4,3]
6 5 [6,5]
7 3 [7,6,5]
Next, I want to compute some statistics on column B, something like that:
df.my_rolling(on='A', func='sum', threshold=10).B.mean()
I have got an idea: we could think of the elements of column A as seconds. Transform A in a datetime column and perform a standard rolling on it. But I don't know how to do that.
This is no able to do with rolling since the rolling window is not fixed
l = [[df.index[(df.A.loc[:x].iloc[::-1].cumsum()>=10).idxmax():x+1].tolist()[::-1]
if (df.A.loc[:x].sum()>=10) else np.nan] for x in df.A.index]
Out[46]:
[[nan],
[nan],
[nan],
[[3, 2, 1]],
[[4, 3, 2, 1]],
[[5, 4, 3]],
[[6, 5]],
[[7, 6, 5]]]
df['new'] = l

Pandas .corr() returning "__"

It was working great until it wasn't, and no idea what I'm doing wrong. I've reduced it to a very simple datsaset t:
1 2 3 4 5 6 7 8
0 3 16 3 2 17 2 3 2
1 3 16 3 2 19 4 3 2
2 3 16 3 2 9 2 3 2
3 3 16 3 2 19 1 3 2
4 3 16 3 2 17 2 3 1
5 3 16 3 2 17 1 17 1
6 3 16 3 2 19 1 17 2
7 3 16 3 2 19 4 3 1
8 3 16 3 2 19 1 3 2
9 3 16 3 2 7 2 17 1
corr = t.corr()
corr
returns "__"
and
sns.heatmap(corr)
throws the following error "zero-size array to reduction operation minimum which has no identity"
I have no idea what's wrong? I've tried it with more rows etc, and double checked that I don't have nay missing values...what's going on? I had such a pretty heatmap earlier, I've been trying to
As mentioned above, change type to float. Simply,
corr = t.astype('float64').corr()
The problem here is not the dataframe itself but the origin of it. I found same problem by using drop or iloc in a dataframe. The key is the global type the dataframe has.
Let's say we have the following dataframe:
list_ex = [[1.1,2.1,3.1,4,5,6,7,8],[1.2,2.2,3.3,4.1,5.5,6,7,8],
[1.3,2.3,3,4,5,6.2,7,8],[1.4,2.4,3,4,5,6.2,7.3,8.1]]
list_ex_new=pd.DataFrame(list_ex)
you can calculate the list_ex_new.corr() with no problem. If you check the arguments of the dataframe by vars(list_ex_new), you'll obtain:
{'_is_copy': None, '_data': BlockManager
Items: RangeIndex(start=0, stop=8, step=1)
Axis 1: RangeIndex(start=0, stop=4, step=1)
FloatBlock: slice(0, 8, 1), 8 x 4, dtype: float64, '_item_cache': {}}
where dtype is float64.
A new dataframe can be defined by list_new_new = list_ex_new.iloc[1:,:] and the correlations can be evaluated, successfully. A check of the dataframe's attributes shows:
{'_is_copy': ,
'_data': BlockManager
Items: RangeIndex(start=0, stop=8, step=1)
Axis 1: RangeIndex(start=1, stop=4, step=1)
FloatBlock: slice(0, 8, 1), 8 x 3, dtype: float64,
'_item_cache': {}}
where dtype is still float64.
A third dataframe can be defined:
list_ex_w = [['a','a','a','a','a','a','a','a'],[1.1,2.1,3.1,4,5,6,7,8],
[1.2,2.2,3.3,4.1,5.5,6,7,8],[1.3,2.3,3,4,5,6.2,7,8],
[1.4,2.4,3,4,5,6.2,7.3,8.1]]
list_ex_new_w=pd.DataFrame(list_ex_w)
A evaluation of dataframe's correlation will result in a empty dataframe, since list_ex_w attributes look like:
{'_is_copy': None, '_data': BlockManager
Items: RangeIndex(start=0, stop=8, step=1)
Axis 1: Index(['a', 1, 2, 3, 4], dtype='object')
ObjectBlock: slice(0, 8, 1), 8 x 5, dtype: object, '_item_cache': {}}
Where now dtype is 'object', since the dataframe is not consistent in its types. there are strings and floats together. Finally, a fourth dataframe can be generated:
list_new_new_w = list_ex_new_w.iloc[1:,:]
this will generate as a result same notebook but with no 'a's, apparently a perfectly correct dataframe to calculate the correlations. However this will return again an empty dataframe. A final check of the dataframe attributes shows:
vars(list_new_new_w)
{'_is_copy': None, '_data': BlockManager
Items: Index([1, 2, 3, 4], dtype='object')
Axis 1: RangeIndex(start=0, stop=8, step=1)
ObjectBlock: slice(0, 4, 1), 4 x 8, dtype: object, '_item_cache': {}}
where dtype is still object, thus the method corr returns an empty dataframe.
This problem can be solved by using astype(float)
list_new_new_w.astype(float).corr()
In summary, it seems pandas at the time corr or cov among others methods are called generate a new dataframe with same attibutes ignoring the case the new dataframe has a consistent global type. I've been checking out the pandas source code and I understand this is the correct interpretation of pandas' implementation.

How to calculate multiple columns from multiple columns in pandas

I am trying to calculate multiple colums from multiple columns in a pandas dataframe using a function.
The function takes three arguments -a-, -b-, and -c- and and returns three calculated values -sum-, -prod- and -quot-. In my pandas data frame I have three coumns -a-, -b- and and -c- from which I want to calculate the columns -sum-, -prod- and -quot-.
The mapping that I do works only when I have exactly three rows. I do not know what is going wrong, although I expect that it has to do something with selecting the correct axis. Could someone explain what is happening and how I can calculate the values that I would like to have.
Below are the situations that I have tested.
INITIAL VALUES
def sum_prod_quot(a,b,c):
sum = a + b + c
prod = a * b * c
quot = a / b / c
return (sum, prod, quot)
df = pd.DataFrame({ 'a': [20, 100, 18],
'b': [ 5, 10, 3],
'c': [ 2, 10, 6],
'd': [ 1, 2, 3]
})
df
a b c d
0 20 5 2 1
1 100 10 10 2
2 18 3 6 3
CALCULATION STEPS
Using exactly three rows
When I calculate three columns from this dataframe and using the function function I get:
df['sum'], df['prod'], df['quot'] = \
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
df
a b c d sum prod quot
0 20 5 2 1 27.0 120.0 27.0
1 100 10 10 2 200.0 10000.0 324.0
2 18 3 6 3 2.0 1.0 1.0
This is exactly the result that I want to have: The sum-column has the sum of the elements in the columns a,b,c; the prod-column has the product of the elements in the columns a,b,c and the quot-column has the quotients of the elements in the columns a,b,c.
Using more than three rows
When I expand the dataframe with one row, I get an error!
The data frame is defined as:
df = pd.DataFrame({ 'a': [20, 100, 18, 40],
'b': [ 5, 10, 3, 10],
'c': [ 2, 10, 6, 4],
'd': [ 1, 2, 3, 4]
})
df
a b c d
0 20 5 2 1
1 100 10 10 2
2 18 3 6 3
3 40 10 4 4
The call is
df['sum'], df['prod'], df['quot'] = \
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
The result is
...
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
ValueError: too many values to unpack (expected 3)
while I would expect an extra row:
df
a b c d sum prod quot
0 20 5 2 1 27.0 120.0 27.0
1 100 10 10 2 200.0 10000.0 324.0
2 18 3 6 3 2.0 1.0 1.0
3 40 10 4 4 54.0 1600.0 1.0
Using less than three rows
When I reduce tthe dataframe with one row I get also an error.
The dataframe is defined as:
df = pd.DataFrame({ 'a': [20, 100],
'b': [ 5, 10],
'c': [ 2, 10],
'd': [ 1, 2]
})
df
a b c d
0 20 5 2 1
1 100 10 10 2
The call is
df['sum'], df['prod'], df['quot'] = \
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
The result is
...
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
ValueError: need more than 2 values to unpack
while I would expect a row less:
df
a b c d sum prod quot
0 20 5 2 1 27.0 120.0 27.0
1 100 10 10 2 200.0 10000.0 324.0
QUESTIONS
The questions I have:
1) Why do I get these errors?
2) How do I have to modify the call such that I get the desired data frame?
NOTE
In this link a similar question is asked, but the given answer did not work for me.
The answer doesn't seem correct for 3 rows as well. Can you check other values except first row and first column. Looking at the results, product of 20*5*2 is NOT 120, it's 200 and is placed below in sum column. You need to form list in correct way before assigning to new columns. You can try use following to set the new columns:
df['sum'], df['prod'], df['quot'] = zip(*map(sum_prod_quot, df['a'], df['b'], df['c']))
For details follow the link