Retain values in a Pandas dataframe - pandas

Consider the following Pandas Dataframe:
_df = pd.DataFrame([
[4.0, "Diastolic Blood Pressure", 1.0, "2017-01-15", 68],
[4.0, "Diastolic Blood Pressure", 5.0, "2017-04-15", 60],
[4.0, "Diastolic Blood Pressure", 8.0, "2017-06-18", 68],
[4.0, "Heart Rate", 1.0, "2017-01-15", 85],
[4.0, "Heart Rate", 5.0, "2017-04-15", 72],
[4.0, "Heart Rate", 8.0, "2017-06-18", 81],
[6.0, "Diastolic Blood Pressure", 1.0, "2017-01-18", 114],
[6.0, "Diastolic Blood Pressure", 6.0, "2017-02-18", 104],
[6.0, "Diastolic Blood Pressure", 9.0, "2017-03-18", 124]
], columns = ['ID', 'VSname', 'Visit', 'VSdate', 'VSres'])
I'd like to create the 'Flag' variable in this df: for each ID and VSName, show the difference from baseline (visit 1) at each visit.
I tried different approaches and I'm stuck.
I come from a background of SAS programming, and that'd be very easy in SAS to retain values from a row to another, and then substract. I'm sure my mind is poluted by SAS (and the title is clearly wrong), but this has to be doable with Pandas, one way or another. Any idea?
Thanks a lot for your help.
Kind regards,
Nicolas

Assuming the DataFrame is ordered by id and visit group (i.e. the 5, 8 and directly after the 1), you can use cumcount:
c = (df.visit == 1).cumcount()
You can subtract VSRes from the first VSRes entry of each group:
df.VSRes - df.groupby(c).VSRes.transform("first")

I tried the answers kindly given, none worked, got errors I coudln't fix. Not sure why... I managed to produced something close using the following:
baseline = df[df["Visit"] == 1.0]
baseline = baseline.rename(columns={'VSres': 'baseline'})
df = pd.merge(df, baseline, on = ["ID", "VSname"], how='left')
df["chg"] = df["VSres"] - df["baseline"]
That's not very beautiful, I know...

Related

Reindex pandas DataFrame to match index with another DataFrame

I have two pandas DataFrames with different (float) indices.
I want to update the second dataframe to match the first dataframe's index, updating its values to be interpolated using the index.
This is the code I have:
from pandas import DataFrame
df1 = DataFrame([
{'time': 0.2, 'v': 1},
{'time': 0.4, 'v': 2},
{'time': 0.6, 'v': 3},
{'time': 0.8, 'v': 4},
{'time': 1.0, 'v': 5},
{'time': 1.2, 'v': 6},
{'time': 1.4, 'v': 7},
{'time': 1.6, 'v': 8},
{'time': 1.8, 'v': 9},
{'time': 2.0, 'v': 10}
]).set_index('time')
df2 = DataFrame([
{'time': 0.25, 'v': 1},
{'time': 0.5, 'v': 2},
{'time': 0.75, 'v': 3},
{'time': 1.0, 'v': 4},
{'time': 1.25, 'v': 5},
{'time': 1.5, 'v': 6},
{'time': 1.75, 'v': 7},
{'time': 2.0, 'v': 8},
{'time': 2.25, 'v': 9}
]).set_index('time')
df2 = df2.reindex(df1.index.union(df2.index)).interpolate(method='index').reindex(df1.index)
print(df2)
Output:
v
time
0.2 NaN
0.4 1.6
0.6 2.4
0.8 3.2
1.0 4.0
1.2 4.8
1.4 5.6
1.6 6.4
1.8 7.2
2.0 8.0
That's correct and as I need - however it seems a more complicated statement than it needs to be.
If there a more concise way to do the same, requiring fewer intermediate steps?
Also, is there a way to both interpolate and extrapolate? For example, in the example data above, the linearly extrapolated value for index 0.2 could be 0.8 instead of NaN. I know I could curve_fit, but again I feel that's more complicated that it may need to be?
One idea with numpy.interp, if values in both indices are increased and processing only one column v:
df1['v1'] = np.interp(df1.index, df2.index, df2['v'])
print(df1)
v v1
time
0.2 1 1.0
0.4 2 1.6
0.6 3 2.4
0.8 4 3.2
1.0 5 4.0
1.2 6 4.8
1.4 7 5.6
1.6 8 6.4
1.8 9 7.2
2.0 10 8.0

Create new data frame from unique values of certain columns [duplicate]

Say my data looks like this:
date,name,id,dept,sale1,sale2,sale3,total_sale
1/1/17,John,50,Sales,50.0,60.0,70.0,180.0
1/1/17,Mike,21,Engg,43.0,55.0,2.0,100.0
1/1/17,Jane,99,Tech,90.0,80.0,70.0,240.0
1/2/17,John,50,Sales,60.0,70.0,80.0,210.0
1/2/17,Mike,21,Engg,53.0,65.0,12.0,130.0
1/2/17,Jane,99,Tech,100.0,90.0,80.0,270.0
1/3/17,John,50,Sales,40.0,50.0,60.0,150.0
1/3/17,Mike,21,Engg,53.0,55.0,12.0,120.0
1/3/17,Jane,99,Tech,80.0,70.0,60.0,210.0
I want a new column average, which is the average of total_sale for each name,id,dept tuple
I tried
df.groupby(['name', 'id', 'dept'])['total_sale'].mean()
And this does return a series with the mean:
name id dept
Jane 99 Tech 240.000000
John 50 Sales 180.000000
Mike 21 Engg 116.666667
Name: total_sale, dtype: float64
but how would I reference the data? The series is a one dimensional one of shape (3,). Ideally I would like this put back into a dataframe with proper columns so I can reference properly by name/id/dept.
If you call .reset_index() on the series that you have, it will get you a dataframe like you want (each level of the index will be converted into a column):
df.groupby(['name', 'id', 'dept'])['total_sale'].mean().reset_index()
EDIT: to respond to the OP's comment, adding this column back to your original dataframe is a little trickier. You don't have the same number of rows as in the original dataframe, so you can't assign it as a new column yet. However, if you set the index the same, pandas is smart and will fill in the values properly for you. Try this:
cols = ['date','name','id','dept','sale1','sale2','sale3','total_sale']
data = [
['1/1/17', 'John', 50, 'Sales', 50.0, 60.0, 70.0, 180.0],
['1/1/17', 'Mike', 21, 'Engg', 43.0, 55.0, 2.0, 100.0],
['1/1/17', 'Jane', 99, 'Tech', 90.0, 80.0, 70.0, 240.0],
['1/2/17', 'John', 50, 'Sales', 60.0, 70.0, 80.0, 210.0],
['1/2/17', 'Mike', 21, 'Engg', 53.0, 65.0, 12.0, 130.0],
['1/2/17', 'Jane', 99, 'Tech', 100.0, 90.0, 80.0, 270.0],
['1/3/17', 'John', 50, 'Sales', 40.0, 50.0, 60.0, 150.0],
['1/3/17', 'Mike', 21, 'Engg', 53.0, 55.0, 12.0, 120.0],
['1/3/17', 'Jane', 99, 'Tech', 80.0, 70.0, 60.0, 210.0]
]
df = pd.DataFrame(data, columns=cols)
mean_col = df.groupby(['name', 'id', 'dept'])['total_sale'].mean() # don't reset the index!
df = df.set_index(['name', 'id', 'dept']) # make the same index here
df['mean_col'] = mean_col
df = df.reset_index() # to take the hierarchical index off again
Adding to_frame
df.groupby(['name', 'id', 'dept'])['total_sale'].mean().to_frame()
You are very close. You simply need to add a set of brackets around [['total_sale']] to tell python to select as a dataframe and not a series:
df.groupby(['name', 'id', 'dept'])[['total_sale']].mean()
If you want all columns:
df.groupby(['name', 'id', 'dept'], as_index=False).mean()[['name', 'id', 'dept', 'total_sale']]
The answer is in two lines of code:
The first line creates the hierarchical frame.
df_mean = df.groupby(['name', 'id', 'dept'])[['total_sale']].mean()
The second line converts it to a dataframe with four columns('name', 'id', 'dept', 'total_sale')
df_mean = df_mean.reset_index()

Merge similar columns and add extracted values to dict

Given this input:
pd.DataFrame({'C1': [6, np.NaN, 16, np.NaN], 'C2': [17, np.NaN, 1, np.NaN],
'D1': [8, np.NaN, np.NaN, 6], 'D2': [15, np.NaN, np.NaN, 12]}, index=[1,1,2,2])
I'd like to combine columns beginning in the same letter (the Cs and Ds), as well as rows with same index (1 and 2), and extract the non-null values to the simplest representation without duplicates, which I think is something like:
{1: {'C': [6.0, 17.0], 'D': [8.0, 15.0]}, 2: {'C': [16.0, 1.0], 'D': [6.0, 12.0]}}
Using stack or groupby gets me part of the way there, but I feel like there is a more efficient way to do it.
You can rename columns by lambda function for first letters with aggregate lists after DataFrame.stack and then create nested dictionary in dict comprehension:
s = df.rename(columns=lambda x: x[0]).stack().groupby(level=[0,1]).agg(list)
d = {level: s.xs(level).to_dict() for level in s.index.levels[0]}
print (d)
{1: {'C': [6.0, 17.0], 'D': [8.0, 15.0]}, 2: {'C': [16.0, 1.0], 'D': [6.0, 12.0]}}

pandas 0.20.3 DataFrame behavior changes for pyspark.ml.vectors object in a column

The following code works in pandas 0.20.0 but not in 0.20.3:
import pandas as pd
from pyspark.ml.linalg import Vectors
df = pd.DataFrame({'A': [1,2,3,4],
'B': [1,2,3,4],
'C': [1,2,3,4],
'D': [1,2,3,4]},
index=[0, 1, 2, 3])
df.apply(lambda x: pd.Series(Vectors.dense([x["A"], x["B"]])), axis=1)
This produces from pandas 0.20.0:
0
0 [1.0, 1.0]
1 [2.0, 2.0]
2 [3.0, 3.0]
3 [4.0, 4.0]
but it is different in pandas 0.20.3:
0 1
0 1.0 1.0
1 2.0 2.0
2 3.0 3.0
3 4.0 4.0
How can I achieve the first behavior in 0.20.3?

Odoo 10 ORM API - Create seller ids for ProductTemplate

I have
_seller_ids = [(0, 0, {'min_qty': 1.0, 'product_code': u'1006004', 'price': 1.0, 'name': res.partner(84,)})]
and
my_product_template = ProductTemplate(34,)
How can I create those seller_ids -in the example it is just one supplier but it might be more than that as it is a list- for my_product_template? - ProductTemplate(34,)-
I have tried:
my_product_template.seller_ids.create(_seller_ids)
without success
Thanks,
You can try below:
my_product_template.seller_ids = [(0, 0 , {'min_qty': 1.0, 'product_code': u'1006004', 'price': 1.0, 'name': seller.name}) for seller in sellers]