I've been trying to create a column of variability given the mean of the column data values for 'A' and 'B' below. I don't understand how to create the average for each row or element-wise in the panda column by the single data value with the long-term average(s). For example, imagine if have data that looks like this in pandas df1:
Year Name Data
1999 A 2
2000 A 4
1999 B 6
2000 B 8
And, i have a DF with the long-term mean called "LTmean", which in this case is = 3 and 7.
mean_df =
Name Data mean
0 A 3
1 B 7
So, the result would look like this for a new df: dfnew['var'] = (df1.['Data']/mean_df(???) -1:
Year Name Var
1999 A -0.3
2000 A 0.3
1999 B -0.14
2000 B 0.14
Thank you for any suggestions on this! Would a loop be the best idea to loop through each column by the "Name' in each DF somehow?
df['Var'] = df1['Data']/LTmean - 1
Related
I'm trying to calculate variability statistics from two df's - one with current data and one df with average data for the month. Suppose I have a df "DF1" that looks like this:
Name year month output
0 A 1991 1 10864.8
1 A 1997 2 11168.5
2 B 1994 1 6769.2
3 B 1998 2 3137.91
4 B 2002 3 4965.21
and a df called "DF2" that contains monthly averages from multiple years such as:
Name month output_average
0 A 1 11785.199
1 A 2 8973.991
2 B 1 8874.113
3 B 2 6132.176667
4 B 3 3018.768
and, i need a new DF calling it "DF3" that needs to look like this with the calculations specific to the change in the "name" column and for each "month" change:
Name year month Variability
0 A 1991 1 -0.078097875
1 A 1997 2 0.24454103
2 B 1994 1 -0.237197002
3 B 1998 2 -0.488287737
4 B 2002 3 0.644782
I have tried options like this below but with errors about duplicating the axis or key errors -
DF3['variability'] =
((DF1.output/DF2.set_index('month'['output_average'].reindex(DF1['name']).values)-1)
Thank you for your help in leaning Python row calculations coming from matlab!
For two columns, you can better use merge instead of set_index:
df3 = df1.merge(df2, on=['Name','month'], how='left')
df3['variability'] = df3['output']/df3['output_average'] - 1
Output:
Name year month output output_average variability
0 A 1991 1 10864.80 11785.199000 -0.078098
1 A 1997 2 11168.50 8973.991000 0.244541
2 B 1994 1 6769.20 8874.113000 -0.237197
3 B 1998 2 3137.91 6132.176667 -0.488288
4 B 2002 3 4965.21 3018.768000 0.644780
Say I have a pandas dataframe df1 as follows:
OpDay Rid Tid Sid Dist
0 18Sep 1 1 1 10
1 18Sep 1 1 1 15
2 18Sep 1 1 1 20
3 18Sep 1 5 4 5
4 18Sep 1 5 4 50
and df2 like:
S_Day R_ID T_ID S_ID ABC XYZ
0 18Sep 1 1 1 100 60
1 18Sep 1 5 4 125 100
Number of rows in df2 is equal to total number of unique combinations of OpDay+Rid+Tid+Sid in df1.
Now, I want the values of columns ABC and XYZ from df2 corresponding to this each unique combination. But I don't want to store these values in df1. Just need these values for some computation purpose and then I want to store the result in df2 only by creating a new column.
To summarize, lets say ,I want to do some computation using df1.Dist[3] for which I need values from columns df2.ABC and df2.XYZ also, so first find the row index in df2 where,
S_Day = OpDay[3],
R_ID = Rid[3],
T_ID = Tid[3] and
S_ID = Sid[3]
(In this case its row#1),
so use df2.ABC[1] and df2.XYZ[1] and store results in df2.RESULT[1].
So now df2 will look something like:
S_Day R_ID T_ID S_ID ABC XYZ RESULT
0 18Sep 1 1 1 100 60 Nan
1 18Sep 1 5 4 125 100 some computed value
Basically I guess I need a lookup kind of a function but don't know how to proceed further.
Please help as I am new to the world of python and programming. Many thanks in advance.
You can use .loc and Boolean indices to do what you want. Let's say that you're after the ith row of df1:
i = 3
Next, you can use Boolean indexing to find the corresponding rows in df2:
bool_index = (df1.loc[i, 'OpDay'] == df2.loc[:, 'S_Day']) & (df1.loc[i, 'Rid'] == df2.loc[:, 'R_ID']) & (df1.loc[i, 'Tid'] == df2.loc[:, 'T_ID']) & (df1.loc[i, 'Sid'] == df2.loc[:, 'S_ID'])
You might want to include a check to verify that you found one and only one combination:
sum(bool_index) == 1
And finally, you can use the boolean index to call the right values from df2:
ABC_for_computation = df2.loc[bool_index, 'ABC']
XYZ_for_computation = df2.loc[bool_index, 'XYZ']
Note that I'm not too sure about the speed of this operation if you have large datasets. In my experience, if speed is affected you should switch to numpy arrays instead of dataframes, particularly when writing data into your dataframe.
I am working with some family data which holds records on caregivers and the number of children that caregiver has. Currently, demographic information for the caregiver and all children that caregiver has is in the caregiver record. I want to take children's demographic information and put it into the the children's respective record/row. Here is an example of the data I am working with:
Vis POS FAMID G1ID G2ID G1B G2B1 G2B2 G2B3 G1R G2R1 G2R2 G2R3
1 0 1 100011 1979 2010 White White
1 1 1 200011
1 0 2 100021 1969 2011 2009 AA AA White
1 1 2 200021
1 2 2 200022
1 0 3 100031 1966 2008 2010 2011 White White AA AA
1 1 3 200031
1 2 3 200032
1 3 3 200033
G1 = caregiver data
G2 = child data
GxBx = birthyear
GxRx = race
OUTPUT
Visit POS FAMID G1 G2 G1Birth G2Birth G1Race G2Race
1 0 1 100011 1979 White
1 1 1 200011 2010 White
1 0 2 100021 1969 AA
1 1 2 200021 2011 AA
1 2 2 200022 2009 White
1 0 3 100031 1966 White
1 1 3 200031 2008 White
1 2 3 200032 2010 AA
1 3 3 200033 2011 AA
From these two tables you can see I want all G2Bx columns to fall into a new G2Birth column, and same principle for G2Rx columns. (I actually have several more instances like race and birthyear in my actual data)
I have been looking into pivots and stacking functions in the pandas dataframe but I hvaen't quite got what I wanted. The closest I have gotten was using the melt function, but the issue I have with the melt function was I couldn't get it to map to indexes with out taking all values from that column. IE it wants to create a row for child2 and child3 for people who only have child1. I might just be using the melt function incorrectly.
What I want are all values from g2Birthdate1 to map onto POS when POS=1, and all g2Birthdate2 to the POS=2 index, etc. Is there a function which can help accomplish this? Or does this require a some additional coding solution?
You can do this with a row and a column MultiIndex and a left join:
# df is your initial dataframe
# Make a baseline dataframe to hold the IDs
id_df = df.drop(columns=[c for c in df.columns if c not in ["G1ID", "G2ID","Vis","FAMID","POS"]])
# Make a rows MultiIndex to join on at the end
id_df = id_df.set_index(["Vis","FAMID","POS"])
# Rename the columns to reflect the hierarchical nature
data_df = df.drop(columns=[c for c in df.columns if c in ["G1ID", "G2ID", "POS"]])
# Make the first two parts of the MultiIndex required for the join at the end
data_df = data_df.set_index(["Vis","FAMID"])
# Make the columns also have a MultiIndex
data_df.columns = pd.MultiIndex.from_tuples([("G1Birth",0),("G2Birth",1),("G2Birth",2),("G2Birth",3),
("G1Race",0),("G2Race",1),("G2Race",2),("G2Race",3)])
# Name the columnar index levels
data_df.columns.names = (None, "POS")
# Stack the newly formed lower-level into the rows MultiIndex to complete it in prep for joining
data_df = data_df.stack("POS")
# Join to the id dataframe on the full MultiIndex
final = id_df.join(data_df)
final = final.reset_index()
I have the dataframe with many columns in it , some of it contains price and rest contains volume as below:
year_month 0_fx_price_gy 0_fx_volume_gy 1_fx_price_yuy 1_fx_volume_yuy
1990-01 2 10 3 30
1990-01 2 20 2 40
1990-02 2 30 3 50
I need to do group by year_month and do mean on price columns and sum on volume columns.
is there any quick way to do this in one statement like do average if column name contains price and sum if it contains volume?
df.groupby('year_month').?
Note: this is just sample data with less columns but format is similar
output
year_month 0_fx_price_gy 0_fx_volume_gy 1_fx_price_yuy 1_fx_volume_yuy
1990-01 2 30 2.5 70
1990-02 2 30 3 50
Create dictionary by matched values and pass to DataFrameGroupBy.agg, last add reindex if order of output columns is changed:
d1 = dict.fromkeys(df.columns[df.columns.str.contains('price')], 'mean')
d2 = dict.fromkeys(df.columns[df.columns.str.contains('volume')], 'sum')
#merge dicts together
d = {**d1, **d2}
print (d)
{'0_fx_price_gy': 'mean', '1_fx_price_yuy': 'mean',
'0_fx_volume_gy': 'sum', '1_fx_volume_yuy': 'sum'}
Another solution for dictionary:
d = {}
for c in df.columns:
if 'price' in c:
d[c] = 'mean'
if 'volume' in c:
d[c] = 'sum'
And solution should be simplify if only price and volume columns without first column filtered out by df.columns[1:]:
d = {x:'mean' if 'price' in x else 'sum' for x in df.columns[1:]}
df1 = df.groupby('year_month', as_index=False).agg(d).reindex(columns=df.columns)
print (df1)
year_month 0_fx_price_gy 0_fx_volume_gy 1_fx_price_yuy 1_fx_volume_yuy
0 1990-01 2 40 3 60
1 1990-02 2 20 3 30
I'm struggling with a bit of a complex (to me) lookup-type problem.
I have a dataframe df1 that looks like this:
Grade Value
0 A 16
1 B 12
2 C 5
And another dataframe (df2) where the values in one of the columns('Grade') from df1 forms the index:
Tier 1 Tier 2 Tier 3
A 20 17 10
B 16 11 3
C 7 6 2
I've been trying to write a bit of code that for each row in df1, look ups the row corresponding with 'Grade' in df2, finds the smallest value in df2 greater than 'Value', and returns the name of that column.
E.g. for the second row of df1, it looks up the row with index 'B' in df2: 16 is the smallest value greater than 12, so it returns 'Tier 1'. Ideal output would be:
Grade Value Tier
0 A 16 Tier 2
1 B 12 Tier 1
2 C 5 Tier 2
My novice, downloaded-Python-last-week attempt so far has been as follows, which is throwing up all manner of errors and doesn't even try returning the column name. Sorry also about the micro-ness of the question: any help appreciated!
for i, row in input_df1.iterrows():
Tier = np.argmin(df1['Value']<df2.loc[row,0:df2.shape[1]])
df2.loc[df1.Grade].eq(df1.Value, 0).idxmax(1)