Repeated Measures ANOVA in Pandas, dependent variable values in different Columns - pandas

I am quiet new to Data-Science so maybe this will be quiet easy for more advanced coders.
I want to do a repeated measures ANOVA based on pre & post measurements of a test in different groups (Experimental Group vs. Control Group). Every subject only participated in one group.
In my Pandas - df I have the following columns:
"Subject ID" (unique), "Condition" (Experimental or Control), "Pre-Measure Value", "Post-Measure Value" ...
subject_id = [1,2,3,4]
condition = [1,2,1,2]
pre = [1.1,2.1,3.1,4.1]
post = [1.2, 2.2, 3.2, 4.2]
sample_df = pd.DataFrame({"Subject ID": subject_id, "Condition": condition, "Pre": pre, "Post": post})
sample_df
How can I analyze this using ANOVA?
The packages I've seen use dataframes where the dep variable is in one column whereas in my dataframe the depending measures which I want to evaluate are in two columns. Would I need to add another column specifying whether the value is pre or post for every value and condition.
Is there a "handy" function to do something like this?
Specifically the output would need to look like:
subject_id_new = [1,1,2,2,3,3,4,4]
condition_new = [1,1,2,2,1,1,2,2]
measurement = ["pre", "post","pre", "post","pre", "post","pre", "post"]
value = [1.1, 1.2,2.1,2.2,3.1,3.2,4.1,4.2]
new_df = pd.DataFrame({"Subject ID":subject_id_new, "Condition": condition_new, "Measurement": measurement, "Value": value})
Thanks a lot.

Actually, what I looked for is:
sample_df.melt(id_vars=['Subject ID', "Condition"])
This results in the dataframe with a column specifying which measurement point the value is referring to.

Related

pandas Dataframe from itertools.product output not able to be created

I want to create a Dataframe of all possible combinations:
Primary_function=['Office','Hotel','Hospital(General Medical & Surgical)','Other - Education']
City=['Miami','Houston','Phoenix','Atlanta','Las Vegas','San Francisco','Baltimore','Chicago','Boulder','Minneapolis']
Gross_Floor_Area=[50,100,200]
Years_Built=[1950,1985,2021]
Floors_above_grade=[2,6,15]
Heat=['Electricity - Grid Purchase','Natural Gas','District Steam']
WWR=[30,50,70]
Buildings=[Primary_function,City,Gross_Floor_Area,Years_Built,Floors_above_grade,Heat,WWR]
a=list((itertools.product(*Buildings)))
df=pd.DataFrame(a,columns=Buildings)
The error that I am getting is :
ValueError: Length of columns passed for MultiIndex columns is different
Pass a list with strings of the columns, i.e.
columns = ["Primary Function", "City", "Gross Floor Area", "Year Built", "Floors Above Grade", "Heat", "WWR"]
df = pd.DataFrame(a, columns = columns)
As Mr. T suggests, if you do this frequently you will be better off using a dict.

Adding columns in panda with and without filter

I'm new to pandas. I'm trying to add columns to my df. There are multiple columns in the csv. The names of the columns include, "Name", "Date", ..., "Problem", "Problem.1", "Problem.2" etc. The user is going to be downloading the files at different times and the number of problems will change so I can't just list the problems.
I only want the columns: Name, Date, and all columns whose name contains the word "Problem".
I know this isn't correct but the idea is...
df=df['Name', 'Date', df.filter (regex = 'Problem')]
Any help is appreciated.Thank you in advance!!!
Use this:
df [ ['Name', 'Date'] + [col for col in df.columns if 'Problem' in col] ]

Issues using mergeDynamicFrame on AWS Glue

I need do a merge between two dynamic frames on Glue.
I tried to use the mergeDynamicFrame function, but i keep getting the same error:
AnalysisException: "cannot resolve 'id' given input columns: [];;\n'Project ['id]\n+- LogicalRDD false\n"
Right now, i have 2 DF:
df_1(id, col1, salary_src) and df_2(id, name, salary)
I want to merge df_2 into df_1 by the "id" column.
df_1 = glueContext.create_dynamic_frame.from_catalog(......)
df_2 = glueContext.create_dynamic_frame.from_catalog(....)
merged_frame = df_1.mergeDynamicFrame(df_2, ["id"])
applymapping1 = ApplyMapping.apply(frame = merged_frame, mappings = [("id", "long", "id", "long"), ("col1", "string", "name", "string"), ("salary_src", "long", "salary", "long")], transformation_ctx = "applymapping1")
datasink2 = glueContext.write_dynamic_frame.from_options(....)
As a test i tried to pass a column from both DFs (salary and salary_src), and, the error as:
AnalysisException: "cannot resolve 'salary_src' given input columns: [id, name, salary];;\n'Project [salary#2, 'salary_src]\n+- LogicalRDD [id#0, name#1, salary#2], false\n"
Is this case, it seems to recognize the columns from the df_2 (id, name, salary).. but if i pass just one of the columns, or even the 3, it keeps failing
It doesn't appear to be a mergeDynamicFrame issue.
Based on the information you provided, it looks like your df1, df2 or both are not reading data correctly and returning an empty dynamicframe which is why you have an empty list of input columns "input columns: []"
if you are reading from s3, you must crawl your data before you can use glueContext.create_dynamic_frame.from_catalog.
you can also include df1.show() or df1.printSchema() after you create your dynamic_frame as a troubleshooting step to make sure you are reading your data correctly before merging.

pandas merge produce duplicate columns

n1 = DataFrame({'zhanghui':[1,2,3,4] , 'wudi':[17,'gx',356,23] ,'sas'[234,51,354,123] })
n2 = DataFrame({'zhanghui_x':[1,2,3,5] , 'wudi':[17,23,'sd',23] ,'wudi_x':[17,23,'x356',23] ,'wudi_y':[17,23,'y356',23] ,'ddd':[234,51,354,123] })
code above defined two DataFrame objects. I wanna use 'zhanghui' field from n1 and 'zhanghui_x' field from n2 as "on" field merge n1 and n2,so my code like this:
n1.merge(n2,how = 'inner',left_on = 'zhanghui',right_on='zhanghui_x')
and then result columns given like this :
sas wudi_x zhanghui ddd wudi_y wudi_x wudi_y zhanghui_x
Some duplicate columns appeared,such as 'wudi_x' ,'wudi_y'.
So it's a pandas inner problems or I had a wrong usage about pd.merge ?
From pandas documentation, the merge() function has following properties;
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False,
validate=None)
where suffixes denote default suffix string to be attached to 'over-lapping' columns with defaults '_x' and '_y'.
I'm not sure if I understood your follow-up question correctly, but;
#case1
if the first dataFrame has column 'column_name_x' and the second dataFrame has column 'column_name' then there are no over-lapping columns and therefore no suffixes are attached.
#case2
if the first dataFrame has columns 'column_name', 'column_name_x' and the second dataFrame also has column 'column_name', the default suffixes attach to over-lapping columns and therefore the first frame's 'columnn_name' becomes 'column_name_x' and result in a duplicate of already existing column.
You can however, pass a None value to one(not all) of the suffixes to ensure that column names of certain dataFrame remain as-is.
Your approach is right, pandas automatically gives postscripts after merging the columns that are "duplicated" with the original headers given a postscript _x, _y, etc.
you can first select what columns to merge and proceed:
cols_to_use = n2.columns - n1.columns
n1.merge(n2[cols_to_use],how = 'inner',left_on = 'zhanghui',right_on='zhanghui_x')
result columns:
sas wudi zhanghui ddd wudi_x wudi_y zhanghui_x
When I tried to run cols_to_use = n2.columns - n1.columns,it gave me a TypeError like this:
cannot perform __sub__ with this index type: <class pandas.core.indexes.base.Index'>
then I tried to use code below:
cols_to_use = [i for i in list(n2.columns) if i not in list(n1.columns) ]
It worked fine,result columns given like this:
sas wudi zhanghui ddd wudi_x wudi_y zhanghui_x
So,#S Ringne's method really resolved my problems.
=============================================
Pandas just simply add suffix such as '_x' to resolve the duplicate-column-name problem when it comes to merging two Frame objects.
But what will it happen if the name form of 'a-column-name'+'_x' appears in either Frame object? I used to think that it will check if the name form of 'a-column-name'+'_x' appears, But actually pandas doesn't have this check?

Create a dataframe from MultiLCA results in Brightway2

I am trying to create a pandas dataframe from the results of a MultiLCA calculation, using as columns the methods and as rows the functional units. I did find a sort of solution, but it is a bit cumbersome (I am not very good with dictionaries)
...
mlca=MultiLCA("my_calculation_setup")
pd.DataFrame(mlca.results,columns=mlca.methods)
fu_names=[]
for d in mlca.func_units:
for key in d:
fu_names.append(str(key))
dfresults['fu']=fu_names
dfresults.set_index('fu',inplace=True)
is there a more elegant way of doing this? The names are also very long, but that's a different story...
Your code seems relatively elegant to me. If you want to stick with str(key), then you could simplify it somewhat with a list comprehension:
mlca=MultiLCA("my_calculation_setup")
dfresults = pd.DataFrame(mlca.results, columns=mlca.methods)
dfresults['fu'] = [str(key) for demand in mlca.func_units for key in demand]
dfresults.set_index('fu', inplace=True)
Note that this only works if your demand dictionaries have one activity each. You could have situations where one demand dictionary would have two activities (like LCA({'foo': 1, 'bar': 2})), where this would fail because there would be too many elements in the fu list.
If you do know that you only have one activity per demand, then you can make a slightly nicer dataframe as follows:
mlca=MultiLCA("my_calculation_setup")
scores = pd.DataFrame(mlca.results, columns=mlca.methods)
as_activities = [
(get_activity(key), amount)
for dct in mlca.func_units
for key, amount in dct.items()
]
nicer_fu = pd.DataFrame(
[
(x['database'], x['code'], x['name'], x['location'], x['unit'], y)
for x, y in as_activities
],
columns=('Database', 'Code', 'Name', 'Location', 'Unit', 'Amount')
)
nicer = pd.concat([nicer_fu, scores], axis=1)
However, in the general case dataframes as not a perfect match for calculation setups. When a demand dictionary has multiple activities, there is no nice way to "squish" this into one dimension or one row.