Flatten Json array column in Dask DataFrame - dask-distributed

I have a Dask Dataframe which contains a column with Json Array value and another column as usual. For example
account_id, account_name, Audiences
22304923, Test, [{"id":1231230234, "name": "Audience 1"},{"id":972937423, "name": "Audience 2"}]
I want to flatten it to
account_id, account_name, audience_id, audience_name
22304923, Test, 1231230234, Audience 1
22304923, Test, 972937423, Audience 2
How should I do it with Dask (make sure it works on the large data set by the distributed process supported by Dask)
Solution:
I'm able to make it work
#delayed
def flatten_data(input_df):
"""Flatten the data by the custom audience list under each record"""
def flatten_audiences(df):
df.audiences = df.audiences.apply(json.loads)
exploded = df.explode("audiences")
result = pd.concat(
[
exploded[["account_name", "account_id"]].reset_index(drop=True),
pd.json_normalize(exploded["audiences"]),
],
axis=1,
)
return result.rename({"audience_id": "id", "audience_name": "name"})
return input_df.map_partitions(
flatten_audiences,
meta=(
("account_name", "object"),
("account_id", "int64"),
("audience_name", "object"),
("audience_id", "object"),
),).compute()

Related

Select cells in a pandas DataFrame by a Series of its column labels

Say we have a DataFrame and a Series of its column labels, both (almost) sharing a common index:
df = pd.DataFrame(...)
s = df.idxmax(axis=1).shift(1)
How can I obtain cells given a series of columns, getting value from every row using a corresponding column label from the joined series? I'd imagine it would be:
values = df[s] # either
values = df.loc[s] # or
In my example I'd like to have values that are under biggest-in-their-row values (I'm doing a poor man's ML :) )
However I cannot find any interface selecting cells by series of columns. Any ideas folks?
Meanwhile I use this monstrous snippet:
def get_by_idxs(df: pd.DataFrame, idxs: pd.Series) -> pd.Series:
ts_v_pairs = [
(ts, row[row['idx']])
for ts, row in df.join(idxs.rename('idx'), how='inner').iterrows()
if isinstance(row['idx'], str)
]
return pd.Series([v for ts, v in ts_v_pairs], index=[ts for ts, v in ts_v_pairs])
I think you need dataframe lookup
v = s.dropna()
v[:] = df.to_numpy()[range(len(v)), df.columns.get_indexer_for(v)]

Repeated Measures ANOVA in Pandas, dependent variable values in different Columns

I am quiet new to Data-Science so maybe this will be quiet easy for more advanced coders.
I want to do a repeated measures ANOVA based on pre & post measurements of a test in different groups (Experimental Group vs. Control Group). Every subject only participated in one group.
In my Pandas - df I have the following columns:
"Subject ID" (unique), "Condition" (Experimental or Control), "Pre-Measure Value", "Post-Measure Value" ...
subject_id = [1,2,3,4]
condition = [1,2,1,2]
pre = [1.1,2.1,3.1,4.1]
post = [1.2, 2.2, 3.2, 4.2]
sample_df = pd.DataFrame({"Subject ID": subject_id, "Condition": condition, "Pre": pre, "Post": post})
sample_df
How can I analyze this using ANOVA?
The packages I've seen use dataframes where the dep variable is in one column whereas in my dataframe the depending measures which I want to evaluate are in two columns. Would I need to add another column specifying whether the value is pre or post for every value and condition.
Is there a "handy" function to do something like this?
Specifically the output would need to look like:
subject_id_new = [1,1,2,2,3,3,4,4]
condition_new = [1,1,2,2,1,1,2,2]
measurement = ["pre", "post","pre", "post","pre", "post","pre", "post"]
value = [1.1, 1.2,2.1,2.2,3.1,3.2,4.1,4.2]
new_df = pd.DataFrame({"Subject ID":subject_id_new, "Condition": condition_new, "Measurement": measurement, "Value": value})
Thanks a lot.
Actually, what I looked for is:
sample_df.melt(id_vars=['Subject ID', "Condition"])
This results in the dataframe with a column specifying which measurement point the value is referring to.

Issues using mergeDynamicFrame on AWS Glue

I need do a merge between two dynamic frames on Glue.
I tried to use the mergeDynamicFrame function, but i keep getting the same error:
AnalysisException: "cannot resolve 'id' given input columns: [];;\n'Project ['id]\n+- LogicalRDD false\n"
Right now, i have 2 DF:
df_1(id, col1, salary_src) and df_2(id, name, salary)
I want to merge df_2 into df_1 by the "id" column.
df_1 = glueContext.create_dynamic_frame.from_catalog(......)
df_2 = glueContext.create_dynamic_frame.from_catalog(....)
merged_frame = df_1.mergeDynamicFrame(df_2, ["id"])
applymapping1 = ApplyMapping.apply(frame = merged_frame, mappings = [("id", "long", "id", "long"), ("col1", "string", "name", "string"), ("salary_src", "long", "salary", "long")], transformation_ctx = "applymapping1")
datasink2 = glueContext.write_dynamic_frame.from_options(....)
As a test i tried to pass a column from both DFs (salary and salary_src), and, the error as:
AnalysisException: "cannot resolve 'salary_src' given input columns: [id, name, salary];;\n'Project [salary#2, 'salary_src]\n+- LogicalRDD [id#0, name#1, salary#2], false\n"
Is this case, it seems to recognize the columns from the df_2 (id, name, salary).. but if i pass just one of the columns, or even the 3, it keeps failing
It doesn't appear to be a mergeDynamicFrame issue.
Based on the information you provided, it looks like your df1, df2 or both are not reading data correctly and returning an empty dynamicframe which is why you have an empty list of input columns "input columns: []"
if you are reading from s3, you must crawl your data before you can use glueContext.create_dynamic_frame.from_catalog.
you can also include df1.show() or df1.printSchema() after you create your dynamic_frame as a troubleshooting step to make sure you are reading your data correctly before merging.

Looping over only a certain range of object indices in python

I am trying to append a pandas dataframe based on two pre-existing columns in that dataframe. The issue I'm having is that the index of the pandas dataframe is in object format, not integer format. To make things more complicated, I only want to append a certain range of the dataframe, leaving the remaining cells in the new column as 'NaN'. In order to append over only a certain range of the dataframe, I will have to use a "for" loop.
Here is my question: How do I loop over a certain range of the dataframe when I have an object index?
My initial pandas dataframe is simply...
import pandas as pd
dates = ['2005Q4','2006Q1','2006Q2','2006Q3','2006Q4','2007Q1','2007Q2']
col1 = [ 5.9805, 6.2181, 6.3508, 6.7878, 6.6212, 6.4583, 6.4068 ]
col2 = [ 'NaN', -0.001054985938, -0.121731711952, 0.046275331889,
-0.017517211963, -0.023422842422, 0.009072170884 ]
data = pd.DataFrame(
{
'col1': col1,
'col2': col2
},
columns = [
'col1',
'col2'
],
index = dates
)
All I'm trying to do is something like this...
data['col3'] = 'NaN'
for i in range('2006Q1','2006Q4',1):
data['col3'][i] = data['col1'][i-1] +\
data['col2'][i]
Naively, I had hoped that python would be able to correlate the object name in the index with the actual index number associated with that particular indice. For example, if I define the index as given, python would be able to know that '2005Q4' is index = 0, '2006Q1' is index = 1, etc. In this way, I could use object strings in the range() function and it would still know the integer I'm referring to. However, this appears not to be the case.
I need to avoid converting the objects into date format as well. It is important that I keep the index in the format 'YearQuarter', and I have yet to find a simple way of using pd.to_datetime that is able to do this.
Does anyone have any suggestions on how to loop over only a certain range of object-based indices in python?
Using .index() with a list returns the index of the item you're looking for. Try this tweak to your for loop.
for i in range(dates.index('2006Q1'),dates.index('2006Q4'),1):
Obviously however much more efficient ways to do this. .shift() shifts your entire column up or down however much you want :
data['col3'] = data.col1 - data.col2.shift(-1)

Create a dataframe from MultiLCA results in Brightway2

I am trying to create a pandas dataframe from the results of a MultiLCA calculation, using as columns the methods and as rows the functional units. I did find a sort of solution, but it is a bit cumbersome (I am not very good with dictionaries)
...
mlca=MultiLCA("my_calculation_setup")
pd.DataFrame(mlca.results,columns=mlca.methods)
fu_names=[]
for d in mlca.func_units:
for key in d:
fu_names.append(str(key))
dfresults['fu']=fu_names
dfresults.set_index('fu',inplace=True)
is there a more elegant way of doing this? The names are also very long, but that's a different story...
Your code seems relatively elegant to me. If you want to stick with str(key), then you could simplify it somewhat with a list comprehension:
mlca=MultiLCA("my_calculation_setup")
dfresults = pd.DataFrame(mlca.results, columns=mlca.methods)
dfresults['fu'] = [str(key) for demand in mlca.func_units for key in demand]
dfresults.set_index('fu', inplace=True)
Note that this only works if your demand dictionaries have one activity each. You could have situations where one demand dictionary would have two activities (like LCA({'foo': 1, 'bar': 2})), where this would fail because there would be too many elements in the fu list.
If you do know that you only have one activity per demand, then you can make a slightly nicer dataframe as follows:
mlca=MultiLCA("my_calculation_setup")
scores = pd.DataFrame(mlca.results, columns=mlca.methods)
as_activities = [
(get_activity(key), amount)
for dct in mlca.func_units
for key, amount in dct.items()
]
nicer_fu = pd.DataFrame(
[
(x['database'], x['code'], x['name'], x['location'], x['unit'], y)
for x, y in as_activities
],
columns=('Database', 'Code', 'Name', 'Location', 'Unit', 'Amount')
)
nicer = pd.concat([nicer_fu, scores], axis=1)
However, in the general case dataframes as not a perfect match for calculation setups. When a demand dictionary has multiple activities, there is no nice way to "squish" this into one dimension or one row.