Issues using mergeDynamicFrame on AWS Glue - dataframe

I need do a merge between two dynamic frames on Glue.
I tried to use the mergeDynamicFrame function, but i keep getting the same error:
AnalysisException: "cannot resolve 'id' given input columns: [];;\n'Project ['id]\n+- LogicalRDD false\n"
Right now, i have 2 DF:
df_1(id, col1, salary_src) and df_2(id, name, salary)
I want to merge df_2 into df_1 by the "id" column.
df_1 = glueContext.create_dynamic_frame.from_catalog(......)
df_2 = glueContext.create_dynamic_frame.from_catalog(....)
merged_frame = df_1.mergeDynamicFrame(df_2, ["id"])
applymapping1 = ApplyMapping.apply(frame = merged_frame, mappings = [("id", "long", "id", "long"), ("col1", "string", "name", "string"), ("salary_src", "long", "salary", "long")], transformation_ctx = "applymapping1")
datasink2 = glueContext.write_dynamic_frame.from_options(....)
As a test i tried to pass a column from both DFs (salary and salary_src), and, the error as:
AnalysisException: "cannot resolve 'salary_src' given input columns: [id, name, salary];;\n'Project [salary#2, 'salary_src]\n+- LogicalRDD [id#0, name#1, salary#2], false\n"
Is this case, it seems to recognize the columns from the df_2 (id, name, salary).. but if i pass just one of the columns, or even the 3, it keeps failing

It doesn't appear to be a mergeDynamicFrame issue.
Based on the information you provided, it looks like your df1, df2 or both are not reading data correctly and returning an empty dynamicframe which is why you have an empty list of input columns "input columns: []"
if you are reading from s3, you must crawl your data before you can use glueContext.create_dynamic_frame.from_catalog.
you can also include df1.show() or df1.printSchema() after you create your dynamic_frame as a troubleshooting step to make sure you are reading your data correctly before merging.

Related

How to Parse nested json column to two columns called key and value

I have a source table with 3 columns. One of the column contains json values. some of the rows contain simple json but some of the rows contain nested json like in image's source table. I want the target table to look like in image attached. could someone help with pyspark code or sql code to put it in databrick?
This json doesn't have a fixed schema. it can be varried in different ways but ultimately its a json.
source and target tables
I am expecting pyspark code for above question.
Here is the sample code used to achieve this.
%py
df1 = spark.sql("select eventId, AppId, eventdata from tableA)
df1 = df1 .withColumn("EventData",from_json(df1 .eventdata,MapType(StringType(),StringType())))
df1 = df1 .select(df1.eventId,df1.AppId, explode_outer(df1.EventData))
display(df1)
this resulted in below output
[output][1]
Below is a sample json:
{
"brote":"AKA",
"qFilter":"{\"xfilters\":[{\"Molic\":\"or\",\"filters\":[{\"logic\":\"and\",\"field\":\"Name\",\"operator\":\"contains\",\"value\":\"*R-81110\"},{\"logic\":\"and\",\"field\":\"Title\",\"operator\":\"contains\",\"value\":\"*R-81110\"}]}],\"pSize\":200,\"page\":1,\"ignoreConfig\":false,\"relatedItemFilters\":[],\"entityType\":\"WAFADocuments\"}",
"config":"[\"PR_NMO\"]",
"title":"All Documents",
"selected":"PR_NMO",
"selectedCreateConfig":"PR_NMO",
"selectedQueryConfigs":[
"PR_CVO"
],
"selectedRoles":[
"RL_ZAC_Planner"
]
}
[1]: https://i.stack.imgur.com/Oftvr.png
The requirement is hard to achieve as the schema of the nested values is not fixed. To do it with the sample you have given, you can use the following code:
df1 = df.withColumn("EventData",from_json(df.EventData,MapType(StringType(),StringType())))
df1 = df1 .select(df1.eventID,df1.AppID, explode_outer(df1.EventData))
#df1.show()
df2 = df1.filter(df1.key == 'orders')
user_schema = ArrayType(
StructType([
StructField("id", StringType(), True),
StructField("type", StringType(), True)
])
)
df3 = df2.withColumn("value", from_json("value", user_schema)).selectExpr( "eventID", "AppID", "key","inline(value)")
df3 = df3.melt(['eventID','AppID','key'],['id','type'],'sub_order','val')
req = df3.withColumn('key',concat(df3.key,lit('.'),df3.sub_order))
final_df = df1.filter(df1.key != 'orders').union(req.select('eventID','AppID','key','val'))
final_df.show()
This might be not possible is the schema would be constantly changing.

Repeated Measures ANOVA in Pandas, dependent variable values in different Columns

I am quiet new to Data-Science so maybe this will be quiet easy for more advanced coders.
I want to do a repeated measures ANOVA based on pre & post measurements of a test in different groups (Experimental Group vs. Control Group). Every subject only participated in one group.
In my Pandas - df I have the following columns:
"Subject ID" (unique), "Condition" (Experimental or Control), "Pre-Measure Value", "Post-Measure Value" ...
subject_id = [1,2,3,4]
condition = [1,2,1,2]
pre = [1.1,2.1,3.1,4.1]
post = [1.2, 2.2, 3.2, 4.2]
sample_df = pd.DataFrame({"Subject ID": subject_id, "Condition": condition, "Pre": pre, "Post": post})
sample_df
How can I analyze this using ANOVA?
The packages I've seen use dataframes where the dep variable is in one column whereas in my dataframe the depending measures which I want to evaluate are in two columns. Would I need to add another column specifying whether the value is pre or post for every value and condition.
Is there a "handy" function to do something like this?
Specifically the output would need to look like:
subject_id_new = [1,1,2,2,3,3,4,4]
condition_new = [1,1,2,2,1,1,2,2]
measurement = ["pre", "post","pre", "post","pre", "post","pre", "post"]
value = [1.1, 1.2,2.1,2.2,3.1,3.2,4.1,4.2]
new_df = pd.DataFrame({"Subject ID":subject_id_new, "Condition": condition_new, "Measurement": measurement, "Value": value})
Thanks a lot.
Actually, what I looked for is:
sample_df.melt(id_vars=['Subject ID', "Condition"])
This results in the dataframe with a column specifying which measurement point the value is referring to.

Adding columns in panda with and without filter

I'm new to pandas. I'm trying to add columns to my df. There are multiple columns in the csv. The names of the columns include, "Name", "Date", ..., "Problem", "Problem.1", "Problem.2" etc. The user is going to be downloading the files at different times and the number of problems will change so I can't just list the problems.
I only want the columns: Name, Date, and all columns whose name contains the word "Problem".
I know this isn't correct but the idea is...
df=df['Name', 'Date', df.filter (regex = 'Problem')]
Any help is appreciated.Thank you in advance!!!
Use this:
df [ ['Name', 'Date'] + [col for col in df.columns if 'Problem' in col] ]

Generate a pyarrow schema in the format of a list of pa.fields?

Is there a way for me to generate a pyarrow schema in this format from a pandas DF? I have some files which have hundreds of columns so I can't type it out manually.
fields = [
pa.field('id', pa.int64()),
pa.field('date', pa.timestamp('ns')),
pa.field('name', pa.string()),
pa.field('status', pa.dictionary(pa.int8(), pa.string(), ordered=False),
]
I'd like to save it in a file and then refer to it explicitly when I save data with to_parquet.
I tried to use schema = pa.Schema.from_pandas(df) but when I print out schema it is in a different format (I can't save it as a list of data type tuples like the fields example above).
Ideally, I would take a pandas dtype dictionary and then remap it into the fields list above. Is that possible?
schema = {
'id': 'int64',
'date': 'datetime64[ns]',
'name': 'object',
'status': 'category',
}
Otherwise, I will make the dtype schema, print it out and paste it into a file, make any required corrections, and then do a df = df.astype(schema) before saving the file to Parquet. However, I know I can run into issues with fully null columns in a partition or object columns with mixed data types.
I really don't understand why pa.Schema.from_pandas(df) doesn't work for you.
As far as I understood what you need is this:
schema = pa.Schema.from_pandas(df)
fields = []
for col_name, col_type in zip(schema.names, schema.types):
fields.append(pa.field(col_name, col_type))
or using list comprehension:
schema = pa.Schema.from_pandas(df)
fields = [pa.field(col_name, col_type) for col_name, col_type in zip(schema.names, schema.types)]

Data slicing a pandas frame - I'm having problems with unique

I am facing issues trying to select a subset of columns and running unique on it.
Source Data:
df_raw = pd.read_csv('data/master.csv', nrows=10000)
df_raw.shape()
Produces:
(10000, 86)
Process Data:
df = df_raw[['A','B','C']]
df.shape()
Produces:
(10000, 3)
Furthermore, doing:
df_raw.head()
df.head()
produces a correct list of rows and columns.
However,
print('RAW:',sorted(df_raw['A'].unique()))
works perfectly
Whilst:
print('PROCESSED:',sorted(df['A'].unique()))
produces:
AttributeError: 'DataFrame' object has no attribute 'unique'
What am I doing wrong? If the shape and head output are exactly what I want, I'm confused why my processed dataset is throwing errors. I did read Pandas 'DataFrame' object has no attribute 'unique' on SO which correctly states that unique needs to be applied to columns which is what I am doing.
This was a case of a duplicate column. Given this is proprietary data, I abstracted it as 'A', 'B', 'C' in this question and therefore masked the problem. (The real data set had 86 columns and I had duplicated one of those columns twice in my subset, and was trying to do a unique on that)
My problem was this:
df_raw = pd.read_csv('data/master.csv', nrows=10000)
df = df_raw[['A','B','C', 'A']] # <-- I did not realize I had duplicated A later.
This was causing problems when doing a unique on 'A'
From the entire dataframe to extract a subset a data based on a column ID. This works!!
df = df.drop_duplicates(subset=['Id']) #where 'id' is the column used to filter
print (df)