Pyspark remove field in struct column - dataframe

I want to remove a part of a value in a struct and save that version of the value as a new column in my dataframe, which looks something like this:
column
{"A": "2022-01-26T14:21:32.214+0000", "B": 69, "C": {"CA": 42, "CB": "Hello"}, "D": "XD"}
I want to remove the field C and it's values and save the rest as one new column without dividing A, B, D fields into different columns. What I want should look like this:
column
newColumn
{"A": "2022-01-26T14:21:32.214+0000", "B": 69, "C": {"CA": 42, "CB": "Hello"}, "D": "XD"}
{"A": "2022-01-26T14:21:32.214+0000", "B": 69, "D": "XD"}
I have successfully removed C by converting my dataframe to a dict, but now I can't manage to convert it back into ONE column. My attempt at removing C looks like this:
dfTemp = df.select('column').collect()[0][0].asDict(True)
dfDict = {}
for k in dfTemp:
if k != 'C':
dfDict[k] = dfTemp[k]
If you have a better way to remove a part of struct like mine and keeping the result in one column and not adding more rows or if you know how to convert a dict to a dataframe without dividing the key and value pairs into separate columns please leave a suggestion.

Assuming your column is of type string and contains json, you can first parse it into StructType using from_json like this:
df = spark.createDataFrame([
('{"A": "2022-01-26T14:21:32.214+0000", "B": 69, "C": {"CA": 42, "CB": "Hello"}, "D": "XD"}',)
], ["column"])
df = df.withColumn(
"parsed_column",
F.from_json("column", "struct<A:string,B:int,C:struct<A:int,CB:string>,D:string>")
)
Now removing the field C from the struct column:
Spark >=3.1
Use dropFields method:
result = df.withColumn("newColumn", F.to_json(F.col("parsed_column").dropFields("C"))).drop("parsed_column")
result.show(truncate=False)
#+-----------------------------------------------------------------------------------------+----------------------------------------------------+
#|column |newColumn |
#+-----------------------------------------------------------------------------------------+----------------------------------------------------+
#|{"A": "2022-01-26T14:21:32.214+0000", "B": 69, "C": {"CA": 42, "CB": "Hello"}, "D": "XD"}|{"A":"2022-01-26T14:21:32.214+0000","B":69,"D":"XD"}|
#+-----------------------------------------------------------------------------------------+----------------------------------------------------+
Spark <3.1
Recreate the struct column and filter the field C
result = df.withColumn(
"newColumn",
F.to_json(
F.struct(*[
F.col(f"parsed_column.{c}").alias(c)
for c in df.selectExpr("parsed_column.*").columns if c != 'C'
])
)
).drop("parsed_column")
Another method by parsing the json string values into MapType then applying function map_filter to remove key C:
result = df.withColumn(
"newColumn",
F.to_json(
F.map_filter(
F.from_json("column", "map<string,string>"),
lambda k, v: k != "C"
)
)

Well, it's not trivial as it would seems. First, your approach is not meant for Spark, unless you're working with very little data (and so, you don't need Spark) and you're better off using pure Python like you tried. Using collect() fetch all data on the driver which would not work with large data.
The distributed approach for this is as follows:
infer schema on part of your JSON data (unless you want to do it manually - which is tedious)
transform your dataframe with this schema to have access to named attributes
select attributes as needed and back to JSON
I tried to decompose as much as I could here:
from pyspark.sql.types import IntegerType, StructType, StringType, StructField
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import json
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
# Create input data
data = [json.dumps({"A": "2022-01-26T14:21:32.214+0000", "B": 69, "C": {"CA": 42, "CB": "Hello"}, "D": "XD"})]
df = spark.createDataFrame(data, "string").toDF("colA")
df.show()
+-----------------------------------------------------------------------------------------+
|colA |
+-----------------------------------------------------------------------------------------+
|{"A": "2022-01-26T14:21:32.214+0000", "B": 69, "C": {"CA": 42, "CB": "Hello"}, "D": "XD"}|
+-----------------------------------------------------------------------------------------+
# Infer schema - infering on first 10 rows
s = df.select(F.col("colA").alias("s")).rdd.map(lambda x: x.s).take(10)
schema = spark.read.json(sc.parallelize(s)).schema
print(schema)
# StructType(List(StructField(A,StringType,true),StructField(B,LongType,true),StructField(C,StructType(List(StructField(CA,LongType,true),StructField(CB,StringType,true))),true),StructField(D,StringType,true)))
# read JSON string with schema
new_df = df.withColumn("colB", F.from_json("colA", schema))
new_df.show(truncate=False)
+-----------------------------------------------------------------------------------------+---------------------------------------------------+
|colA |colB |
+-----------------------------------------------------------------------------------------+---------------------------------------------------+
|{"A": "2022-01-26T14:21:32.214+0000", "B": 69, "C": {"CA": 42, "CB": "Hello"}, "D": "XD"}|{2022-01-26T14:21:32.214+0000, 69, {42, Hello}, XD}|
+-----------------------------------------------------------------------------------------+---------------------------------------------------+
# Finally ...
new_df.select(F.to_json(F.struct("colB.A", "colB.B", "colB.D")).alias("colC")).show(truncate=False)
+----------------------------------------------------+
|colC |
+----------------------------------------------------+
|{"A":"2022-01-26T14:21:32.214+0000","B":69,"D":"XD"}|
+----------------------------------------------------+

Related

Pandas Convert Data Type of List inside a Dictionary

I have the following data structure:
import pandas as pd
names = {'A': [20, 5, 20],
'B': [18, 7, 13],
'C': [19, 6, 18]}
I was able to convert the Data Type for A, B, C from an object to a string as follows:
df = df.astype({'Team-A': 'string', 'Team-B': 'string', 'Team-C': 'string'}, errors='raise')
How can I convert the data types in the list to float64?
You can convert the dictionary to a dataframe and then change the dataframe to float.
import pandas as pd
names = {'A': [20, 5],
'B': [18, 7],
'C': [19, 6]}
df=pd.DataFrame(names)
df.astype('float64')
If you dont want to use dataframe you can do like this
names={k:[float(i) for i in v] for k,v in names.items()}

How do columns work in a Pandas Dataframe after using GroupBy

Basically, I want to use iterrows method to loop through my group-by dataframe, but I can't figure out how the columns work. In the example below, it does not create a column Called "Group1" and "Group2" like one might expect. One of the columns is a dtype itself?
import pandas as pd
df = pd.DataFrame(columns=["Group1", "Group2", "Amount"])
df = df.append({"Group1": "Apple", "Group2": "Red Delicious", "Amount": 15}, ignore_index=True)
df = df.append({"Group1": "Apple", "Group2": "McIntosh", "Amount": 20}, ignore_index=True)
df = df.append({"Group1": "Apple", "Group2": "McIntosh", "Amount": 30}, ignore_index=True)
df = df.append({"Group1": "Apple", "Group2": "Fuju", "Amount": 7}, ignore_index=True)
df = df.append({"Group1": "Orange", "Group2": "Navel", "Amount": 9}, ignore_index=True)
df = df.append({"Group1": "Orange", "Group2": "Navel", "Amount": 5}, ignore_index=True)
df = df.append({"Group1": "Orange", "Group2": "Mandarin", "Amount": 12}, ignore_index=True)
print(df.dtypes)
print(df.to_string())
df_sum = df.groupby(['Group1', 'Group2']).sum(['Amount'])
print("---- Sum Results----")
print(df_sum.dtypes)
print(df_sum.to_string())
for index, row in df_sum.iterrows():
# The line below is what I want to do conceptually.
# print(row.Group1, row.Group2. row.Amount) # 'Series' object has no attribute 'Group1'
print(row.Amount) # 'Series' object has no attribute 'Group1'
The part of the output we are interested is here. I noticed that "Group1 and Group2" are on a lin below the Amount.
---- Sum Results----
Amount int64
dtype: object
Amount
Group1 Group2
Apple Fuju 7
McIntosh 50
Red Delicious 15
Orange Mandarin 12
Navel 14
Simply try:
df_sum = df.groupby(['Group1', 'Group2'])['Amount'].sum().reset_index()
OR
df_sum = df.groupby(['Group1', 'Group2'])['Amount'].agg('sum').reset_index()
Even, it Simply can be ad follows, as we are performing the sum based on the Group1 & Group2 only.
df_sum = df.groupby(['Group1', 'Group2']).sum().reset_index()
Another way:
df_sum = df.groupby(['Group1', 'Group2']).agg({'Amount': 'sum'}).reset_index()
Try to reset_index
df_sum = df.groupby(['Group1', 'Group2']).sum(['Amount']).reset_index()

Efficient column MultiIndex ordering

I have this dataframe :
df = pandas.DataFrame({'A' : [2000, 2000, 2000, 2000, 2000, 2000],
'B' : ["A+", 'B+', "A+", "B+", "A+", "B+"],
'C' : ["M", "M", "M", "F", "F", "F"],
'D' : [1, 5, 3, 4, 2, 6],
'Value' : [11, 12, 13, 14, 15, 16] }).set_index((['A', 'B', 'C', 'D']))
df = df.unstack(['C', 'D']).fillna(0)
And I'm wondering is there is a more elegant way to order the columns MultiIndex that the following code :
# rows ordering
df = df.sort_values(by = ['A', "B"], ascending = [True, True])
# col ordering
df = df.transpose().sort_values(by = ["C", "D"], ascending = [False, False]).transpose()
Especially I feel like the last line with the two transpose si far more complex than it should be. I tried using sort_index but wasn't able to use it in a MultiIndex context (for both lines and columns).
You can use sort index on both levels:
out = df.sort_index(level=[0,1],axis=1,ascending=[True, False])
I can use
axis=1
And therefore the last line become
df = df.sort_values(axis = 1, by = ["C", "D"], ascending = [True, False])

How can i convert my dataset into json format like my required format

i want to convert my this dataset
enter image description here
into this json format using pandas
y = {'name':['a','b','c'],"rollno":[1,2,3],"teacher":'xyz',"year":1998}
First create dictionary by DataFrame.to_dict and filter out duplicated lists for scalars in dictionary comprehension with if-else by check length of sets:
d = {k:v if len(set(v)) > 1 else v[0] for k, v in df.to_dict('l').items()}
print (d)
{'name': ['a', 'b', 'c'], 'rollno': [1, 2, 3], 'teacher': 'xyz', 'year': 1998}
And then convert to json:
import json
j = json.dumps(d)
print (j)
{"name": ["a", "b", "c"], "rollno": [1, 2, 3], "teacher": "xyz", "year": 1998}
If values should be duplicated:
import json
j = json.dumps(df.to_dict(orient='l'))
print (j)
{"name": ["a", "b", "c"], "rollno": [1, 2, 3],
"teacher": ["xyz", "xyz", "xyz"], "year": [1998, 1998, 1998]}

Convert a dictionary within a list to rows in pandas

I currently have a data frame like this:
and I would like to explode the "listing" column into rows. I would like to use the key in the dictionary as column names, so ideally this is how I would like to data frame to look like this:
eventId listingId currentPrice
103337923 1307675567 ...
103337923 1307675567 ...
103337923 1307675567 ...
This is what I get with this: print(listing_df.head(3).to_dict())
Definitely there should be a better way to do this. But this works. :)
df1 = pd.DataFrame(
{"a": [1,2,3,4],
"b": [5,6,7,8],
"c": [[{"x": 17, "y": 18, "z": 19}, {"x": 27, "y": 28, "z": 29}],
[{"x": 37, "y": 38, "z": 39}, {"x": 47, "y": 48, "z": 49}],
[{"x": 57, "y": 58, "z": 59}, {"x": 27, "y": 68, "z": 69}],
[{"x": 77, "y": 78, "z": 79}, {"x": 27, "y": 88, "z": 89}]]})
Now you can create a new DataFrame from the above:
df2 = pd.DataFrame(columns=df1.columns)
df2_index = 0
for row in df1.iterrows():
one_row = row[1]
for list_value in row[1]["c"]:
one_row["c"] = list_value
df2.loc[df2_index] = one_row
df2_index += 1
Output is the way you need:
Now that we have expanded list into separate rows, you can further expand json into columns with:
df2[list(df2["c"].head(1).tolist()[0].keys())] = df2["c"].apply(
lambda x: pd.Series([x[key] for key in x.keys()]))
Hope it helps!