Pandas - Break nested json into multiple rows - pandas

I have my Dataframe in the below structure. I would like to break them based on the nested values within the details column
cust_id, name, details
101, Kevin, [{"id":1001,"country":"US","state":"OH"}, {"id":1002,"country":"US","state":"GA"}]
102, Scott, [{"id":2001,"country":"US","state":"OH"}, {"id":2002,"country":"US","state":"GA"}]
Expected output
cust_id, name, id, country, state
101, Kevin, 1001, US, OH
101, Kevin, 1002, US, GA
102, Scott, 2001, US, OH
102, Scott, 2002, US, GA

df = df.explode('details').reset_index(drop=True)
df = df.merge(pd.json_normalize(df['details']), left_index=True, right_index=True).drop('details', axis=1)
df.explode("details") basically duplicates each row in the details N times, where N is the number of items in the array (if any) of details of that row
Since explode duplicates the rows, the original rows' indices (0 and 1) are copied to the new rows, so their indices are 0, 0, 1, 1, which messes up later processing. reset_index() creates a fresh new column for the index, starting at 0. drop=True is used because by default pandas will keep the old index column; this removes it.
pd.json_normalize(df['details']) converts the column (where each row contains a JSON object) to a new dataframe where each key unique of all the JSON objects is new column
df.merge() merges the new dataframe into the original one
left_index=True and right_index=True tells pandas to merge the specified dataframe starting from it's first, row into this dataframe, starting at its first row
.drop('details', axis=1) gets rid of the old details column containing the old objects

Related

merging to find only unique values

I have a simple requirement to insert data into the mongo db only if the data is not present based on a primary key(My primary key is a combination of 3 attributes item_id, process_name and actor) using lambda script. Explanation on what I am doing currently:
1). df is the dataframe loaded from the input csv file, it will have only those headers for the process. This df data has to be inserted into my db.
df = wr.s3.read_csv(path=part_file_url, low_memory=False)
df = df[header]
NOTE: header will have the primary key attributes along with additional default attributes for insertion.
2). I query my collection basis process name and id. Further, i use project_keys to bring in only the primary key of all records from db as follows:
project_keys ={'_id': 0, 'item_id': 1, 'process_name': 1, 'actor': 1}
"queryString": [ "item_id", "process_name","actor"]
curr_data_df = pd.DataFrame(list(audit_collection.find({"process_name":process_name, "processId":process_id}, project_keys)), columns=queryString)
3). Now to find the unique values, I do the following:
result = df.merge(curr_data_df,how='left', indicator=True) #step 3.a
result = result[result['_merge']=='left_only'] # step 3.b
df = result.drop(columns='_merge') # step 3.c
Once i fetch the unique values in df, I change every item in df and do insert_many into my collection.
Queries:
Is my above solution of finding unique values in input df with left merging basis primary key right?
In my merge function, I dont specify any column names as from my database i bring in only the primary key columns and my df will already have those 3 keys. So by default will it take the common merge columns as "item_id", "process_name","actor" or i have to specifically specify it again. During my testing with small data in python shell, above had worked. But wanted this confirmed,
I saw one weird case today and unable to figure out:
I uploaded a same file second time to check if this data duplication check works. See the logs:
Number of records in original input df: 148 (df from step 1)
Original MONGODB source data-> Number of records: 684, Number of columns: 3 (curr_data_df from step 2)
Merge input*mongodb -> Number of records: 4166, Number of columns: 53 (step 3.a)
Unique new records -> Number of records after removing duplicates: 0, Number of columns: 53 (step 3.b)
Input df Unique records -> Number of records in final df: 0, Number of columns: 53 (step 3.c)
So though the df correwctly identified that unique new records is 0 in last step but why in step 3.a merge records was 4166 wherein the input had only 148 records and database data has only 684 records. Any insights here?

split content of a column pandas

I have the following Pandas Dataframe
Which can also be generated using this list of dictionaries:
list_of_dictionaries = [
{'Project': 'A', 'Hours': 2, 'people_ids': [16986725, 17612732]},
{'Project': 'B', 'Hours': 2, 'people_ids': [17254707, 17567393, 17571668, 17613773]},
{'Project': 'C', 'Hours': 3, 'people_ids': [17097009, 17530240, 17530242, 17543865, 17584457, 17595079]},
{'Project': 'D', 'Hours': 2, 'people_ids': [17097009, 17584457, 17702185]}]
I have implemented kind of what I need, but adding columns vertically:
df['people_id1']=[x[0] for x in df['people_ids'].tolist()]
df['people_id2']=[x[1] for x in df['people_ids'].tolist()]
And then I get a different column of every single people_id, just until the second element, because when I add the extraction 3rd element on a third column, it crashes because , there is no 3rd element to extract from the first row.
Even though, what I am trying to do is to extract every people_id from people_ids column, and then each one of those will have their associated value from the Project and Hours columns, so I get a dataset like this one:
Any idea on how could I get this output?
I think what you are looking for is explode on 'people_ids' column.
df = df.explode('people_ids', ignore_index=True)

comparing and removing rows in pandas

I am trying to create a new object by comparing two list. If the rows are matching the row should be removed form the splitted row_list or appended to a new list containing only the differences between both lists.
results = []
for row in splitted_row_list:
print(row)
for row1 in all_rows:
if row1 == row:
splitted_row_list.remove(row)
else:
results.append(row)
print(results)
However, this code just returns all the rows. Does anyone have a suggestion?
Sample data
all_rows[0]:'1390', '139080', '13980', '1380', '139080', '13080'
splitted_row_list[0]:'35335','53527','353529','242424','5222','444'
As I understand you want to compare two lists by index and keep the differences and you want to do it with pandas (because of the tag):
So here are two lists for example:
ls1=[0,10,20,30,40,50,60,70,80,90]
ls2=[0,15,20,35,40,55,60,75,80,95]
I make a pandas dataframe with these lists, and build a mask to filter out the the matching values:
df= pd.DataFrame(data={'ls1':ls1, 'ls2':ls2})
mask= df['ls1']!=df['ls2']
I can then call the different values for each list using the mask:
# list 1
df[mask]['ls1'].values
out: array([10, 30, 50, 70, 90])
and
# list 2
df[mask]['ls2'].values
out: array([15, 35, 55, 75, 95])

Adding columns from every other row in a pandas dataframe

In the picture, 1, you can see the start of my data frame. I would like to make two new columns that consist of the values (confirmed_cases and deaths) and to get rid of the 'Type' column. Essentially I want there to be one row of data for each county and to have a confirmed_cases and death column added using the values from the data already. I tried the code below but obviously the length of values does not match the length of index.
Any suggestions?
apidata['Confirmed_Cases'] = apidata['values'].iloc[::2].values
apidata['Deaths'] = apidata['values'].iloc[1::2].values
(Sorry about the link to the photo, I am too new to the site to be able to just include the photo in the post)
Maybe if there's a way to double how many times each value is posted in the new column? So first five deaths would be [5, 5, 26, 26, 0] and then I can just delete every other row?
I ended up figuring it out, by creating a second dataframe that deleted half of the rows (every other one) from the first dataframe and then adding the values from the original dataframe into the new one.
apidata2 = apidata.iloc[::2]
apidata2['Confirmed_Cases'] = apidata['values'].iloc[::2].values
apidata2['Deaths'] = apidata['values'].iloc[1::2].values
apidata2.head()
Finished output

PySpark Create new column from transformations in another dataframe

Looking for a more functional and computationally efficient approach in PySpark ->
I have master table (containing billions of rows), the columns of interests are:
id - (String),
tokens - (Array(string))- ex, ['alpha', 'beta', 'gamma']
-- (Calling it dataframe, df1)
I have another summary table which contains top 25 tokens like:
-- (Calling it dataframe, df2)
Ex:
Token
Alpha
Beta
Zi
Mu
Now to this second table (or, dataframe), I wish to append a row which contains a list of ids for that token from the first table, so that the result looks like:
Token Ids
Alpha [1, 2, 3]
Beta [3, 5, 6, 8, 9]
Zi [2, 8, 12]
Mu [1, 15, 16, 17]
Present Approach:
From the df2, figure out the distinct tokens and store it as a list (say l1).
(For every token from list, l1):
Filter df1 to extract the unique ids as a list, call it l2
Add this new list (l2) as a new column (Ids) to the dataframe (df2) to create a new dataframe (df3)
persist df3 to a table
I agree this is a terribe approach and for any given l1 with 100k records, it will run forever. Can anyone help me rewrite the code (for Pyspark)
You can alternatively attempt to join both the table on a new column which would essentially contain only the tokens exploded to the individual rows. That would be helpful from both computational efficiency, allocated resources and the required processing time.
Additionally, there are several in-the-box join privileges including 'map-side join' which would further propel your cause.
Explode the tokens array column of df1 and then join with df2 (left join) with lower case of tokens and token and then groupBy token and collect the ids as set
from pyspark.sql import functions as f
#exolode tokens column for joining with df2
df1 = df1.withColumn('tokens', f.explode('tokens'))
#left join with case insensitive and collecting ids as set for each token
df2.join(df1, f.lower(df1.tokens) == f.lower(df2.token), 'left')\
.groupBy('token')\
.agg(f.collect_set('id').alias('ids'))\
.show(truncate=False)
I hope the answer is helpful