Split and merge nested DataFrame in Python - pandas

I have a dataframe, which has two columns. One of the columns is also another dataframe. It looks like below:
I want to have a dataframe with 3 columns, containing "Date_Region", "transformed_weight" and "Barcode", which would replicate each "Date_Region" row times the length of its "Weight-Barcode" dataframe. The final dataframe should looks like below:

This will do:
pd.concat(
iter(final_df.apply(
lambda row: row['Weights-Barcode'].assign(
Date_Region=row['Date_Region'],
),
axis=1,
)),
ignore_index=True,
)[['Date_Region', 'transformed_weight', 'Barcode']]
From the inside out:
final_df.apply(..., axis=1) will call the lambda function on each row.
The lambda function uses assign() to return the nested DataFrame from that row with an addition of the Date_Region column with the value from the outside.
Calling iter(...) on the resulting series results in an iterable of the DataFrames already including the added column.
Finally, using pd.concat(...) on that iterable to concatenate them all together. I'm using ignore_index=True here to just reindex everything again (it doesn't seem to me your index is meaninful, and not ignoring them means you'd end up with duplicates.)
Finally, I'm reordering the columns, so the added Date_Region column becomes the leftmost one.

Related

Pandas splitting a column with new line separator

I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns?
Suppose the column looks like this.
A\nB
1\n2
2\n3
3\n4
Desired output:
|A|B|
|-|-|
|1|2|
|2|3|
|3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using
colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error
df[colNew].str.split('\n', 2, expand=True)
AttributeError: DataFrame object has no attribute 'str'
You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.
I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes.
But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')

How do you split All columns in a large pandas data frame?

I have a very large data frame that I want to split ALL of the columns except first two based on a comma delimiter. So I need to logically reference column names in a loop or some other way to split all the columns in one swoop.
In my testing of the split method:
I have been able to explicitly refer to ( i.e. HARD CODE) a single column name (rs145629793) as one of the required parameters and the result was 2 new columns as I wanted.
See python code below
HARDCODED COLUMN NAME --
df[['rs1','rs2']] = df.rs145629793.str.split(",", expand = True)
The problem:
It is not feasible to refer to the actual column names and repeat code.
I then replaced the actual column name rs145629793 with columns[2] in the split method parameter list.
It results in an ERROR
'str has ni str attribute'
You can index columns by position rather than name using iloc. For example, to get the third column:
df.iloc[:, 2]
Thus you can easily loop over the columns you need.
I know what you are asking, but it's still helpful to provide some input data and expected output data. I have included random input data in my code below, so you can just copy and paste this to run, and try to apply it to your dataframe:
import pandas as pd
your_dataframe=pd.DataFrame({'a':['1,2,3', '9,8,7'],
'b':['4,5,6', '6,5,4'],
'c':['7,8,9', '3,2,1']})
import copy
def split_cols(df):
dict_of_df = {}
cols=df.columns.to_list()
for col in cols:
key_name = 'df'+str(col)
dict_of_df[key_name] = copy.deepcopy(df)
var=df[col].str.split(',', expand=True).add_prefix(col)
df=pd.merge(df, var, how='left', left_index=True, right_index=True).drop(col, axis=1)
return df
split_cols(your_dataframe)
Essentially, in this solution you create a list of the columns that you want to loop through. Then you loop through that list and create new dataframes for each column where you run the split() function. Then you merge everything back together on the index. I also:
included a prefix of the column name, so the column names did not have duplicate names and could be more easily identifiable
dropped the old column that we did the split on.
Just import copy and use the split_cols() function that I have created and pass the name of your dataframe.

Pandas Dataframe: How to get the cell instead of is value

I have a task to compare two dataframe with same columns name but different size, we can call it previous and current. I am trying to get the difference between (previous and current) in the Quantity and Booked Columns and highlight it as yellow. The common key between the two dataframe would be the 'SN' columns
I have coded out the following
for idx, rows in df_n.iterrows():
if rows["Quantity"] == rows['Available'] + rows['Booked']:
continue
else:
rows["Quantity"] = rows["Quantity"] - rows['Available'] - rows['Booked']
df_n.loc[idx, 'Quantity'].style.applymap('background-color: yellow')
# pdb.set_trace()
if (df_o['Booked'][df_o['SN'] == rows["SN"]] != rows['Booked']).bool():
df_n.loc[idx, 'Booked'].style.apply('background-color: yellow')
I realise I have a few problems here and need some help
df_n.loc[idx, 'Quantity'] returns value instead of a dataframe type. How can I get a dataframe from one cell. Do I have to pd.DataFrame(data=df_n.loc[idx, 'Quantity'], index=idx, columns ='Quantity'). Will this create a copy or will update the reference?
How do I compare the SN of both dataframe, looking for a better way to compare. One thing I could think of is to use set index for both dataframe and when finished using them, reset them back?
My dataframe:
Previous dataframe
Current Dataframe
df_n.loc[idx, 'Quantity'] returns value instead of a dataframe type.
How can I get a dataframe from one cell. Do I have to
pd.DataFrame(data=df_n.loc[idx, 'Quantity'], index=idx, columns
='Quantity'). Will this create a copy or will update the reference?
To create a DataFrame from one cell you can try: df_n.loc[idx, ['Quantity']].to_frame().T
How do I compare the SN of both dataframe, looking for a better way to
compare. One thing I could think of is to use set index for both
dataframe and when finished using them, reset them back?
You can use df_n.merge(df_o, on='S/N') to merge dataframes and 'compare' columns.

How do I stack 3-D arrays in a grouped pandas dataframe?

I have a pandas dataframe that consists of two columns: a column of string identifiers and a column of 3-D arrays. The arrays have been grouped by the ID. How can I stack all the arrays for each group so that there is a single stacked array for each ID? The code I have is as follows:
df1 = pd.DataFrame({'IDs': ids})
df2 = pd.DataFrame({'arrays':arrays})
df = pd.concat([df1, df2], axis=1)
grouped = df['arrays'].groupby(df['IDs'])
(I attempted np.dstack(grouped), but this was unsuccessful.)
I believe this is what you want:
df.groupby('IDs')['arrays'].apply(np.dstack).to_frame().reset_index()
It will apply the np.dstack(...) function to each group of arrays sharing an ID.
The apply() function returns a pd.Series (with IDs as index), so we then use to_frame() to create a DataFrame from it and reset_index() to put IDs back as a column.
(Note: The documentation for apply() talks about using agg() for efficiency, but unfortunately it doesn't seem to be possible to use agg() with a function that returns an ndarray, such as np.dstack. In that case, agg() wants to treat that array as multiple objects, as a series, rather than as a single object... My attempts with it resulted in an exception saying "function does not reduce".)

pandas DatatFrame.corr returns a one by one DF

I can't get what is possibly wrong in the way I use df.corr() function.
For a DF with 2 columns it returns only 1*1 resulting DF.
In:
merged_df[['Citable per Capita','Citations']].corr()
Out:
one by one resulting DF
What can be the problem here? I expected to see as many rows and columns as many columns were there in the original DF
I found the problem - it was the wrong dtype of the first column values.
To change type of all the columns, use:
df=df.apply(lambda x: pd.to_numeric(x, errors='ignore'))
Note that apply creates a copy of df. That is why reassignment is necessary here