How do I stack 3-D arrays in a grouped pandas dataframe? - pandas

I have a pandas dataframe that consists of two columns: a column of string identifiers and a column of 3-D arrays. The arrays have been grouped by the ID. How can I stack all the arrays for each group so that there is a single stacked array for each ID? The code I have is as follows:
df1 = pd.DataFrame({'IDs': ids})
df2 = pd.DataFrame({'arrays':arrays})
df = pd.concat([df1, df2], axis=1)
grouped = df['arrays'].groupby(df['IDs'])
(I attempted np.dstack(grouped), but this was unsuccessful.)

I believe this is what you want:
df.groupby('IDs')['arrays'].apply(np.dstack).to_frame().reset_index()
It will apply the np.dstack(...) function to each group of arrays sharing an ID.
The apply() function returns a pd.Series (with IDs as index), so we then use to_frame() to create a DataFrame from it and reset_index() to put IDs back as a column.
(Note: The documentation for apply() talks about using agg() for efficiency, but unfortunately it doesn't seem to be possible to use agg() with a function that returns an ndarray, such as np.dstack. In that case, agg() wants to treat that array as multiple objects, as a series, rather than as a single object... My attempts with it resulted in an exception saying "function does not reduce".)

Related

How do i combine multiple dataframes using a repeating index system

I have multiple dataframes that I want to combine and only want to use the indexing system of the first dataframe. The problem is the indices I want to use are repeating and I want to keep it that way.
df = pd.concat([df1, df2, df3], axis=1, join='inner')
This gives me InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Just so it's clear, df1 has repeating indices (0-9 and then it repeats again multiple times), whereas df2 and df3 are single-column dataframes and have non-repeating indices. The number of rows do match though.
Well from what i understand your index repeats itself, on df1. That is what is causing the given error InvalidIndexError: Reindexing only valid with uniquely valued Index objects, since you have a loop beetween (0,9 values) pandas, will never be able to identify which row to join with what row since the indexes well are repeated so non unique. My apprach would be just to use join, but hey if you want to use concat for reasons
A few ways to do this would be just
Just using the join function
df1.join([df2,df3])
But if you insist on using concat, i would
x = df1.index
df1.reset_index(drop=True)
df = pd.concat([df1,df2,df3],axis=1,join='inner')
df.index = x

Groupby does return previous df without changing it

df=pd.read_csv('../input/tipping/tips.csv')
df_1 = df.groupby(['day','time'])
df_1.head()
Guys, what am I missing here ? As it returns to me previous dataframe without groupby
We can print it using the following :
df_1 = df.groupby(['day','time']).apply(print)
groupby doesn't work the way you are assuming by the sounds of it. Using head on the grouped dataframe takes the first 5 rows of the dataframe, even if it is across groups because that is how the groupby object is built. You can use #tlentali's approach to print out each group, but df_1 will not be assigned the grouped dataframe that way, instead, None (the number of groups times) as that is the output of print.
The way below gives a lot of control over how to show/display the groups and their keys
This might also help you understand more about how the grouped data frame structure in pandas works.
df_1 = df.groupby(['day','time'])
# for each (day,time) and grouped data
for key, group in df_1:
# show the (day,time)
print(key)
# display head of the grouped data
group.head()

Find dates and difference between extreme observations

he function passed to apply must take a dataframe as its first argument and return a DataFrame, Series or scalar. apply will then take care of combining the results back together into a single dataframe or series. apply is therefore a highly flexible grouping method.
While apply is a very flexible method, its downside is that using it can be quite a bit slower than using more specific methods like agg or transform. Pandas offers a wide range of method that will be much faster than using apply for their specific purposes, so try to use them before reaching for apply.
easiest is an aggregation with groupby and then do a select
# make index a column
df = df.reset_index()
# get min of holdings for each ticker
lowest = df[['ticker','holdings']].groupby('ticker').min()
print(lowest)
# select lowest my performing a left join (solutions with original)
# this gives only the matching rows of df in return
lowest_dates = lowest.merge(df, on=['ticker','holdings'], how='left')
print(lowest_dates)
If you just want a series of Date you can use this function.
def getLowest(df):
df = df.reset_index()
lowest = df[['ticker','holdings']].groupby('ticker').min()
lowest_dates = lowest.merge(df, on=['ticker','holdings'], how='left')
return lowest_dates['Date']
From my point of view it would be better to return the entire dataframe, to know which ticker was lowest when. In this case you can :
return lowest_dates

Split and merge nested DataFrame in Python

I have a dataframe, which has two columns. One of the columns is also another dataframe. It looks like below:
I want to have a dataframe with 3 columns, containing "Date_Region", "transformed_weight" and "Barcode", which would replicate each "Date_Region" row times the length of its "Weight-Barcode" dataframe. The final dataframe should looks like below:
This will do:
pd.concat(
iter(final_df.apply(
lambda row: row['Weights-Barcode'].assign(
Date_Region=row['Date_Region'],
),
axis=1,
)),
ignore_index=True,
)[['Date_Region', 'transformed_weight', 'Barcode']]
From the inside out:
final_df.apply(..., axis=1) will call the lambda function on each row.
The lambda function uses assign() to return the nested DataFrame from that row with an addition of the Date_Region column with the value from the outside.
Calling iter(...) on the resulting series results in an iterable of the DataFrames already including the added column.
Finally, using pd.concat(...) on that iterable to concatenate them all together. I'm using ignore_index=True here to just reindex everything again (it doesn't seem to me your index is meaninful, and not ignoring them means you'd end up with duplicates.)
Finally, I'm reordering the columns, so the added Date_Region column becomes the leftmost one.

Does a DataFrame with a single row have all the attributes of a DataFrame?

I am slicing a DataFrame from a large DataFrame and daughter df have only one row. Does a daughter df with a single row has same attributes like parent df?
import numpy as np
import pandas as pd
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,2),index=dates,columns=['col1','col2'])
df1=df.iloc[1]
type(df1)
>> pandas.core.series.Series
df1.columns
>>'Series' object has no attribute 'columns'
Is there a way I can use all attributes of pd.DataFrame on a pd.series ?
Possibly what you are looking for is a dataframe with one row:
>>> pd.DataFrame(df1).T # T -> transpose
col1 col2
2013-01-02 -0.428913 1.265936
What happens when you do df.iloc[1] is that pandas converts that to a series, which is one-dimensional, and the columns become the index. You can still do df1['col1'], but you can't do df.columns because a series is basically a column, and hence the old columns are now the new index
As a result, you can returns the former columns like this:
>>> df1.index.tolist()
['col1', 'col2']
This used to confuse me quite a bit. I also expected df.iloc[1] to be a dataframe with one row, but it has always been the default behavior of pandas to automatically convert any one dimensional dataframe slice (whether row or column) to a series. It's pretty natural for a row, but less so for a column (since the columns become the index), but really is not a problem once you understand what is happening.