Reassigning pandas column in place from a slice of another dataframe - pandas

So I learned from this answer to this question that in pandas 0.20.3 and above reassigning the values of a dataframe works without giving the SettingWithCopyWarning in many ways as follows:
df = pd.DataFrame(np.ones((5,6)),columns=['one','two','three',
'four','five','six'])
df.one *=5
df.two = df.two*5
df.three = df.three.multiply(5)
df['four'] = df['four']*5
df.loc[:, 'five'] *=5
df.iloc[:, 5] = df.iloc[:, 5]*5
HOWEVER
If I were to take a part of that dataframe like this for example:
df1 = df[(df.index>1)&(df.index<5)]
And then try one of the above methods for reassigning a column like so:
df.one *=5
THEN I will get the SettingWithCopyWarning.
So is this a bug or am I just missing something about the way pandas expects for this kind of operation to work?

Related

One to One mapping of data in the row in pandas

I have dataset looks like this
And I want the output of this data frame like this. So it's kind of one to one mapping of row values. Assume option1 and option2 has same comma separated values.
Please let me know how do I achieve this ?
You can use the zip() function from the standard Python library and the explode() method from Pandas dataframe like that :
df["option1"] = df["option1"].str.split(",")
df["option2"] = df["option2"].str.split(",")
df["option3"] = df["option3"]*max(df["option1"].str.len().max(), df["option2"].str.len().max())
new_df = pd.DataFrame(df.apply(lambda x: list(zip(x[0], x[1], x[2])), axis=1).explode().to_list(), columns=df.columns)
new_df

Combine two dataframe to send a automated message [duplicate]

is there a way to conveniently merge two data frames side by side?
both two data frames have 30 rows, they have different number of columns, say, df1 has 20 columns and df2 has 40 columns.
how can i easily get a new data frame of 30 rows and 60 columns?
df3 = pd.someSpecialMergeFunct(df1, df2)
or maybe there is some special parameter in append
df3 = pd.append(df1, df2, left_index=False, right_index=false, how='left')
ps: if possible, i hope the replicated column names could be resolved automatically.
thanks!
You can use the concat function for this (axis=1 is to concatenate as columns):
pd.concat([df1, df2], axis=1)
See the pandas docs on merging/concatenating: http://pandas.pydata.org/pandas-docs/stable/merging.html
I came across your question while I was trying to achieve something like the following:
So once I sliced my dataframes, I first ensured that their index are the same. In your case both dataframes needs to be indexed from 0 to 29. Then merged both dataframes by the index.
df1.reset_index(drop=True).merge(df2.reset_index(drop=True), left_index=True, right_index=True)
If you want to combine 2 data frames with common column name, you can do the following:
df_concat = pd.merge(df1, df2, on='common_column_name', how='outer')
I found that the other answers didn't cut it for me when coming in from Google.
What I did instead was to set the new columns in place in the original df.
# list(df2) gives you the column names of df2
# you then use these as the column names for df
df[list(df2)] = df2
There is way, you can do it via a Pipeline.
** Use a pipeline to transform your numerical Data for ex-
Num_pipeline = Pipeline
([("select_numeric", DataFrameSelector([columns with numerical value])),
("imputer", SimpleImputer(strategy="median")),
])
**And for categorical data
cat_pipeline = Pipeline([
("select_cat", DataFrameSelector([columns with categorical data])),
("cat_encoder", OneHotEncoder(sparse=False)),
])
** Then use a Feature union to add these transformations together
preprocess_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
Read more here - https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html
This solution also works if df1 and df2 have different indices:
df1.loc[:, df2.columns] = df2.to_numpy()

Creating new new data frame with so many columns from existing dataframe by dropping one column

I have one dataframe called df and I need to drop one column amount from df and assign it to a new data frame called df1 but I get an error below. How can I fix this? Please note that below is a small example. In my real dataset I do have 28 columns and I can't type their names to create a new one
Thank you.
df= [['A',100,10000]
,['A',120,15000]
,['A',300,50000]
,['B',100,180000]
,['B',80,200000]]
df = pd.DataFrame(df, columns = ['group','size','amount'])
df1=df.drop('amount',axis=1,inplace=True)
df1.head()
"AttributeError: 'NoneType' object has no attribute 'head'"
inplace operations return None, so need omit assign back:
df.drop('amount',axis=1,inplace=True)
df.head()
Or remove inpalce=True and assign back:
df = df.drop('amount',axis=1)
df.head()

Preferred pandas code for selecting all rows and a subset of columns

Suppose that you have a pandas DataFrame named df with columns ['a','b','c','d','e'] and you want to create a new DataFrame newdf with columns 'b' and 'd'. There are two possible ways to do this:
newdf = df[['b','d']]
or
newdf = df.loc[:,['b','d']]
The first is using the indexing operator. The second is using .loc. Is there a reason to prefer one over the other?
Thanks to #coldspeed, it seems that newdf = df.loc[:,['b','d']] is preferred to avoid the dreaded SettingWithCopyWarning.

convert pandas groupby object to dataframe while preserving group semantics

I have miserably failed extrapolating from any answers I have found for grouping a dataframe, and then merging back the group semantics computed by groupby into the original dataframe. Seems documentation is lacking and SO answers are not applicable to current pandas versions.
This code:
grouped = df.groupby(pd.Grouper(
key = my_time_column,
freq = '15Min',
label='left',
sort=True)).apply(pd.DataFrame)
Yields back a dataframe, but I have found no way of making the transition to a dataframe having the same data as the original df, while also populating a new column with the start datetime, of the group that each row belonged to in the groupby object.
Here's my current hack that solves it:
grouped = df.groupby(pd.Grouper(
key = my_datetime_column,
freq = '15Min',
label='left',
sort=True))
sorted_df = grouped.apply(pd.DataFrame)
interval_starts = []
for group_idx, group_member_indices in grouped.indices.items():
for group_member_index in group_member_indices:
interval_starts.append(group_idx)
sorted_df['interval_group_start'] = interval_starts
Wondering if there's an elegant pandas way.
pandas version: 0.23.0
IIUC, this should do what you're looking for:
grouped = df.groupby(pd.Grouper(key=my_time_column,
freq = '15Min',
label='left',
sort=True))\
.apply(pd.DataFrame)
grouped['start'] = grouped.loc[:, my_time_column] \
.groupby(level=0) \
.transform('min')