concat series onto dataframe with column name - pandas

I want to add a Series (s) to a Pandas DataFrame (df) as a new column. The series has more values than there are rows in the dataframe, so I am using the concat method along axis 1.
df = pd.concat((df, s), axis=1)
This works, but the new column of the dataframe representing the series is given an arbitrary numerical column name, and I would like this column to have a specific name instead.
Is there a way to add a series to a dataframe, when the series is longer than the rows of the dataframe, and with a specified column name in the resulting dataframe?

You can try Series.rename:
df = pd.concat((df, s.rename('col')), axis=1)

One option is simply to specify the name when creating the series:
example_scores = pd.Series([1,2,3,4], index=['t1', 't2', 't3', 't4'], name='example_scores')
Using the name attribute when creating the series is all I needed.

Try:
df = pd.concat((df, s.rename('CoolColumnName')), axis=1)

Related

DataFrame Groupby apply on second dataframe?

I have 2 dataframes df1, df2. Both have id as a column. I want to compute a new column, weighted_average, in df1 that is a function of the values in df2 with the same id.
First, I think I should do df1.groupby("id"). Is it possible to use GroupBy.apply(...) and have it use values from df2? In the examples I've seen, it usually just operates on df1 values.
If they have same id positions and length, you can do some like:
df2["new column name"] = df1["column name"].apply(...)

Combine two dataframe to send a automated message [duplicate]

is there a way to conveniently merge two data frames side by side?
both two data frames have 30 rows, they have different number of columns, say, df1 has 20 columns and df2 has 40 columns.
how can i easily get a new data frame of 30 rows and 60 columns?
df3 = pd.someSpecialMergeFunct(df1, df2)
or maybe there is some special parameter in append
df3 = pd.append(df1, df2, left_index=False, right_index=false, how='left')
ps: if possible, i hope the replicated column names could be resolved automatically.
thanks!
You can use the concat function for this (axis=1 is to concatenate as columns):
pd.concat([df1, df2], axis=1)
See the pandas docs on merging/concatenating: http://pandas.pydata.org/pandas-docs/stable/merging.html
I came across your question while I was trying to achieve something like the following:
So once I sliced my dataframes, I first ensured that their index are the same. In your case both dataframes needs to be indexed from 0 to 29. Then merged both dataframes by the index.
df1.reset_index(drop=True).merge(df2.reset_index(drop=True), left_index=True, right_index=True)
If you want to combine 2 data frames with common column name, you can do the following:
df_concat = pd.merge(df1, df2, on='common_column_name', how='outer')
I found that the other answers didn't cut it for me when coming in from Google.
What I did instead was to set the new columns in place in the original df.
# list(df2) gives you the column names of df2
# you then use these as the column names for df
df[list(df2)] = df2
There is way, you can do it via a Pipeline.
** Use a pipeline to transform your numerical Data for ex-
Num_pipeline = Pipeline
([("select_numeric", DataFrameSelector([columns with numerical value])),
("imputer", SimpleImputer(strategy="median")),
])
**And for categorical data
cat_pipeline = Pipeline([
("select_cat", DataFrameSelector([columns with categorical data])),
("cat_encoder", OneHotEncoder(sparse=False)),
])
** Then use a Feature union to add these transformations together
preprocess_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
Read more here - https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html
This solution also works if df1 and df2 have different indices:
df1.loc[:, df2.columns] = df2.to_numpy()

Numpy where perform multiple actions

I have two dataframe columns where I want to check if the element of one are inside the other one. I perform this using the pandas isin method.
However, if the element is present in the second dataframe, I also want to subtract is from both:
attivo['S'] = np.where(attivo['SKU'].isin(stampate['SKU-S']), attivo['S'] - 1, attivo['S'])
In this example, if an item in the column S of attivo dataframe is present in the column SKU-S of the stampate dataframe, the S column will decrease by one unit, however, I also want that the same column S will decrease in the stampate dataframe.
How is it possible to achieve this?
EDIT with sample data:
df1 = pd.DataFrame({'SKU': 'productSKU', 'S': 5}, index=[0])
df2 = pd.DataFrame({'SKU-S': 'productSKU', 'S': 5}, index=[0])
Currently, I am achieving this:
df1['S'] = np.where(df1['SKU'].isin(df2['SKU-S']), df1['S'] - 1, df1['S'])
However, I would like that both dataframes are updated, in this case, both of them will display 4 in the S column.
IIUC:
s = df1['SKU'].isin(df2['SKU-S'])
# modify df1
df1['S'] -= s
# count the SKU in df1 that belongs to df2 by values
counts = df1['SKU'].where(s).value_counts()
# modify df2
df2['S'] -= df2['SKU-S'].map(counts).fillna(0)

Streamlit - Applying value_counts / groupby to column selected on run time

I am trying to apply value_counts method to a Dataframe based on the columns selected dynamically in the Streamlit app
This is what I am trying to do:
if st.checkbox("Select Columns To Show"):
all_columns = df.columns.tolist()
selected_columns = st.multiselect("Select", all_columns)
new_df = df[selected_columns]
st.dataframe(new_df)
The above lets me select columns and displays data for the selected columns. I am trying to see how could I apply value_counts/groupby method on this output in Streamlit app
If I try to do the below
st.table(new_df.value_counts())
I get the below error
AttributeError: 'DataFrame' object has no attribute 'value_counts'
I believe the issue lies in passing a list of columns to a dataframe. When you pass a single column in [] to a dataframe, you get back a pandas.Series object (which has the value_counts method). But when you pass a list of columns, you get back a pandas.DataFrame (which doesn't have value_counts method defined on it).
Can you try st.table(new_df[col_name].value_counts())
I think the error is because value_counts() is applicable on a Series and not dataframe.
You can try Converting ".value_counts" output to dataframe
If you want to apply on one single column
def value_counts_df(df, col):
"""
Returns pd.value_counts() as a DataFrame
Parameters
----------
df : Pandas Dataframe
Dataframe on which to run value_counts(), must have column `col`.
col : str
Name of column in `df` for which to generate counts
Returns
-------
Pandas Dataframe
Returned dataframe will have a single column named "count" which contains the count_values()
for each unique value of df[col]. The index name of this dataframe is `col`.
Example
-------
>>> value_counts_df(pd.DataFrame({'a':[1, 1, 2, 2, 2]}), 'a')
count
a
2 3
1 2
"""
df = pd.DataFrame(df[col].value_counts())
df.index.name = col
df.columns = ['count']
return df
val_count_single = value_counts_df(new_df, selected_col)
If you want to apply for all object columns in the dataframe
def valueCountDF(df, object_cols):
c = df[object_cols].apply(lambda x: x.value_counts(dropna=False)).T.stack().astype(int)
p = (df[object_cols].apply(lambda x: x.value_counts(normalize=True,
dropna=False)).T.stack() * 100).round(2)
cp = pd.concat([c,p], axis=1, keys=["Count", "Percentage %"])
return cp
val_count_df_cols = valueCountDF(df, selected_columns)
And Finally, you can use st.table or st.dataframe to show the dataframe in your streamlit app

pandas: appending a row to a dataframe with values derived using a user defined formula applied on selected columns

I have a dataframe as
df = pd.DataFrame(np.random.randn(5,4),columns=list('ABCD'))
I can use the following to achieve the traditional calculation like mean(), sum()etc.
df.loc['calc'] = df[['A','D']].iloc[2:4].mean(axis=0)
Now I have two questions
How can I apply a formula (like exp(mean()) or 2.5*mean()/sqrt(max()) to column 'A' and 'D' for rows 2 to 4
How can I append row to the existing df where two values would be mean() of the A and D and two values would be of specific formula result of C and B.
Q1:
You can use .apply() and lambda functions.
df.iloc[2:4,[0,3]].apply(lambda x: np.exp(np.mean(x)))
df.iloc[2:4,[0,3]].apply(lambda x: 2.5*np.mean(x)/np.sqrt(max(x)))
Q2:
You can use dictionaries and combine them and add it as a row.
First one is mean, the second one is some custom function.
ad = dict(df[['A', 'D']].mean())
bc = dict(df[['B', 'C']].apply(lambda x: x.sum()*45))
Combine them:
ad.update(bc)
df = df.append(ad, ignore_index=True)