Apply Series function to the whole dataframe - pandas

Well, I knew that function on each" cell" can applies to the whole dataframe using applymap()
However, is there any way to apply Series function,eg: str.upper() to the whole dataframe

Yes, it could be applied directly to the applymap method of the dataframe.
Demo:
df = pd.DataFrame([['a', 'b'], ['c', 'd'], ['e', 'f']])
df
Various possiblities:
1) applymap dataframe:
df.applymap(str.upper)
2) stack + unstack combo:
df.stack().str.upper().unstack()
3) apply series:
df.apply(lambda x: x.str.upper())
All produce:

Related

How to calculate mean in a dataframe for a single year [duplicate]

Im trying to use groupby function, is there a way to use a specific element rather than the column name. Example of code.
df.groupby(['Month', 'Place'])['Number'].sum()
This is what I want to do.
df.groupby(['April', Place'])['Number'].sum()
you need filter your DataFrame before:
df.loc[df['Month'].eq('April')].groupby('Place')['Number'].sum()
#df[df['Month'].eq('April')].groupby('Place')['Number'].sum()
Yes, you can pass as many columns to groupby
df = pd.DataFrame([['a', 'x', 1], ['a', 'x', 2], ['b', 'y',3], ['b', 'z',4]])
df.columns = ['c1', 'c2', 'c3']
df.groupby(['c1', 'c2'])['c3'].mean()
resuls in
c1 c2
a x 1.5
b y 3.0
z 4.0

My pandas merge is not bringing over data from the right df. Why?

The code runs without error, but the right data is not populating into the resulting dataframe.
I've tried with and without the index and neither seem to work. I looked into dtypes but it looks like they match on the fields I'm using as the index. I noted that the indicator is saying left_only, making me think the merge is not actually bringing anything over. It clearly must not be, because fields that are not null in the right df are showing null in the resulting dataframe.
df = df[(df['A'].notna())]
group = df.groupby(['A', 'B', 'Period', 'D'])
df2 = group['Monthly_Need'].sum()
df2 = df2.reset_index()
df = df.set_index(['A', 'B', 'Period', 'D'])
df2 = df2.set_index(['A', 'B', 'Period', 'D'])
df = df.merge(df2, how='left', left_index=True, right_index=True, indicator=True)
df = df.reset_index()

How can I change the filled color of stacked area plot in DataFrame?

I want to change the filled color in the stacked area plots drawn with Pandas.Dataframe.
import pandas as pd
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
ax = df.plot.area(linewidth=0);
The area plot example
Now I guess that the instance return by the plot function offers the access to modifying the attributes like colors.
But the axes classes are too complicated to learn fast. And I failed to find similar questions in the Stack Overflow.
So can any master do me a favor?
Use 'colormap' (See the document for more details):
ax = df.plot.area(linewidth=0, colormap="Pastel1")
The trick is using the 'color' parameter:
Soln 1: dict
Simply pass a dict of {column name: color}
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'], )
ax = df.plot.area(color={'b':'0', 'c':'#17A589', 'a':'#9C640C', 'd':'#ECF0F1'})
Soln 2: sequence
Simply pass a sequence of color codes (it will match the order of your columns).
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'], )
ax = df.plot.area(color=('0', '#17A589', '#9C640C', '#ECF0F1'))
No need to set linewidth (it will automatically adjust colors). Also, this wouldn't mess with the legend.
The API of matplotlib is really complex, but here artist Module gives a very plain illustration. For the bar/barh plots, the attributes can be visited and modified by .patches, but for the area plot they need to be with .collections.
To achieve the specific modification, use codes like this.
import pandas as pd
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
ax = df.plot.area(linewidth=0);
for collection in ax.collections:
collection.set_facecolor('#888888')
highlight = 0
ax.collections[highlight].set_facecolor('#aa3333')
Other methods of the collections can be found by run
dir(ax.collections[highlight])

From Pandas groupBy to PySpark groupBy

Consider an Spark DataFrame, wherein we have few columns. The goal is to perform groupBy operation on it without converting it to Pandas DataFrame. An equivalent Pandas groupBy code looks something like this:
def compute_metrics(x):
return pd.Series({
'a': x['a'].values[0],
'new_b': np.sum(x['b']),
'c': np.mean(x['c']),
'cnt': len(x)
})
data.groupby([
'col_1',
'col_2'
]).apply(compute_metrics).reset_index()
And I'm intending to write this in PySpark. So far I have come up with something like this in PySpark:
gdf = df.groupBy([
'col_1',
'col_2'
]).agg({
'c': 'avg',
'b': 'sum'
}).withColumnRenamed('sum(b)', 'new_b')
However, I am not sure how to go about 'a': x['a'].values[0] and 'cnt': len(x). I thought about using collect_list from from pyspark.sql import functions but that slaps my face with Column object is not Callable. Any idea how to accomplish the aforementioned conversion? Thanks!
[UPDATE] Would it make sense to perform count operation on any column in order to get cnt? Say I do this:
gdf = df.groupBy([
'col_1',
'col_2'
]).agg({
'c': 'avg',
'b': 'sum',
'some_column': 'count'
}).withColumnRenamed('sum(b)', 'new_b')
.withColumnRenamed('count(some_column)', 'cnt')
I have this toy solution using PySpark function sum, avg, count and first. note that I use Spark 2.1 in this solution. Hope this help a bit!
from pyspark.sql.functions import sum, avg, count, first
# create toy example dataframe with column 'A', 'B' and 'C'
ls = [['a', 'b',3], ['a', 'b', 4], ['a', 'c', 3], ['b', 'b', 5]]
df = spark.createDataFrame(ls, schema=['A', 'B', 'C'])
# group by column 'A' and 'B' then performing some function here
group_df = df.groupby(['A', 'B'])
df_grouped = group_df.agg(sum("C").alias("sumC"),
avg("C").alias("avgC"),
count("C").alias("countC"),
first("C").alias("firstC"))
df_grouped.show() # print out the spark dataframe

Conditional on pandas DataFrame's

Let df1, df2, and df3 are pandas.DataFrame's having the same structure but different numerical values. I want to perform:
res=if df1>1.0: (df2-df3)/(df1-1) else df3
res should have the same structure as df1, df2, and df3 have.
numpy.where() generates result as a flat array.
Edit 1:
res should have the same indices as df1, df2, and df3 have.
For example, I can access df2 as df2["instanceA"]["parameter1"]["paramter2"]. I want to access the new calculated DataFrame/Series res as res["instanceA"]["parameter1"]["paramter2"].
Actually numpy.where should work fine there. Output here is 4x2 (same as df1, df2, df3).
df1 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
df2 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
df3 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
res = df3.copy()
res[:] = np.where( df1 > 1, (df2-df3)/(df1-1), df3 )
x y
0 -0.671787 -0.445276
1 -0.609351 -0.881987
2 0.324390 1.222632
3 -0.138606 0.955993
Note that this should work on both series and dataframes. The [:] is slicing syntax that preserves the index and columns. Without that res will come out as an array rather than series or dataframe.
Alternatively, for a series you could write as #Kadir does in his answer:
res = pd.Series(np.where( df1>1, (df2-df3)/(df1-1), df3 ), index=df1.index)
Or similarly for a dataframe you could write:
res = pd.DataFrame(np.where( df1>1, (df2-df3)/(df1-1), df3 ), index=df1.index,
columns=df1.columns)
Integrating the idea in this question into JohnE's answer, I have come up with this solution:
res = pd.Series(np.where( df1 > 1, (df2-df3)/(df1-1), df3 ), index=df1.index)
A better answer using DataFrames will be appreciated.
Say df is your initial dataframe and res is the new column. Use a combination of setting values and boolean indexing.
Set res to be a copy of df3:
df['res'] = df['df3']
Then adjust values for your condition.
df[df['df1']>1.0]['res'] = (df['df2'] - df['df3'])/(df['df1']-1)