Pandas Map on some rows only? - pandas

Is there a way to use pandas map on only some rows and ignore all others?
Example DF :
df = pd.DataFrame({'ProductID': ['Playstation','Playstation','Playstation','Sony Playstation','Sony Playstation','Sony Playstation'],
'date': [dt.date(2022,11,1),dt.date(2022,11,5),dt.date(2022,11,1),dt.date(2022,11,10),dt.date(2022,11,15),dt.date(2022,11,1)],
'customerID' : ['01','01','01','01','02','01'],
'brand': ['Cash','Cash','Game','Cash','Cash','Game'],
'gmv': [10,50,30,40,50,60]})
As you can see, I have similar products that, for some reason, are sometimes as "Playstation" and sometime as "Sony Playstation".
How can I do a pandas map to replace "Playstation" to "Sony Playstation"?
Take into that this is an example with only 2 brands. On my DF I have several brands, so make a dict of them is not viable (and I might need to change many brandes at once).
Can I apply map on a filtered df? I´ve tried apply map on partial DF:
Gift.loc[(Gift.brand == 'PlayStation')].brand.map({'PlayStation': 'Sony PlayStation'}, na_action='ignore')
If so, how to I move the new data to origim df?

Related

pandas: split pandas columns of unequal length list into multiple columns

I have a dataframe with one column of unequal list which I want to spilt into multiple columns (the item value will be the column names). An example is given below
I have done through iterrows, iterating thruough the rows and examine the list from each rows. It seem workable as my dataframe has few rows. However, I wonder if there is any clean methods
I have done through additional_df = pd.DataFrame(venue_df.location.values.tolist())
However the list break down into as below
thanks fro your help
Can you try this code: built assuming venue_df.location contains the list you have shown in the cells.
venue_df['school'] = venue_df.location.apply(lambda x: ('school' in x)+0)
venue_df['office'] = venue_df.location.apply(lambda x: ('office' in x)+0)
venue_df['home'] = venue_df.location.apply(lambda x: ('home' in x)+0)
venue_df['public_area'] = venue_df.location.apply(lambda x: ('public_area' in x)+0)
Hope this helps!
First lets explode your location column, so we can get your wanted end result.
s=df['Location'].explode()
Then lets use crosstab in that series so we can get your end result
import pandas as pd
pd.crosstab(s).unstack()
I didnt test it out cause i dont know you base_df

Groupby does return previous df without changing it

df=pd.read_csv('../input/tipping/tips.csv')
df_1 = df.groupby(['day','time'])
df_1.head()
Guys, what am I missing here ? As it returns to me previous dataframe without groupby
We can print it using the following :
df_1 = df.groupby(['day','time']).apply(print)
groupby doesn't work the way you are assuming by the sounds of it. Using head on the grouped dataframe takes the first 5 rows of the dataframe, even if it is across groups because that is how the groupby object is built. You can use #tlentali's approach to print out each group, but df_1 will not be assigned the grouped dataframe that way, instead, None (the number of groups times) as that is the output of print.
The way below gives a lot of control over how to show/display the groups and their keys
This might also help you understand more about how the grouped data frame structure in pandas works.
df_1 = df.groupby(['day','time'])
# for each (day,time) and grouped data
for key, group in df_1:
# show the (day,time)
print(key)
# display head of the grouped data
group.head()

analogy of SUMIFS in Excel function in Pandas

I have a difficulty with applying Excel SUMIFS type function in Pandas.
I have a table similar to one on picture.
I need to find Sum of each product sold each day. But I don't need it in Summary table. I need it to be written in column next to each one as shown in red column. In excel I'm using SUMIFS function. But in Pandas I can't find any analogy.
Once again, I don't need just count/or sum as shown summary in another table. I need it be near each entry in new column. Is there any way I can do it?
P.S VERY Important thing- Writing with group by where I will need to write each condition isn't a solution. Because I want it to show next to each cell. In my data there will be thousands of entries and I don't know each entry to write =="apple", =="orange" each time. I need the same logic as in Excel.
enter image description here
You can do this with groupby and its transform method.
Creating something that looks like your dataframe, but abbreviated:
import pandas as pd
df = pd.DataFrame({
'date': ["22.10.2021", "22.10.2021", "22.10.2021", "22.10.2021", "23.10.2021"],
'Product': ["apple", "apple", "orange", "orange", "cherry"],
'sold_kg': [2, 3, 1, 4, 2]})
Then we group by and apply the sum as transformation to the sold_kg column and assign the result back as a new column:
df['Sold that day'] = df.groupby(['date', 'Product']).sold_kg.transform("sum")
In your words, we often use groupby to create "summaries" or aggregations. But transform is also useful to know since it allows us to splat the result back into the data frame it came from, just like in your example.
if we consider the image as dataframe df, simply do
>>> pd.merge( df.groupby(['Date','Product']).sum().reset_index(),df, on=['Date','Product'], how='left')
You will just need to rename some columns later-on, but that should do

How to select a value in a dataframe with MultiIndex?

I use the Panda library to analyze the data coming from an excel file.
I used pivot_table to get a pivot table with the information I'm interested in. I end up with a multi index array.
For "OPE-2016-0001", I would like to obtain the figures for 2017 for example. I've tried lots of things and nothing works. What is the correct method to use? thank you
import pandas as pd
import numpy as np
from math import *
import tkinter as tk
pd.set_option('display.expand_frame_repr', False)
df = pd.read_csv('datas.csv')
def tcd_op_dataExcercice():
global df
new_df = df.assign(Occurence=1)
tcd= new_df.pivot_table(index=['Numéro opération',
'Libellé opération'],
columns=['Exercice'],
values=['Occurence'],
aggfunc=[np.sum],
margins=True,
fill_value=0,
margins_name='Total')
print(tcd)
print(tcd.xs('ALSTOM 8', level='Libellé opération', drop_level=False))
tcd_op_dataExcercice()
I get the following table (image).
How do I get the value framed in red?
You can use .loc to select rows by a DataFrame's Index's labels. If the Index is a MultiIndex, it will index into the first level of the MultiIndex (Numéro Opéracion in your case). Though you can pass a tuple to index into both levels (e.g. if you specifically wanted ("OPE-2016-0001", "ALSTOM 8"))
It's worth noting that the columns of your pivoted data are also a MultiIndex, because you specified the aggfunc, values and columns as lists, rather than individual values (i.e. without the []). Pandas creates a MultiIndex because of these lists, even though they had one
argument.
So you'll also need to pass a tuple to index into the columns to get the value for 2017:
tcd.loc["OPE-2016-0001", ('sum', 'Occurence', 2017)]
If you had instead just specified the aggfunc etc as individual strings, the columns would just be the years and you could select the values by:
tcd.loc["OPE-2016-0001", 2017]
Or if you specifically wanted the value for ALSTOM 8:
tcd.loc[("OPE-2016-0001", "ALSTOM 8"), 2017]
An alternative to indexing into a MultiIndex would also be to just .reset_index() after pivoting -- in which case the levels of the MultiIndex will just become columns in the data. And you can then select rows based on the values of those columns. E.g (assuming you specified aggfunc etc as strings):
tcd = tcd.reset_index()
tcd.query("'Numéro Opéracion' == 'OPE-2016-0001'")[2017]

Making Many Empty Columns in PySpark

I have a list of many dataframes each with a subset schema of a master schema. In order to union these dataframes, I need to construct a common schema among all the dataframes. My thought is that I need to create empty columns for all the missing columns for each of the dataframes. I have about on average 80 missing features and 100s of dataframes.
This is somewhat of a duplicate or inspired by Concatenate two PySpark dataframes
I am currently implementing things this way:
from pyspark.sql.functions import lit
for df in dfs: # list of dataframes
for feature in missing_features: # list of strings
df = df.withColumn(feature, lit(None).cast("string"))
This seems to be taking a significant amount of time. Is there a faster way to concat these dataframes with null in place of missing features?
You might be able to cut time a little by replacing your code with:
cols = ["*"] + [lit(None).cast("string").alias(f) for f in missing_features]
dfs_new = [df.select(cols) for df in dfs]