I have a difficulty with applying Excel SUMIFS type function in Pandas.
I have a table similar to one on picture.
I need to find Sum of each product sold each day. But I don't need it in Summary table. I need it to be written in column next to each one as shown in red column. In excel I'm using SUMIFS function. But in Pandas I can't find any analogy.
Once again, I don't need just count/or sum as shown summary in another table. I need it be near each entry in new column. Is there any way I can do it?
P.S VERY Important thing- Writing with group by where I will need to write each condition isn't a solution. Because I want it to show next to each cell. In my data there will be thousands of entries and I don't know each entry to write =="apple", =="orange" each time. I need the same logic as in Excel.
enter image description here
You can do this with groupby and its transform method.
Creating something that looks like your dataframe, but abbreviated:
import pandas as pd
df = pd.DataFrame({
'date': ["22.10.2021", "22.10.2021", "22.10.2021", "22.10.2021", "23.10.2021"],
'Product': ["apple", "apple", "orange", "orange", "cherry"],
'sold_kg': [2, 3, 1, 4, 2]})
Then we group by and apply the sum as transformation to the sold_kg column and assign the result back as a new column:
df['Sold that day'] = df.groupby(['date', 'Product']).sold_kg.transform("sum")
In your words, we often use groupby to create "summaries" or aggregations. But transform is also useful to know since it allows us to splat the result back into the data frame it came from, just like in your example.
if we consider the image as dataframe df, simply do
>>> pd.merge( df.groupby(['Date','Product']).sum().reset_index(),df, on=['Date','Product'], how='left')
You will just need to rename some columns later-on, but that should do
Related
My problem is that I have a dataframe which has null values, but these null values are filled with another column of the same data frame, then I would like to know how to take that column and put the information of the other column to fill the missing data. I'm using deepnote
link:
https://deepnote.com
For example:
Column A
Column B
Cell 1
Cell 2
NULL
Cell 4
My desired output:
Column A
Cell 1
Cell 4
I think it should be with sub queries and using some WHERE, any ideas?
thanks for the question and welcome to StackOverflow.
It is not 100% clear which direction you need your solution to go, so I am offering two alternatives which I think should get you going.
Pandas way
You seem to be working with Pandas dataframes. The usual way to work with Pandas dataframes is to use Pandas builtin functions. In this case, there is literally a function for filling null values, it's called fillna. We can use it to fill values from another column like this:
df_raw = pd.DataFrame(data={'Column A': ['Cell 1', None], 'Column B': ['Cell 2', 'Cell 4']})
# copying the original dataframe to a clean one
df_clean = df_raw.copy()
# applying the fillna to fill null values from another column
df_clean['Column A'] = df_clean['Column A'].fillna(df_clean['Column B'])
This will make your df_clean look like you need
Column A
Cell 1
Cell 4
Dataframe SQL way
You mentioned "queries" and "where" in your question which seems you might be playing with some combination of Python and SQL world. Enter DuckDB world which supports exactly this, in Deepnote we call these Dataframe SQLs.
You can query e.g. CSV files directly from these Dataframe SQL blocks, but you can also use a previously defined Dataframe.
select * from df_raw
In order to fill the null values like you are requesting, we can use standard SQL querying and a function called coalesce as Paul correctly pointed out.
select coalesce("Column A", "Column B") as "Column A" from df_raw
This will also create what you need in SQL world. In Deepnote, specifically, this will also give you a Dataframe.
Column A
Cell 1
Cell 4
Feel free to check out my project in Deepnote with these examples, and go ahead and duplicate it if you want to iterate on the code a bit. There is also plenty more alternatives, if you're in a real SQL database and want to update existing columns, you would use update statement. And if you are in a pure Python, this is of course also possible in a loop or using lambda functions.
I have a dataframe with one column of unequal list which I want to spilt into multiple columns (the item value will be the column names). An example is given below
I have done through iterrows, iterating thruough the rows and examine the list from each rows. It seem workable as my dataframe has few rows. However, I wonder if there is any clean methods
I have done through additional_df = pd.DataFrame(venue_df.location.values.tolist())
However the list break down into as below
thanks fro your help
Can you try this code: built assuming venue_df.location contains the list you have shown in the cells.
venue_df['school'] = venue_df.location.apply(lambda x: ('school' in x)+0)
venue_df['office'] = venue_df.location.apply(lambda x: ('office' in x)+0)
venue_df['home'] = venue_df.location.apply(lambda x: ('home' in x)+0)
venue_df['public_area'] = venue_df.location.apply(lambda x: ('public_area' in x)+0)
Hope this helps!
First lets explode your location column, so we can get your wanted end result.
s=df['Location'].explode()
Then lets use crosstab in that series so we can get your end result
import pandas as pd
pd.crosstab(s).unstack()
I didnt test it out cause i dont know you base_df
I am new to Pandas. Sorry for using images instead of tables here; I tried to follow the instructions for inserting a table, but I couldn't.
Pandas version: '1.3.2'
Given this dataframe with Close and Volume for stocks, I've managed to calculate OBV, using pandas, like this:
df.groupby('Ticker').apply(lambda x: (np.sign(x['Close'].diff().fillna(0)) * x['Volume']).cumsum())
The above gave me the correct values for OBV as
shown here.
However, I'm not able to assign the calculated values to a new column.
I would like to do something like this:
df['OBV'] = df.groupby('Ticker').apply(lambda x: (np.sign(x['Close'].diff().fillna(0)) * x['Volume']).cumsum())
But simply doing the expression above of course will throw us the error:
ValueError: Columns must be same length as key
What am I missing?
How can I insert the calculated values into the original dataframe as a single column, df['OBV'] ?
I've checked this thread so I'm sure I should use apply.
This discussion looked promising, but it is not for my case
Use Series.droplevel for remove first level of MultiIndex:
df['OBV'] = df.groupby('Ticker').apply(lambda x: (np.sign(x['Close'].diff().fillna(0)) * x['Volume']).cumsum()).droplevel(0)
I'm working on a jupyter notebook, and I would like to get the average 'pcnt_change' based on 'day_of_week'. How do I do this?
A simple groupby call would do the trick here.
If df is the pandas dataframe:
df.groupby('day_of_week').mean()
would return a dataframe with average of all numeric columns in the dataframe with day_of_week as index. If you want only certain column(s) to be returned, select only the needed columns on the groupby call (for e.g.,
df[['open_price', 'high_price', 'day_of_week']].groupby('day_of_week').mean()
I am trying to preset the dimensions of my data frame in pandas so that I can have 500 rows by 300 columns. I want to set it before I enter data into the dataframe.
I am working on a project where I need to take a column of data, copy it, shift it one to the right and shift it down by one row.
I am having trouble with the last row being cut off when I shift it down by one row (eg: I started with 23 rows and it remains at 23 rows despite the fact that I shifted down by one and should have 24 rows).
Here is what I have done so far:
bolusCI = pd.DataFrame()
##set index to very high number to accommodate shifting row down by 1
bolusCI = bolus_raw[["Activity (mCi)"]].copy()
activity_copy = bolusCI.shift(1)
activity_copy
pd.concat([bolusCI, activity_copy], axis =1)
Thanks!
There might be a more efficient way to achieve what you are looking to do, but to directly answer your question you could do something like this to init the DataFrame with certain dimensions
pd.DataFrame(columns=range(300),index=range(500))
You just need to define the index and columns in the constructor. The simplest way is to use pandas.RangeIndex. It mimics np.arange and range in syntax. You can also pass a name parameter to name it.
pd.DataFrame
pd.Index
df = pd.DataFrame(
index=pd.RangeIndex(500),
columns=pd.RangeIndex(300)
)
print(df.shape)
(500, 300)