I am trying to sum the total of unique values in a pandas data frame but for some reason I am having difficulty getting just one number for each unique values.
calmuni 30,000.00 CA 1-3 year paper
calmuni 95,000.00 CA 1-3 year paper
massmuni 25,000.00 MA 1-3 year paper
massmuni 30,000.00 RI 1-3 year paper
massmuni 175,000.00 MA 1-3 year paper
I am trying to sum column B based off the unique values in column A but my groupby function isn't working. I would like to one line item sum value for each unique value:
calmuni 30,000.0095,000.00125,000.0020,000.0020,000.00...
massmuni 25,000.0030,000.00175,000.0025,000.0050,000.00..


Expanding group by window to count nunique

I have the following df:
I want to create an expanding window to count number of unique customers at any point.
the output for the following df should be:
because in the first month we had 4 different customers, in the 2nd one, 3 where added (the other one was in the first month, and in the last month only one added (number 10))
You can first drop the duplicated customers (only keep the first ones that appeared) and then cumulatively sum the number of (now unique) customers per month:
counts = df.drop_duplicates("customer").groupby("month").size().cumsum().to_dict()
to get
>>> counts
{1: 4, 2: 7, 3: 8}
Since there are repeated customers, you can drop those repeated customers using
By default it will keep the first occurence of customer number and will drop next occurences. To count the number of unique customers each month,
df['customer'] = df.groupby('month')['customer'].transform('count')
df = df.drop_duplicates(ignore_index=True)
To roll the window over the customer column, calculate cumulative sum of that column
df['customer'] = df['customer'].cumsum()
It will give the desired output
month customers
1 4
2 7
3 8

How to create new columns using groupby based on logical expressions

I have this CSV file
I want to create three columns, 'MT_Value','M_Value', and 'T_Data', one who has the mean of the data grouped by year and month, which I accomplished by doing this.
But for M_value I need to do the mean of only the values different from zero, and for T_Data I need the count of the values that are zero divided by the total of values, I guess that for the last one I need to divide the amount of values that are zero by the amount of total data grouped, but honestly I am a bit lost. I looked on google and they say something about transform but I didn't understood very well
Thank you.
You could do something like this:
Explanation: assign will create new columns with respective names. Now
data.Valor.where(data.Valor!=0) will replace 0 values with nan, which will be ignored when we call mean().
data.Valor.eq(0) will replace 0 with 1 and other values with 0. So when you do mean(), you compute count(Valor==0)/total_count().
Valor M_Value T_Data
Year Month
1970 1 2.306452 6.500000 0.645161
2 1.507143 4.688889 0.678571
3 2.064516 7.111111 0.709677
4 11.816667 13.634615 0.133333
5 7.974194 11.236364 0.290323
... ... ... ...
1997 10 3.745161 7.740000 0.516129
11 11.626667 21.800000 0.466667
12 0.564516 4.375000 0.870968
1998 1 2.000000 15.500000 0.870968
2 1.545455 5.666667 0.727273
[331 rows x 3 columns]

Pandas matching algorithm with itself

I'm trying to create a matching algo in pandas that does the following with a given table:
A table contains purchases and sales of products by date, item, quantity (+ for purchases and - for sales) and price.
Create an algorithm that matches purchases and sales per item and the corresponding average profit for each item in total.
Matches can only be on the same date, otherwise they are not matched at all.
Remaining positive or negative inventories per day are ignored
Negative inventories are allowed.
Example with a single product:
date product quantity price
1 X +2 1
1 X -1 2
1 X -2 4
2 X +1 1
2 X +1 2
3 X -1 4
The result would be that only on day 1 the 3 trades are matched, with a profit of -2+2+4=4. Because inventory is +2, -1, and then again -1. The remaining inventory of -1 is ignored. Day 2 and 3 have no matches because the trades are not closed on the same day.
Correct output:
product Profit
X +4
Is there any elegant way to get to this result without having to loop over the table multiple times with iterrow?
For reproducing the df:
df = pd.DataFrame({'date':[1,1,1,2,2,3],'product': ['X']*6,'quantity':[2,-1,-2,1,1,-1],'price':[1,2,4,1,2,4]})
The process that you describing could use groupby & aggregate, something like this:
But I don't fully understand your rules for matching. So in Day 1, I got a different total profit. Price * quantity is (+2*1)+(-1*2)+(-2*4)=-8, so profit seems to be 8.
Using iterrow() is a rather bad practice. Not only you're writing excessive code, but also it's likely much slower (check a comparison here).
Most of those type of jobs can be accomplished by combining groupby(), aggregate() and apply(). Check out this great tutorial.
I hope this helps you or future answers :)

Understanding Correlation Between Columns Pandas DataFrame

I have a dataset with daily sales of two products for the first 10 days of their release. The dataframe below shows a single and dozens of items being sold per day for each product. Its believed that no dozens product was sold before a single item of the product had been sold. The two products (Period_ID) has expected number of dozens sale.
d = {'Period_ID':['A12']*10, 'Prod_A_Doz':[1.2]*10, 'Prod_B_Doz':[2.4]*10, 'A_Singles':[0,0,0,1,1,2,2,3,3,4], 'B_Singles':[0,0,1,1,2,2,3,3,4,4],
'A_Dozens':[0,0,0,0,0,0,0,1,1,1], 'B_Dozens':[0,0,0,0,0,0,1,1,2,2]}
df = pd.DataFrame(data=d)
I want to construct a descriptive analysis in which one of my questions is to figure out how many single items of each product sold in average before a dozen was sold the 1st time, 2nd time,..., 10th time?
Given that df.Period_ID.nunique() = 1568
Modifying the dataset for sales per day as oppose to the above cumulative sales and using Pankaj Joshi solution with small alteration,
print(f'Average number of single items before {index + 1} dozen = {df1.A_Singles[:val+1].mean():0.2f}')
d = {'Period_ID':['A12']*10, 'Prob_A_Doz':[1.2]*10, 'Prod_B_Doz':[2.4]*10, 'A_Singles':[0,0,0,1,0,1,0,1,0,1], 'B_Singles':[0,0,1,0,1,0,1,0,1,0],
'A_Dozens':[0,0,0,0,0,0,0,1,0,0], 'B_Dozens':[0,0,0,0,0,0,1,0,1,0]}
df1 = pd.DataFrame(data=d)
# For product A
Average number of single items before 1 dozen = 0.38
# For product B
Average number of single items before 1 dozen = 0.43
Average number of single items before 2 dozen = 0.44, But I want this to be counted from the last Dozens of sales. so rather 0.44, it should be 0.5
The aim is once I have the information for each Period_ID then i will take the average for all df.Period_ID.nunique() (= 1568) and try to optimise the expected number of 'Dozens' sale for each product given under the col Prod_A_Doz and Prod_B_Doz
I would appreciate all the help.
Here is how I will go about it:
d = {'Period_ID':['A12']*10, 'Prob_A_Doz':[1.2]*10, 'Prod_B_Doz':[2.4]*10, 'A_Singles':[0,0,0,1,1,2,2,3,3,4], 'B_Singles':[0,0,1,1,2,2,3,3,4,4],
'A_Dozens':[0,0,0,0,0,0,0,1,1,1], 'B_Dozens':[0,0,0,0,0,0,1,1,2,2]}
df1 = pd.DataFrame(data=d)
for per_id in set(df1.Period_ID):
df_temp = df1[df1.Period_ID == per_id]
for index, val in enumerate(df_temp.index[df_temp.A_Dozens>0]):
print(f'Average number of single items before {index} dozen = {df_temp.A_Singles[:val].mean():0.2f}')
print(f'Average number of single items before {index} dozen = {df_temp.B_Dozens[:val].mean():0.2f}')

Create new column on pandas DataFrame in which the entries are randomly selected entries from another column

I have a DataFrame with the following structure.
df = pd.DataFrame({'tenant_id': [1,1,1,2,2,2,3,3,7,7], 'user_id': ['ab1', 'avc1', 'bc2', 'iuyt', 'fvg', 'fbh', 'bcv', 'bcb', 'yth', 'ytn'],
'text':['apple', 'ball', 'card', 'toy', 'sleep', 'happy', 'sad', 'be', 'u', 'pop']})
This gives the following output:
df = df[['tenant_id', 'user_id', 'text']]
tenant_id user_id text
1 ab1 apple
1 avc1 ball
1 bc2 card
2 iuyt toy
2 fvg sleep
2 fbh happy
3 bcv sad
3 bcb be
7 yth u
7 ytn pop
I would like to groupby on tenant_id and create a new column which is a random selection of strings from the user_id column.
Thus, I would like my output to look like the following:
tenant_id user_id text new_column
1 ab1 apple [ab1, bc2]
1 avc1 ball [ab1]
1 bc2 card [avc1]
2 iuyt toy [fvg, fbh]
2 fvg sleep [fbh]
2 fbh happy [fvg]
3 bcv sad [bcb]
3 bcb be [bcv]
7 yth u [pop]
7 ytn pop [u]
Here, random id's from the user_id column have been selected, these id's can be repeated as "fvg" is repeated for tenant_id=2. I would like to have a threshold of not more than ten id's. This data is just a sample and has only 10 id's to start with, so generally any number much less than the total number of user_id's. This case say 1 less than total user_id's that belong to a tenant.
i tried first figuring out how to select random subset of varying length with
new_column = df.user_id.sample(n=np.random.randint(1, 10)))
I am kinda lost after this, assigning it to my df results in Nan's, probably because they are of variable lengths. Please help.
per my comment:
Your 'new column' is not a new column, it's a new cell for a single row.
If you want to assign the result to a new column, you need to create a new column, and apply the cell computation to it.
df['new column'] = df['user_id'].apply(lambda x: df.user_id.sample(n=np.random.randint(1, 10))))
it doesn't really matter what column you use for the apply since the variable is not used in the computation