How to count Total Price in dataframe - dataframe

I have retail data from which I created retail dataframe
spark.sparkContext.addFile('https://raw.githubusercontent.com/databricks/Spark-The-Definitive-Guide/master/data/retail-data/all/online-retail-dataset.csv')
retail_df = spark.read.csv(SparkFiles.get('online-retail-dataset.csv'), header=True, inferSchema=True)\
.withColumn('OverallItems', struct('StockCode', 'Description', 'UnitPrice', 'Quantity', 'InvoiceDate','CustomerID', 'Country'))
then I created retail_array that has two columns InvoiceNo and Items
retail_array = retail_df.groupBy('InvoiceNo')\
.agg(collect_list(col('OverallItems')).alias('Items'))
I want to count total price of invoice items and add to into items column in retail_array.
So far I have written this code:
transformer = lambda x: struct(x['UnitPrice'], x['Quantity'], x['UnitPrice'] * x['Quantity']).cast("struct<UnitPrice:double,Quantity:double,TotalPrice:double>")
TotalPrice_df = retail_array\
.withColumn('TotalPrice', transform("items", transformer))
TotalPrice_df.show(truncate=False)
But with this code Im adding to retail_arraynew column, but I want this new column to be part of items column inretail_array`.
for one invoice item output is like:
--+
|InvoiceNo|Items|TotalPrice |
+---------+---------------------------------------------------------------------------------------
|536366 |[{22633, HAND WARMER UNION JACK, 1.85, 6, 12/1/2010 8:28, 17850, United Kingdom}, {22632, HAND WARMER RED POLKA DOT, 1.85, 6, 12/1/2010 8:28, 17850, United Kingdom}] |[{1.85, 6.0, 11.100000000000001}, {1.85, 6.0, 11.100000000000001}]
I want it count 11.100000000000001 + 11.100000000000001 and add it into items column with no extra column. Also for other invoice items there are sometimes more than two total price I want to add to each other.

Use aggregate instead of transform function to calculate the total price like this:
from pyspark.sql import functions as F
retail_array = retail_df.groupBy("InvoiceNo").agg(
F.collect_list(F.col("OverallItems")).alias("Items")
).withColumn(
"TotalPrice",
F.aggregate("items", F.lit(.0), lambda acc, x: acc + (x["Quantity"] * x["UnitPrice"]))
)
Note however that you can actually calculate this TotalPrice in the same aggregation when you collect the list of structs and thus avoid, additional calculations by iterating on array elements:
retail_array = retail_df.groupBy("InvoiceNo").agg(
F.collect_list(F.col("OverallItems")).alias("Items"),
F.sum(F.col("Quantity") * F.col("UnitPrice")).alias("TotalPrice")
)
retail_array.show(1)
#+---------+--------------------+------------------+
#|InvoiceNo| Items| TotalPrice|
#+---------+--------------------+------------------+
#| 536366|[{22633, HAND WAR...|22.200000000000003|
#+---------+--------------------+------------------+
But with this code I'm adding to retail_array new column, but I want this new column to be part of items column in retail_array
Note sure I correctly understood this part. Items column is an array of structs, that does not make much sens to replicate the total price of an InvoiceNo in each
of its items.
That said, if you really want to do this, you can use transform after calculating the total price (step above):
result = retail_array.withColumn(
"Items",
F.transform("Items", lambda x: x.withField("TotalPrice", F.col("TotalPrice")))
).drop("TotalPrice")
result.show(1, False)
#+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|InvoiceNo|Items |
#+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|536366 |[{22633, HAND WARMER UNION JACK, 1.85, 6, 12/1/2010 8:28, 17850, United Kingdom, 22.200000000000003}, {22632, HAND WARMER RED POLKA DOT, 1.85, 6, 12/1/2010 8:28, 17850, United Kingdom, 22.200000000000003}]|
#+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Related

How to separate entries, and count the occurrences

I'm trying to count which country most celebrities come from. However the csv that I'm working with has multiple countries for a single celeb. e.g. "France, US" for someone with a double nationality.
To count the above, I can use .count() for the entries in the "nationality" column. But, I want to count France, US and any other country separately.
I cannot figure out a way to separate all the entries in column and then, count the occurrences.
I want to be able to reorder my dataframe with these counts, so I want to count this inside the structure
data.groupby(by="nationality").count()
This returns some faulty counts of
"France, US" 1
Assuming this type of data:
data = pd.DataFrame({'nationality': ['France','France, US', 'US', 'France']})
nationality
0 France
1 France, US
2 US
3 France
You need to split and explode, then use value_counts to get the sorted counts per country:
out = (data['nationality']
.str.split(', ')
.explode()
.value_counts()
)
Output:
France 3
US 2
Name: nationality, dtype: int64

Applying a function to list of columns of a dataframe?

I scraped this table from this URL:
"https://www.patriotsoftware.com/blog/accounting/average-cost-living-by-state/"
Which looks like this:
State Annual Mean Wage (All Occupations) Median Monthly Rent Value of a Dollar
0 Alabama $44,930 $998 $1.15
1 Alaska $59,290 $1,748 $0.95
2 Arizona $50,930 $1,356 $1.04
3 Arkansas $42,690 $953 $1.15
4 California $61,290 $2,518 $0.87
And then I wrote this function to help me turn the strings into ints:
def money_string_to_int(s):
return int(s.replace(",", "").replace("$",""))
money_string_to_int("$1,23")
My function works when I apply it to only one column. I found this answer here about using on multiple columns: How to apply a function to multiple columns in Pandas
But my code below does not work and produces no errors:
ls = ['Annual Mean Wage (All Occupations)', 'Median Monthly Rent',
'Value of a Dollar']
ppe_table[ls] = ppe_table[ls].apply(money_string_to_int)
Lets try
df.set_index('State').apply(lambda x: (x.str.replace('[$,]','').astype(float))).reset_index()

Taking mean of N largest values of group by absolute value

I have some DataFrame:
d = {'fruit': ['apple', 'pear', 'peach'] * 6, 'values': np.random.uniform(-5,5,18), 'values2': np.random.uniform(-5,5,18)}
df = pd.DataFrame(data=d)
I can take the mean of each fruit group as such:
df.groupby('fruit').mean()
However, for each group of fruit, I'd like to take the mean of the N number of largest values as
ranked by absolute value.
So for example, if my values were as follows and N=3:
[ 0.7578507 , 3.81178045, -4.04810913, 3.08887538, 2.87999752, 4.65670954]
The desired outcome would be (4.65670954 + -4.04810913 + 3.81178045) / 3 = ~1.47
Edit - to clarify that sign is preserved in outcome:
(4.65670954 + -20.04810913 + 3.81178045) / 3 = -3.859
Updating with a new approach that I think is simpler. I was avoiding apply like the plague but maybe this is one of the more acceptable uses. Plus it fixes the fact that you want to mean the original values as ranked by their absolute values:
def foo(d):
return d[d.abs().nlargest(3).index].mean()
out = df.groupby('fruit')['values'].apply(foo)
So you index each group by the 3 largest absolute values, then mean.
And for the record my original, incorrect, and slower code was:
df['values'].abs().groupby(df['fruit']).nlargest(3).groupby("fruit").mean()

Understanding Correlation Between Columns Pandas DataFrame

I have a dataset with daily sales of two products for the first 10 days of their release. The dataframe below shows a single and dozens of items being sold per day for each product. Its believed that no dozens product was sold before a single item of the product had been sold. The two products (Period_ID) has expected number of dozens sale.
d = {'Period_ID':['A12']*10, 'Prod_A_Doz':[1.2]*10, 'Prod_B_Doz':[2.4]*10, 'A_Singles':[0,0,0,1,1,2,2,3,3,4], 'B_Singles':[0,0,1,1,2,2,3,3,4,4],
'A_Dozens':[0,0,0,0,0,0,0,1,1,1], 'B_Dozens':[0,0,0,0,0,0,1,1,2,2]}
df = pd.DataFrame(data=d)
QUESTION
I want to construct a descriptive analysis in which one of my questions is to figure out how many single items of each product sold in average before a dozen was sold the 1st time, 2nd time,..., 10th time?
Given that df.Period_ID.nunique() = 1568
Modifying the dataset for sales per day as oppose to the above cumulative sales and using Pankaj Joshi solution with small alteration,
print(f'Average number of single items before {index + 1} dozen = {df1.A_Singles[:val+1].mean():0.2f}')
d = {'Period_ID':['A12']*10, 'Prob_A_Doz':[1.2]*10, 'Prod_B_Doz':[2.4]*10, 'A_Singles':[0,0,0,1,0,1,0,1,0,1], 'B_Singles':[0,0,1,0,1,0,1,0,1,0],
'A_Dozens':[0,0,0,0,0,0,0,1,0,0], 'B_Dozens':[0,0,0,0,0,0,1,0,1,0]}
df1 = pd.DataFrame(data=d)
# For product A
Average number of single items before 1 dozen = 0.38
# For product B
6
Average number of single items before 1 dozen = 0.43
8
Average number of single items before 2 dozen = 0.44, But I want this to be counted from the last Dozens of sales. so rather 0.44, it should be 0.5
The aim is once I have the information for each Period_ID then i will take the average for all df.Period_ID.nunique() (= 1568) and try to optimise the expected number of 'Dozens' sale for each product given under the col Prod_A_Doz and Prod_B_Doz
I would appreciate all the help.
Here is how I will go about it:
d = {'Period_ID':['A12']*10, 'Prob_A_Doz':[1.2]*10, 'Prod_B_Doz':[2.4]*10, 'A_Singles':[0,0,0,1,1,2,2,3,3,4], 'B_Singles':[0,0,1,1,2,2,3,3,4,4],
'A_Dozens':[0,0,0,0,0,0,0,1,1,1], 'B_Dozens':[0,0,0,0,0,0,1,1,2,2]}
df1 = pd.DataFrame(data=d)
for per_id in set(df1.Period_ID):
print(per_id)
df_temp = df1[df1.Period_ID == per_id]
for index, val in enumerate(df_temp.index[df_temp.A_Dozens>0]):
print(val)
print(f'Average number of single items before {index} dozen = {df_temp.A_Singles[:val].mean():0.2f}')
print(f'Average number of single items before {index} dozen = {df_temp.B_Dozens[:val].mean():0.2f}')

Create new column on pandas DataFrame in which the entries are randomly selected entries from another column

I have a DataFrame with the following structure.
df = pd.DataFrame({'tenant_id': [1,1,1,2,2,2,3,3,7,7], 'user_id': ['ab1', 'avc1', 'bc2', 'iuyt', 'fvg', 'fbh', 'bcv', 'bcb', 'yth', 'ytn'],
'text':['apple', 'ball', 'card', 'toy', 'sleep', 'happy', 'sad', 'be', 'u', 'pop']})
This gives the following output:
df = df[['tenant_id', 'user_id', 'text']]
tenant_id user_id text
1 ab1 apple
1 avc1 ball
1 bc2 card
2 iuyt toy
2 fvg sleep
2 fbh happy
3 bcv sad
3 bcb be
7 yth u
7 ytn pop
I would like to groupby on tenant_id and create a new column which is a random selection of strings from the user_id column.
Thus, I would like my output to look like the following:
tenant_id user_id text new_column
1 ab1 apple [ab1, bc2]
1 avc1 ball [ab1]
1 bc2 card [avc1]
2 iuyt toy [fvg, fbh]
2 fvg sleep [fbh]
2 fbh happy [fvg]
3 bcv sad [bcb]
3 bcb be [bcv]
7 yth u [pop]
7 ytn pop [u]
Here, random id's from the user_id column have been selected, these id's can be repeated as "fvg" is repeated for tenant_id=2. I would like to have a threshold of not more than ten id's. This data is just a sample and has only 10 id's to start with, so generally any number much less than the total number of user_id's. This case say 1 less than total user_id's that belong to a tenant.
i tried first figuring out how to select random subset of varying length with
df.sample
new_column = df.user_id.sample(n=np.random.randint(1, 10)))
I am kinda lost after this, assigning it to my df results in Nan's, probably because they are of variable lengths. Please help.
Thanks.
per my comment:
Your 'new column' is not a new column, it's a new cell for a single row.
If you want to assign the result to a new column, you need to create a new column, and apply the cell computation to it.
df['new column'] = df['user_id'].apply(lambda x: df.user_id.sample(n=np.random.randint(1, 10))))
it doesn't really matter what column you use for the apply since the variable is not used in the computation