ggplot bar chart for frequency multiply by sales - ggplot2

my data has a list of orders consist of
order number, product category, number of sales
each order can contain multiple number of sales
I need to create a bar chart of sales by product category
, Instead of just counting the number of orders for each product category.
I need to count number of orders multiply by their respective number of sales.
how do I write the code? thanks

Let´s start with a reproducible example.
library(tidyverrse)
d <- data.frame(category = c(rep("Produt1", 5),rep("Produt2", 5), rep("Produt3", 5)),
order = round(rnorm(15, 10, 5)),
sales = round(rnorm(15, 5, 4)))
category order sales
Produt1 13 12
Produt1 10 7
Produt1 3 8
Produt1 12 7
Produt1 11 1
Produt2 22 7
Produt2 9 8
Produt2 0 3
Produt2 15 5
Produt2 13 5
We can aggregate by group, and sum up the results in a new variable called total.
d1 <- d |>
group_by(category) |>
summarise(total = sum(order * sales))
ggplot() +
geom_col(data = d1,
aes(y = total,
x = category,
fill = category)) +
theme_bw()
Is that what you had in mind?

Related

Postgres calculate average using distinct IDs‚ values also distinct

I have a postgres query that is supposed to calculate an average value based on a set of values. This set of values should be based on DISTINCT ID's.
The query is the following:
#{context.answers_base}
SELECT
stores.name as store_name,
answers_base.question_name as question_name,
answers_base.question_id as question_id,
(sum(answers_base.answer_value) / NULLIF(count(answers_base.answer_id),0)) as score, # <--- this line is calculating wrong
sum(answers_base.answer_value) as score_sum,
count(answers_base.answer_id) as question_answer_count,
count(DISTINCT answers_base.answer_id) as answer_count
FROM answers_base
INNER JOIN stores ON stores.id = answers_base.store_id
WHERE answers_base.answer_value IS NOT NULL AND answers_base.question_type_id = :question_type_id
AND answers_base.scale = TRUE
#{context.filter_answers}
GROUP BY stores.name, answers_base.question_name, answers_base.question_id, answers_base.sort_order
ORDER BY stores.name, answers_base.sort_order
The thing is, that on the indicated line (sum(answers_base.answer_value) / NULLIF(count(answers_base.answer_id),0)) some values are counted more than once.
Part of the solution is making it DISTINCT based on ID, like so:
(sum(answers_base.answer_value) / NULLIF(count(DISTINCT answers_base.answer_id),0))
This will result in an average that divided by the right number, but here the sum it's dividing is still wrong.
Doing the following (make sum() DISTINCT) does not work, for the reason that values are not unique. The values are either 0 / 25 / 50 / 75 / 100, so different IDs might contain 'same' values.
(sum(DISTINCT answers_base.answer_value) / NULLIF(count(DISTINCT answers_base.answer_id),0))
How would I go about making this work?
Here are simplified versions of the table structures.
Table Answer
ID
answer_date
1
Feb 01, 2022
2
Mar 02, 2022
3
Mar 13, 2022
4
Mar 21, 2022
Table AnswerRow
ID
answer_id
answer_value
1
1
25
2
1
50
3
1
50
4
2
75
5
2
100
6
2
0
7
3
25
8
4
25
9
4
100
10
4
50
Answer 1' answer_rows:
25 + 50 + 50 -> average = 125 / 3
Answer 2' answer_rows:
75 + 100 + 0 -> average = 175 / 3
Answer 3' answer_rows:
25 -> average = 25 / 1
Answer 4' answer_rows:
25 + 100 + 50 -> average = 175 / 3
For some reason, we get duplicate answer_rows in the calculation.
Example of the problem; for answer_id=1 we have the following answer_rows in the calculation, giving us a different average:
ID
answer_id
answer_value
1
1
25
2
1
50
3
1
50
3
1
50
3
1
50
3
1
50
Result: 25 + 50 + 50 + 50 + 50 + 50 -> 275 / 6
Desired result: 25 + 50 + 50 -> 125 / 3
Making answer_row_id distinct (see beginning of post) makes it possible for me to get:
25 + 50 + 50 + **50 + 50 + 50** -> 275 / **3**
But not
25 + 50 + 50 -> 275 / 3
What I would like to achieve is having a calculation that selects answer_row distinctly based on its ID, and those answer_rows will be used both for calculation x and y in calculation average -> x / y.
answers_base is the following (simplified):
WITH answers_base as (
SELECT
answers.id as answer_id,
answers.store_id as store_id,
answer_rows.id as answer_row_id,
question_options.answer_value as answer_value
FROM answers
INNER JOIN answer_rows ON answers.id = answer_rows.answer_id
INNER JOIN stores ON stores.id = answers.store_id
WHERE answers.status = 0
)
I think this would be best solved with a window function. Something along the lines of
SELECT
ROW_NUMBER() OVER (PARTITION BY answer_rows.id ORDER BY answer_rows.created_at DESC) AS duplicate_answers
...
WHERE
answer_rows.duplicate_answers = 1
This would filter out multiple rows with the same id, and only keep one entry. (I chose the "first by created_at", but you could change this to whatever logic suits you best.)
A benefit to this approach is that it makes the rationale behind the logic clear, contained and re-usable.

MS-Access: sum up 2 values before building a pivot table

So somebody ordered cheese, beer, bread and cigarettes, and table tblItems looks like this:
iID iName iAmount iPrice iProductID iOrderID
------------------------------------------------------
1 cheese 1 2.99 11 7
2 can of beer 6 0.99 14 7
3 bread 1 2.25 15 7
4 cigarettes 1 6.99 16 7
Before feeding this order into a TRANSFORM query, I need to
summarise all items with iProductID = 11 OR iProductID = 15
name it as food
and assign to it a iProductID = 99
iProductIDs other that 14 or 99 are to be filtered out
I do not want to change the data in tblItems though.
I need it to look like this:
iID iName iAmount iPrice iProductID iOrderID
-------------------------------------------------------
1 food 1 5.24 99 7
2 can of beer 6 0.99 11 7
I've been fiddling around for more than an hour and just can't seem to grasp it.
The filtering-part is easy.
SELECT * FROM tblItems WHERE iProductID = 11 OR iProductID = 99
It is the aggregating part that gives me headaches.
Can somebody please help and point me in the right direction?
You want an aggregation query:
select iif(i.productid in (11, 15), 99, i.productid) as productid,
"food" as iName,
sum(i.price) as price,
i.orderid
from tblItems as i
where i.iproductid in (11, 15, 14)
group by iif(i.productid in (11, 15), 99, i.productid), i.orderid;

Sequence of numbers per category given first entry (Python, Pandas)

Suppose I have 5 categories {A, B, C, D, E} and Several date entries of PURCHASES with distinct dates (for instance, A may range from 01/01/1900 to 31/01/1901 and B from 02/02/1930 to 03/03/1933.
I want to create a new column 'day of occurrence' where I have sequence of number 1...N from the point I find the first date in which number of purchases >= 5.
I want this to compare how categories are similar from the day they've achieved 5 purchases (dates are irrelevant here, but product lifetime is)
Thanks!
Here is how you can label rows from 1 to N depending on column value.
import pandas as pd
df = pd.DataFrame(data=[3, 6, 9, 3, 6], columns=['data'])
df['day of occurrence'] = 0
values_count = df.loc[df['data'] > 5].shape[0]
df.loc[df['data'] > 5, 'day of occurrence'] = range(1, values_count + 1)
The initial DataFrame:
data
0 3
1 6
2 9
3 3
4 6
Output DataFrame:
data day of occurrence
0 3 0
1 6 1
2 9 2
3 3 0
4 6 3
Your data should be sorted by date, for example, df = df.sort_values(by='your-datetime-column')

How to apply different aggregate functions to different columns in pandas?

I have the dataframe with many columns in it , some of it contains price and rest contains volume as below:
year_month 0_fx_price_gy 0_fx_volume_gy 1_fx_price_yuy 1_fx_volume_yuy
1990-01 2 10 3 30
1990-01 2 20 2 40
1990-02 2 30 3 50
I need to do group by year_month and do mean on price columns and sum on volume columns.
is there any quick way to do this in one statement like do average if column name contains price and sum if it contains volume?
df.groupby('year_month').?
Note: this is just sample data with less columns but format is similar
output
year_month 0_fx_price_gy 0_fx_volume_gy 1_fx_price_yuy 1_fx_volume_yuy
1990-01 2 30 2.5 70
1990-02 2 30 3 50
Create dictionary by matched values and pass to DataFrameGroupBy.agg, last add reindex if order of output columns is changed:
d1 = dict.fromkeys(df.columns[df.columns.str.contains('price')], 'mean')
d2 = dict.fromkeys(df.columns[df.columns.str.contains('volume')], 'sum')
#merge dicts together
d = {**d1, **d2}
print (d)
{'0_fx_price_gy': 'mean', '1_fx_price_yuy': 'mean',
'0_fx_volume_gy': 'sum', '1_fx_volume_yuy': 'sum'}
Another solution for dictionary:
d = {}
for c in df.columns:
if 'price' in c:
d[c] = 'mean'
if 'volume' in c:
d[c] = 'sum'
And solution should be simplify if only price and volume columns without first column filtered out by df.columns[1:]:
d = {x:'mean' if 'price' in x else 'sum' for x in df.columns[1:]}
df1 = df.groupby('year_month', as_index=False).agg(d).reindex(columns=df.columns)
print (df1)
year_month 0_fx_price_gy 0_fx_volume_gy 1_fx_price_yuy 1_fx_volume_yuy
0 1990-01 2 40 3 60
1 1990-02 2 20 3 30

Percentage of variable corresponding to percentage of other variable

I have two numerical variables, and would like to calculate the percentage of one variable corresponding to at least 50% of the other variable's sum.
For example:
A | B
__________
2 | 8
1 | 20
3 | 12
5 | 4
2 | 7
1 | 11
4 | 5
Here, the sum of column B is 68, so I'm looking for the rows (in B's descending order) where cumulative sum is at least 34.
In that case, they are rows 2, 3 & 6 (cumulative sum of 45). The sum of these row's column A is 5, which I want to compare to the total sum of column A (18).
Therefore, the result I'm looking for is 5 / 18 * 100 = 28.78%
I'm looking for a way to implement this in QlikSense, or in SQL.
Here's one way you can do it - there is probably some optimisation to be done, but this gives what you want.
Source:
LOAD
*,
RowNo() as RowNo_Source
Inline [
A , B
2 , 8
1 , 20
3 , 12
5 , 4
2 , 7
1 , 11
4 , 5
];
SourceSorted:
NoConcatenate LOAD *,
RowNo() as RowNo_SourceSorted
Resident Source
Order by B asc;
drop table Source;
BTotal:
LOAD sum(B) as BTotal
Resident SourceSorted;
let BTotal=peek('BTotal',0);
SourceWithCumu:
NoConcatenate LOAD
*,
rangesum(peek('BCumu'),B) as BCumu,
$(BTotal) as BTotal,
rangesum(peek('BCumu'),B)/$(BTotal) as BCumuPct,
if(rangesum(peek('BCumu'),B)/$(BTotal)>=0.5,A,0) as AFiltered
Resident SourceSorted;
Drop Table SourceSorted;
I worked with a debug fields that might be useful but you could of course remove these.
Then in the front end you do your calculation of sum(AFiltered)/sum(A) to get the stat you want and format it as a percentage.