SELECT random 10% of rows for each category on SQL Server - sql

There is a table of products sold.
row_id
customer
product
date_sold
1
customer_1
thingamajig
01.01.2023
2
customer_12
whosi-whatsi
03.01.2023
3
customer_1
watchamacallit
04.01.2023
4
customer_4
whosi-whatsi
06.01.2023
...
...
...
...
There is always one row per one item.
Let's say customer_1 ordered 100 items total. customer_2 ordered 50 items total. customer_3 ordered 17 items total. How do you select random 10% of rows for each customer? The fraction of rows selected should be rounded up (for example 12 rows total results in 2 selected). That means every customer that bought at least one item should appear in the resulting table. In this case the resulting table for customer_1, customer_2 and customer_3 would have 10 + 5 + 2 = 17 rows.
My initial approach would be to create a temp table, calculate desired row counts for each customer and then loop through the temp table and select rows for each customer. Then insert them to another table and select from that one:
drop table if exists #row_counts
select
customer
ceiling(convert(decimal(10, 2), count(product)) / 10) as row_count
into #row_counts
from products_sold
group by customer
-- then use cursor to loop over #row_counts and insert into the final table
-- for randomness an 'order by newid()' will be used
But this just doesn't feel like the right solution...

You need to know total count and a row count of what you want.
Something like this can perhaps be of service:
EDITED due to it not being randomized properly:
select *
from (
select row_number() over(partition by customerid order by newid()) as sortOrder
, COUNT(*) OVER(PARTITION BY customerID) AS cnt
, *
FROM products
) p
-- Now, we want 10% of total count rounded upwards
WHERE sortOrder <= CEILING(cnt * 0.1)

Related

Sum over Partition By Only when Value is Greater than 0

I only want to sum the applied amount when a ledger amount in another table is positive
Example
Table A
Statement # ID
500 1
500 2
500 3
500 4
Table B
Ledger_Amount Type ID
-389.41 Credit 1
-1218.9 Credit 2
-243.63 Credit 3
3485.19 Invoice 4
Table C
Applied_Amount ID
389.41 1
1218.9 2
243.63 3
1633.25 4
The current code is
(sum(applied_amount) over (partition by statement_number),0)
It is coming up with a total of $3485.19 because it is summing by statement number only, and all IDs have the same statement number, the value I want it to come up with is $1633.25 because it should not sum anything where the ledger_amount in table B is less than 0, so ID 1,2,3 should not be summed only valid value is ID 4
There is one approach:
Assuming ID is a unique column, first we should get the IDs we'd work on based on the statementnumber and save them in a temp table:
select Id
into #Ids
from tableA
where StatementNumber=#yourStatementNumber
Then, eliminate the IDs where they have a negative number in table B
Select Id
into #IdsWithPositiveLedger
From #Ids
Where Id in (
Select ID
From tableB
Where Ledger_Amount>0
)
Finally, use the ids left to get your sum:
Select sum(applied_amount)
from tableC
where Id in (select Id from #IdsWithPositiveLedger)

Search data in sql on behalf of list of parameters ,

Hi All i am unable to create a query in sql i want to get all employee who's contain the product which i passed in parameter
Ex, i if passed product id 11,12 then its should be return only DeliveryBoyId1 and 2 and if passed the product id 11 then its should be return only DeliveryBoyId 1,2,3 and if i passed productid 11,12,16 then its should return 0 record because there is no delivery boy assigned the product id 11,12,16
I don't know what your table is called so I'm calling it dbo.Delivery. Try this out:
;with CTE as (
select distinct DeliveryBoyId --get all the boys who delivered product 11
from dbo.Delivery
where ProductId = 11
union all --combine the above results with the below
select distinct DeliveryBoyId --get all the boys who delivered product 12
from dbo.Delivery
where ProductId = 12
)
select DeliveryBoyId
from CTE
group by DeliveryBoyId
having count(1) = 2 --get all the boys who appear twice in the above table, once for each product
If you want to match additional products, you can add to the CTE section for other product IDs.
If you want a single solution for any number of products, you can use this instead (which may be a little less readable but more concise and easier to maintain):
select DeliveryBoyId
from (
select distinct DeliveryBoyId, ProductId
from dbo.Delivery
where ProductId in (11, 12)
) s
group by DeliveryBoyId
having count(1) = 2 --number matches number of products you need to match

Query to list every line item but only list total once

We have a sales report that pulls data from multiple tables and my query shows correct data except orders that have multiple line items, i.e., the Total from the Orders table is listed on every line item row.
How can we list the Order Total only once on the row that has the smallest line item ID (for that order) but still list every line item row? Thanks!
Data Structure:
Orders Table:
Order_ID
Total
Line Items Table:
ID
Order_ID
Line_Item_Price
Line_Item_Qty
Result should be:
Order_ID Total Line_Item_Price Line_item_Qty Line_Item_ID
---------- ------- ----------------- --------------- --------------
10001 200 100 2 32001
10002 150 150 1 32002
10003 210 55 1 32003
10003 30 2 32004
10003 95 1 32005
10004 125 125 1 32006
This should be done in the application not in SQL.
But you can do that using window functions
select o.order_id,
case row_number() over (partition by o.order_id order by line_item_id)
when 1 then o.total
end as total,
li.line_item_price,
li.line_item_qty,
li.line_item_id
from orders o
join line_item li on o.order_id = li.order_id
order by o.order_id, li.line_item_id;
row_number() assigns a unique row number for each line item for every order. When the number is 1 the total is displayed, otherwise it's not.
In a relational database there is no such thing as "the first row" unless you specify an order by - in this case the "first row" is the line item with the smallest line_item_id
Online example: http://rextester.com/TQOIX50171
Unrelated, but: storing the total in the orders table is not a terribly good idea. In a normalized design you shouldn't store information that can easily be derived from existing data.
What this does is get the orders with their items and ranks them. So that the smallest item_id is rank 1 for that order, and the latest is the last rank.
ROW_NUMBER() is the function that gives the index of the rows in the output of the query (row 1 = 1, row 2 = 2). Then we can combine this with OVER (PARTITION BY), which means get the row numbers within a certain window, a partition. In this case we want to number the rows for the windows of Order_IDs. We use ORDER BY alongside to say how we order the rows within the window
When we have this table, we can then write a query on it to show the total only on the rows where the item_rank = 1
WITH rank_items_for_orders AS (
SELECT
Order_ID,
Line_Item_Price,
Line_Item_Qty,
Line_Item_ID,
Total,
ROW_NUMBER()
OVER (PARTITION BY Order_ID, ORDER BY Line_Item_ID ASC)
AS order_the_items_IDs
FROM orders o
LEFT JOIN line_items li ON o.order_id = li.order_id
ORDER BY Order_ID ASC)
SELECT
Order_ID,
Line_Item_Price,
Line_Item_Qty,
Line_Item_ID,
CASE WHEN order_the_items_IDs = 1
THEN Total ELSE NULL END
AS Total
FROM
rank_items_for_orders

Calculate percentages of columns in Oracle SQL

I have three columns, all consisting of 1's and 0's. For each of these columns, how can I calculate the percentage of people (one person is one row/ id) who have a 1 in the first column and a 1 in the second or third column in oracle SQL?
For instance:
id marketing_campaign personal_campaign sales
1 1 0 0
2 1 1 0
1 0 1 1
4 0 0 1
So in this case, of all the people who were subjected to a marketing_campaign, 50 percent were subjected to a personal campaign as well, but zero percent is present in sales (no one bought anything).
Ultimately, I want to find out the order in which people get to the sales moment. Do they first go from marketing campaign to a personal campaign and then to sales, or do they buy anyway regardless of these channels.
This is a fictional example, so I realize that in this example there are many other ways to do this, but I hope anyone can help!
The outcome that I'm looking for is something like this:
percentage marketing_campaign/ personal campaign = 50 %
percentage marketing_campaign/sales = 0%
etc (for all the three column combinations)
Use count, sum and case expressions, together with basic arithmetic operators +,/,*
COUNT(*) gives a total count of people in the table
SUM(column) gives a sum of 1 in given column
case expressions make possible to implement more complex conditions
The common pattern is X / COUNT(*) * 100 which is used to calculate a percent of given value ( val / total * 100% )
An example:
SELECT
-- percentage of people that have 1 in marketing_campaign column
SUM( marketing_campaign ) / COUNT(*) * 100 As marketing_campaign_percent,
-- percentage of people that have 1 in sales column
SUM( sales ) / COUNT(*) * 100 As sales_percent,
-- complex condition:
-- percentage of people (one person is one row/ id) who have a 1
-- in the first column and a 1 in the second or third column
COUNT(
CASE WHEN marketing_campaign = 1
AND ( personal_campaign = 1 OR sales = 1 )
THEN 1 END
) / COUNT(*) * 100 As complex_condition_percent
FROM table;
You can get your percentages like this :
SELECT COUNT(*),
ROUND(100*(SUM(personal_campaign) / sum(count(*)) over ()),2) perc_personal_campaign,
ROUND(100*(SUM(sales) / sum(count(*)) over ()),2) perc_sales
FROM (
SELECT ID,
CASE
WHEN SUM(personal_campaign) > 0 THEN 1
ELSE 0
end AS personal_campaign,
CASE
WHEN SUM(sales) > 0 THEN 1
ELSE 0
end AS sales
FROM the_table
WHERE ID IN
(SELECT ID FROM the_table WHERE marketing_campaign = 1)
GROUP BY ID
)
I have a bit overcomplicated things because your data is still unclear to me. The subquery ensures that all duplicates are cleaned up and that you only have for each person a 1 or 0 in marketing_campaign and sales
About your second question :
Ultimately, I want to find out the order in which people get to the
sales moment. Do they first go from marketing campaign to a personal
campaign and then to sales, or do they buy anyway regardless of these
channels.
This is impossible to do in this state because you don't have in your table, either :
a unique row identifier that would keep the order in which the rows were inserted
a timestamp column that would tell when the rows were inserted.
Without this, the order of rows returned from your table will be unpredictable, or if you prefer, pure random.

How to loop through a table and look for adjacent rows with identical values in one field and update another column conditionally in SQL?

I have a table that has a field called ‘group_quartile’ which uses the sql ntile() function to calculate which quartile does each customer lie in on the basis of their activity scores. However using this ntile(0 function i find there are some customers which have same activity scores but are in different quartiles. I need to modify the ‘group-quartile’ column to make all customers with the same activity scores lie in the same group_quartile.
A view of the table values :
Customer_id Product Activity_Score Group_Quartile
CH002 T 2328 1
CR001 T 268 1
CN001 T 178 1
MS006 T 45 2
ST001 T 21 2
CH001 T 0 2
CX001 T 0 3
KH001 T 0 3
MH002 T 0 4
SJ003 T 0 4
CN001 S 439 1
AC002 S 177 1
SC001 S 91 2
PV001 S 69 3
TS001 S 0 4
I used CTE expression but it didnot work.
My query only updates(from the above example) :
CX001 T 0 3
modified to
CX001 T 0 2
So only the first repeating activity score is checked and that row’s group_quartile is updated to 2.
I need to update all the below rows as well.
CX001 T 0 3
KH001 T 0 3
MH002 T 0 4
SJ003 T 0 4
I cannot use DENSE_RANK() instead of quartile to segregate the records as arranging the customers per product in approximately 4 quartiels is a business requirement.
From my understanding I need to loop through the table -
Find a row which has same activity score and the same product as its predecessor but has a different group_quartile
Update the selected row's group_quartile to its predecessor's quartile value
Then againg loop through the updated table to look for any row with the above condition , and update that row similarly.
The loop continues until all rows with same activity scores (for the same product) are put in the same group_quartile.
--
THIS IS THE TABLE STRUCTURE I AM WORKING ON:
CREATE TABLE #custs
(
customer_id NVARCHAR(50),
PRODUCT NVARCHAR(50),
ACTIVITYSCORE INT,
GROUP_QUARTILE INT,
RANKED int,
rownum int
)
INSERT INTO #custs
-- adding a column to give row numbers(unique id) for each row
SELECT customer_id, PRODUCT, ACTIVITYSCORE,GROUP_QUARTILE,RANKED,
Row_Number() OVER(partition by product ORDER BY activityscore desc) N
FROM
-- rows derived form a parent table based on 'segmentation' column value
(SELECT customer_id, PRODUCT, ACTIVITYSCORE,
DENSE_RANK() OVER (PARTITION BY PRODUCT ORDER BY ACTIVITYSCORE DESC) AS RANKED,
NTILE(4) OVER(PARTITION BY PRODUCT ORDER BY ACTIVITYSCORE DESC) AS GROUP_QUARTILE
FROM #parent_score_table WHERE (SEGMENTATION = 'Large')
) as temp
ORDER BY PRODUCT
The method I used to achieve this partially is as follows :
-- The query find the rows which have activity score same as its previous row but has a different GRoup_Quartiel value.
-- I need to use a query to update this row.
-- Next, find any rows in this newly updated table that has activity score same as its previous row but a differnet group_quartile vale.
-- Continue to update the tabel in the above manner until all rows with same activity scores have been updated to have the same quartile value
I managed to find only the rows which have activity score same as its previous row but has a different Group_Quartill value but cannot loop thorugh to find new rows that may match this updated row.
select t1.customer_id,t1.ACTIVITYSCORE,t1.PRODUCT, t1.RANKED, t1.GROUP_QUARTILE, t2.GROUP_QUARTILE as modified_quartile
from #custs t1, #custs t2
where (
t1.rownum = t2.rownum + 1
and t1.ACTIVITYSCORE = t2.ACTIVITYSCORE
and t1.PRODUCT = t2.PRODUCT
and not(t1.GROUP_QUARTILE = t2.GROUP_QUARTILE))
Can anyone help with what should be the t-sql statement for the above?
Cheers!
Assuming you've already worked out a basis Group_Quartile as indicated above, you can update the table with a query similar to the following:
update a
set Group_Quartile = coalesce(topq.Group_Quartile, a.Group_Quartile)
from activityScores a
outer apply
(
select top 1 Group_Quartile
from activityScores topq
where a.Product = topq.Product
and a.Activity_Score = topq.Activity_Score
order by Group_Quartile
) topq
SQL Fiddle with demo.
Edit after comment:
I think you did a lot of the work already by getting the Group_Quartile working.
For each row in the table, the statement above will join another row to it using the outer apply statement. Only one row will be joined back to the original table due to the top 1 clause.
So each for each row, we are returning one more row. The extra row will be matched on Product and Activity_Score, and will be the row with the lowest Group_Quartile (order by Group_Quartile). Finally, we update the original row with this lowest Group_Quartile value so each row with the same Product and Activity_Score will now have the same, lowest possible Group_Quartile.
So SJ003, MH002, etc will all be matched to CH001 and be updated with the Group_Quartile value of CH001, i.e. 2.
It's hard to explain code! Another thing that might help is looking at the join without the update statement:
select a.*
, TopCustomer_id = topq.Customer_Id
, NewGroup_Quartile = topq.Group_Quartile
from activityScores a
outer apply
(
select top 1 *
from activityScores topq
where a.Product = topq.Product
and a.Activity_Score = topq.Activity_Score
order by Group_Quartile
) topq
SQL Fiddle without update.