Count records by groups while drilling down in the recordset - sql

I am working on a website which has a drill down feature for a large recordset.
Lets say I have the following recordset:
tblBrand (ID, Name, PriceFrom, PriceTo, BHPFrom, BHPTo)
1, Audi, 170000, 340000, 100, 250
2, BMW, 250000, 290000, 110, 400
3, Ford, 275000, 500000, 75, 150
4, Kia, 110000, 250000, 50, 100
5, VW, 135000, 460000, 50, 200
To the user I will by default show all records. And in my drill down feature, I am presenting to the user URLs like this:
Price:
0-100000 (0 brands)
100000-200000 (3 brands)
200000-300000 (5 brands)
300000-> (3 brands)
BHP:
0-100 (3 brands)
100-200 (5 brands)
200-> (3 brands)
So far so good. I got the solution to the count-by-ranges problem here: Group a range towards a range
But my problem occurs when the user clicks on one of my drill-down links. Lets say he clicks on "BHP 0-100". Then it will only be tblBrand ID 3,4,5 to be displayed. This is of-course easy with a simple WHERE clause. But here is my question:
How do I present to the user an updated count of Brands per Price-range and Brands per BHP-range after the user has done the first drill-down by selecting "BHP 0-100"?
Currently my solution (which works) is to insert the resulting recordset after the drill down into a temp-table and do all the new counting on that table. This is an easy way, but with 100k+ records and +/- 20 types of drill down possibilities each having 5-20 ranges, it becomes a very heavy task!
UPDATE: This is what I am doing (pseudecode):
SELECT INTO #DrillDownRecordset (ID, Name)
FROM tblBrand
WHERE BHPFrom > #SelBHPFrom AND BHPTo < #SelBHPTo
SELECT COUNT (ID)
FROM #DrillDownRecordset
GROUP BY PriceRanges
SELECT COUNT (ID)
FROM #DrillDownRecordset
GROUP BY BHPRanges
This works, but is very heavy..
Could I be looking at using ROLLUP?

Related

SQL: removing duplicates based on different criteria, actually creates new records

I have a data base (dbo) with duplicates. In particular, one employee can work two roles (Role Number) in the same business (Business code) or work two / the same role within different business in the same or different area (Area Code), see below:
What I want is to remove duplicate records. Thus, I created this code:
Select
dbo.year,
min(dbo.RoleNumber) AS Role,
min(dbo.AreaCode) AS Area,
min(dbo.BusinessCode) AS BCode,
dbo.EmployeeNumber
From dbo
Group by dbo.year, dbo.EmployeeNumber
This code works well when an individual works the lowest role in a business with the lowest number and in the lowest area (e.g., row n* 3 and 4 in my example) or where the area code and business code are the same in the duplicate records (e.g., row n* 1 and 2).
However, I have some cases where an individual’s lowest role is associated with a higher Business code or/and area code. In this case, SQL creates new records combining these elements see examples below:
rows 5-10: 2018, 651, 5110, 3, 17;
rows 11-13: 2018, 649, 6215, 4, 20;
rows 14-15: 2018, 750, 5101, 5, 24.
This is not a problem per se, but it is problematic when I join tables to get additional data for these employees. The key elements to join tables are Area and business codes and employee's number, however with my code SQL is creating new records that do not exist in other tables, this leads to additional data being NULL.
Is there a way to fix this? I need SQL to always select the lowest Role number first, if the role number is the same then the lowest establishment number should be selected and if the same, the lowest Area code should finally be selected.
So for instance, I would expect that the three records creating problems would be retrieved like this:
rows 5-10: 2018, 651, 6319, 3, 17;
rows 11-13: 2018, 650, 6215, 4, 20;
rows 14-15: 2018, 750, 8076, 5, 24.
Thank you
Silvia
you can use window function:
select * from
(
select * , row_number() over (partition by year, employeenumber order by rolenumber,businesscode,areacode) rn
from youratble
) t
where rn = 1
you can play with order by inside the window function to choose the row you want.

Calculating and displaying customer lifetime value histogram with BigQuery and Data Studio

Consider a table in Google BigQuery containing purchase records for customer. For the sake of simplicity, let's focus on the following properties:
customer_id, product_id, amount
I'd like to create a Google Data Studio report from the above data set showing a customer lifetime value histogram. The customer lifetime value is the sum of amount for any given customer. The histogram would show how many customers fall into a certain bucket by their total amount - I would define the buckets like 0-10, 10-20, 20-30 etc. value ranges.
Like this:
Finally, I'd also like to filter the histogram by product_id. When the filter is active, the histogram would show the totals for customers who - at least once - purchased the given product.
As of this moment, I think this is not possible to implement in Datastudio, but I hope I am wrong.
Things I've tried so far:
Displaying an average customer lifetime value for the whole dataset is easy, via a calculated field in Datastudio as SUM(amount) / COUNT(customer_id)
For creating a histogram, I don't see any way purely in Data Studio (based on the above data set). I think I need to create a view of the original table, consisting a single row for each customer with the total amount. The bucket assignment could be implemented either in Big Query or in Data Studio with CASE ... WHEN.
However, for the final step, i.e. creating a product filter that filters the histogram for those customers who purchased the given product, I have no clue how to approach this.
Any thoughts?
I was able to do a similar reproduction to what you describe but it's not straightforward so I'll try to detail everything. The main idea is to have two data sources from the same table: one contains customer_id and product_id so that we can filter it while the other one contains customer_id and the already calculated amount_bucket field. This way we can join it (blend data) on customer_id and filter according to product_id which won't change the amount_bucket calculations.
I used the following script to create some data in BigQuery:
CREATE OR REPLACE TABLE data_studio.histogram
(
customer_id STRING,
product_id STRING,
amount INT64
);
INSERT INTO data_studio.histogram (customer_id, product_id, amount)
VALUES ('John', 'Game', 60),
('John', 'TV', 800),
('John', 'Console', 300),
('Paul', 'Sofa', 1200),
('George', 'TV', 750),
('Ringo', 'Movie', 20),
('Ringo', 'Console', 250)
;
Then I connect directly to the BigQuery table and get the following fields. Data source is called histogram:
We add our second data source (BigQuery) using a custom query:
SELECT
customer_id,
CASE
WHEN SUM(amount) < 500 THEN '0-500'
WHEN SUM(amount) < 1000 THEN '500-1000'
WHEN SUM(amount) < 1500 THEN '1000-1500'
ELSE '1500+'
END
AS amount_bucket
FROM
data_studio.histogram
GROUP BY
customer_id
With only the latter we could already do a basic histogram with the following configuration:
Dimension is amount_bucket, metric is Record count. I made a bucket_order custom field to sort it as lexicographically '1000-1500' comes before '500-1000':
CASE
WHEN amount_bucket = '0-500' THEN 0
WHEN amount_bucket = '500-1000' THEN 1
WHEN amount_bucket = '1000-1500' THEN 2
ELSE 3
END
Now we add the product_id filter on top and a new chart with the following configuration:
Note that metric is CTD (Count Distinct) of customer_id and the Blended data data source is implemented as:
An example where I filter by TV so only George and John appear but the other products are still counted for the total amount calculation:
I hope it works for you.

How to make a query that return data of rows related to each row in table

i have some tables about Double-entry bookkeeping.
table VoucherDetail Contains Accounting Entries for Each Voucher and
other tables are Accounts Group/Ledger/Definitive
here are diagrams of tables
im trying to get opposite side of an entry and show it in a custom column that matches entry debit/credit amount(Ref to image 2).
i did some google search and find nothing. here is the query i made so far(Ref to image 1):
SELECT
dbo.Vouchers.VoucherId,
vd.VoucherDetailIndex AS ind,
vd.Debit,
vd.Credit,
vd.Description,
CONCAT ( ag.Name, '_', al.Name, '_', ad.Name ) AS names,
CONCAT ( ag.GroupId, '_', al.LedgerId, '_', ad.DefinitiveId ) AS ids
FROM dbo.Vouchers
JOIN dbo.VoucherDetails AS vd ON vd.Voucher_VoucherIndex = dbo.Vouchers.VoucherIndex
JOIN dbo.AccDefinitives AS ad ON vd.AccDefinitive_DefinitiveIndex = ad.DefinitiveIndex
JOIN dbo.AccLedgers AS al ON ad.AccLedger_LedgerIndex = al.LedgerIndex
JOIN dbo.AccGroups AS ag ON al.AccGroup_GroupIndex = ag.GroupIndex
here is the result im getting :
result i want to be :
here is an example to explain what i need :
EVENT :
we put 10$ on bank as our Equity, now we need to create a voucher for this:
INSERT INTO Vouchers(VoucherIndex, VoucherId, VoucherDate, Description) VALUES
(1, 1, 2019/01/01, initial investment);
and now we need to add Entry of this event to VoucherDetail of Voucher 1
which will have 2 entry; 1 for cash and 1 for Equity :
INSERT INTO VoucherDetails(VoucherDetailIndex, Debit, Credit, Description AccDefinitive_DefinitiveIndex, AccLedger_LedgerIndex, Voucher_VoucherIndex, EntityOrder) VALUES
(1, 10$, 0, 'Put Cash on Bank as initial Investment', 10101, 101, 1, 1),
(2, 0, 10$, 'initial Investment', 50101, 501, 1, 2);
now we run the first query i provided here is the result
now we have our common result, lets get to the problem
imagine someone filled these tables with 10000 row data
and we need to find Voucher no.10, with 20 entries inside VoucherDetail
we get these entries by doing a simple query.
but we don't know which related to which(like in above example Cash with 10$ debt related to Equity with 10$ credit)
if we want to know it, we need to spend time on it every time we need to find something
the query need to search whole table and find opposite side related to each row based on Debit or Credit value of row
this should be the result i wrote in excel :
as you can see in the image above there is 2 new columns added
Account in opposite Side and Account ID in opposite side
first row refers to Equity which related to Cash and
second row refers to Cash Which related to Equity.
As far as I can see, what you need to be able to do is join two VoucherDetail records that have the same Voucher_VoucherIndex value (let's call this VoucherID for brevity). However, the only two things these records have in common is their VoucherID and the fact that the Debit value = the Credit value in the other, and vice versa.
In the comments you mentioned that multiple VoucherDetail rows with the same VoucherID can have the same Debit value (and I presume Credit value). If this wasn't the case, you could add something like this to your query:
JOIN dbo.VoucherDetails AS vd_opposite
ON vd.Voucher_VoucherIndex = vd_opposite.Voucher_VoucherIndex
AND (vd.Debit = vd_opposite.Credit OR vd.Credit = vd_opposite.Debit)
You can't do this though, because Debit/Credit and VoucherID together are not enough to be unique, so you might pick up extra rows in the join that you don't want.
Therefore, your only option is to add a new ID field to your table (maybe called SaleID or something) that definitively links the two rows that represent opposite sides of the same "sale" with a common ID. Then, the above JOIN would look like this:
JOIN dbo.VoucherDetails AS vd_opposite
ON vd.Voucher_VoucherIndex = vd_opposite.Voucher_VoucherIndex
AND vd.SaleID = vd_opposite.SaleID
In addition to adding that JOIN, you would need to join the new vd_opposite table against all of the dbo.Acc* tables again to get access to the data you want, and obviously add the fields from those tables that you want in the results to your SELECT fields.

Expand-collapse report with data set based on GROUPING SETS

I've used the Expand/Collapse feature in SSRS reports before, but in all those cases it was Reporting Services that was doing the grouping and totalling. This time around I utilize GROUPING SETS in my dataset query to let SQL Server handle aggregating the data. I want to create a report that has Expand/Collapse features for the groups, but can't seem to get it to work.
Repro
First up, here's a way to get a small repro simulating my actual situation. Use the following query for a dataset:
-- Simulating with already denormalized data for sake of simplicity
DECLARE #Order TABLE (Category VARCHAR(20), Product VARCHAR(20), PersonId INT);
INSERT INTO #Order
(Category, Product, PersonId)
VALUES ('Fruit', 'Banana', 1)
,('Fruit', 'Banana', 1)
,('Cakes', 'Chocolate', 1)
,('Fruit', 'Apple', 2)
,('Cakes', 'Chocolate', 2)
,('Cakes', 'Berry Jam', 3)
,('Cakes', 'Chocolate', 3)
,('Cakes', 'Chocolate', 3)
,('Fruit', 'Banana', 4)
,('Cakes', 'Berry Jam', 5)
SELECT Category,
Product,
COUNT(DISTINCT PersonId) AS NrOfBuyers
FROM #Order AS o
GROUP BY GROUPING SETS ((), (Category), (Category, Product))
This will provide this output (I've manually ordered the output to illustrate my intentions):
Category Product NrOfBuyers
-------- ------- ----------
Fruit Apple 1
Fruit Banana 2
Fruit NULL 3
Cakes Berry Jam 2
Cakes Chocolate 3
Cakes NULL 4
NULL NULL 5
To foreshadow what I'm aiming for, here's what I want to get in Excel.
Expanded version of intended result:
Collapsed version of intended result:
What I've tried so far:
While writing this question and creating the repro I did realize that my first approach of just dumping my dataset in a tablix was wrong.
So what I tried to fix this was recreating the tablix with proper Row Groups like so:
In addition to that I need a column on the left hand side outside the main group to hold the toggle "+" for the grand total row.
However, this gives incorrect numbers for the collapsed version:
These should be different: Cakes and Fruit have a "Subtotal" of 3 and 4, respectively.
This seems like a problem with ordering the rows, so I've checked the sorting for the Tablix and that should order rows as the appear in the "intended result" screenshots. It doesn't, and after a bit I understood why: the groups do sorting as well. So I've added sorting for the groups as well, e.g. this is the one for the Product Row Group:
This seems to improve things (it does the sorting bit I needed anyways) but it doesn't fix having the wrong numbers in collapsed state.
What do I need to do to finish this last stretch and complete the report?
The approach can work, but one last step is needed to get the correct numbers for collapsed state. Know that with the example from the question this design:
Shows the following expression for this cell:
=Fields!NrOfBuyers.Value
But this sneakily seems to come down to this:
=First(Fields!NrOfBuyers.Value)
When it is evaluated in the context of a collapsed row.
So, one way to "fix" this and get the correct sub totals is to change that expression to:
=Last(Fields!NrOfBuyers.Value)
Which will give the desired output in collapsed state:
Or semi-collapsed:
And finally, expanded:

Find a series of data using non-exact measurements (fuzzy logic)

This is a more complex follow-up question to: Efficient way to look up sequential values
Each Product can have many Segment rows (thousands). Each segment has position column that starts at 1 for each product (1, 2, 3, 4, 5, etc.) and a value column that can contain any values such as (323.113, 5423.231, 873.42, 422.64, 763.1, etc.). The data is read-only.
It may help to think of the product as a song and the segments as a set of musical notes in the song.
Given a subset of contiguous segments, like a snippet of a song, I would like to identify potential matches for products. However, due to potential errors in measurements, the segments in the subset may not match the segments in the database exactly.
How can I identify product candidates by finding the segments of products which most closely match the subset of segments I have measured? Also, is a database the best medium for this type of data?
-
Here are just some thoughts for how I was about to approach this problem. Please don't take these as exact requirements. I am open to any kind of algorithms to make this work as best as possible. I was thinking there needs to be multiple threshold variables for determining closeness. One possibility might be to implement a proximity threshold and a match threshold.
For example, given these values:
Product A contains these segments: 11,21,13,13,15.
Measurement 1 has captured: 20,14,14,15.
Measurement 2 has captured: 11,21,78,13.
Measurement 3 has captured: 15,13,21,13,11.
If a proximity threshold allowed the measured segment to be 1 above or below the actual segment, then Measurement 1 may match Product A because, although many segments do not match exactly, they are within the proximity threshold relative to the actual values.
If a match threshold allowed for measurements with matches of 3 or more, Measurement 2 may return Product A because, although one of the segments (78) far exceeds the proximity threshold, it still matches 3 segments in the correct order and so is within the match threshold.
Measurement 3 would not match Product A because, although all measured segments exist in the actual segments, they are not within the proximity or match thresholds.
Update: One of the answers asked me to define what I mean by most closely match. I'm not exactly sure how to answer that, but I'll try to explain by continuing with the song analogy. Let's say the segments represent maximum frequencies of a recorded song. If I record that same song again it will be similar, but due to background noise and other limitations of recording equipment, some of the frequencies will match, some will be close, and a few will be way off. In this scenario, how would you define when one recording "matches" another? That's the same kind of matching logic I'm looking for to use in this problem.
From the information you posted this can be solved with the edmond's blossom v perfect match algorithm. Either you can minimize or maximize the function and it will always find the best match. Maybe you can use a brute force solution with 2 loops. The wikipedia about edmond's matching algorithm: http://en.wikipedia.org/wiki/Edmonds%27s_matching_algorithm
You need to come up with a definition for "most closely match". I don't know how anyone here can help you with that since no one here is going to know the business requirements or intricacies of the data. Your two methods both sound reasonable, but I have no idea if they actually are or not.
As for whether or not a database is the correct medium for this kind of data, I'd say that a database is probably the perfect medium for the data, but it is very like not the correct medium for processing the data. Whether it's possible or not will depend on your final solution on what constitutes "most closely match".
As a quick note, SSIS has some fuzzy match capabilities built into it for processing data. I've only played around with it though and that was a couple of years ago, so I don't know if it would work for what you're doing or not.
If you take literally your song example, one approach is to boil down your input to a bit-vector fingerprint, and then look up that fingerprint in a database as an exact match. You can increase the chances of finding a good match by extracting several fingerprints from your input and/or trying e.g. all bit-vectors that are only 1 or bit-errors away from your fingerprint.
If you have access to the ACM digital library, you can read a description of this sort of approach in "The Shazam Music Recognition service" at acm=1321038137_73cd62cf2b16cd73ca9070e7d5ea0744">http://delivery.acm.org/10.1145/1150000/1145312/p44-wang.pdf?ip=94.195.253.182&acc=ACTIVE%20SERVICE&CFID=53180383&CFTOKEN=41480065&acm=1321038137_73cd62cf2b16cd73ca9070e7d5ea0744. There is also some information at http://www.music.mcgill.ca/~alastair/621/porter11fingerprint-summary.pdf.
The input format you describe suggests that you might be able to do something with the random projection method described in http://en.wikipedia.org/wiki/Locality_sensitive_hashing.
To answer your second question, depending on exactly what a position corresponds to, you might consider boiling down the numbers to hash fingerprints made up of bits or characters, and storing these in a text search database, such as Apache Lucene.
Could you take the approach of matching the measurements against each segment position by position and calculating the difference for each position. Then slide the measurements along one position and calculate the difference. Then find which slide position scored the lowest difference. Do this for every product and then you know which product the measurements match to closest.
Test tables and data:
CREATE TABLE [dbo].[Segment]
(
[ProductId] INT,
[Position] INT,
[Value] INT
)
INSERT [dbo].[Segment]
VALUES (1, 1, 300),
(1, 2, 5000),
(1, 3, 900),
(1, 4, 400),
(1, 5, 800),
(2, 1, 400),
(2, 2, 6000),
(2, 3, 1000),
(2, 4, 500),
(2, 5, 900),
(3, 1, 400),
(3, 2, 5400),
(3, 3, 900),
(3, 4, 400),
(3, 5, 900)
CREATE TABLE #Measurement
(
[Position] INT,
[Value] INT
)
INSERT #Measurement
VALUES (1, 5400),
(2, 900),
(3, 400)
As you can see, the measurements match (a subset of) the third product exactly.
Some helpers:
CREATE TABLE #ProductSegmentCount
(
[ProductId] INT,
[SegmentCount] INT
)
INSERT #ProductSegmentCount
SELECT [ProductId], MAX([Position])
FROM [dbo].[Segment]
GROUP BY [ProductId]
DECLARE #MeasurementSegmentCount INT = (SELECT MAX([Position]) FROM #Measurement)
A recursive common table expression to show the products ordered by closest match:
;WITH [cteRecursive] AS
(
SELECT s.[ProductId],
0 AS [RecursionId],
m.[Position] AS [MeasurementPosition],
s.[Position] AS [SegmentPosition],
ABS(m.[Value] - s.[Value]) AS [Difference]
FROM #Measurement m
INNER JOIN [dbo].[Segment] s
ON m.[Position] = s.[Position]
UNION ALL
SELECT s.[ProductId],
[RecursionId] + 1 AS [RecursionId],
m.[Position],
s.[Position],
ABS(m.[Value] - s.[Value]) AS [Difference]
FROM [cteRecursive] r
INNER JOIN #Measurement m
ON m.[Position] = r.[MeasurementPosition]
INNER JOIN [dbo].[Segment] s
ON r.[ProductId] = s.[ProductId]
AND m.[Position] + (r.[RecursionId]) = s.[Position]
INNER JOIN #ProductSegmentCount psc
ON s.[ProductId] = psc.[ProductId]
WHERE [RecursionId] <= ABS(#MeasurementSegmentCount - psc.[SegmentCount])
)-- select * from [cteRecursive] where [ProductId] = 3 order by RecursionId, SegmentPosition
, [cteDifferences] AS
(
SELECT [ProductId], [RecursionId], SUM([Difference]) AS [Difference]
FROM [cteRecursive]
GROUP BY [ProductId], [RecursionId]
)-- select * from [cteDifferences]
SELECT [ProductId], MIN([Difference]) AS [Difference]
FROM [cteDifferences]
GROUP BY [ProductId]
ORDER BY MIN([Difference])
OPTION (MAXRECURSION 0)