Determine whether a "Long tail" exists in a dataset - sql

I am working with a 200MB database that stores feedback from users about products in different years (Assume that 1 record = 1 feedback). The DB has the following schema
asin is a unique identifier of a product.
I am trying to determine if in any given year 20% of the products("asin") are accountable for more than 65% of the all feedbacks in that year ("Long-tail phenomena").
I tried to refer to the ASIN and the Year as a unique key and assign the value 1 to each time they key appeared. Then I would aggregate these values and would check if the first 20 ASIN count is bigger than 65% of all the feedback count.
The following is an example of my code in PySpark:
reviews_rdd = reviews_df_split_date.rdd
print(type(reviews_rdd))
reviews_rdd\
.map(lambda r: ((r.asin, r.Year),1))\
.reduceByKey(lambda v1, v2: v1 + v2)\
.collect()
which produced the following output:
That's where I got stucked and didn't know how to group all ASIN in a given year and check the ASIN with the most feedbacks.
How can I check whether if less than 20% of the products in any given year account for over than 65% of all the feedbacks given in that year?

Related

Query table and return values with the highest confidence only and remove duplicates

I want to find a Microsoft SQL query to only get the maximum customer_confidence for a specific customer.
Scenario 1
In the example table you will see that number 454 for customer 2603396 has two confidence percentages, one at 75% and one at 50%, I would only like the 75% to be returned.
Scenario 2
For number 918 I have two customer numbers 2744398 and 4866968, I only want the highest customer_confidence customer result to be returned, in the example table it should be 2744398 with 75% confidence.
How can I achieve this in Microsoft SQL?
Here is what I have tried:
select
distinct no, customer
,max(customer_confidence) as 'customer_confidence'
from (Table)

Join multiple tables in Microsoft SQL Server where there is only one line match from table 1 and multiple lines from table 2 and 3

I am stuck on something, which I have never used in my 10 years of SQL. I thought it would be useful if there was someway of doing this. Firstly I am running SQL Server Express (latest free version) on Windows. To talk to the database I am using SSMS.
There are three tables/queries.
1 table (A) has one data value I want to pull through.
2 tables (B)/(C) have multiple values.
Column common to all tables is CAMPAIGN NAME
Column common to (B)/(C) is PRODUCT NAME
This is an example of the data:
OUTPUT GOAL
I have tried the following:
UNION ALL (but this does not assist when I want to calculate AMOUNT - MARKETING - TOTAL INVESTMENT
I tried PARTITION (but I simple could now get it to work.
If I use joins, it brings through a head count / total investment and marketing cost per product, which when using SUM brings through the incorrect values for head count / total investment and marketing cost vs total amount, quantity.
I tried splitting the costs based on Quantity / Total Quantity or Amount / Total Amount, but the cost associated with the product is not correct or directly relating to the product this way.
Am I trying to do something impossible, or is there a way to do this in SQL?
The following comes pretty close to what you want:
select . . . -- select the columns you want here
from a join
b
on b.campaign_name = a.campaign_name join
c
on c.campaign_name = b.campaign_name and
c.product_name = b.product_name;
This produces a result set with a separate row for each campaign/product.

Counting latest instance of multiple only based on filter context

I've got a large table of events that have occurred in an inventory of vehicles, which affect whether they are in service or out of service. I would like to create a measure that would be able to count the number of vehicles in the various inventories at any point in time, based on the events in this table.
This table is pulled from a SQL database into an Excel 2016 sheet, and I'm using PowerPivot to try to come up with the DAX measure.
Here is some example data event_list:
vehicle_id event_date event event_sequence inventory
100 2018-01-01 purchase 1 in-service
101 2018-01-01 purchase 1 in-service
102 2018-02-04 purchase 1 in-service
100 2018-02-07 maintenance 2 out-of-service
101 2018-02-14 damage 2 out-of-service
101 2018-02-18 repaired 3 in-service
100 2018-03-15 repaired 3 in-service
102 2018-05-01 damage 2 out-of-service
103 2018-06-03 purchase 1 in-service
I'd like to be able to create a pivot table in Excel (or use CUBE functions, etc) to get an output table like this:
date in-service out-of-service
2018-02-04 3 0
2018-02-14 1 2
2018-03-15 3 0
2018-06-03 3 1
Essentially, I want to be able to calculate the inventory based on any date in time. The example only has a few dates, but hopefully provides enough of a picture.
I've basically come up with this so far, but it counts more vehicles than desired - I can't figure out how to only take the latest event_sequence or event_date and use that to count the inventory.
cumulative_vehicles_at_date:=CALCULATE(
COUNTA([vehicle_id]),
IF(IF(HASONEVALUE (event_list[event_date]), VALUES (event_list[event_date]))>=event_list[event_date],event_list[event_date])
)
I tried using MAX() and EARLIER() functions, but they don't seem to work.
Edit: Added the PowerBI tag as I'm now using that software to attempt to solve this as well. See comments on Alexis Olson's answer.
I think I've found a much cleaner method than I gave previously.
Let's add two columns onto the event_list table. One which counts vehicles "in-service" on that date and one which counts vehicles "out-of-service" on that date.
InService =
VAR Summary = SUMMARIZE(
FILTER(event_list,
event_list[event_date] <= EARLIER(event_list[event_date])),
event_list[vehicle_id],
"MaxSeq", MAX(event_list[event_sequence]))
VAR Filtered = FILTER(event_list,
event_list[event_sequence] =
MAXX(
FILTER(Summary,
event_list[vehicle_id] = EARLIER(event_list[vehicle_id])),
[MaxSeq]))
RETURN SUMX(Filtered, 1 * (event_list[inventory] = "in-service"))
You can create an analogous calculated column for OutOfService or you can just take the total minus the InService count.
OutOfService =
CALCULATE(
DISTINCTCOUNT(event_list[vehicle_id]),
FILTER(event_list,
event_list[event_date] <= EARLIER(event_list[event_date])))
- event_list[InService]
Now all you have to do is put event_date on the matrix visual rows section and add the InService and OutOfService columns to the values section (use Maximum or Minimum for the aggregation option rather than Sum).
Here's the logic behind the calculated column InService:
We first create a Summary table which calculates the maximal event_sequence value for each vehicle. (We filter the event_date to only consider dates up to the current one we are working with.)
Now that we know what the last event_sequence value is for each vehicle, we use that to filter the entire table down to just the rows that correspond to those vehicles and sequence values. The filter goes through the table row by row and checks to see if the sequence value matches the one we calculated in the Summary table. Note that when we filter the Summary table to just the vehicle we are currently working with, we only get a single row. I'm just using MAXX to extract the [MaxSeq] value. (It's kind of like using LOOKUPVALUE, but you can't use that on a variable.)
Now that we've filtered the table just to the most recent events for each vehicle, all we need to do is count how many of them are "in-service". I used a SUMX here where the 1*(True/False) coerces the boolean value to return 1 or 0.
This is pretty difficult. I don't have a great answer, but here's something that kind of works.
You'll create a new calculated table where you'll calculate the status for each vehicle on each date. Start with the base cross join for each vehicle and each date:
= CROSSJOIN(VALUES(event_list[vehicle_id]), VALUES(event_list[event_date]))
Then add a calculated column to find the max sequence number for each vehicle on that date.
Sequence = MAXX(
FILTER(event_list,
event_list[event_date] <= Cross[event_date] &&
event_list[vehicle_id] = Cross[vehicle_id]),
event_list[event_sequence])
Now you can lookup the inventory value for each vehicle/sequence pair with another calculated column:
Inventory = LOOKUPVALUE(
event_list[inventory],
event_list[vehicle_id], Cross[vehicle_id],
event_list[event_sequence], Cross[Sequence])
The result should look something like this:
Once you have this, you can create a matrix using this calculated table. Put the event_date on the rows and Inventory on the columns. Filter out blank inventory values in the visual level filter and put the vehicle_id in the values field, using a count or distinct count as the aggregation method (instead of the default sum).
It should look like this:

counting and numbering in a select statement in Access SQL

Could you please help me figuring out how to accomplish the following.
I have a table containing the number of products available between one date and another as per below:
TABLE MyProducts
DateProduct ProductId Quantity Price
26/02/2016 7 2 100
27/02/2016 7 3 100
28/02/2016 7 4 100
I have created a form where users need to select a date range and the number of products they are looking for (in my example, the number of products is going to be 1).
In this example, let's say that a user makes the following selection:
SELECT SUM(MyProducts.Price) As TotalPrice
FROM MyProducts WHERE MyProducts.DateProduct
Between #2/26/2016# And #2/29/2016#-1 AND MyProducts.Quantity>=1
Now the user can see the total amount that 1 product costs: 300
For this date range, however, I want to allow users to select from a combobox also the number of products that they can still buy: if you give a look at the Quantity for this date rate, a user can only buy a maximum of 2 products because 2 is the lowest quantity available is in common for all the dates listed in the query.
First question: how can I feed the combobox with a "1 to 2" list (in this case) considering that 2 is lowest quantity available in common for all the dates queried by this user?
Second question: how can I manage the products that a user has purchased.
Let's say that a user has purchased 1 product within this date range and a second user has purchased for the very same date range the same quantity too (which is 1) for a total of 2 products purchased already in this date range. How can I see that for this date rate and giving this case the number of products actually available are:
DateProduct ProductId Quantity Price
26/02/2016 7 0 100
27/02/2016 7 1 100
28/02/2016 7 2 100
Thank you in advance and please let me know should you need further information.
You could create a table with an integer field counting from 1 to whatever max qty you could expect. Then create a query that will only return rows from your new table up to the min() qty in the MyProducts table. Use that query as the control source of your combobox.
EDIT: You will actually need two queries. The first should be:
SELECT Min(MyProducts.Quantity) AS MinQty FROM MyProducts;
which I called "qryMinimumProductQty". I create the table called "Numbering" with a single integer field called "Sequence". The second query:
SELECT Numbering.Sequence FROM Numbering, qryMinimumProductQty WHERE Numbering.Sequence<=qryMinimumProductQty.MinQty;
AFAIK there is no Access function/feature that will fill in a series of numbers in a combobox control source. You have to build the control source yourself. (Anyone with more VBA experience might have a solution to solve this, but I do not.)
It makes me ache thinking of an entire table with a single integer column only being used for a combobox though. A simpler approach to the combobox would just to show the qty available in a control on your form, give an unbound text box for the user to enter their order qty, and add a validation rule to stop the order and notify them if they have chosen a number greater than the qty on hand. (Just a thought)
As for your second question, I don't really understand what you're looking for either. It sounds like there may be another table of purchases? It should be a simple query to relate MyProducts to Purchases and take the difference between your MyProducts!qty and the Purchases!qty. If you don't have a table to store Purchases, it might be warranted based on my cursory understanding of your system.

MDX query to count number of rows that match a certain condition (newest row for each question, client group)

I have the following fact table:
response_history_id client_id question_id answer
1 1 2 24
2 1 2 27
3 1 3 12
4 1 2 43
5 2 2 39
It holds history of client answers to some questions. The largest response_history_id for each client_id,question_id combination is the latest answer for that question and client.
What I want to do is to count the number of clients whose latest answer falls within a specific range
I have some dimensions:
question associated with question_id
client associated with client_id
response_history_id associated with response_history_id
range associated with answer. 0-20 low, 20-40 = medium, >40 is high
and some measures:
max_history_id as max(response_history_id)
clients_count as disticnt count(client_id)
Now, I want to group only the latest answers by range:
select
[ranges].members on 0,
{[Measures].[clients_count]} on 1
from (select [question].[All].[2] on 1 from [Cube])
What I get is:
Measures All low medium high
clients_count 2 0 2 1
But what I wanted (and I can't get) is the calculation based on the latest answer:
Measures All low medium high
clients_count 2 0 1 1
I understand why my query doesn't give me the desired result, it's more for demonstration purpose. But I have tried a lot of more complex MDX queries and still couldn't get any good result.
Also, I can't generate a static view from my fact table because later on I would like to limit the search by another column in fact table which is timestamp, my queries must eventually be able to get _the number of clients whose latest answer to a question before a given timestamp falls within a specific range.
Can anyone help me with this please?
I can define other dimensions and measures and I am using iccube.