How can I reduce complexity? Data preparation, SQL + Tableau - sql

I need to prepare some data to connect to tableau, and I'm struggling because the size of the data is too much for tableau to handle, so I'm looking for ideas to code this efficiently in SQL.
Setup:
I have 2 million users
There are 30 different categories, and each user can fall into many. For example:
User 1 - Category A, B and C
User 2 - Category F
User 3 - Category A, B
What I want:
Select three categories and assign priority 1, priority 2 and priority 3
These selection is not static, so today I may choose A, B, C but tomorrow those categories can be D, G, A
So if I have:
Priority 1: A
Priority 2: B
Priority 3: C
I want the number of users who fall into category A
I want the number of users who fall into category B AND are not in category A
I want the number of users who fall into category C AND are not in category A or B
My original idea was to create a table with one row per user and one yes/no column per category, and then aggregate, but still the size of the final table is too huge for tableau to handle.
Any ideas?
Update: My idea is to prepare a table with aggregated numbers and a few thousand rows max, so that it can be processed with tableau

You can assign each of the 30 categories a unique placeholder 1 to 30. Each user will be thereafter assigned a binary number of 30digits based on the categories he is falling in. This binary number can then be converted into decimal number the greatest of which can be 2^31-1 i.e. 10 digit number which can be stored without exp format.
Whenever you will have to see the categories user falling in that can be done by applying reverse conversion i.e. decimal to binary and thereafter to string with padding zeros on left side. From this string you can search places of 1s at desired place.
I think you can try this methodology.

Related

Generate random records from the table tblFruit based on the field Type

I will need your help to generate random records from the table tblFruit based on the field Type (without no duplication)
As per the above table.
There are 4 type of fruit number 1,2,3,4
I want to generate x records dynamically from the table tblFruit (e.g 7 records).
Let say I need to get 7 random record of fruit .
My result should contains fruit of the different types. However, we need to ensure that the result contains only 7 records.
i.e
2 records of type 1,
2 records of type 2,
2 records of type 3,
1 records of type 4
e.g
Note: If i want to generate 10 records (without no duplication),
then i will get 2 records of each type and the two remaining records randomly of any type.
Much grateful for your help.
I might suggest:
select top (7) f.*
from tblfruit f
order by row_number() over (partition by type order by newid());
This will actually produce a result with approximately the same number of rows of each type (well, off by 1), but that meets your needs.

How to combine a row of cells in VBA if certain column values are the same

I have a database where all of the input from the user (through a userform) gets stored. In the database, each column is a different category for the type of data (ex. date, shift, quantity, etc) and the data from the userform input gets put into its corresponding category. For some of the data, all the data is the same except for the quantity. I was wondering how I could combine these rows into one and add the quantities to each other for the whole database (ex. combining the first and third data entries). I have tried playing around with a couple different loops but can't seem to figure anything out.
Period Date Line Shift Type Quantity
4 x 2 4/3/18 A 3 14 18
4 x 2 4/3/18 A 3 13 12
4 x 2 4/3/18 A 3 14 15
Thank you!
If you're looking to modify the underlying database, you might be able to query the data into the format you want by including all the other columns in a GROUP BY statement, save the result to another table, then replace the original table with the properly formatted one.
If you have the data in Excel and you just want to view it with the duplicate rows summed, a Pivot Table would be a good choice. You can select all the other columns as rows for the Pivot Table and sum of Quantity as the values.

What is the best way to reassign ordinal number of a move operation

I have a column in the sql server called "Ordinal" that is used to indicate the display order of the rows. It starts from 0 and skips 10 for the next row. so we have something like this:
Id Ordinal
1 0
2 20
3 10
It skips 10 because we wanted to be able to move item in between items (based on ordinal) without having to reassign ordinal number for the entire table.
As you can imagine eventually, Ordinal number will need to be reassign somehow for a move in between operation either on surrounding rows or for the entire table as the unused ordinal numbers between the target items are all used up.
Is there any algorithm that I can use to effectively reorder the ordinal number for the move operation taken in the consideration like long term maintainability of the table and minimizing update operations of the table?
You can re-number the sequences using a somewhat complicated UPDATE statement:
UPDATE u
SET u.sequence = 10 * (c.num_below-1)
FROM test u
JOIN (
SELECT t.id, count(*) AS num_below
FROM test t
JOIN test tr ON tr.sequence <= t.sequence
GROUP BY t.id
) c ON c.id=u.id
The idea is to obtain a count of items with the sequence lower than that of the current row, multiply the count by ten, and assign it as the new count.
The content of test before the UPDATE:
ID Sequence
__ ________
1 0
2 10
3 20
4 12
The content of test after the UPDATE:
ID Sequence
__ ________
1 0
2 30
3 10
4 20
Now the sequence numbers are evenly spread again, so you can continue inserting in the middle until you run out of new sequence numbers; then you can re-number again.
Demo.
These won't answer your question directly--I just thought I might suggest some other approaches:
One possibility--don't try to do it by hand. Have your software manage the numbers. If they need re-writing, just save them with new numbers.
a second--use a "Linked List" instead. In each record store the index of the next record you want displayed, then have your code load that directly into a linked list.
Yet another simple approach. Let's say you're inserting a new record with an ordinal equal x.
First, check if there's a row having ordinal value equal x. In case there's one, just update all the records having the ordinal value equal or bigger than x increasing them by y. Then, you are safe to insert a new record.
This way you're sure you'll not run update every time and of course, you'll keep the order.

Row blocks in Hive (how to group rows by certain criteria and count these groups)

Here is a sample of the data I have:
Date_key UserID
20140401 a
20140402 a
20140406 a
20140407 a
20140408 a
20140409 a
20140404 b
20140408 b
20140409 b
20140414 b
20140415 b
... ...
Each row has a Date, User ID couple which indicates that that user was active on that day. A user can appear on multiple dates and a date will have multiple users -- just like in the example.
I want to get the number of consecutive day groups (i.e. blocks of activity). For example, this value for 'User a' would be 2 because they were active on 20140401 and 20140402 which is the first group of consecutive days. After 20140402, they waited for a while before becoming active again (i.e. they were not active the following day). On 20140406, their second block of activity started and continued without any break up until 20140409. For 'User b', this value would be 3 because they have been active during three consecutive day periods: 1)20140404 2) 20140408, 20140409 3) 20140414, 20140415
I use Hive. I am not sure if this is possible in Hive, but if the data needs to be carried over to a RDBMS to perform this task, I can do that too. Your recommendations are greatly appreciated. Thank you!
Cheers
When you use the distribute by clause, ie: .......distribute by user_id sort by user_id,date_key desc...... all the records for a particular user would go to a particular reducer, where the records are then sorted by date_key descending. Here why don't we write a UDF to iterate through the records, and when ever there is break in the continuity it would increment the counter for continuity by 1 and return the result along with the user_id.

SQL View Summarizing One Table, Columns Based on Unknown Categories Entered

I have a table in the form:
date / category (string) / count (integer)
--------------------------------------------
7/15 A 3
7/15 B 7
7/15 C 2
7/16 A 9
7/16 B 1
7/16 C 2
Basically, for each day, each category will have a count associated with it.
The problem is, I don't necessarily know what these categories will end up being. Say I know they are A, B, and C, but next week, there is a D, E, and F.
And this is the view that I want to build:
Date / A / B / C / .. (however many categories found)
---------------------------------------------------------
7/15 3 5 2 3 4
7/16 9 5 9 6 4
...
..
.
I usually know enough SQL to get by, but this one is racking my brain. I don't think I am using the right vocabulary when trying to google it, because I'm not finding the answers I am looking for.
The answer is simple, you cannot build a view to do what you would like. A view has its columns pre-defined.
You could do one of the following:
Create a stored procedure that creates a view every week. This stored procedure would analyze the data, determine the columns, and then use dynamic SQL to alter the view.
Change the definition of what you want and put the values in a single column, separated by commas (or some other character).
Predefine a list of acceptable columns, create the view (using pivot, say) and then periodically go through an modify it when new values arise.
Do the pivoting at the application layer. This is particularly easy in Excel.
One big caveat with (1) and (3). If anything uses the view as "select * from view", you need to be sure that those queries/stored procedures/user defined functions/etc. are recompiled. Otherwise, they will have the wrong list of columns (this may only apply to SQL Server).