Dynamically determine and categorize duplicates in Tableau - data-visualization

I have a set that has the following structure:
ID | Date | DollarAmount
1 | Jan | 50
1 | Jan | 20
2 | Jan | 10
1 | Feb | 20
2 | Feb | 10
I am trying to dynamically be able to determine if for a particular period in time there is a duplicate based on the ID column.
For example, based on the data above, I would ideally have
I have tried to filter based on Number of Records but it shows filters out based on the TOTAL observations across the dataset, not date ranges.
Any help is much appreciated
Thanks!

Apparently you define a duplicate records as those that have the same value for the ID and Date fields, where Date is really a string containing the abbreviation for the month name.
In that case, define a (Boolean valued) LOD calculated field called [Duplicates] as {FIXED [ID], [Date] : Count(1) > 1}
Place [Duplicates] on the color shelf, Sum([Dollar Amount]) on rows and [Date] on Columns.
You will see the values True and False on the Color Legend. You can edit the aliases for those values if you want to display a more clear label such Duplicates, Non-Duplicates
If you have a true date valued field instead of a string, you may want to use DateTrunc() to define your duplicate test at the level of granularity that matches your problem.

Related

Get the difference in time between multiple rows with the same column name

I need to get the time difference between two dates on different rows. This part is okay but I can have instances of the same title. A quick example which will explain things some more.
Lets say we have a table with the following records:
| ID | Title | Date |
| ----- | ------- |--------------------|
| 1 | Down |2021-03-07 12:05:00 |
| 2 | Up |2021-03-07 13:05:00 |
| 3 | Down |2021-03-07 10:30:00 |
| 4 | Up |2021-03-07 11:00:00 |
I basically need to get the time difference between the first "Down" and "Up". So ID 1 & 2 = 1 hour.
Then ID 3 & 4 = 30 mins, and so on for the amount of "Down" and "Up" rows there are.
(These will always be grouped together one after another)
It doesn't matter if the results are seperate or a SUM of all the differences.
I'm trying to get this done without a temp table.
Thank you.
This can be done using analytical functions, the availability of which will be determined based on your sql engine. The idea is to get the next value in the same row as the one you need to calculate the diff/sum
In the case above it would look some thing like below
SELECT
id ,
title,
Date as startdate,
LEAD(Date,1) OVER (
ORDER BY id
) enddate
FROM
table;
Once you have it on the same row, you can carry out your time difference operation.

Structuring Month-Based Data in SQL

I'm curious about what the best way to structure data in a SQL database where I need to keep track of certain fields and how they differ month to month.
For example, if I had a users table in which I was trying to store 3 different values: name, email, and how many times they've logged in each month. Would it be best practice to create a new column for each month and store the number of times they logged in that month under that column? Or would it be better to create a new row/table for each month?
My instinct says creating new columns is the best way to reduce redundancy, however I can see it getting a little unwieldy when the number of columns in the table changes over time. (I was also thinking that if I were to do it by column, it would warrant having a total_column that keeps track of all months at a time).
Thanks!
In my opinion, the best approach is to store each login for each user.
Use a query to summarize the data the way you need it when you query it.
You should only be thinking about other structures if summarizing the detail doesn't meet performance requirements -- which for a monthly report don't seem so onerous.
Whatever you do, storing counts in separate columns is not the right thing to do. Every month, you would need to add another column to the table.
I'm not an expert but in my opinion, it is best to store data in a separate table (in your case). That way you can manipulate the data easily and you don't have to modify the table design in the future.
PK: UserID & Date or New Column (Ex: RowNo with auto increment)
+--------+------------+-----------+
| UserID | Date | NoOfTimes |
+--------+------------+-----------+
| 01 | 2018.01.01 | 1 |
| 01 | 2018.01.02 | 3 |
| 01 | 2018.01.03 | 5 |
| .. | | |
| 02 | 2018.01.01 | 2 |
| 02 | 2018.01.02 | 6 |
+--------+------------+-----------+
Or
PK: UserID, Year & Month or New Column (Ex: RowNo with auto increment)
+--------+------+-------+-----------+
| UserID | Year | Month | NoOfTimes |
+--------+------+-------+-----------+
| 01 | 2018 | Jan | 10 |
| 01 | 2018 | feb | 13 |
+--------+------+-------+-----------+
Before you create the table, please take a look at the database normalization. Especially 1st (1NF), 2nd (2NF) and 3rd (3NF) normalization forms.
https://www.tutorialspoint.com/dbms/database_normalization.htm
https://www.lifewire.com/database-normalization-basics-1019735
https://www.essentialsql.com/get-ready-to-learn-sql-database-normalization-explained-in-simple-english/
https://www.studytonight.com/dbms/database-normalization.php
https://medium.com/omarelgabrys-blog/database-normalization-part-7-ef7225150c7f
Either approach is valid, depending on query patterns and join requirements.
One row for each month
For a user, the row containing login count for the month will be inserted when data is available for the month. There will be 1 row per month per user. This design will make it easier to do joins by month column. However, multiple rows will need to be accessed to get data for a user for the year.
-- column list
name
email
month
login_count
-- example entries
'user1', 'user1#email.com','jan',100
'user2', 'user2#email.com','jan',65
'user1', 'user1#email.com','feb',90
'user2', 'user2#email.com','feb',75
One row for all months
You do not need to dynamically add columns, since number of months is known in advance. The table can be initially created to accommodate all months. By default, all month_login_count columns would be initialized to 0. Then, the row would be updated as the login count for the month is populated. There will be 1 row per user. This design is not the best for doing joins by month. However, only one row will need to be accessed to get data for a user for the year.
-- column list
name
email
jan_login_count
feb_login_count
mar_login_count
apr_login_count
may_login_count
jun_login_count
jul_login_count
aug_login_count
sep_login_count
oct_login_count
nov_login_count
dec_login_count
-- example entries
'user1','user1#email.com',100,90,0,0,0,0,0,0,0,0,0,0
'user2','user2#email.com',65,75,0,0,0,0,0,0,0,0,0,0

DAX Query to get average of a column within the same table

We have a table named MetricsTable which has columns A1 and Group simply.
We want to add a calculated column AvgA1 to this table which calculates the average of column A1 filtered by the value of Group . What should be our DAX query? The point is that we want to claculate the average from the values within the same table.
| id | A1| Group | AvgA1
| -- | --- | --- ------| ----
| 1 | 20 | Group1| 20
| 2 | 10 | Group2| 30
| 3 | 50 | Group2| 30
| 4 | 30 | Group2| 30
| 5 | 35 | Group3| 35
Regards
Likely you should use a measure and put that measure into a pivot table's 'Values' section:
AverageA1:=
AVERAGE( Metrics[A1] )
Then it will be updated based on filter and slicer selections in the pivot table, and subtotaled appropriately across various dimension categories.
If it strictly needs to be a column in the table for reasons not enumerated in your question, then the following will work:
AverageA1 =
CALCULATE(
AVERAGE( Metrics[A1] )
,ALLEXCEPT( Metrics, Metrics[Group] )
)
CALCULATE() takes an expression and a list of 0-N arguments to modify the filter context in which that expression is evaluated.
ALLEXCEPT() takes a table, and a list of 1-N fields from which to preserve context. The current (row) context in the evaluation of this column definition is the value of every field on that row. We remove the context from ALL fields EXCEPT those named in arguments 2-N of ALLEXCEPT(). Thus we preserve the row context of [Group], and calculate an average across the table where that [Group] is the same as in the current context.

SSRS report table including Averages

I need a report that has some data in it with calculation data among regular rows. For example:
Name | Age | Salary
HR | 35 | $1300
John | 30 | $1000
Mark | 40 | $1600
Law | 45 | $1500
Bill | 40 | $1000
Sara | 50 | $2000
The idea is to group rows by a field and then add a row with average numbers for this group.
Is it possible? I also have 2 date parameters (start and end), so I need to get all the records to SSRS and then filter them out...
Yes, this is possible and very straight forward.
Create your report with the data rows, then create a group on the Department field. You can do this a few ways: right click on the detail rows and select Add Group... or drag the department field to the Row groups pane in the design window.
Add a row to the group by right clicking on the details group and choosing to add a total, before the details. In the new row, set your formula to be =Avg(MyDataset!AgeFieldName.Value)
Take a look at the tutorials available on MSDN, especially the Grouping and Totals section

Find last (first) instance in table but exclude most recent (oldest) date

I have a table that reflects a monthly census of a certain population. Each month on an unpredictable day early in that month, the population is polled. Any member who existed at that point is included in that month's poll, any member who didn't is not.
My task is to look through an arbitrary date range and determine which members were added or lost during that time period. Consider the sample table:
ID | Date
2 | 1/3/2010
3 | 1/3/2010
1 | 2/5/2010
2 | 2/5/2010
3 | 2/5/2010
1 | 3/3/2010
3 | 3/3/2010
In this case, member with ID "1" was added between Jan and Feb, and member with ID 2 was lost between Feb and Mar.
The problem I am having is that if I just poll to try and find the most recent entry, I will capture all the members that were dropped, but also all the members that exist on the last date. For example, I could run this query:
SELECT
ID,
Max(Date)
FROM
tableName
WHERE
Date BETWEEN '1/1/2010' AND '3/27/2010'
GROUP BY
ID
This would return:
ID | Date
1 | 3/3/2010
2 | 2/5/2010
3 | 3/3/2010
What I actually want, however, is just:
ID | Date
2 | 2/5/2010
Of course I can manually filter out the last date, but since the start and end date are parameters I want to generalize that. One way would be to run sequential queries. In the first query I'd find the last date, and then use that to filter in the second query. It would really help, however, if I could wrap this logic into a single query.
I'm also having a related problem when I try to find when a member was first added to the population. In that case I'm using a different type of query:
SELECT
ID,
Date
FROM
tableName i
WHERE
Date BETWEEN '1/1/2010' AND '3/27/2010'
AND
NOT EXISTS(
SELECT
ID,
Date
FROM
tableName ii
WHERE
ii.ID=i.ID
AND
ii.Date < i.Date
AND
Date BETWEEN '1/1/2010' AND '3/27/2010'
)
This returns:
ID | Date
1 | 2/5/2010
2 | 1/1/2010
3 | 1/1/2010
But what I want is:
ID | Date
1 | 2/5/2010
I would like to know:
1. Which approach (the MAX() or the subquery with NOT EXISTS) is more efficient and
2. How to fix the queries so that they only return the rows I want, excluding the first (last) date.
Thanks!
You could do something like this:
SELECT
ID,
Max(Date)
FROM
tableName
WHERE
Date BETWEEN '1/1/2010' AND '3/27/2010'
GROUP BY
ID
having max(date) < '3/1/2010'
This filters out anyone polled in March.