Structuring Month-Based Data in SQL - sql

I'm curious about what the best way to structure data in a SQL database where I need to keep track of certain fields and how they differ month to month.
For example, if I had a users table in which I was trying to store 3 different values: name, email, and how many times they've logged in each month. Would it be best practice to create a new column for each month and store the number of times they logged in that month under that column? Or would it be better to create a new row/table for each month?
My instinct says creating new columns is the best way to reduce redundancy, however I can see it getting a little unwieldy when the number of columns in the table changes over time. (I was also thinking that if I were to do it by column, it would warrant having a total_column that keeps track of all months at a time).
Thanks!

In my opinion, the best approach is to store each login for each user.
Use a query to summarize the data the way you need it when you query it.
You should only be thinking about other structures if summarizing the detail doesn't meet performance requirements -- which for a monthly report don't seem so onerous.
Whatever you do, storing counts in separate columns is not the right thing to do. Every month, you would need to add another column to the table.

I'm not an expert but in my opinion, it is best to store data in a separate table (in your case). That way you can manipulate the data easily and you don't have to modify the table design in the future.
PK: UserID & Date or New Column (Ex: RowNo with auto increment)
+--------+------------+-----------+
| UserID | Date | NoOfTimes |
+--------+------------+-----------+
| 01 | 2018.01.01 | 1 |
| 01 | 2018.01.02 | 3 |
| 01 | 2018.01.03 | 5 |
| .. | | |
| 02 | 2018.01.01 | 2 |
| 02 | 2018.01.02 | 6 |
+--------+------------+-----------+
Or
PK: UserID, Year & Month or New Column (Ex: RowNo with auto increment)
+--------+------+-------+-----------+
| UserID | Year | Month | NoOfTimes |
+--------+------+-------+-----------+
| 01 | 2018 | Jan | 10 |
| 01 | 2018 | feb | 13 |
+--------+------+-------+-----------+
Before you create the table, please take a look at the database normalization. Especially 1st (1NF), 2nd (2NF) and 3rd (3NF) normalization forms.
https://www.tutorialspoint.com/dbms/database_normalization.htm
https://www.lifewire.com/database-normalization-basics-1019735
https://www.essentialsql.com/get-ready-to-learn-sql-database-normalization-explained-in-simple-english/
https://www.studytonight.com/dbms/database-normalization.php
https://medium.com/omarelgabrys-blog/database-normalization-part-7-ef7225150c7f

Either approach is valid, depending on query patterns and join requirements.
One row for each month
For a user, the row containing login count for the month will be inserted when data is available for the month. There will be 1 row per month per user. This design will make it easier to do joins by month column. However, multiple rows will need to be accessed to get data for a user for the year.
-- column list
name
email
month
login_count
-- example entries
'user1', 'user1#email.com','jan',100
'user2', 'user2#email.com','jan',65
'user1', 'user1#email.com','feb',90
'user2', 'user2#email.com','feb',75
One row for all months
You do not need to dynamically add columns, since number of months is known in advance. The table can be initially created to accommodate all months. By default, all month_login_count columns would be initialized to 0. Then, the row would be updated as the login count for the month is populated. There will be 1 row per user. This design is not the best for doing joins by month. However, only one row will need to be accessed to get data for a user for the year.
-- column list
name
email
jan_login_count
feb_login_count
mar_login_count
apr_login_count
may_login_count
jun_login_count
jul_login_count
aug_login_count
sep_login_count
oct_login_count
nov_login_count
dec_login_count
-- example entries
'user1','user1#email.com',100,90,0,0,0,0,0,0,0,0,0,0
'user2','user2#email.com',65,75,0,0,0,0,0,0,0,0,0,0

Related

How to aggregate number of customers in SQL Server?

I have transaction data like this:
| Time_Stamp | Customer_ID | Amount | Department | Pay_Method | Channel |
|---------------------|-------------|--------|------------|-------------|------------|
| 2018-03-07 14:23:33 | 374856829 | 14.63 | Fruit | Credit Card | Mobile App |
I have written an aggregation procedure like this:
INSERT INTO Days
(
Year,
Month,
Day,
Department,
Pay_Method,
Total_Dollars,
Total_Transactions,
Total_Customers
)
SELECT
YEAR(Time_Stamp),
MONTH(Time_Stamp),
DAY(Time_Stamp),
Department,
Pay_Method,
SUM(Amount),
COUNT(*),
COUNT(DISTINCT(Customer_ID))
FROM
Transactions
GROUP BY
YEAR(Time_Stamp),
MONTH(Time_Stamp),
DAY(Time_Stamp),
Department,
Pay_Method
Which populates a data mart table like this:
| Year | Month | Day | Department | Pay_Method | Total_Dollars | Total_Transactions | Total_Customers |
|------|-------|-----|------------|------------|---------------|--------------------|-----------------|
| 2018 | 3 | 7 | Home | Cash | 2398540.57 | 543084 | 325783 |
| 2018 | 3 | 7 | Home | Credit | 7458392.47 | 1587695 | 758643 |
So far, so good.
I then have procedures which feed the charts UI like this:
SELECT
Year,
Month,
Day,
SUM(Total_Dollars),
SUM(Total_Transactions),
SUM(Total_Customers)
FROM
Days
WHERE
IIF(#Department IS NULL, Department, #Department) AND
IIF(#Pay_Method IS NULL, Pay_Method, #Pay_Method)
GROUP BY
Year,
Month,
Day
This all works great for Total_Transactions and Total_Dollars, but not for Total_Customers.
The Total_Customers numbers in the Days table are correct in each row, for that specific combination of Year, Month, Day, Department and Pay_Method, but when two of those rows are summed together, the total becomes inaccurate, because the same customer may have made multiple transactions using different Department(s) and Pay_Method(s) on the same date. The numbers become even more inaccurate when adding days together to get monthly customer counts, etc...
I thought the solution would be to try and trick SQL Server into considering "all" as possible values for the various "group by" fields, and played around with group by and case quite a bit but couldn't figure it out. Essentially, in addition to my Days table containing every specific combination of Year, Month, Day, Department and Pay_Method, I also need to generate rows where Year, Month, Day, Department and Pay_Method are considered as "any" or "all". Lastly, I don't need to generate rows where Year is "any" and Month and Day are specified (although it wouldn't hurt really), as no one cares for totals of March 7th in any year, etc...
Can someone help me write the query to properly populate my Days table?
Your problem is because the "grain" of your model is wrong. Grain is the term given to the level of detail in a fact table.
You always want to store your facts at the finest level of detail, then you can aggregate your data correctly. You were already at that point with your first table.
Rather than aggregating the data (incorrectly) into your second table, simply rewrite or amend that table to break your date/time into the fields you require for reporting.
By the way, if this is truly representative of your data, I suspect that you might actually be hiding an error in your transaction count. You may need a finer level of detail than "department", and I suspect it might be a concept like "product". What would happen to your model if a customer bought both apples and oranges?

Dynamically determine and categorize duplicates in Tableau

I have a set that has the following structure:
ID | Date | DollarAmount
1 | Jan | 50
1 | Jan | 20
2 | Jan | 10
1 | Feb | 20
2 | Feb | 10
I am trying to dynamically be able to determine if for a particular period in time there is a duplicate based on the ID column.
For example, based on the data above, I would ideally have
I have tried to filter based on Number of Records but it shows filters out based on the TOTAL observations across the dataset, not date ranges.
Any help is much appreciated
Thanks!
Apparently you define a duplicate records as those that have the same value for the ID and Date fields, where Date is really a string containing the abbreviation for the month name.
In that case, define a (Boolean valued) LOD calculated field called [Duplicates] as {FIXED [ID], [Date] : Count(1) > 1}
Place [Duplicates] on the color shelf, Sum([Dollar Amount]) on rows and [Date] on Columns.
You will see the values True and False on the Color Legend. You can edit the aliases for those values if you want to display a more clear label such Duplicates, Non-Duplicates
If you have a true date valued field instead of a string, you may want to use DateTrunc() to define your duplicate test at the level of granularity that matches your problem.

Structure of a relational database for comparing multiple dates

We have a Microsoft Access Database at work to track an ongoing list of customers. Each customer has to sign a contract with several departments - totally 13 (!) departments - for which we want to keep track about the current progress for each customer when a contract is sent and received. This structure looks similar to something like this:
Table 1
-------------------------------------------------------------------------------------------------------------------
CUSTOMER_ID | ... | DEP_A_SENT | DEP_A_RECEIVED | DEP_B_SENT | DEP_B_RECEIVED | DEP_C_SENT | DEP_C_RECEIVED | ... |
-------------------------------------------------------------------------------------------------------------------
1 | ... | 2015-05-01 | 2015-05-03 | 2015-05-04 | 2015-05-09 | 2015-05-01 | 2015-05-05 | ... |
2 | ... | 2015-05-01 | 2015-05-05 | 2015-05-01 | 2015-05-03 | 2015-05-13 | --- | ... |
...
I want to be able to calculate the timespan between DEP_X_SENT with DEP_X_RECEIVED for customer and department (such as "department A: 2 days, department B: 5 days..." for customer ID 1)
More importantly, I want to compare all the DEP_X_RECEIVED dates with each other for one customer: Determining the first (MIN) and the last (MAX) date a contract has been received to finding how many days it takes for each customer until all contracts are received. (such as "the contracts were received within 6 days" for customer ID 1, because the first was received on May 3rd. and the last on May 9th). Furthermore, I want to calculate the average timespan this took for all customers. If the contract is not received yet, the is no value in that field.
In MySQL I can work with functions such GREATEST and LEAST to compare values between different columns, but in Access I have to rely for now on VBA and I think it is considered bad practice. How can I normalize and restructure my table for archieving my goals with rather simple MAX, MIN and AVGoperations? Many thanks!
Simply fold your existing table into this structure:
create table TABLE_1 (
CUSTOMER_ID int,
DEPARTMENT_ID int, -- foreign key reference to DEPARTMENT table
SENT date,
RECEIVED date
);
Now you can perform the required analysis simply, and retrieve the original layout as either a Pivot report or LEFT OUTER JOIN from the DEPARTMENT table to the new TABLE_1.

SQL payments matrix

I want to combine two tables into one:
The first table: Payments
id | 2010_01 | 2010_02 | 2010_03
1 | 3.000 | 500 | 0
2 | 1.000 | 800 | 0
3 | 200 | 2.000 | 300
4 | 700 | 1.000 | 100
The second table is ID and some date (different for every ID)
id | date |
1 | 2010-02-28 |
2 | 2010-03-01 |
3 | 2010-01-31 |
4 | 2011-02-11 |
What I'm trying to achieve is to create table which contains all payments before the date in ID table to create something like this:
id | date | T_00 | T_01 | T_02
1 | 2010-02-28 | 500 | 3.000 |
2 | 2010-03-01 | 0 | 800 | 1.000
3 | 2010-01-31 | 200 | |
4 | 2010-02-11 | 1.000 | 700 |
Where T_00 means payment in the same month as 'date' value, T_01 payment in previous month and so on.
Is there a way to do this?
EDIT:
I'm trying to achieve this in MS Access.
The problem is that I cannot connect name of the first table's column with the date in the second (the easiest way would be to treat it as variable)
I added T_00 to T_24 columns in the second (ID) table and was trying to UPDATE those fields
set T_00 =
iif(year(date)&"_"&month(date)=2010_10,
but I realized that that would be to much code for access to handle if I wanted to do this for every payment period and every T_xx column.
Even if I would write the code for T_00 I would have to repeat it for next 23 periods.
Your Payments table is de-normalized. Those date columns are repeating groups, meaning you've violated First Normal Form (1NF). It's especially difficult because your field names are actually data. As you've found, repeating groups are a complete pain in the ass when you want to relate the table to something else. This is why 1NF is so important, but knowing that doesn't solve your problem.
You can normalize your data by creating a view that UNIONs your Payments table.
Like so:
CREATE VIEW NormalizedPayments (id, Year, Month, Amount) AS
SELECT id,
2010 AS Year,
1 AS Month,
2010_01 AS Amount
FROM Payments
UNION ALL
SELECT id,
2010 AS Year,
2 AS Month,
2010_02 AS Amount
FROM Payments
UNION ALL
SELECT id,
2010 AS Year,
3 AS Month,
2010_03 AS Amount
FROM Payments
And so on if you have more. This is how the Payments table should have been designed in the first place.
It may be easier to use a date field with the value '2010-01-01' instead of a Year and Month field. It depends on your data. You may also want to add WHERE Amount IS NOT NULL to each query in the UNION, or you might want to use Nz(2010_01,0.000) AS Amount. Again, it depends on your data and other queries.
It's hard for me to understand how you're joining from here, particularly how the id fields relate because I don't see how they do with the small amount of data provided, so I'll provide some general ideas for what to do next.
Next you can join your second table with this normalized Payments table using a method similar to this or a method similar to this. To actually produce the result you want, include a calculated field in this view with the difference in months. Then, create an actual Pivot Table to format your results (like this or like this) which is the proper way to display data like your tables do.

Best Way To Run Length Encode Data

I've created an table that tracks the various attributes of objects over time.
Id | Attribute1 | Attribute2 | Attribute3 | StartDate | EndDate
------------------------------------------------------------------
01 | 100 | Null | Null | 2004-02-03 | 2006-04-30
01 | 100 | Null | D | 2006-05-01 | 2010-11-06
01 | 150 | Null | D | 2010-11-07 | Null
02 | 700 | 5600 | Null | 1998-09-27 | 2002-01-27
New data (~10s of thousands of records) come in each day. What I want to do is compare each record to the current data for that id, and then:
a) Do nothing if the attributes match.
b) If the attributes are different, update the current record so that the EndDate is the current date, and create a new record with the new attributes.
c) Create a new record if there isn't any data for that id.
My question is, what is the most efficient way to do this?
I can write a script that goes through each record, does the comparison, and the updates the table as appropriate, but I fell like this is brute-force, rather than an intelligent solution.
Would this be a good place to use a cursor?
How do you process data? As it comes in or in batch?
If it is as it comes in, then I would do a set of checks on the most likely attribute to change and to the least likely (just to optimize the checking a bit) and update as needed. 10's of thousands is not enough data to worry about slowing down too much. This is the straight forward approach.
If you process as a batch (like at end of business each day), sort the data by ID then descending end date. Delete all other instances of ID and only care about the latest one. No intermediary data would matter.
Example: you have 2 entries for id 1, one with endDate Jan 1 other with endDate Jan 25. Look at Jan 25 entry first and update if needed. Jan 1 entry is too old to care about at that point.