How to assign event counts to relative date values in SQL? - sql

I want to line up multiple series so that all milestone dates are set to month zero, allowing me to measure the before-and-after effect of the milestone. I'm hoping to be able to do this using SQL server.
You can see an approximation of what I'm starting with at this data.stackexchange.com query. This sample query returns a table that basically looks like this:
+------------+-------------+---------+---------+---------+---------+---------+
| UserID | BadgeDate | 2014-01 | 2014-02 | 2014-03 | 2014-04 | 2014-05 |
+------------+-------------+---------+---------+---------+---------+---------+
| 7 | 2014-01-02 | 232 | 22 | 19 | 77 | 11 |
+------------+-------------+---------+---------+---------+---------+---------+
| 89 | 2014-04-02 | 345 | 45 | 564 | 13 | 122 |
+------------+-------------+---------+---------+---------+---------+---------+
| 678 | 2014-03-11 | 55 | 14 | 17 | 222 | 109 |
+------------+-------------+---------+---------+---------+---------+---------+
| 897 | 2014-03-07 | 234 | 56 | 201 | 19 | 55 |
+------------+-------------+---------+---------+---------+---------+---------+
| 789 | 2014-02-22 | 331 | 33 | 67 | 108 | 111 |
+------------+-------------+---------+---------+---------+---------+---------+
| 989 | 2014-01-09 | 12 | 89 | 97 | 125 | 323 |
+------------+-------------+---------+---------+---------+---------+---------+
This is not what I'm ultimately looking for. Values in month columns are counts of answers per month. What I want is a table with counts under relative month numbers as defined by BadgeDate (with BadgeDate month set to month 0 for each user, earlier months set to negative relative month #s, and later months set to positive relative month #s).
Is this possible in SQL? Or is there a way to do it in Excel with the above table?
After generating this table I plan on averaging relative month totals to plot a line graph that will hopefully show a noticeable inflection point at relative month zero. If there's no apparent bend, I can probably assume the milestone has a negligible effect on the Y-axis metric. (I'm not even quite sure what this kind of chart is called. I think Google might have been more helpful if I knew the proper terms for what I'm talking about.)
Any ideas?

This is precisely what the aggregate functions and case when ... then ... else ... end construct are for:
select
UserID
,BadgeDate
,sum(case when AnswerDate = '2014-01' then 1 else 0 end) as '2014-01'
-- etc.
group by
userid
,BadgeDate
The PIVOT clause is also available in some flavours and versions of SQL, but is less flexible in general so the traditional mechanism is worth understanding.
Likewise, the PIVOT TABLE construct in EXCEL can produce the same report, but there is value in maximally aggregating the data on the server in bandwidth competitive environments.

Related

Merging some columns from two postgres tables into a new table based on row value

Hello PostgresSQL experts (and maybe this is also a task for Perl's DBI since I also happen to be working with it, but...) I might also have some terminologies misused here so bear with me.
I have a set of 32 tables, every one exactly as the other. The first column of every table always contains a date, while the second column contains values (integers) that can change once every 24 hours, some samples get back-dated. In many cases, the tables may not contain data for a particular date, ever. So here's an example of two such tables:
date_list | sum date_list | sum
---------------------- --------------------------
2020-03-12 | 4 2020-03-09 | 1
2020-03-14 | 5 2020-03-11 | 3
| 2020-03-12 | 5
| 2020-03-13 | 9
| 2020-03-14 | 12
The idea is to merge the separate tables into one, sort of like a grid, but with the samples placed in the correct row in its own column and ensuring that the date column (always the first column) is not missing any dates, looking like this:
date_list | sum1 | sum2 | sum3 .... | sum32
---------------------------------------------------------
2020-03-08 | | |
2020-03-09 | | 1 |
2020-03-10 | | | 5
2020-03-11 | | 3 | 25
2020-03-12 | 4 | 5 | 35
2020-03-13 | | 9 | 37
2020-03-14 | 5 | 12 | 40
And so on, so 33 columns by 2020-01-01 to date.
Now, I have tried doing a FULL OUTER JOIN and it succeeds. It's the subsequent attempts that get me trouble, creating a long, cascading table with the values in the wrong place or accidentally clobbering data. So I know this works if I use a table of one column with a date sequence and joining the first data table, just as a test of my theory using baby steps:
SELECT date_table.date_list, sums_1.sum FROM date_table FULL OUTER JOIN sums_1 ON date_table.date_list = sums_1.date_list
2020-03-07 | 1
2020-03-08 |
2020-03-09 |
2020-03-10 | 2
2020-03-11 |
2020-03-12 | 4
Encouraged, I thought I'd get a little more ambitious with my testing, but that places some rows out of sequence to the bottom of the table and I'm not sure that I'm losing data or not, this time trying USING as an alternative:
SELECT * FROM sums_1 FULL OUTER JOIN sums_2 USING (date_list);
Result:
fecha_sintomas | sum | sum
----------------+-------+-------
2020-03-09 | | 1
2020-03-11 | | 3
2020-03-12 | 4 | 5
2020-03-13 | | 9
2020-03-14 | 5 | 12
2020-03-15 | 6 | 15
2020-03-16 | 8 | 20
: : :
2020-10-29 | 10053 | 22403
2020-10-30 | 10066 | 22407
2020-10-31 | 10074 | 22416
2020-11-01 | 10076 | 22432
2020-11-02 | 10077 | 22434
2020-03-07 | 1 |
2020-03-10 | 2 |
(240 rows)
I think I'm getting close. In any case, where do I get to what I want, which is my grid of data described above? Maybe this is an iterative process that could benefit from using DBI?
Thanks,
You can full join like so:
select date_list, s1.sum as sum1, s2.sum as sum2, s3.sum as sum3
from sums_1 s1
full join sums_2 s2 using (date_list)
full join sums_3 s3 using (date_list)
order by date_list;
The using syntax makes unqualified column date_list unambiguous in the select and order by clauses. Then, we need to enumerate the sum columns, provided aliases for each of them.

Query'd top 15 faults, need the accumulated downtime from another column

I'm currently trying to query up a list of the top 15 occurring faults on a PLC in the warehouse. I've gotten that part down:
Select top 15 fault_number, fault_message, count(*) FaultCount
from Faults_Stator
where T_stamp> dateadd(hour, -18, getdate())
Group by Fault_number, Fault_Message
Order by Faultcount desc
HOOOWEVER I now need to find out the accumulated downtime of said faults in the top 15 list, information in another column "Fault_duration". How would I go about doing this? Thanks in advance, you've all helped me so much already.
+--------------+---------------------------------------------+------------+
| Fault Number | Fault Message | FaultCount |
+--------------+---------------------------------------------+------------+
| 122 | ST10: Part A&B Failed | 23 |
| 4 | ST16: Part on Table B | 18 |
| 5 | ST7: No Spring Present on Part A | 15 |
| 6 | ST7: No Spring Present on Part B | 12 |
| 8 | ST3: No Pin Present B | 8 |
| 1 | ST5: No A Housing | 5 |
| 71 | ST4: Shuttle Right Not Loaded | 4 |
| 144 | ST15: Vertical Cylinder did not Retract | 3 |
| 98 | ST8: Plate Loader Can not Retract | 3 |
| 72 | ST4: Shuttle Left Not Loaded | 2 |
| 94 | ST8: Spring Gripper Cylinder did not Extend | 2 |
| 60 | ST8: Plate Loader Can not Retract | 1 |
| 83 | ST6: No A Spring Present | 1 |
| 2 | ST5: No B Housing | 1 |
| 51 | ST4: Vertical Cylinder did not Extend | 1 |
+--------------+---------------------------------------------+------------+
I know I wouldn't be using the same query, but I'm at a loss at how to do this next step.
Fault duration is a column which dictates how long the fault lasted in ms. I'm trying to have those accumulated next to the corresponding fault. So the first offender would have those 23 individual fault occurrences summed next to it, in another column.
You should be able to use the SUM accumulator:
Select top 15 fault_number, fault_message, count(*) FaultCount, SUM (Fault_duration) as FaultDuration
from Faults_Stator
where T_stamp> dateadd(hour, -18, getdate())
Group by Fault_number, Fault_Message
Order by Faultcount desc

How do I make this calculated measure axis independent and portable?

So I am a beginner at MDX and I have an MDX query that works the way I want it to so long as I put the set on either the columns or rows. If I put the same set on the filter axis it doesn't work. I'd like to make this calculated measure is independent on where this set lives. I'm guaranteed to always have some form of a set included, but I'm not guaranteed which axis the user will place it on (eg row, columns, filter).
Here is the query that works:
WITH MEMBER Measures.avgApplicants as
Avg([applicationDate].[yearMonth].[month].Members, [Measures].[applicants])
SELECT
{[Measures].[applicants],[Measures].[avgApplicants]} ON 0,
{[applicationDate].[yearMonth].[year].[2015]:[applicationDate].[yearMonth].[year].[2016]} ON 1
FROM [applicants]
And results:
| | applicants | avgMonthlyApplicants |
+------+------------+----------------------+
| 2015 | 367 | 33 |
| 2016 | 160 | 33 |
However, if I shift this query around to move the set onto the filter axis I get nothing:
WITH MEMBER Measures.avgApplicants as
Avg([applicationDate].[yearMonth].[month].Members, [Measures].[applicants])
SELECT
{[Measures].[applicants],[Measures].[avgApplicants]} ON 0,
{[Gender].Members} ON 1
FROM [applicants]
WHERE ([applicationDate].[yearMonth].[year].[2015]:[applicationDate].[yearMonth].[year].[2016])
I get this:
| | applicants | avgApplicants |
+-------------+-------------+------------+---------------+
| All Genders | | 478 | |
| | Female | 172 | |
| | Male | 183 | |
| | Not Known | 61 | |
| | Unspecified | 62 | |
So how do a create this calculated measure work so that it isn't dependent on which axis the set is placed on?

Find a subset of numbers that equals to the target weighted average and target sum

There is a SQL server table containing 1 million of rows. A sample data is shown below.
Percentage column is computed as = ((Y/X)* 100)
+----+--------+-------------+-----+-----+-------------+
| ID | Amount | Percentage | X | Y | Z |
+----+--------+-------------+-----+-----+-------------+
| 1 | 10 | 9.5 | 100 | 9.5 | 95 |
| 2 | 20 | 9.5 | 100 | 9.5 | 190 |
| 3 | 40 | 5 | 100 | 5 | 200 |
| 4 | 50 | 5.555555556 | 90 | 5 | 277.7777778 |
| 5 | 70 | 8.571428571 | 70 | 6 | 600 |
| 6 | 100 | 9.230769231 | 65 | 6 | 923.0769231 |
| 7 | 120 | 7.058823529 | 85 | 6 | 847.0588235 |
| 8 | 60 | 10.52631579 | 95 | 10 | 631.5789474 |
| 9 | 80 | 10 | 100 | 10 | 800 |
| 10 | 95 | 10 | 100 | 10 | 950 |
+----+--------+-------------+-----+-----+-------------+
Now I need to find the rows such that their amount value add up to a given Amount and weighted average matches to the given Percentage.
For example, if the target Amount =365 and target Percentage=9.84, then from the given dataset, we can say that rows with ID=1,2,6,8,9,10 form the subset which will match the given targets.
Amount = 10+20+100+60+80+95
= 365
Percentage = Sum of (product of Amount and Percentage)/Sum of (Amount)
(I am using Z column to store the products of Amount and Percentage to make the calculations easier)
= ((10*9.5)+(20*9.5)+(100*9.23077)+(60*10.5264)+(80*10)+(95*10))/ (10+20+100+60+80+95)
= 9.834673618
So the rows 1,2,6,8,9,10 matches the given target sum and target weighted average.
Proposed algorithm should work on the 1 million rows and main objective is to achieve the match on the weighted average (Percentage) with Amount as much close as possible to the target Amount.
I found few questions on the stackoverflow which are related to match the target sum. But my problem is to match two target attributes Sum and weighted average.
Which algorithm can be used to achieve this?
Since the target "Percentage" is only approximate (therefore not an actual constraint), let's try removing it and find a solution for Amount. This can only make the problem easier.
What's left is the Subset Sum Problem, which is NP-complete. There are simple exponential-time solutions, and sneaky pseudo-polynomial-time solutions, but I don't think any of them will be practical for a table with 106 rows.
If this is an academic exercise, I suggest you write up the cleverest pseudo-polynomial-time solution you can come up with. If it's a task in the real world, I suggest you go back to the person who gave it to you, explain that an exact solution is impractical, and negotiate for an approximate solution.

Sybase select distinct on one column, do not care about others

I have seen many similar questions but none that meet my needs exactly, and I cannot seem to deduce a solution on my own from inspecting the other questions.
I have the following (mock) table below. My actual table has many more columns.
TableA:
ID | color | feel | size | alive | age
------------------------------------------
1 | blue | soft | large | true | 36
2 | red | soft | large | true | 36
2 | blue | hard | small | false | 37
2 | blue | soft | large | true | 36
2 | blue | soft | small | false | 39
15 | blue | soft | medium | true | 04
15 | blue | soft | large | true | 04
15 | green | soft | large | true | 15
40 | pink | sticky | large | true | 83
51 | brown | rough | tiny | false | 01
51 | gray | soft | tiny | true | 59
34 | blue | soft | large | true | 02
I want the result to look like:
Result of query on TableA:
ID | color | feel | size | alive | age
-------------------------------------------
1 | blue | soft | large | true | 36
2 | red | soft | large | true | 36
15 | blue | soft | medium | true | 04
40 | pink | sticky | large | true | 83
51 | brown | rough | tiny | false | 01
34 | blue | soft | large | true | 02
I want one row for every unique ID column, but I do not want to check the other columns. I need the other columns returned in my result set, but I do not want to filter on them. I just need one row for every unique ID - I do not care which row.
In my example, I selected the first row of every unique ID.
I have tried variations of
select *
from TableA
group by ID having ID = max(ID)
Most examples I have seen with group by and max and/or min functions involve only 2 columns. I have many more columns, however.
I have also seen examples using CTE, but I am not using SQL Server (I am using Sybase).
How can I achieve the result set described?
EDIT
We are using Sybase version 15.1.
Your solution with MIN has some drawbacks. It doesn't return you a specific row but MIN values from the group of rows. You can get as result rows which are not in database. Is it OK for you ?
Row_number is supported in sybase 15.2
http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc38151.1520/html/iqrefbb/iqrefbb262.htm
It's sad if it is not supported in 15.1. You can use then identity column and temporary table to achieve what you want.
There are a variety of ways to do this. If you have a more recent version of Sybase, you can use row_number():
select t.*
from (select t.*, row_number() over (partition by id order by id) as seqnum
from table t
) t
where seqnum = 1;
The solution I have come up with is below.
It "feels" like a poor solution - I am still open to new answers:
SELECT
ID,
min(color),
min(feel),
min(size),
min(alive),
min(age)
FROM TableA
group by ID
I do not like how verbose I am with the application of the min function to every column, but this returns the desired result set.