Count rows based on a pair of distinct values - sql

We use hive to run queries on AB test data. The problem here is that we have some duplicate data we are trying to ignore. Luckily we have a means to ignore duplicate data. Our conversion_meta column contains an indicator for this duplicate data.
I'd like to find distinct (conversion_meta, conversion_type). I can't really figure out the correct syntax though. Here is what I have so far:
select conversion_type, day, sum(if(is_control='true', 1, 0)) as Control,
sum(if(is_control='false', 1, 0)) as Test from Actions
where day > "2013-12-20" and experiment_key='xyz' group by conversion_type, day
The columns in the end result should look like:
Conversion Type, Day, Control (count), Test (count)

I think you can solve this problem with union all.:
select conversion_type, day, sum(if(is_control='true', 1, 0)) as Control,
sum(if(is_control='false', 1, 0)) as Test from Actions
where day > "2013-12-20" and experiment_key='xyz' and conversion_meta = false
group by conversion_type, day
UNION ALL
select conversion_type, day, sum(if(is_control='true', 1, 0)) as Control,
sum(if(is_control='false', 1, 0)) as Test from Actions
where day > "2013-12-20" and experiment_key='xyz' and conversion_meta = true
group by conversion_type, day

Related

SQL script to with the shown screenshot

I want to write a sql script to as shown in the screenshot image. Thank you.
enter image description here
I've tried MAX() function to aggregate the ESSBASE_MONTH field to make it distinct and display a single month in the output instead of multiple months. I am yet to figure out how to put 0 in any month that EMPID did not perform any sale like in December under "Total GreaterThan 24 HE Account" and "Total_HE_Accounts"
The fields of the table are not very informative however based on screenshot, this is the best answer I could come up with.
Assuming the table name is SALES;
select
ADJ_EMPID,
ESSBASE_MONTH,
MAX(YTD_COUNT) AS YTD_COUNT,
SUM(TOTAL_24) AS TOTAL_24,
SUM(TOTAL_ACC) AS TOTAL_ACC
from SALES
group by
ADJ_EMPID,
ESSBASE_MONTH
The above will aggregate the monthly 'sales' data as expected.
To add the 'missing' rows such as the December, it is possible to do it by doing a union of the above query with a vitural table.
select
MAX(MONTH_NUMBER) AS MONTH_NUMBER,
ADJ_EMPID,
ESSBASE_MONTH,
MAX(YTD_COUNT) AS YTD_COUNT,
SUM(TOTAL_24) AS TOTAL_24,
SUM(TOTAL_ACC) AS TOTAL_ACC
from (
select
1 as MONTH_NUMBER,
*
from SALES
union all
select * from (values
(1, '300014366', 'January', 0, 0, 0),
(2, '300014366', 'Feburary', 0, 0, 0),
-- add the other missing months as required
(11, '300014366', 'November', 0, 0, 0),
(12, '300014366', 'December', 0, 0, 0)
) TEMP_TABLE (MONTH_NUMBER, ADJ_EMPID, ESSBASE_MONTH, YTD_COUNT, TOTAL_24, TOTAL_ACC)
) as AGGREGATED_DATA
group by
ADJ_EMPID,
ESSBASE_MONTH
order by MONTH_NUMBER;
TEMP_TABLE is a vitural tables which contains all the months and sales as zero. There is a special field MONTH_NUMBER added to sort the months in the proper order.
Not the easiest query to understand, the requirement is not exactly feasible either..
Link to fiddledb for a working solution with PostgreSQL 15.

Find Gaps in a single date column SQL Server

Good Day everyone,
I need your help.
I am trying to detect gaps in a single column of the type Date or DateTime in SQL Server.
Say we have a list of schools and each school has many records and there is a field of uploadDate.
So something like that:
My outcome would be something like that:
Thank you all.
You can use lead():
select name, dateadd(day, 1, upload_date), dateadd(day, -1, next_upload_date)
from (select t.*,
lead(upload_date) over (partition by name order by upload_date) as next_upload_date
from t
) t
where next_upload_date <> dateadd(day, 1, upload_date);

Find two local averages within one SQL Server data set

In the plant at our company there is a physical process that has a two-stage start and a two-stage finish. As a widget starts to enter the process a new record is created containing the widget ID and a timestamp (DateTimeCreated) and once the widget fully enters the process another timestamp is logged in a different field for the same record (DateTimeUpdated). The interval is a matter of minutes.
Similarly, as a widget starts to exit the process another record is created containing the widget ID and the DateTimeCreated, with the DateTimeUpdated being populated when the widget has fully exited the process. In the current table design an "exiting" record is indistinguishable from an "entering" record (although a given widget ID occurs only either once or twice so a View could utilise this fact to make the distinction, but let's ignore that for now).
The overall time a widget is in the process is several days but that's not really of importance to the discussion. What is important is that the interval when exiting the process is always longer than when entering. So a very simplified, imaginary set of sorted interval values might look like this:
1, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 6, 7, 7, 7, 7, 8, 8, 8, 8, 10, 10, 10
You can see there is a peak in the occurrences of intervals around the 3-minute-mark (the "enters") and another peak around the 7/8-minute-mark (the "exits"). I've also excluded intervals of 5 minutes to demonstrate that enter-intervals and exit-intervals can be considered mutually exclusive.
We want to monitor the performance of each stage in the process daily by using a query to determine the local averages of the entry and exit data point clusters. So conceptually the two data sets could be split either side of an overall average (in this case 5.375) and then an average calculated for the values below the split (2.75) and another average above the split (8). Using the data above (in a random distribution) the averages are depicted as the dotted lines in the chart below.
My current approach is to use two Common Table Expressions followed by a final three-table-join query. It seems okay, but I can't help feeling it could be better. Would anybody like to offer an alternative approach or other observations?
WITH cte_Raw AS
(
SELECT
DATEDIFF(minute, DateTimeCreated, DateTimeUpdated) AS [Interval]
FROM
MyTable
WHERE
DateTimeCreated > CAST(CAST(GETDATE() AS date) AS datetime) -- Today
)
, cte_Midpoint AS
(
SELECT
AVG(Interval) AS Interval
FROM
cte_Raw
)
SELECT
AVG([Entry].Interval) AS AverageEntryInterval
, AVG([Exit].Interval) AS AverageExitInterval
FROM
cte_Raw AS [Entry]
INNER JOIN
cte_Midpoint
ON
[Entry].Interval < cte_Midpoint.Interval
INNER JOIN
cte_Raw AS [Exit]
ON
[Exit].Interval > cte_Midpoint.Interval
I don't think your query produces accurate results. Your two JOINs are producing a proliferation of rows, which throw the averages off. They might look correct (because one is less than the other), but it you did counts, you would see that the counts in your query have little to do with the sample data.
If you are just looking for the average of values that are less than the overall average and greater than the overall average, then you an use window functions:
WITH t AS (
SELECT t.*, v.[Interval],
AVG(v.[Interval]) OVER () as avg_interval
FROM MyTable t CROSS JOIN
(VALUES (DATEDIFF(minute, DateTimeCreated, DateTimeUpdated))
) v(Interval)
WHERE DateTimeCreated > CAST(CAST(GETDATE() AS date) AS datetime)
)
SELECT AVG(CASE WHEN t.[Interval] < t.avg_interval THEN t.[Interval] END) AS AverageEntryInterval,
AVG(CASE WHEN t.[Interval] > t.avg_interval THEN t.[Interval] END) AS AverageExitInterval
FROM t;
I decided to post my own answer as at the time of writing neither of the two proposed answers will run. I have however removed the JOIN statements and used the CASE statement approach proposed by Gordon.
I've also multiplied the DATEDIFF result by 1.0 to prevent rounding of results from the AVG function.
WITH cte_Raw AS
(
SELECT
1.0 * DATEDIFF(minute, DateTimeCreated, DateTimeUpdated) AS [Interval]
FROM
MyTable
WHERE
DateTimeCreated > CAST(CAST(GETDATE() AS date) AS datetime) -- Today
)
, cte_Midpoint AS
(
SELECT
AVG(Interval) AS Interval
FROM
cte_Raw
)
SELECT AVG(CASE WHEN cte_Raw.Interval < cte_Midpoint.Interval THEN cte_Raw.[Interval] END) AS AverageEntryInterval,
AVG(CASE WHEN cte_Raw.Interval > cte_Midpoint.Interval THEN cte_Raw.[Interval] END) AS AverageExitInterval
FROM cte_Raw CROSS JOIN cte_Midpoint
This solution does not cater for the theoretical pitfall indicated by Vladimir of uneven dispersions of Entry vs Exit intervals, as in practice we can be confident this does not occur.

SQL query involving count, group by and substring

I would like to group rows of this table according to dates which form the start of SessionID and for each day, I would like to count how many rows there are for each set of ReqPhone values. Each set of ReqPhone values will be defined by the first four digits of ReqPhone. In other words, I would like to know how many rows there are for ReqPhone starting with 0925, 0927 and 0940, how many rows there are for ReqPhone starting with 0979, 0969 and 0955, etc etc.
I have been trying all kinds of group by and count but still haven't arrived at the right query.
Can anybody enlighten me?
Update:
In my country, the government assigns telecoms phone numbers starting with certain digits. Therefore, if you know the starting digits, you know which telecom someone is using. I am trying to count how many messages are sent each day using each telecoms.
SELECT SUBSTRING(ReqPhone, 1, 4),
DATEADD(DAY,0, DATEDIFF(DAY, 0, SessionID)) AS dayCreated,
COUNT(*) AS tally
FROM yourTable
GROUP BY SUBSTRING(ReqPhone, 1, 4),
DATEADD(DAY, 0, DATEDIFF(DAY, 0, SessionID))
SELECT LEFT(ReqPhone, 4),
DATEADD(DAY,0, DATEDIFF(DAY, 0, SessionID)) AS dayCreated,
COUNT(*) AS tally
FROM yourTable
GROUP BY LEFT(ReqPhone,4),
DATEADD(DAY, 0, DATEDIFF(DAY, 0, SessionID))
This will help you to calculate the count of rows group by the ReqPhone type. This query is working successfully in Oracle DB.
SELECT COUNT(SESSIONID), REQP
FROM (SELECT SESSIONID,SUBSTR(REQPHONE,1,4) AS REQP FROM SCHEMA_NAME.TABLE_NAME)
GROUP BY REQP
Note: Please use the column which is unique in the COUNT expression.

count number of ocurrences in a year in a particular field

I have strings like FVS101209GO5 Stored in a MS Access data table, I want to count the number of strings in a certain year, so in the example that would be the year 2010
I was doing
query = "SELECT SUM( IIF( Mid( KEYLastName, 4, 2) , 1,0)) AS occur FROM MyTable WHERE Year(mydate)=2010 ;"
The length of the string is 12 or 13, for the examples #JW added
qwe123456XXX - 2012
asd345678XXX - 2034
FVS101209GO5 - 2010
If you wish to find the count of occurrences of various years within a string, you might like to use:
SELECT Mid([KEYLastName],4,2) AS [Year],
Count(KEYLastName) AS CountOfOccurances
FROM MyTable
GROUP BY Mid([KEYLastName],4,2)
This will return all the two digit years at (4,2) and the number of times they each occur.
Edit re Comments
SELECT KEYLastName,
Mid([KEYLastName],4,2) AS [Year],
DCount("*","MyTable","Mid([KEYLastName],4,2)="
& Mid([KEYLastName],4,2)) AS YearCount
FROM MyTable
Seems the 4th and 5th characters in KEYLastName represent the last 2 digits of a year, so "FVS101209GO5" is for 2010. If that is correct you can count the number of KEYLastName values which represent 2010 with either of these 2 queries:
SELECT Sum(IIf(Mid(KEYLastName, 4, 2) = "10", 1, 0)) AS occur
FROM MyTable;
SELECT Count(IIf(Mid(KEYLastName, 4, 2) = "10", 1, Null)) AS occur
FROM MyTable;
However, I'm unsure why you also have a WHERE clause to restrict the rows to those where mydate is from 2010. If you want that, too, create an index on mydate and include this WHERE clause in one of the above queries.
WHERE mydate >= #2010-1-1# AND mydate < #2011-1-1#
With an index on mydate that should be much faster than asking the db engine to apply the Year() function to the mydate value from every row in the table.