report builder - queries returning incorrect results - sql

I have set a reporting project where I would like to get stats for my tables and later integrate this into a webservice. For the following queries though, I am getting incorrect results and I will note where below:
1 - Get the number of new entries for a given day
SELECT COUNT(*) AS RecordsCount,CAST(FLOOR(CAST(dateadded AS float))
AS datetime)as collectionDate
FROM TFeed GROUP BY CAST(FLOOR(CAST(dateadded AS float))
AS datetime) order by collectionDate
works fine and I am able to put this in a bar graph successfully.
2 - Get the top 10 searchterms with the highest records per searchterm requested by a given client in the last 10 days
SELECT TOP 10 searchterm, clientId, COUNT(*) AS TermResults FROM TFeed
where dateadded > getdate() - 10 GROUP BY
searchterm,clientId order by TermResults desc
does not work
If I do a query in the Database for one of those terms that returns 98 in the report, the result is 984 in the database.
3 - I need to get the number of new records per client for a given day as well.
Also I was wondering if it is possible to put these queries into one report and not individual reports for each query which is not a big deal but having to cut and paste into one doc afterwards is tedious.
Any ideas appreciated

For #2,
WITH tmp as
(
SELECT clientId, searchTerm, COUNT(1) as TermResults,
DENSE_RANK() OVER (partition by clientId
ORDER BY clientId, COUNT(1) DESC) as rnk
FROM TFeed
WHERE dateadded > GETDATE() - 10
GROUP BY clientId, searchterm
)
SELECT *
FROM tmp
WHERE rnk < 11
USE RANK() if you want to skip a rank if there are two matches (if lets say term1 and term2 have the same number of count, they are both rank 1 and the following term will be ranked 3rd instead of 2nd
For #3,
you can define multiple datasets within one report. Then you would just create three charts / table and associate those with their respective datasets

Related

Sort many ids in a table in SQL Server

I have been given a task which I should look on items table and grab first item of 2019 and last item for 2019 and set the active flags on them as active , the query I wrote only I can grab one by one depends on the store, and it takes days to finish if I have no other choice, here is my query in SQL Server:
SELECT *
FROM NODES
WHERE NODE ID = 5562
AND DATE BETWEEN '2019/01/01' AND '2019/12/30'
Basically I need the first and the last item for the year, but the problem is every Node is a specific store which has many record and I have run the query for million of records in many Nodes, is it possible if I for example say OK SQL from the given nodes take first and last item for 2019 and display to me and then update their active flag = 'Y'
Is it possible with a CTE, do I need a CTE at all?
Thank you
If I understood correctly, you could try using a CTE with a windowed function to fetch only the first row from each store after ordering by date in ascending order and the first row from each store after ordering by date in descending order.
For instance :
CREATE TABLE NODES (NodeId int,NodeDate DATETIME2,status NVARCHAR(128))
INSERT INTO NODES(NodeId,NodeDate,Status) VALUES
(1,'2019/01/01','inactive'),
(1,'2019/03/01','inactive'),
(1,'2019/06/01','inactive'),
(1,'2019/09/01','inactive'),
(1,'2019/12/01','inactive'),
(2,'2019/01/01','inactive'),
(2,'2019/03/01','inactive'),
(2,'2019/06/01','inactive'),
(2,'2019/09/01','inactive'),
(2,'2019/12/01','inactive'),
(3,'2019/01/01','inactive'),
(3,'2019/03/01','inactive'),
(3,'2019/06/01','inactive'),
(3,'2019/09/01','inactive'),
(3,'2019/12/01','inactive')
;WITH cte AS
(
SELECT status,
ROW_NUMBER() OVER (PARTITION BY NodeId ORDER BY NodeDate ASC) AS FirstDate,
ROW_NUMBER() OVER (PARTITION BY NodeId ORDER BY NodeDate DESC) AS LastDate
FROM NODES
WHERE NodeDate >= '2019/01/01' AND NodeDate < '2020/01/01'
)
UPDATE CTE SET status = 'active'
WHERE FirstDate = 1 OR LastDate = 1
SELECT * FROM NODES
Try it online
Please do note however that this operation can be non deterministic if multiple rows have the same date.
See also :
Get top 1 row of each group

Adding grouping in framing clause window while creating partitions

Using the dataset hosted on Google (MBL Data) as an example, here is what I am accomplishing to do - obtain last 3 weeks score run for a given Venue.
My aggregated dataset looks like this without the strikes_3wk column -
Logic for strikes_3wk column is to partition the aggregated dataset by venueName, order by YearWeek column and then obtain the last 3 weeks aggregated strikes data.
Here is the query I have written so far. I see that the windowing function is where I need to modify the logic. So, is there a way to add grouping within the windowing function? Is there any alternative way of doing this?
In the image I added a new column 'expected', showing values for two weeks.
select inr.*
,sum(inr.strikes) over (Venue_Week rows between current row and 2 following) as strikes_3wk
from
(
select seasonType
,gameStatus
,homeTeamName
,awayTeamName
,venueName
,CAST(
CONCAT(
CAST(EXTRACT(YEAR FROM createdAt) as string)
,CAST(EXTRACT(WEEK(Monday) FROM createdAt) as string)
) as INT64)
as YearWeek
,sum(homeFinalRuns) as homeFinalRuns
,sum(strikes) as strikes
from `bigquery-public-data.baseball.games_wide`
where createdAt is not null
group by seasonType
,gameStatus
,homeTeamName
,awayTeamName
,venueName
,YearWeek
)inr
window Venue_Week as (
partition by inr.venueName
order by inr.YearWeek desc
)
So you are looking for strikes per venue regardless of who did them, right?
May be something like:
SELECT INR.*, STATS.strikes_3wk
FROM `bigquery-public-data.baseball.games_wide` INR
LEFT JOIN (
SELECT venueName, SUM(strikes) as strikes_3wk
FROM `bigquery-public-data.baseball.games_wide` INR2
WHERE YearWeek IN (
SELECT TOP 3 YearWeek
FROM `bigquery-public-data.baseball.games_wide`
WHERE venueName = INR2.venueName
ORDER BY YearWeek DESC
)
GROUP BY venueName
) STATS
ON INR.venueName = STATS.venueName

Executing a Aggregate function within a case without Group by

I am trying to assign a specific code to a client based on the number of gifts that they have given in the past 6 months using a CASE. I am unable to use WITH (screenshot) due to the limitations of the software that I am creating the query in. It only allows for select functions. I am unsure how to get a distinct count from another table (transaction data) and use that as parameters in the CASE I have currently built (based on my client information table). Does anyone know of any workarounds for this? I am unable to GROUP BY clientID at the end of my query because not all of my columns are aggregate, and I only need to GROUP BY clientID for this particular WHEN statement in the CASE. I have looked into the OVER() clause, but I am needing my date range that I am evaluating to be dynamic (counting transactions over the last six months), and the amount of rows that I would be including is variable, as the transaction count month to month varies. Also, the software that I am building this in does not recognize the PARTITIONED BY parameter of the over clause.
Any help would be great!
EDIT:
it is not letting me attach an image... -____- I have added the two sections of code that I am looking for assistance with!
WITH "6MonthGIftCount" (
"ConstituentID"
,"GiftCount"
)
AS (
SELECT COUNT(DISTINCT "GiftView"."GiftID" FROM "GiftView" WHERE MONTHS_BETWEEN("GiftView"."GiftDate", getdate()) <= 6 GROUP BY "GiftView"."ConstituentID")
SELECT...CASE
WHEN "6MonthGiftCount"."GiftCount" >= 4
THEN 'A010'
)
Perform your grouping/COUNT(1) in a subquery to obtain the total # of donations by ConstituentID, then JOIN this total into your main query that uses this new column to perform its CASE statement.
select
hist.*,
case when timesDonated > 5 then 'gracious donor'
when timesDonated > 3 then 'repeated donor'
when timesDonated >= 1 then 'donor'
else null end as donorCode
from gifthistory hist
left join ( /* your grouping subquery here, pretending to be a new table */
select
personID,
count(1) as timesDonated
from gifthistory i
WHERE abs(months_between(giftDate, sysdate)) <= 6
group by personid ) grp on hist.personid = grp.personID
order by 1;
*Naturally, syntax changes will vary by DB; you didn't specify which it was based on, but you should be able to use this template with whichever you utilize. This works in both Oracle and SQL Server after tweaking the month calculation appropriately.

How Can I Retrieve The Earliest Date and Status Per Each Distinct ID

I have been trying to write a query to perfect this instance but cant seem to do the trick because I am still receiving duplicated. Hoping I can get help how to fix this issue.
SELECT DISTINCT
1.Client
1.ID
1.Thing
1.Status
MIN(1.StatusDate) as 'statdate'
FROM
SAMPLE 1
WHERE
[]
GROUP BY
1.Client
1.ID
1.Thing
1.status
My output is as follows
Client Id Thing Status Statdate
CompanyA 123 Thing1 Approved 12/9/2019
CompanyA 123 Thing1 Denied 12/6/2019
So although the query is doing what I asked and showing the mininmum status date per status, I want only the first status date. I have about 30k rows to filter through so whatever does not run overload the query and have it not run. Any help would be appreciated
Use window functions:
SELECT s.*
FROM (SELECT s.*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY statdate) as seqnum
FROM SAMPLE s
WHERE []
) s
WHERE seqnum = 1;
This returns the first row for each id.
Use whichever of these you feel more comfortable with/understand:
SELECT
*
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY statusdate) as rn
FROM sample
WHERE ...
) x
WHERE rn = 1
The way that one works is to number all rows sequentially in order of StatusDate, restarting the numbering from 1 every time ID changes. If you thus collect all the number 1's togetyher you have your set of "first records"
Or can coordinate a MIN:
SELECT
*
FROM
sample s
INNER JOIN
(SELECT ID, MIN(statusDate) as minDate FROM sample WHERE ... GROUP BY ID) mins
ON s.ID = mins.ID and s.StatusDate = mins.MinDate
WHERE
...
This one prepares a list of all the ID and the min date, then joins it back to the main table. You thus get all the data back that was lost during the grouping operation; you cannot simultaneously "keep data" and "throw away data" during a group; if you group by more than just ID, you get more groups (as you have found). If you only group by ID you lose the other columns. There isn't any way to say "GROUP BY id, AND take the MIN date, AND also take all the other data from the same row as the min date" without doing a "group by id, take min date, then join this data set back to the main dataset to get the other data for that min date". If you try and do it all in a single grouping you'll fail because you either have to group by more columns, or use aggregating functions for the other data in the SELECT, which mixes your data up; when groups are done, the concept of "other data from the same row" is gone
Be aware that this can return duplicate rows if two records have identical min dates. The ROW_NUMBER form doesn't return duplicated records but if two records have the same minimum StatusDate then which one you'll get is random. To force a specific one, ORDER BY more stuff so you can be sure which will end up with 1

How can I make this query run efficiently?

In BigQuery, we're trying to run:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT value, UTC_USEC_TO_DAY(timestamp) as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [Datastore.PerformanceDatum]
WHERE type = "MemoryPerf"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
which returns a relatively small amount of data. But we're getting the message:
Error: Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead. For more details, please see https://developers.google.com/bigquery/docs/query-reference#groupby
What is making this query fail, the size of the subquery? Is there some equivalent query we can do which avoids the problem?
Edit in response to comments: If I add GROUP EACH BY (and drop the outer ORDER BY), the query fails, claiming GROUP EACH BY is here not parallelizable.
I wrote an equivalent query that works for me:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, UTC_USEC_TO_DAY(dtimestamp) as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [io_sensor_data.moscone_io13]
WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
If I run only the inner query, I get 3,660,624 results. Is your dataset bigger than that?
The outer select gives me only 4 results when grouped by day. I'll try a different grouping to see if I can hit a limit there:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, dtimestamp / 1000 as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [io_sensor_data.moscone_io13]
WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
Runs too, now with 57,862 different groups.
I tried different combinations to get to the same error. I was able to get the same error as you doubling the amount of initial data. An easy "hack" to double the amount of data is changing:
FROM [io_sensor_data.moscone_io13]
To:
FROM [io_sensor_data.moscone_io13], [io_sensor_data.moscone_io13]
Then I get the same error. How much data do you have? Can you apply an additional filter? As you are already partitioning the percentile_rank by day, can you add an additional query to only analyze a fraction of the days (for example, only last month)?