SQL to select records for a specific date given created time and modified time - sql

CONTEXT
I've been asked by my management to "analyze" our issue tracking database - they use it to catalog our internal bugs, etc. My SQL and DB skills are primitive so I need some help.
THE DATA
I received a single table of 3 million records. It accounts for 250K bugs. Each revision of a bug is a row in the table. That's how 250K bugs ends up in 3 million records.
The data looks like this
BugID Created Modified AssignedTo Priority Status
27 mar-31-2003 mar-31-2003 mel 2 Open
27 mar-31-2003 apr-01-2003 mel 1 Open
27 mar-31-2003 apr-10-2003 steve 1 Fixed
Thus, I have the complete history of every bug and can see how they have evolved every day.
WHAT I WANT TO ACCOMPLISH
I have a lot of things I've been asked to provide as reports. But the most basic question I have been asked to do is enable someone to look at the bugs as they existed at a specific date.
For example, if someone asked for all the bugs on mar 1 2003, then bug 27 isn't in the result because it doesn't exist on that day. Or if they asked for the bugs on April 7 they'd see bug 27 and that still marked as open
MY SPECIFIC QUESTION
Given the schema I outlined, what SQL query will provide a view of the records on a specific date?
TECHNICAL DETAILS
I am using Microsoft SQL Server 2008
WHAT I'VE TRIED SO FAR
As I said my SQL skills are primitive. I was able use WHERE clauses to filter out modifications made after the target date and bugs that didn't exist by the target date, but wasn't able to find the single record happened on that date.

WITH
sequenced_data AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY BugID ORDER BY Modified DESC) AS sequence_id,
*
FROM
yourTable
WHERE
Modified <= #datetime_stamp
)
SELECT
*
FROM
sequenced_data
WHERE
sequence_id = 1
This assumes you want to see the fixed bugs. If you want to filter out bugs that were fixed 'a long time ago' (say, 30 days), add this...
AND (Status <> 'Fixed' OR Modified >= DATEADD(DAY, -30, #datetime_stamp))

Related

SQL Query Across Partitioned Database (by Day)

So in the past 3 months, we have gone from a Google Sheets with 5 tabs, up to a connected BigQuery DB referencing the Google Sheets with 5 tables and writing queries. Today, we just upgraded again to a full daily partitioned database.
I am struggling to figure out how to write my queries across multiple days of data.
When I go to start the query it defaults to today.
SELECT order_number
FROM `project-123456.client_name.orders`
WHERE DATE(submitted_date) = "2022-02-10"
LIMIT 1000
I am trying to figure out the syntax for the month of January for example (and I know this isn't right)
WHERE DATE(submitted_date) = Jan 1 - Jan 31.
Any suggestions would be great, I am learning SQL at an alarming pace but in this case, I think I just don't know the right question to ask.
Ok I figured it out.
WHERE DATE(submitted_date) >= "2022-01-01" AND DATE(submitted_date) <= "2022-01-31"
Another option:
where date_trunc(date(submitted_date), month) = '2022-01-01'

Need column comprised of data from date two weeks ago for comparison

Let me start by saying that I am somewhat new to SQL/Snowflake and have been putting together queries for roughly 2 months. Some of my query language may not be ideal and I fully understand if there's a better, more efficient way to execute this query. Any and all input is appreciated. Also, this particular query is being developed in Snowflake.
My current query is pulling customer volumes by department and date based on a 45 day window with a 24 day lookback from current date and a 21 day look forward based on scheduled appointments. Each date is grouped based on where it falls within that 45 day window: current week (today through next 7 days), Week 1 (forward-looking days 8-14), and Week 2 (forward-looking days 15-21). I have been working to try and build out a comparison column that, for any date that lands within either the Week 1 or Week 2 group, will pull in prior period volumes from either 14 days prior (Week 1) or 21 days prior (Week 2) but am getting nowhere. Is there a best-practice for this type of column? Generic example of the current output is attached. Please note that the 'Prior Wk' column in the sample output was manually populated in an effort to illustrate the way this column should ideally work.
I have tried several different iterations of count(case...) similar to that listed below; however, the 'Prior Wk' column returns the count of encounters/scheduled encounters for the same day rather than those that occurred 14 or 21 days ago.
Count(Case When datediff(dd,SCHED_DTTM,getdate())
between -21 and -7 then 1 else null end
) as "Prior Wk"
I've tried to use an IFF statement as shown below, but no values return.
(IFF(ENCOUNTER_DATE > dateadd(dd,8,getdate()),
count(case when ENC_STATUS in (“Phone”,”InPerson”) AND
datediff(dd,ENCOUNTER_Date,getdate()) between 7 and 14 then 1
else null end), '0')
) as "Prior Wk"
Also have attempted creating and using a temporary table (example included) but have not managed to successfully pull information from the temp table that didn't completely disrupt my encounter/scheduled counts. Please note for this approach I've only focused on the 14 day group and have not begun to look at the 21 day/Week 2 group. My attempt to use the temp table to resolve the problem centered around the following clause (temp table alias: "Date1"):
CASE when AHS.GL_Number = "DATEVISIT1"."GL_NUMBER" AND
datevisit1.lookback14 = dateadd(dd,14,PE.CONTACT_Date)
then "DATEVISIT1"."ENC_Count"
else null end
as "Prior Wk"*
I am extremely appreciative of any insight on the current best practices around pulling prior period data into a column alongside current period data. Any misuse of terminology on my part is not deliberate.
I'm struggling to understand your requirement but it sounds like you need to use window functions https://docs.snowflake.com/en/sql-reference/functions-analytic.html, in this case likely a SUM window function. The LAG window function, https://docs.snowflake.com/en/sql-reference/functions/lag.html, might also be of some help

Unexpected result with ORDER BY

I have the following query:
SELECT
D.[Year] AS [Year]
, D.[Month] AS [Month]
, CASE
WHEN f.Dept IN ('XSD') THEN 'Marketing'
ELSE f.Dept
END AS DeptS
, COUNT(DISTINCT f.OrderNo) AS CountOrders
FROM Sales.LocalOrders AS l WITH
INNER JOIN Sales.FiscalOrders AS f
ON l.ORDER_NUMBER = f.OrderNo
INNER JOIN Dimensions.Date_Dim AS D
ON CAST(D.[Date] AS DATE) = CAST(f.OrderDate AS DATE)
WHERE YEAR(f.OrderDate) = 2019
AND f.Dept IN ('XSD', 'PPM', 'XPP')
GROUP BY
D.[Year]
, D.[Month]
, f.Dept
ORDER BY
D.[Year] ASC
, D.[Month] ASC
I get the following result the ORDER BY isn't giving the right result with Month column as we can see it is not ordered:
Year Month Depts CountOrders
2019 1 XSD 200
2019 10 PPM 290
2019 10 XPP 150
2019 2 XSD 200
2019 3 XPP 300
The expected output:
Year Month Depts CountOrders
2019 1 XSD 200
2019 2 XSD 200
2019 3 XPP 300
2019 10 PPM 290
2019 10 XPP 150
Your query
It is ordered by month, as your D.[Month] is treated like a text string in the ORDER BY clause.
You could do one of two things to fix this:
Use a two-digit month number (e.g. 01... 12)
Use a data type for the ORDER BY clause that will be recognized as representing a month
A quick fix
You can correct this in your code by quickly changing the ORDER BY clause to analyze those columns as though they are numbers, which is done by converting ("casting") them to an integer data type like this:
ORDER BY
CAST(D.[Year] AS INT) ASC
,CAST(D.[Month] AS INT) ASC
This will correct your unexpected query results, but does not address the root cause, which is your underlying data (more on that below).
Your underlying data
The root cause of your issue is how your underlying data is stored and/or surfaced.
Your Month seems to be appearing as a default data type (VarChar), rather than something more specifically suited to a month or date.
If you administer or have access to or control over the database, it is a good idea to consider correcting this.
In considering this, be mindful of potential context and change management issues, including:
Is this underlying data, or just a representation of upstream data that is elsewhere? (e.g. something that is refreshed periodically using a process that you do not control, or a view that is redefined periodically)
What other queries or processes rely on how this data is currently stored or surfaced (including data types), that may break if you mess with it?
Might there be validation issues if correcting it? (such as from the way zero, null, non-numeric or non-date data is stored, even if invalid)
What change management practices should be followed in your environment?
Is the data source under high transactional load?
Is it a production dataset?
Are other reporting processes dependent on it?
None of these issues are a good excuse to leave something set up incorrectly forever, which will likely compound the issue and introduce others. However, that is only part of the story.
The appropriate approach (correct it, or leave it) will depend on your situation. In a perfect textbook world, you'd correct it. In your world, you will have to decide.
A better way?
The above solution is a bit of a quick and nasty way to force your query to work.
The fact that the solution CASTs late in the query syntax, after the results have been selected and filtered, hints that is not the most elegant way to achieve this.
Ideally you can convert data types as early as possible in the process:
If done in underlying data, not the query, this is the ultimate but may not suit the situation (see below)
If done in the query, try to do it earlier.
In your case, your GROUP BY and ORDER BY are both using columns that look to be redundant data from the original query results, that is, you are getting a DATE and a MONTH and a YEAR. Ideally you would just get a DATE and then use the MONTH or YEAR from that date. Your issue is your dates are not actually dates (see "underlying data" above), which:
In the case of DATE, is converted in your INNER JOIN line ON CAST(D.[Date] AS DATE) = CAST(f.OrderDate AS DATE) (likely to minimise issues with the join)
In the case of D.[year] and D.[month], are not converted (which is why we still need to convert them further down, in ORDER BY)
You could consider ignoring D.[month] and use the MONTH DATEPART computed from DATE, which would avoid the need to use CAST in the ORDER BY clause.
In your instance, this approach is a middle ground. The quick fix is included at the top of this answer, and the best fix is to correct the underlying data. This last section considers optimizing the quick fix, but does not correct the underlying issue. It is only mentioned for awareness and to avoid promoting the use of CAST in an ORDER BY clause as the most legitimate way of addressing your issue with good clean query syntax.
There are also potential performance tradeoffs between how many columns you select that you don't need (e.g. all of the ones in D?), whether to compute the month from the date or a seperate month column, whether to cast to date before filtering, etc. These are beyond the scope of this solution.
So:
The immediate solution: use the quick fix
The optimal solution: after it's working, consider the underlying data (in your situation)
The real problem is your object Dimensions.Date_Dim here. As you are simply ordering on the value of D.[Year] and D.[Month] without manipulating the values at all, this means the object is severely flawed; you are storing numerical data as a varchar. varchar, and numerical data types are completely different. For example 2 is less than 10 but '2' is greater than '10'; because '2' is greater than '1', so therefore it must also be greater than '10'.
The real solution, therefore, is fixing your object. Assuming that both Month and Year are incorrectly stored as a varchar, don't have any non-integer values (another and different flaw if so), and not a computed column then you could just do:
ALTER TABLE Dimensions.Date_Dim ALTER COLUMN [Year] int NOT NULL;
ALTER TABLE Dimensions.Date_Dim ALTER COLUMN [Month] int NOT NULL;
You could, however, also make the columns a PERSISTED computed column, which might well be easier, in my opinion, as DATEPART already returns a strongly typed int value.
ALTER TABLE dbo.Date_Dim DROP COLUMN [Month];
ALTER TABLE dbo.Date_Dim ADD [Month] AS DATEPART(MONTH,[Date]) PERSISTED;
Of course, for both solutions, you'll need to (first) DROP and (afterwards) reCREATE any indexes and constraints on the columns.
As long as your "Month" is always 1-12, you can use
SELECT ..., TRY_CAST(D.[Month] AS INT) AS [Month],...
ORDER BY TRY_CAST(D.[Month] AS INT)
The simplest solution is:
ORDER BY MIN(D.DATE)
or:
ORDER BY MIN(f.ORDER_DATE)
Fiddling with the year and month columns is totally unnecessary when you have a date column that is available.
A very common issue when you store numerical data as a varchar/nvarchar.
Try to cast Year and Month to INT.
ORDER BY
CAST(D.[Year] AS INT) ASC
,CAST(D.[Month] AS INT) ASC
If you try using the <, > and BETWEEN operators, you will get some really "weird" results.

Data collection from GDELT using bigquery

I am trying to construct an economic indicator based on all events with specific cameo codes from gdelt database.
So the idea is to collect data from 1990 to till date and see how economic cooperation varied based on news appearances of certain words. CAMEO codes 0211, 0311, 061, 1011 and 1211 in specific.
My query is how to extract this data for these specific cameo codes. If you can direct me to any source, it would be of great help.
One person suggested me that try using bigquery. I honestly don't know how to navigate the google bigquery page till now (I tried my best probably being from a non-tech background, it was a bit overwhelming for me). If any of you can help with one Cameo code data extraction example then I can play around with other events.
Edit: I am editing to show the progress I have made and the issues I am facing while running the query.
SELECT
*
FROM
[gdelt-bq:full.events]
WHERE
Year >= 1979
AND EventCode IN ('0211', '0311','061', '1011', '1211')
AND Actor1CountryCode != Actor2CountryCode
This query will process 228 GB when run and also excludes the cases where both the country codes are null. It has over 2 million rows and I cant download this as a csv file from bigquery platform.
The part where I need help is the following,
is there any way that I can get the total number of events for each event code which satisfies the following conditions
Actor1Countrycode and Actor2CountryCode should be different except when they are null
Count for each event code every month which satisfies the above condition.
PS: You can run the code given by Ben P in the answer below to see the number and type of columns in the database.
Edit2: There is another query that I am trying to write where in the AvgTone of an event with a specified code is greater than the average of AvgTone of all events in that particular month. Any leads on how to write this would be really helpful. Suppose, I add a WHERE clause wherein the AvgTone is greater than the average of AvgTone of all events for that particular period (MonthYear in this case). My doubt is how to write this in a query format.
SELECT
MonthYear,
COUNT(*)
FROM
[gdelt-bq:full.events]
WHERE
EventCode IN ('0211',
'0311',
'061')
AND Actor1CountryCode != Actor2CountryCode
AND AvgTone > (SELECT AVG(AvgTone) FROM [gdelt-bq:full.events] GROUP BY MonthYear ORDER BY MonthYear)
GROUP BY
MonthYear
ORDER BY
MonthYear
Error: ELEMENT can only be applied to result with 0 or 1 row.
Can someone help me with the above query? Thanks.
The GDELT database is available in BigQuery.
Here is a link to their available datasets, your first step would to identify which contains the information you are interested in:
https://blog.gdeltproject.org/the-datasets-of-gdelt-as-of-february-2016/
Then this section of the site contains sample queries, which you can use as a starting point and try to tweak to your needs (note that these examples appear to me mostly in Legacy SQL, I would suggest you use them as a guide and rewrite then in Standard SQL):
https://blog.gdeltproject.org/a-compilation-of-gdelt-bigquery-demos/
If you have any specific SQL/BigQuery questions after you have done this I would recommend you come back with fresh questions and share examples of your working code, details what you have already tried and the results you expect to see.
Having had a quick look, and I must say i am not familiar with the dataset, but this may be a simple query that can start you on your way:
-- first we select all columns from the event dataset, which seems
-- to be the one you want, containing cameo codes
SELECT * FROM `gdelt-bq.full.events`
-- then we add a filter to only look at events in or after 1990
WHERE Year >= 1990
-- and another filter to look at only the specific camera
--codes you provided (I think EventCode is the correct column here,
AND EventCode IN ('0211','0311','061','1011','1211')
-- finally, we add a limit to our query, so we don't bring back ALL
-- the results while testing, once we are happy with our query, we'd remove this!
LIMIT 100
Finally, the GDELT tag right here on StackOverflow contains some really great content.
Hope that helps, GDELT looks like a fascinating project!
I finally figured out a way to extract data from GDELT using bigquery. Although the query is very simple, my lack of SQL knowledge made it difficult. Thanks to Ben who provided the initial help. Following are the queries which satisfy the conditions given in the question.
SELECT
MonthYear,
COUNT(*)
FROM
[gdelt-bq:full.events]
WHERE
EventCode IN ('0211', '0311','061')
AND Actor1CountryCode IS NULL
AND Actor2CountryCode IS NULL
GROUP BY
MonthYear
ORDER BY
MonthYear
SELECT
MonthYear,
COUNT(*)
FROM
[gdelt-bq:full.events]
WHERE
EventCode IN ('0211', '0311','061')
AND Actor1CountryCode != Actor2CountryCode
GROUP BY
MonthYear
ORDER BY
MonthYear

SQL for Next/Prior Business Day from Calendar table (in MS Access)

I have a Calendar table pulled from our mainframe DBs and saved as a local Access table. The table has history back to the 1930s (and I know we use back to the 50s in at least one place), resulting in 31k records. This Calendar table has 3 fields of interest:
Bus_Dt - every day, not just business days. Primary Key
Bus_Day_Ind - indicates if the day was a valid business day for the stock market.
Prir_Bus_Dt - the prior business day. Contains some errors (about 50), all old.
I have written a query to retrieve the first business day on or after the current calendar day, but it runs supremely slowly. (5+ minutes) I have examined the showplan output and see it is being run via an x-join, which between 30k+ record tables gives a solution space (and date comparisons) in the order of nearly 10 million. However, the actual task is not hard, and could be preformed comfortably by excel in minimal time using a simple sort.
My question is thus, is there any way to fix the poor performance of the query, or is this an inherent failing of SQL? (DB2 run on the mainframe also is slow, though not crushingly so. Throwing cycles at the problem and all that.) Secondarily, if I were to trust prir_bus_dt, can I get there better? Or restrict the date range (aka, "cheat"), or any other tricks I didn't think of yet?
SQL:
SELECT TE2Clndr.BUS_DT AS Cal_Dt
, Min(TE2Clndr_1.BUS_DT) AS Next_Bus_Dt
FROM TE2Clndr
, TE2Clndr AS TE2Clndr_1
WHERE TE2Clndr_1.BUS_DAY_IND="Y" AND
TE2Clndr.BUS_DT<=[te2clndr_1].[bus_dt]
GROUP BY TE2Clndr.BUS_DT;
Showplan:
Inputs to Query
Table 'TE2Clndr'
Table 'TE2Clndr'
End inputs to Query
01) Restrict rows of table TE2Clndr
by scanning
testing expression "TE2Clndr_1.BUS_DAY_IND="Y""
store result in temporary table
02) Inner Join table 'TE2Clndr' to result of '01)'
using X-Prod join
then test expression "TE2Clndr.BUS_DT<=[te2clndr_1].[bus_dt]"
03) Group result of '02)'
Again, the question is, can this be made better (faster), or is this already as good as it gets?
I have a new query that is much faster for the same job, but it depends on the prir_bus_dt field (which has some errors). It also isn't great theory since prior business day is not necessarily available on everyone's calendar. So I don't consider this "the" answer, merely an answer.
New query:
SELECT TE2Clndr.BUS_DT as Cal_Dt
, Max(TE2Clndr_1.BUS_DT) AS Next_Bus_Dt
FROM TE2Clndr
INNER JOIN TE2Clndr AS TE2Clndr_1
ON TE2Clndr.PRIR_BUS_DT = TE2Clndr_1.PRIR_BUS_DT
GROUP BY TE2Clndr.BUS_DT;
What about this approach
select min(bus_dt)
from te2Clndr
where bus_dt >= date()
and bus_day_ind = 'Y'
This is my reference for date() representing the current date