Currently I work on indexing documents in Azure Cosmos DB SQL API. I read here:
https://learn.microsoft.com/en-us/azure/cosmos-db/working-with-dates
that date type could be stored as a string or numeric value as Unix timestamp.
There are several advantages when using string format yyyy-MM-ddTHH:mm:ss.fffffffZ like e.g. human readable, but there is not written directly that this is also more performant when querying.
So we can use string range index or numeric range index for epoch time.
Does someone know what is faster for querying - range filters operations?
e.g. startTime > x and startTime < y
or in udf in similar way when taking into account only hour
e.g.
function somefunc(ts, firstHour, secondHour) {
var ts_date = new Date(ts);
var hour = ts_date.getHours();
return hour >= firstHour && hour < secondHour;
}
Related
We have a PostgreSQL DB running on AWS (Engine Version 11.8).
For each item, we are storing dates as strings [varchar type] (in the following format - '2020-12-16')
Our app has requirements to do date range queries. Based on that, what is the best and most efficient way to do string date comparisons in PostgreSQL?
I looked at several questions here on SO, but no one talks about the question if there is a difference to store the dates as type "varchar" or type "date".
Also based on the 2 storage types above, what would be the most efficient way to do queries? In particular we are looking at querying for ranges (for example from '2020-12-10' and '2020-12-16')
Thanks a lot for your feedback
no one talks about the question if there is a difference to store the dates as type "varchar" or type "date".
Fix your data model! You should be using the proper datatype to store your data: dates should be stored as date. Using the wrong datatype is wrong for many reasons:
whenever you need to do date arithmetic, you need to convert your strings to a date (eg: how do you add one month to string '2020-12-16'?); this is highly inefficient
data integrity cannot be enforced at the time when your data is stored; even using check constraints is not enough. Eg: how can you tell whether '2021-02-29' is a valid date or not?
what would be the most efficient way to do queries? In particular we are looking at querying for ranges.
That said, the format that you are using makes it possible to do direct string comparison. So for a simple range comparison, I would suggest string operations:
where mystring >= '2020-12-10' and mystring < '2020-12-16'
Before even thinking about performance: use the correct data types.
Try, for instance, to get all the friday the thirteenth in 2020 with your string representation. With dates it is very easy:
CREATE table thisyear(
thedate DATE NOT NULL PRIMARY KEY
);
INSERT INTO thisyear(thedate)
SELECT generate_series('2020-01-01'::date, '2021-01-01'::date -1, '1 day'::interval)
;
SELECT * FROM thisyear
WHERE date_part('dow', thedate) = 5 -- friday
AND date_part('day', thedate) = 13 -- the thirteenth
;
Result:
CREATE TABLE
INSERT 0 366
thedate
------------
2020-03-13
2020-11-13
(2 rows)
I have the following query. This query copies the data from Cosmos DB to Azure Data Lake.
select c.Tag
from c
where
c.data.timestamp >= '#{formatDateTime(addminutes(pipeline().TriggerTime, -15), 'yyyy-MM-ddTHH:mm:ssZ' )}'
However, I've got to use the _ts which is the epoch time when the document was created on the cosmos DB collection instead of c.data.timestamp. How do I convert epoch time to date time and compare with it with '#{formatDateTime(addminutes(pipeline().TriggerTime, -15), 'yyyy-MM-ddTHH:mm:ssZ' )}'
I have also tried using
dateadd( SECOND, c._ts, '1970-1-1' ) which clearly isn't supported.
As #Chris said, you could use UDF in cosmos db query.
udf:
function convertTime(unix_timestamp){
var date = new Date(unix_timestamp * 1000);
return date;
}
sql:
You could merge it into your transfer sql:
select c.Tag
from c
where
udf.convertTime(c._ts) >= '#{formatDateTime(addminutes(pipeline().TriggerTime, -15), 'yyyy-MM-ddTHH:mm:ssZ' )}'
I intend to process a dataset from EventHub stored in ADLA, in batches. It seems logical to me to process intervals, where my date is between my last execution datetime and the current execution datetime.
I thought about saving the execution timestamps in a table so I can keep track of it, and do the following:
DECLARE #my_file string = #"/data/raw/my-ns/my-eh/{date:yyyy}/{date:MM}/{date:dd}/{date:HH}/{date:mm}/{date:ss}/{*}.avro";
DECLARE #max_datetime DateTime = DateTime.Now;
#min_datetime =
SELECT (DateTime) MAX(execution_datetime) AS min_datetime
FROM my_adldb.dbo.watermark;
#my_json_bytes =
EXTRACT Body byte[],
date DateTime
FROM #my_file
USING new Microsoft.Analytics.Samples.Formats.ApacheAvro.AvroExtractor(#"{""type"":""record"",""name"":""EventData"",""namespace"":""Microsoft.ServiceBus.Messaging"",""fields"":[{""name"":""SequenceNumber"",""type"":""long""},{""name"":""Offset"",""type"":""string""},{""name"":""EnqueuedTimeUtc"",""type"":""string""},{""name"":""SystemProperties"",""type"":{""type"":""map"",""values"":[""long"",""double"",""string"",""bytes""]}},{""name"":""Properties"",""type"":{""type"":""map"",""values"":[""long"",""double"",""string"",""bytes"",""null""]}},{""name"":""Body"",""type"":[""null"",""bytes""]}]}");
How do I properly add this interval to my EXTRACT query? I tested it using a common WHERE clause with interval defined by hand and it worked, but when I attempt to use #min_datetime it doesn't work, since its result is a rowset.
I thought about applying some filtering in a subsequent query, but I am afraid this means #my_json_bytes will extract my whole dataset and filter it aftewards, resulting in a suboptimized query.
Thanks in advance.
You should be able to apply the filter as part of a later SELECT. U-SQL can push up predicates in certain conditions but I haven't been able to test this yet. Try something like this:
#min_datetime =
SELECT (DateTime) MAX(execution_datetime) AS min_datetime
FROM my_adldb.dbo.watermark;
#my_json_bytes =
EXTRACT Body byte[],
date DateTime
FROM #my_file
USING new Microsoft.Analytics.Samples.Formats.ApacheAvro.AvroExtractor(#"{""type"":""record"",""name"":""EventData"",""namespace"":""Microsoft.ServiceBus.Messaging"",""fields"":[{""name"":""SequenceNumber"",""type"":""long""},{""name"":""Offset"",""type"":""string""},{""name"":""EnqueuedTimeUtc"",""type"":""string""},{""name"":""SystemProperties"",""type"":{""type"":""map"",""values"":[""long"",""double"",""string"",""bytes""]}},{""name"":""Properties"",""type"":{""type"":""map"",""values"":[""long"",""double"",""string"",""bytes"",""null""]}},{""name"":""Body"",""type"":[""null"",""bytes""]}]}");
#working =
SELECT *
FROM #my_json_bytes AS j
CROSS JOIN
#min_datetime AS t
WHERE j.date > t.min_datetime;
I've been tasked to take a calendar date range value from a form front-end and use it to, among other things, feed a query in a Teradata table that does not have a datetime column. Instead the date is aggregated from two varchar columns: one for year (CY = current year, LY = last year, LY-1, etc), and one for the date with format MonDD (like Jan13, Dec08, etc).
I'm using Coldfusion for the form and result page, so I have the ability to dynamically create the query, but I can't think of a good way to do it for all possible cases. Any ideas? Even year differences aside, I can't think of anything outside of a direct comparison on each day in the range with a potential ton of separate OR statements in the query. I'm light on SQL knowledge - maybe there's a better way to script it in the SQL itself using some sort of conversion on the two varchar columns to form an actual date range where date comparisons could then be made?
Here is some SQL that will take the VARCHAR date value and perform some basic manipulations on it to get you started:
SELECT CAST(CAST('Jan18'||TRIM(EXTRACT(YEAR FROM CURRENT_DATE)) AS CHAR(9)) AS DATE FORMAT 'MMMDDYYYY') AS BaseDate_
, CASE WHEN Col1 = 'CY'
THEN BaseDate_
WHEN Col1 = 'LY'
THEN ADD_MONTHS(BaseDate_, -12)
WHEN Col1 = 'LY-1'
THEN ADD_MONTHS(BaseDate_, -24)
ELSE BaseDate_
END AS DateModified_
FROM {MyDB}.{MyTable};
The EXTRACT() function allows you to take apart a DATE, TIME, or TIMESTAMP value.
You have you use TRIM() around the EXTRACT to get rid of the whitespace that is added converting the DATEPART to a CHAR data type. Teradata is funny with dates and often requires a double CAST() to get things sorted out.
The CASE statement simply takes the encoded values you suggested will be used and uses the ADD_MONTHS() function to manipulate the date. Dates are INTEGER in Teradata so you can also add INTEGER values to them to move the date by a whole day. Unlike Oracle, you can't add fractional values to manipulate the TIME portion of a TIMESTAMP. DATE != TIMESTAMP in Teradata.
Rob gave you an sql approach. Alternatively you can use ColdFusion to generate values for the columns you have. Something like this might work.
sampleDate = CreateDate(2010,4,12); // this simulates user input
if (year(sampleDate) is year(now())
col1Value = 'CY';
else if (year(now()) - year(sampleDate) is 1)
col1Value = 'LY'
else
col1Value = 'LY-' & DateDiff("yyyy", sampleDate, now());
col2Value = DateFormat(sampleDate, 'mmmdd');
Then you send col1Value and col2Value to your query as parameters.
The goal
I've got a really strange problem. My goal is to compare a dynamic date in database using dbunit (the database is an Oracle one). This dynamic date is the today date (comparing static date works...)
The experimentation
To compare date, I use this very simple code :
#Test
public void simpleDateTest() throws DatabaseUnitException, SQLException {
// create a date corresponding to today
Date today = new Date();
// load a dataset
IDataSet expectedDataSet = new FlatXmlDataSetBuilder().build(getClass()
.getResource("/dataset.xml"));
// replace [TODAY] by the today date
ReplacementDataSet rDataSet = new ReplacementDataSet(expectedDataSet);
rDataSet.addReplacementObject("[TODAY]", today);
// insert the dataset
DatabaseConnection connection = getConnection();
DatabaseOperation.CLEAN_INSERT.execute(connection, rDataSet);
// do a simple query to get some fields, em is the entity manager which uses the same connection as above
Query q = em
.createNativeQuery("SELECT ID, MY_DATE FROM MY_TABLE");
List<Object[]> result = q.getResultList();
// compare the date with today date
assertEquals(today.getTime(), ((Date) result.get(0)[1]).getTime());
}
With the following dataset :
<?xml version='1.0' encoding='UTF-8'?>
<dataset>
<MY_TABLE ID="1" MY_DATE="[TODAY]" />
</dataset>
The problem
I don't know why, but the assert failed ! When comparing the two dates, there is a very few milliseconds difference. The error is something like that :
java.lang.AssertionError: expected:<1358262234801> but was:<1358262234000>
I don't understand how is it possible to have different dates because they are quite normally the same !
Any clue to understand the problem and how to solve it ?
The Oracle Date type doesn't have a resolution of milliseconds, it only has a resolution of seconds.
When saving a time to the database the milliseconds are simply stripped off. When you look at the actual value you see that the last three digits are all zero.
Assuming that today is a java.util.Date, the Java Date has millisecond precision. An Oracle date, on the other hand, does not have millisecond precision. So you would expect to lose some precision when you store a java.util.Date in Oracle and retrieve it back.
You could store the data in an Oracle timestamp column-- that will have millisecond precision. Or you could strip off the milliseconds from the Java Date before writing it to the database.