I am trying to run a select on a table whereby the data in the table ranges across multiple days, thus it does not conform to daily data that the documentation eludes to.
Application of the xbar selection accross multiple days obviously results in data that is not ordered i.e. select last size, last price by 1 xbar time.second on data that includes 2 days would result in:
second | size price
====================
00:00:01 | 400 555.5
00:00:01 | 600 606.0
00:00:02 | 400 555.5
00:00:02 | 600 606.0
How can one add the current date in the selection such that the result like what is done in pandas can still be orderly across multiple days e.g: 2019-09-26 16:34:40
Furthermore how does one achieve this whilst maintaining a date format that is compatible with pandas once stored in csv?
NB: It is easiest for us to assist you if you provide code that can replicate a sample of the kind of table that you are working with. Otherwise we need to make assumptions about your data.
Assuming that your time column is of timestamp type (e.g. 2019.09.03D23:11:54.711811000), a simple solution is to xbar by one as a timespan, rather than using the time.second syntax:
select last size, last price by 0D00:00:01 xbar time from data
Using xbar keeps the time column as a timestamp rather than casting it to second type.
If your time column is of some other temporal type then you can still use this method if you have a date column in your table that you can use to cast time to a timestamp. This would look something like:
select last size, last price by 0D00:00:01 xbar date+time from data
I would suggest to group by both date and second, and the sum them
update time: date+time from
select last size, last price
by date: `date$time, time: 1 xbar `second$time from data
Or the other shorter and more efficient option is to sum date and second right in the group clause:
select last size, last price by time: (`date$time) + 1 xbar `second$time from data
Related
I can't seem to find a perhaps easy solution to what I'm trying to accomplish here, using SQL and, more importantly, QuestDB. I also find it hard to put my exact question into words so bear with me.
Input
My real input is different of course but a similar dataset or case is the gas_prices table on the demo page of QuestDB. On https://demo.questdb.io, you can directly write and run queries against some sample database, so it should be easy enough to follow.
The main task I want to accomplish is to find out which month was responsible for the year's highest galon price.
Output
Using the following query, I can get the average galon price per month just fine.
SELECT timestamp, avg(galon_price) as avg_per_month FROM 'gas_prices' SAMPLE BY 1M
timestamp
avg_per_month
2000-06-05T00:00:00.000000Z
1.6724
2000-07-05T00:00:00.000000Z
1.69275
2000-08-05T00:00:00.000000Z
1.635
...
...
Then, I get all these monthly averages, group them by year and return the maximum galon price per year by wrapping the above query in a subquery, like so:
SELECT timestamp, max(avg_per_month) as max_per_year FROM (
SELECT timestamp, avg(galon_price) as avg_per_month FROM 'gas_prices' SAMPLE BY 1M
) SAMPLE BY 12M
timestamp
max_per_year
2000-01-05T00:00:00.000000Z
1.69275
2001-01-05T00:00:00.000000Z
1.767399999999
2002-01-05T00:00:00.000000Z
1.52075
...
...
Wanted output
I want to know which month was responsible for the maximum price of a year.
Looking at the output of the above query, we see that the maximum galon price for the year 2000 was 1.69275. Which month of the year 2000 had this amount as average price? I'd like to display this month in an additional column.
For the first row, July 2000 is shown in the additional column for year 2000 because it is responsible for the highest average price in 2000. For the second row, it was May 2001 as that month had the highest average price of 2001.
timestamp
max_per_year
which_month_is_responsible
2000-01-05T00:00:00.000000Z
1.69275
2000-07-05T00:00:00.000000Z
2001-01-05T00:00:00.000000Z
1.767399999999
2001-05-05T00:00:00.000000Z
...
...
What did I try?
I tried by adding a subquery to the SELECT to have a "duplicate" of some sort for the timestamp column but that's apparently never valid in QuestDB (?), so probably the solution is by adding even more subqueries in the FROM? Or a UNION?
Who can help me out with this? The data is there in the database and it can be calculated. It's just a matter of getting it out.
I think 'wanted output' can be achieved with window functions.
Please have a look at:
CREATE TABLE electricity (ts TIMESTAMP, consumption DOUBLE) TIMESTAMP(ts);
INSERT INTO electricity
SELECT (x*1000000)::timestamp, rnd_double()
FROM long_sequence(10000000);
SELECT day, ts, max_per_day
FROM
(
SELECT timestamp_floor('d', ts) as day,
ts,
avg_in_15_min as max_per_day,
row_number() OVER (PARTITION BY timestamp_floor('d', ts) ORDER BY avg_in_15_min desc) as rn_per_day
FROM
(
SELECT ts, avg(consumption) as avg_in_15_min
FROM electricity
SAMPLE BY 15m
)
) WHERE rn_per_day = 1
I have a BigQuery dataset updating on irregular times (can be once, twice a week, or less). Data is structured as following.
id
Column1
Column2
data_date(timestamp)
0
Datapoint0
Datapoint00
2022-01-01
1
Datapoint1
Datapoint01
2022-01-01
2
Datapoint2
Datapoint02
2022-01-03
3
Datapoint3
Datapoint03
2022-01-03
4
Datapoint4
Datapoint04
2022-02-01
5
Datapoint5
Datapoint05
2022-02-01
6
Datapoint6
Datapoint06
2022-02-15
7
Datapoint7
Datapoint07
2022-02-15
Timestamp is a string in 'YYYY-MM-DD' format.
I want to make a chart and a pivot table in Google DataStudio that automatically filters by the latest datapoints ('2022-02-15' in the example). All the solutions I tried are either sub-optimal or just don't work:
Creating a support column doesn't work because I need to mix aggregated and non-aggregated fields (data_date and the latest data_date)
Adding a filter to the charts allows me to specify only a specific day - I would need to edit the chart regularly every time the underlyind data is updated
Using a dropdown filter allows me to dynamically filter whatever date I need. However I consider it suboptimal because I can't have it automatically select the latest date. Having a date filter can make it dynamic, but since the update time is not regular it may select a date range with multiple timestamps/or none at all, so it's also a sub-optimal solution
Honestly I'm out of ideas. I stupidly thought it was possible to add a column saying data_date = (select max(data_date) from dataset, but it seems not possible since max needs to work on aggregated data.
One possible solution could be creating a view that can have the latest data point, and referencing the view from the data studio.
CREATE OR REPLACE VIEW `project_id.dataset_id.table_name` AS
SELECT *
FROM `bigquery-public-data.covid19_ecdc_eu.covid_19_geographic_distribution_worldwide`
ORDER BY date DESC # or timestamp DESC
LIMIT 1
What would be the best way to check if there has been data within a 3 month period up until a maximum of 600 records, then repeat for the 3 months before that if 600 hasn't been reached? Also it's a large table so querying the whole thing could take a few minutes or completely hang Oracle SQL Developer.
ROWNUM seems to give row numbers to the whole table before returning the result of the query, so that seems to take too long. The way we are currently doing it is entering a time period explicitly that we guess there will be enough records within and then limiting the rows to 600. This only takes 5 seconds, but needs to be changed constantly.
I was thinking to do a FOR loop through each row, but am having trouble storing the number of results outside of the query itself to check whether or not 600 has been reached.
I was also thinking about creating a data index? But I don't know much about that. Is there a way to sort the data by date before grabbing the whole table that would be faster?
Thank you
check if there has been data within a 3 month period up until a maximum of 600 records, then repeat for the 3 months before that if 600 hasn't been reached?
Find the latest date and filter to only allow the rows that are within 6 months of it and then fetch the first 600 rows:
SELECT *
FROM (
SELECT t.*,
MAX(date_column) OVER () AS max_date_column
FROM table_name t
)
WHERE date_column > ADD_MONTHS( max_date_column, -6 )
ORDER BY date_column DESC
FETCH FIRST 600 ROWS ONLY;
If there are 600 or more within the latest 3 months then they will be returned; otherwise it will extend the result set into the next 3 month period.
If you intend to repeat the extension over more than two 3-month periods then just use:
SELECT *
FROM table_name
ORDER BY date_column DESC
FETCH FIRST 600 ROWS ONLY;
I was also thinking about creating a data index? But I don't know much about that. Is there a way to sort the data by date before grabbing the whole table that would be faster?
Yes, creating an index on the date column would, typically, make filtering the table faster.
I'm collecting a date to store in the database but have to account for the fact that the user may not know the exact date. I want to offer the option to only enter the month and year. What would be the best way to store these values on the database?
If I use a Date object without the day and then store it as a date in the database then it defaults to the 1st of the month, which would be inaccurate. I had the idea of potentially storing the day, month and year separately as integers and then having a method on the model that returned a Date object and whether or not the day was accurate (ie had been inputted by the user and not just defaulted by the system).
It seems a little messy though, is there a better way to do this?
Thanks.
There are multiple solutions available. Choose one which serves your use-cases the best:
1. Individual date-part fields
You've already experimented with this idea. It seems something like this:
create table t(
year int,
month int,
day int
)
A month of 2017-03 could be represented with (2017, 3, NULL).
Note that all fields are NULL-able. You can make year NOT NULL if you want to some information at least.
With this model, you must use client logic to construct some kind of date-like object for further use.
The big disadvantage of this model is that it is hard to index. You could index f.ex. make_date(year, coalesce(month, 1), coalesce(day, 1)) but using it in queries is rather inconvenient. Also, to disallow some value compositions, which make no sense (f.ex. given a year and a day, but not a month), you should add a (really long) CHECK constraint too, f.ex.
CHECK (CASE
WHEN year IS NULL THEN month IS NULL AND day IS NULL
ELSE CASE WHEN month IS NULL THEN day IS NULL END
END)
2. Sample date and precision
create table t(
sample_date date,
sample_precision date_precision -- enum of f.ex. 'year', 'month', 'day'
)
A month of 2017-03 could be represented with ('2017-03-28', 'month').
This doesn't require long CHECK constraints, but it is fairly hard to select by dates if sample_date is truly just a sample (f.ex. when the whole month of 2017-03 should be represented in a row, a sample date could even be 2017-03-28). When you use the first date as sample_date (from the values it can take, based on sample_precision) things will get slightly easier. But then the following CHECK constraint would be needed for integrity:
CHECK (date_trunc(sample_precision::text, sample_date)::date = sample_date)
(More on improving this further, later.)
3. Possible range
You can store a possible range of dates. With either a possible_start and possible_end or with PostgreSQL's daterange type.
create table t(
possible_start date,
possible_end date,
-- or
possible_range daterange
)
A month of 2017-03 could be represented with ('2017-03-01', '2017-03-31').
In this model when possible_start = possible_end then the date value is exact. You could query two different things now:
Which rows happening around given date(s) for sure (contains)
Which rows possibly happening around given date(s) (intersects)
Both of these types of queries can use indexes with daterange.
The beauty of this is that you are not limited to month ranges. You can use literally any length of ranges. Its only drawback is that the range must be contiguous.
2. + 3. ?
There is a variant, which has all of 3.'s advantages, but looks like 2. with the interval type:
create table t(
possible_start date,
possible_length interval day
)
A month of 2017-03 could be represented with ('2017-03-01', '1 month').
(The day qualifier restricts the minimum precision of the interval to be a day. It is not required for timestamp or timestamptz based solutions.)
The last possible date could be represented with (possible_start + possible_length - interval '1 day')::date. Or, the whole range as (for a daterange index): daterange(possible_start, (possible_start + possible_length)::date) (ranges are implicitly exclusive on their end).
http://rextester.com/AWIO2403
you can store a date in one column and precision in other, to compare values, you can use date_trunc(precision_column,DATE), eg:
t=# create table so36(d date,p text);
CREATE TABLE
t=# insert into so36 select now(),'day';
INSERT 0 1
t=# insert into so36 select '2017-03-01','month';
INSERT 0 1
t=# select *,date_trunc(p,d),date_trunc(p,now()) from so36;
d | p | date_trunc | date_trunc
------------+-------+------------------------+------------------------
2017-03-28 | day | 2017-03-28 00:00:00+00 | 2017-03-28 00:00:00+00
2017-03-01 | month | 2017-03-01 00:00:00+00 | 2017-03-01 00:00:00+00
(2 rows)
This is how my data model looks like:
Id Status StartDate
1 StatusA 01/01/2015
1 StatusB 01/03/2015
1 StatusC 01/05/2015
2 StatusA 01/04/2015
2 StatusB 01/08/2015
I am trying to get the max date of StatusB column per Id.
This is how my dimension looks like:
=If(Match(Status,'StatusB'),Timestamp(StartDate))
It works fine but it also gives me an additional duplicate row with Empty max date.
My Straight Chart table contains only these 2 columns. If i remove the Max Date dimension, it shows one record per Id
What am i missing here?
No need to add the filter in the dimension. QV allow calculated dimension but they can cause a lot of performance issues. (basically when calculated dimensions are used QV is createing new "virtual" table in the memory with the new dimension. With big datasets this can drain your ram/cpu)
For this cases its much better to "control" the dimension through the expressions. In your case just add Status as dimension and type the following expression:
max( {< Status = {'StatusB'} >} StartDate)
And in Numbers tab change the format setting to Timestamp.
Stefan
you can look like this;
aggr(max(StartDate), Status, StartDate)