select from with wildcard, in Q KDB - where-clause

I use KDB with a tool that let me set queries. With this tool, only the parameters (filter value) can be changed with user interaction, not the structure of the query.
I need a query to be run on user action. By default, I want it to select every rows, and after he select something, I want the query to be filtered by his selection.
For example:
By default:
select from quote where symbol = *
After the user chooses a symbol:
select from quote where symbol = `AAPL
However, the default example doesn't work, because there is no * wildcard in KDB, unlike SQL. How can I get all rows by default then?

Not sure exactly what you're trying to achieve here or what your exact constraints are.
If you're saying that the equals (=) is fixed then you're restricted to selecting one symbol only....yet you want to return all rows? Technically this can be done in a select statement like so
select from table where symbol=symbol
but this may not work if your "tool" is expecting an input of type symbol.
To use a wildcard you'd need "like" instead of equals but it sounds like you can't control that.
Have you considered that this tool is designed to only allow one symbol filter for a reason ? Perhaps returning all rows would be too much data, perhaps it would be too slow, perhaps it would take too much memory. Feels like you're trying to hack a shortcut

You can cast the column in question to a string and use like.
A way to achieve your requirement is to either select for all syms (if the user doesn't specify a filter or select for a subset which the user defines.) You might implement this like;
q)show myTable:update stamps:dates+'times from ([]syms:`AAPL`GOOG`PRDC`LALA`LOLO;times:5?.z.T;dates:5?.z.D)
syms times dates stamps
----------------------------------------------------------
AAPL 00:48:15.333 2003.04.10 2003.04.10D00:48:15.333000000
GOOG 00:39:29.270 2015.06.14 2015.06.14D00:39:29.270000000
PRDC 00:15:49.627 2016.05.21 2016.05.21D00:15:49.627000000
LALA 01:30:32.099 2015.01.04 2015.01.04D01:30:32.099000000
LOLO 00:31:19.910 2013.12.01 2013.12.01D00:31:19.910000000
q)filter1:{[allSyms;usrSyms] symList:$[all null usrSyms;allSyms;usrSyms]; select from myTable where syms in symList}[exec distinct syms from myTable;]
/filter1 is a projection so you don't have to query the table for all syms each time
q)filter1`GOOG`APPL
syms times dates stamps
----------------------------------------------------------
GOOG 00:39:29.270 2015.06.14 2015.06.14D00:39:29.270000000
q)filter1`GOOG`AAPL
syms times dates stamps
----------------------------------------------------------
AAPL 00:48:15.333 2003.04.10 2003.04.10D00:48:15.333000000
GOOG 00:39:29.270 2015.06.14 2015.06.14D00:39:29.270000000
q)filter1`
syms times dates stamps
----------------------------------------------------------
AAPL 00:48:15.333 2003.04.10 2003.04.10D00:48:15.333000000
GOOG 00:39:29.270 2015.06.14 2015.06.14D00:39:29.270000000
PRDC 00:15:49.627 2016.05.21 2016.05.21D00:15:49.627000000
LALA 01:30:32.099 2015.01.04 2015.01.04D01:30:32.099000000
LOLO 00:31:19.910 2013.12.01 2013.12.01D00:31:19.910000000
q)
I'd prefer to implement something like;
q)filter2:{[usrSyms] ?[myTable;$[all null usrSyms;();enlist(in;`syms;enlist usrSyms)];0b;()]}
q)filter2`
syms times dates stamps
----------------------------------------------------------
AAPL 00:48:15.333 2003.04.10 2003.04.10D00:48:15.333000000
GOOG 00:39:29.270 2015.06.14 2015.06.14D00:39:29.270000000
PRDC 00:15:49.627 2016.05.21 2016.05.21D00:15:49.627000000
LALA 01:30:32.099 2015.01.04 2015.01.04D01:30:32.099000000
LOLO 00:31:19.910 2013.12.01 2013.12.01D00:31:19.910000000
q)filter2`GOOG
syms times dates stamps
----------------------------------------------------------
GOOG 00:39:29.270 2015.06.14 2015.06.14D00:39:29.270000000
q)filter2`GOOG`AAPL
syms times dates stamps
----------------------------------------------------------
AAPL 00:48:15.333 2003.04.10 2003.04.10D00:48:15.333000000
GOOG 00:39:29.270 2015.06.14 2015.06.14D00:39:29.270000000
q)

Related

Is it possible to calculate a cumulative sum or moving average with a TimescaleDB continuous aggregate?

Consider a table with 2 columns:
create table foo
(
ts timestamp,
precipitation numeric,
primary key (ts)
);
with the following data:
ts
precipitation
2021-06-01 12:00:00
1
2021-06-01 13:00:00
0
2021-06-01 14:00:00
2
2021-06-01 15:00:00
3
I would like to use a TimescaleDB continuous aggregate to calculate a three hour cumulative sum of this data that is calculated once per hour. Using the example data above, my continuous aggregate would contain
ts
cum_precipitation
2021-06-01 12:00:00
1
2021-06-01 13:00:00
1
2021-06-01 14:00:00
3
2021-06-01 15:00:00
5
I can't see a way to do this with the supported syntax for continuous aggregrates. Am I missing something? Essentially, I would like the time bucket to be the preceding x hours, but the calculation to occur hourly.
Good question!
You can do this by calculating a normal continuous aggregate and then doing a window function over it. So, calculate a sum() for each hour and then do sum() as a window function would work.
When you get into more complex aggregates like average or standard deviation or percentile approximation or the like, I'd recommend switching over to some of the two-step aggregates we introduced in the TimescaleDB Toolkit. Specifically, I'd look into the statistical aggregates we recently introduced. They can also do this cumulative sum type thing. (They will only work with DOUBLE PRECISION or things that can be cast to that-ie FLOAT, I'd highly recommend you don't use NUMERIC and instead switch to doubles or floats, doesn't seem like you really need infinite precision calculations here).
You can take a look with some queries I wrote up in this presentation but it might look something like:
CREATE MATERIALIZED VIEW response_times_five_min
WITH (timescaledb.continuous)
AS SELECT api_id,
time_bucket('1 hour'::interval, ts) as bucket,
stats_agg(response_time)
FROM response_times
GROUP BY 1, 2;
SELECT bucket,
average(rolling(stats_agg) OVER last3),
sum(rolling(stats_agg) OVER last3)
FROM response_times_five_min
WHERE api_id = 32
WINDOW last3 as
(ORDER BY bucket RANGE '3 hours' PRECEDING);

Perdurance of a mean over a threshold

I hope I can make this understandable, sorry if my English isn't perfect.
I have a database composed of dated data (measured every 5 minutes since March 2017).
My boss wants me to work on C# and/or SQL but I'm still a beginner in those ( Always been working on R).
The goal is to notice moments where the mean (for an hour or more) is superior to a threshold and for how long.
I've tried doing this by first doing a moving average;
Select DATEPART(YEAR,[Date]),DATEPART(MONTH,[Date]),
DATEPART(DAY,[Date]),DATEPART(HOUR,[Date]),
DATEPART(MINUTE,[Date]) as "minute", AVG(Mesure)
OVER(Order by DATEPART(YEAR,[Date]) ROWS
between 11 PRECEDING and CURRENT ROW) as "moving_average"
from My_data_base
where Code_Prm = 920
I do have to keep the "Where" clause because it's how I can select only the value I need to work on .
From here I don't know if I could find a way to add the "Perdurance" of the mean, for example by concatenating when multiples rows return an average superior to X.
Or if I should rely on C# with multiple if conditions to try and get what I want
Hope this is understandable, thanks
EDIT :
The data is is stored in 3 fields( I don't know if there is a way to show it a better way)
Date
Code_prm
Mesure
2017-03-10 11:18:00
920
X
2017-03-10 11:18:00
901
X
2017-03-10 11:18:00
903
X
2017-03-10 11:23:00
920
X
The expected result would be the average for an hour, for example : From 11:18 too 12:18, only if the average is superior to X.( I think I kind of did it with the moving average)
The next step and what I'm looking for how I could know if the mean superior to X lasts more than an hour, and then how much is it lasting.
Hour is "any hour" I guess, so 12 rows as there is a value every 5 minutes and
I'm sure there are no missing values!

Python Pandas: Rounding Datetime down to nearest 15 minute and extracting time with NA in the Column

I have been struggling to convert a datetime to a rounded time, especially when dealing with NAs -- getting a lot of 'Cannot Convert to Int, Cannot Convert to Datetime', etc, errors.
Say I have data like so:
ColnName=Start_Time
2020-01-01 19:43:32
2020-01-11 12:23:12
NA
2020-03-01 06:23:32
This column is queried from a db and automatically imported as datetime64[ns].
To round down, I know I want 15*(minutes//15). So, I would think that trivially, I could do something like (actual syntax might not be correct, but generally speaking, something along lines of):
pd.to_datetime(hour=df['Start_Time'].dt.hour,minute=15*(df['Start_Time'].dt.minute//15)).dt.time
But, no combination of those things seem to work, because of some of the below issues:
Because there are NA's in the original column, the .dt.hour and .dt.minute methods return float values.
Consistently getting a float value to integer conversion error, even if using 'Int32/64'
A bunch of other methods I've tried have also failed. I did get it to work using [Edit: This actually produces an incorrect answer]:
pd.to_datetime(((df['Start_Time'].dt.hour.fillna(999).astype(int)).astype(str)+':'+ (15*(test['Start_Time'].dt.minute.fillna(999).astype(int)//15)).astype(str)), errors='coerce').dt.time
But performance using this method seems very slow, and the code just looks terrible.
Any help would be appreciated; this will eventually need to be applied to ~200 different datasets, each consisting of 100k+ records, so speed is valuable.
Let's try passing errors='coerce':
df['round_time'] = pd.to_datetime(df['Start_Time'], errors='coerce').dt.floor('15T')
Output:
Start_Time round_time
0 2020-01-01 19:43:32 2020-01-01 19:30:00
1 2020-01-11 12:23:12 2020-01-11 12:15:00
2 NaN NaT
3 2020-03-01 06:23:32 2020-03-01 06:15:00

identifying trends and classifying using sql

i have a table xyz, with three columns rcvr_id,mth_id and tpv. rcvr_id is an id given to a customer, mth_id is a column which stores the month number( mth_id is calculated as (2012-1900) * 12 + 1,2,3.. ( depending on the month). So for example Dec 2011 will have month_id of 1344, Jan 2012 1345 etc. Tpv is a variable which shows the customers transaction amount.
Example table
rcvr_id mth_id tpv
1 1344 23
2 1344 27
3 1344 54
1 1345 98
3 1345 102
.
.
.
so on
P.S if a customer does not have a transaction in a given month, his row for that month wont exist.
Now, the question. Based on transactions for the months 1327 to 1350, i need to classify a customer as steady or sporadic.
Here is a description.
The above image is for 1 customer. i have millions of customers.
How do i go about it? I have no clue how to identify trends in sql .. or rather how to do it the best way possible.
ALSO i am working on teradata.
Ok i have found out how to get standard deviation. Now the important question is : How do i set a standard deviation limit on my own? i just cant randomly say "if standard dev is above 40% he is sporadic else steady". I thought of calculating average of standard deviation for all customers and if it is above that then he is sporadic else steady. But i feel there could be a better logic
I would suggest the STDDEV_POP function - a higher value indicates a greater variation in values.
select
rcvr_id, STDDEV_POP(tpv)
from yourtable
group by rcvr_id
STDDEV_POP is the function for Standard Deviation
If this doesn't differentiate enough, you may need to look at regression functions and variance.

How do I reconstruct a historical view?

I am currently exploring Change Data Capture as an option to store temporal databases. It is great because it stores only the deltas and seems like it may solve my problem. When I enabled CDC, a bunch of tables appeared under System Tables.
When querying cdc.dbo_MyTable, I am able to see all the changes that took place on the table. Now, how would I construct a historical view? For instance, if I wanted to see the state of my table as of a particular date, how would I go about doing that? Is that even possible?
It looks I need to take the log and start applying it over my original table but I was wondering if there is a built-in way of doing this. Any suggestions?
Some of the use cases I am looking at:
Know the state of the graph at a particular point in time
Given two graphs at different times, know the set of links that are different (this can probably be obtained using an EXCEPT clause after constructing the tables)
it's possible, but not with a built-in way i'm a afraid. You would have to reconstruct the timeline by hand.
Given that the change-tracking tables offer the tran_end_time, which is the time that the value of the property should be perceived as persisted, you would have to make a query that fetches all the distinct periods of table states, join on the tracked property changes and then pivot (to have a presentation in the same form as the table). Don't forget to union with the table state itself to obtain the values that have not been changed/tracked for completeness.
The final result, simplified, should look like
RN PK PropA PropB FromDate ToDate
1 1 'Ver1' 'Ver1' 2012-01-01 09:00 2012-01-02 08:00
2 1 'Ver1' 'Ver2' 2012-01-02 08:00 2012-01-03 07:00
3 1 'Ver2' 'Ver2' 2012-01-03 07:00 *getdate()*
4 2 'Ver1' 'Ver1' 2012-01-01 05:00 2012-01-02 06:00
5 2 'Ver1' 'Ver2' 2012-01-02 06:00 2012-01-03 01:00
6 2 'Ver2' 'Ver2' 2012-01-03 01:00 *getdate()*
note that the getdate() is valid if the row wasn't deleted in which case it should be substituted with the deletion date
EDIT, for the 2 use cases.
The first point is easily addressed it's a matter of constructing the temporal object graph and then filtering:
declare #pointInTime datetime = '20120102 10:00';
select * from Reconstructed_TG where FromDate <= #pointInTime and #pointInTime < ToDate
the second point, can be generated easily with the EXCEPT clause, as you point out.
given the above query:
declare #pointInTimeA datetime = '20120102 10:00';
declare #pointInTimeB datetime = '20120103 01:00';
select * from Reconstructed_TG where FromDate <= #pointInTimeA and #pointInTimeA < ToDate
EXCEPT
select * from Reconstructed_TG where FromDate <= #pointInTimeB and #pointInTimeB < ToDate
yet the except clause only presents the rows that have at least one different column value; i don't know if that information is really meaningful to the human eye. Depending on your needs a query that works directly on the cdc data may be more appropriate.
You may want to check out Snapshots, which have been built in to SQL Server since 2005.
These will be most useful to you if you only need a few timepoints, but they can help you track all of the tables in a complex database.
These are deltas, so Compared to a full copy of a database, however, snapshots are highly space efficient. A snapshot requires only enough storage for the pages that change during its lifetime. Generally, snapshots are kept for a limited time, so their size is not a major concern.
I'm not sure about this, never done anything like that, but maybe you can add a column "changeset" to the table that can keep track of the changes you have on the table, every time there's a transaction get the max(changeset) and save the new cahnges with the next value... Or if you have a timestamp and want to know the status of your table at certain time do querys to filter changes previous to the date you want to check...
(Not sure if I should write this is as an answer or a comment... I'm new here)
Anyway, hope it helps...