I need to pull a timeseries from a table that looks about like this
TimeStamp (timestamp), Datapoint (float), Data_source (integer)
So the following query would give me all the data recorded by source 1.
SELECT *
FROM table
WHERE data_source = 1
Now, how do I pick so that data_source = 1, is prioritized over the other sources? ie. I don't want doubles, I always want a datapoint which preferably is from source 1, but if not available pick something else.
I did this with a subquery that counted the amount of source=1 for every row. But that is incredibly slow. There must be an efficient way to do this? Source 1 is only available for about 3% of the points. There may be multiple other sources for one point, but in general any other source will do.
I'm on ms sql 2008. So T-SQL would be preffered, but i think this problem is quite general?
It sounds like you want to combine your data into a single series, prefering source 1.
How about this:
select timestamp,
datapoint
from (select t.*,
min(data_source) over (partition by timestamp) as minDataSource
from t
) t
where data_source = minDataSource
This assumes that "1" is the smallest data source. It calculates the min data source for each time stamp, and then uses the data from that data source.
Related
Brief Summary:
I am currently trying to get a count of completed parts that fall within a specific time range, machine number, operation number, and matches the tool number.
For example:
SELECT Sequence, Serial, Operation,Machine,DateTime,value as Tool
FROM tbPartProfile
CROSS APPLY STRING_SPLIT(Tool_Used, ',')
ORDER BY DateTime desc
is running a query which pulls all the instances that a tool has been changed, I am splitting the CSV from Tool_Used column. I am doing this because there can be multiple changes during one operation.
Objective:
This is where the production count come into place. For example, record 1 has a to0l change of 36 on 12/12/2022. I will need to go back in to the table and get the amount of part completed that equals the OPERATION/MACHINE/TOOL and fall between the date range.
For example:
SELECT *
FROM tbPartProfile
WHERE Operation = 20 AND Machine = 1 AND Tool_Used LIKE '%36%'
ORDER BY DateTime desc
For example this query will give me the datetimes the tools LIKE 36 was changed. I will need to take this datetime and compare it previous query and get the sum of all parts that were ran in this TimeRange/Operation/Machine/Tool Used
I'm looking to only select one data point from each date in my report. I want to ensure each day is accounted for and has at least one row of information, as we had to do a few different things to move a large data file into our data warehouse (import one large Google Sheet for some data, use Python for daily pulls of some of the other data - want to make sure no date was left out), and this data goes from now through last summer. I could do a COUNT DISTINCT clause to just make sure the number of days between the first data point and yesterday (the latest data point), but I want to verify each day is accounted for. Should mention I am in BigQuery. Also, an example of the created_at style is: 2021-02-09 17:05:44.583 UTC
This is what I have so far:
SELECT FIRST(created_at)
FROM 'large_table'
ORDER BY created_at
**I know FIRST is probably not the best clause for this case, and it's currently acting to grab the very first data point in created_at, but just as a jumping-off point.
You can use aggregation:
select any_value(lt).*
from large_table lt
group by created_at
order by min(created_at);
Note: This assumes that created_at is a date -- or at least only has one value per date. You might need to convert it to a date:
select any_value(lt).*
from large_table lt
group by date(created_at)
order by min(created_at);
BigQuery equivalent of the query in your question
SELECT created_at
FROM 'large_table'
ORDER BY created_at
LIMIT 1
I'm working on migrating legacy system data to a new system. I'm trying to migrate the data with history based on changed date. My current query results to below output.
Since it's a legacy system, some of the data falls within same period. I want to group the data based on id and name, and add the value as active record or inactive based on the data falls under same period.
My expected output:
For example, lets take 119 as an example and explain the same. One row marked as yellow since its not falls any overlapping period between other rows, but other two rows overlaps the period 01-No-18 to 30-Sep-19.
I need to split the data for overlapping period, and add the value only for overlapped period. So I need to look for combination based on date, which results to introduce a two rows one for non overlapped which results to below two rows
Another row for overlapped row
Same scenario applied for 148324, two rows introduced, one for overlapped and another non overlapped row.
Also is it possible to get non-overlapped data alone based on any condition ? I want to move overlapping data alone to temp table, and I can move the non-overlapped data directly to output table.
I think I dont have 100% solution, but its hard to decision what data are right and how them sort.
This query is based on lead/lag analytic functions. I had to change NULL values to adequate values in sequence (future and past).
Please try and modify this query and I hope it will fit in your case.
My table:
Query:
SELECT id,name,value,startdate,enddate,
CASE WHEN nvl(next_startdate,29993112)>nvl(prev_enddate,19900101) THEN 'Y' ELSE 'N' END AS active
FROM
(
SELECT datatable.*,
lag(enddate) over (partition by id,name order by startdate,value desc) prev_enddate,
lead(startdate) over (partition by id,name order by startdate,value desc) next_startdate
FROM datatable
) dt
Results:
I have an Athena table with the following fields:
date (str that I'm date_parse()ing to date format)
entity (str, categorical variable)
value (float, the target metric for my analysis)
Each entity has one value per date.
I'm analyzing variance -- specifically, identifying the entitys for which something unusual is happening in the value field. Previously, I was pulling out a single entity's data and doing some simple anomaly detection in Pandas using the ewm functions.
I'm working with a lot of data, though, and it updates daily. So I would prefer not to run the entire ewm time-series analysis on the thousands of entitys in this table every day. My workaround is to try to calculate a z-score using a window function in Athena, then run the more expensive analysis on the top z-scores for any given day. But I can't seem to figure out how to write the query such that the z-score is only calculated with respect to each entity and the relevant day.
Here's my stab at the initial query, which works for a single entity:
with subquery AS
(SELECT date_parse(date, '%Y-%m-%d') AS day,
value,
entity
FROM mytable
WHERE date_parse(date, '%Y-%m-%d') > date_parse('201-01-01', '%Y-%m-%d')
AND entity = 'sample_entity'),
data_with_stddev AS
(SELECT day,
value,
entity,
(value - avg(value)
OVER ()) / (stddev(value)
OVER ()) AS zscore
FROM subquery
ORDER BY 1)
SELECT *
FROM data_with_stddev
WHERE day > date_parse('2019-12-25', '%Y-%m-%d')
ORDER BY zscore desc
The way I've done this in the past is to run a bash script that iterates over all of the entity variables and executes a separate query for each. I'd like to avoid that. Thanks!
The answer is a partition by clause, like this:
...
OVER (PARTITION BY entity ORDER BY day asc)) / (stddev(value)
OVER (PARTITION BY entity ORDER BY day asc)) AS zscore
...
Docs: https://prestodb.io/docs/current/functions/window.html
In OrientDB I have setup a time series using this use case. However, instead of appending my Vertex as an embedded list to the respective hour I have opted to just create an edge from the hour to the time dependent Vertex.
For arguments sake lets say that each hour has up to 60 time Vertex each identified by a timestamp. This means I can perform the following query to obtain a specific desired Vertex:
SELECT FROM ( SELECT expand( month[5].day[12].hour[0].out() ) FROM Year WHERE year = 2015) WHERE timestamp = 1434146922
I can see from the use case that I can use UNION to get several specified time branches in one go.
SELECT expand( records ) FROM (
SELECT union( month[3].day[20].hour[10].out(), month[3].day[20].hour[11].out() ) AS records
FROM Year WHERE year = 2015
)
This works fine if you only have a small number of branches but it doesn't work very well if you want to get all the records for a given time span. Say you wanted to get all the records between;
month[3].day[20].hour[11] -> month[3].day[29].hour[23]
I could iterate through the timespan and create a huge union query but at some point I guess the query would be too long and my guess is that it wouldn't be very efficient. I could also completely bypass the time branches and query the Vectors directly based on the timestamp.
SELECT FROM Sector WHERE timestamp BETWEEN 1406588622 AND 1406588624
The problem being that you loose all efficiencies gained by the time branches.
By experimenting and reading a bit about data types in orientdb, I found that :
The squared brackets allow to :
filtering by one index, example out()[0]
filtering by multiple indexes, example out()[0,2,4]
filtering by ranges, example out()[0-9]
OPTION 1 (UPDATE) :
Using a union to join on multiple time is the only option if you don't want to create all indexes and if your range is small. Here is a query exemple using union in the documentation.
OPTION 2 :
If you always have all the indexes created for your time and if you filter on wide ranges, you should filter by ranges. This is more performant then option 1 for the cost of having to create all indexes on which you want to filter on. Official documentation about field part.
This is how the query would look like :
select
*
from
(
select
expand(hour[0-23].out())
from
(select
expand(month[3].day[20-29])
from
Year
where
year = 2015)
)
where timestamp > 1406588622
I would highly recommend reading this.