We have
set A
set B
time dimension T
We like to see
A minus B
B minus A
A intersect B
across Time T
So far we only use excel to plot out the period we like to compare.
E.g. let say the grains are one month for the time dimension.
If we are comparing
Between 2019Q1 vs 2019Q2
Between 2019Q2 vs 2019Q3
Between 2019Q3 vs 2019Q4
We would have 9 excel file generated.
Is there a better way to do this ?
Related
I have a dataset with parallel time series. The column 'A' depends on columns 'B' and 'C'. The order (and the number) of dependent columns can change. For example:
A B C
2022-07-23 1 10 100
2022-07-24 2 20 200
2022-07-25 3 30 300
How should I transform this data, or how should I build the model so the order of columns 'B' and 'C' ('A', 'B', 'C' vs 'A', C', 'B'`) doesn't change the result? I know about GCN, but I don't know how to implement it. Maybe there are other ways to achieve it.
UPDATE:
I want to generalize my question and make one more example. Let's say we have a matrix as a singe observation (no time series data):
col1 col2 target
0 1 a 20
1 2 a 30
2 3 b 30
3 4 b 40
I would like to predict one value 'target' per each row/instance. Each instance depends on other instances. The order of rows is irrelevant, and the number of rows in each observation can change.
You are looking for a permutation invariant operation on the columns.
One way of achieving this would be to apply column-wise operation, followed by a global pooling operation.
How that achieves your goal:
column-wise operations are permutation equivariant; that is, applying the operation on the columns and permuting the output, is the same as permuting the columns and then applying the operation.
A global pooling operation (e.g., max-pool, avg-pool) across the columns is permutation invariant: the result of an average pool does not depend on the order of the columns.
Applying a permutation invariant operation on top of a permutation equivariant one results in an overall permutation invariant function.
Additionally, you should look at self-attention layers, which are also permutation equivariant.
What I would try is:
Learn a representation (RNN/Transformer) for a single time series. Apply this representation to A, B and C.
Learn a transformer between the representation of A to those of B and C: that is, use the representation of A as "query" and those of B and C as "keys" and "values".
This will give you a representation of A that is permutation invariant in B and C.
Update (Aug 3rd, 2022):
For the case of "observations" with varying number of rows, and fixed number of columns:
I think you can treat each row as a "token" (with a fixed dimension = number of columns), and apply a Transformer encoder to predict the target for each "token", from the encoded tokens.
My data looks like this:
Creation Day Time St1 Time St2
0 28.01.2022 14:18:00 15:12:00
1 28.01.2022 14:35:00 16:01:00
2 29.01.2022 00:07:00 03:04:00
3 30.01.2022 17:03:00 22:12:00
It represents parts being at a given station. What I now need is something that counts how many Columns have the same Day and Hour e.g. How many parts were at the same station for a given Hour.
Here 2 Where at Station 1 for the 28th and the timespan 14-15.
Because in the end I want a bar graph that show production speed. Additionally later in the project I want to highlight Parts that havent moved for >2hrs.
Is it practical to create a datetime object for every Station (I have 5 in total)? Or is there a much simpler way to do this?
FYI I import this data from an excel sheet
I found the solution. As they are just strings I can just add them and reformat the result with pd.to_datetime().
Example:
df["Time St1"] = pd.to_datetime(
df["Creation Day"] + ' ' + df["Time St1"],
infer_datetime_format=False, format='%d.%m.%Y %H:%M:%S'
)
I imagine that I have several time series like following, from different "sources":
time events
0 1000 1080000
1 2003 2122386
2 3007 3043985
3 4007 3872544
4 5007 4853763
Here, an monotonic increasing count events is sampled every 1000 ms. The sampling is not exact so most of the timestamps vary from their ideal values by a few ms - e.g., the second point is at 2003 instead of 2000.
I want to sum several of these time series: they will all be sampled at ~1000 ms but may not agree to the exact millsecond. E.g another time series could be:
time events
0 1000 1070000
1 2002 2122486
2 3006 3063985
3 4007 3872544
4 5009 4853763
I'd like something reasonable in terms of the final result. For example the same number of rows as each of the input dataframes, with a timestamp column the same as the first, or average of the inputs times. As long as the inputs are smooth, the outputs should be too.
I'd suggest DataFrame.reindex() with nearest method. Example:
def combine_datasources(reference_df, extra_dfs, tolerance_ms=100):
reindexed_df_list = [df.reindex(reference_df.index, method='nearest', tolerance=tolerance_ms) for df in extra_dfs]
combined = pd.concat([reference_df, *reindexed_df_list])
return combined.groupby(combined.index).sum()
combine_datasources(df_a, [df_b])
This code changes the index on the dataframes in the extra_dfs list to match the index for the reference dataframe. Then, it concatenates all of the dataframes together. It uses groupby to do the sum, which requires that the indexes match exactly to work. The timestamps will be the same as the one on the reference dataframe.
Note that if you have data from a time period not covered by the reference dataframe, that data will be dropped.
Here's the output for the dataset in your question:
events
time
1000 2150000
2003 4244872
3007 6107970
4007 7745088
5007 9707526
I searched a lot, but didn't find an answer to the following question:
Financial data often come as daily data but with missing dates (weekends, banking holidays ...). I would like to have those data really on a daily basis with missing values, where originally the dates were missing.
So far I did this in liberoffice-calc half-manually, which takes a lot of time. I didn't find ways to really automate this, as there is no fixed rule, which dates are missing.
Example:
I have:
21/12/18 1
27/12/18 2
28/12/18 3
02/01/19 4
I want:
21/12/18 1
22/12/18
23/12/18
24/12/18
25/12/18
26/12/18
27/12/18 2
28/12/18 3
29/12/18
30/12/18
31/12/18
01/01/19
02/01/19 4
I'm not familiar with liberoffice-calc. In Excel or Google Sheets, I would use a lookup table.
In one tab of the spreadsheet, I was enter the dates I want in column A. In another tab, I would place the actual data I have. Then, in column B of the first tab, I would lookup the value for that day from the data on the second tab.
Assuming 1 is in B2, put a start date in say D2 and in E2:
=IFERROR(INDEX(B:B,MATCH(D2,A:A,0)),"")
then copy both down to suit.
I have a dataset that I shaped according to my needs, the dataframe is as follows:
Index A B C D ..... Z
Date/Time 1 0 0 0,35 ... 1
Date/Time 0,75 1 1 1 1
The total number of rows is 8878
What I try to do is create a time-series dendrogram (Example: Whole A column will be compared to whole B column in whole time).
I am expecting an output like this:
(source: rsc.org)
I tried to construct the linkage matrix with Z = hierarchy.linkage(X, 'ward')
However, when I print the dendrogram, it just shows an empty picture.
There is no problem if a compare every time point with each other and plot, but in that way, the dendrogram becomes way too complicated to observe even in truncated form.
Is there a way to handle the data as a whole time series and compare within columns in SciPy?