Lost activities in Aggregated Feed - stream-framework

first of all a little of what I am trying to achieve. I have built a django wrapper on top of stream-framework library. There are 2 feed-classes - FlatFeed(RedisFeed) and AggregatedFeed(RedisAggregatedFeed). Obviously, these are using Redis to store the feed data. I have also implemented my own aggregator class.
Error: The generated aggregated feed doesn't contain all activities while the flat feed has all the activities. The use-case is - there are 3 users A, B and C. B and C performs some activities, then user A follows B and C. User B and C keep on doing more activities. Flat feed of A contains all the activities of B and C, but aggregated feed of A has some lost activities.
For example,
B likes products 1, 2, 3, 4
C likes products 5, 6, 7, 8, 9, 10
A follows B
A follows C
flat_feed(A) has all activities, but aggregated_feed(A) only has likes for 1, 5 and 8. I repeated this use-case several times and each time only these 3 activities are coming.
I have tested my aggregator class implementation on django shell. The output of aggregate and merge function contains all the activities.
Please help !!
Please note that flat feed has correct entries, the missing entries are only in aggregated feeds.

I have found the solution. I rewrote the serialization_id method of AggregatedActivity as -
def serialization_id(self):
milliseconds = str(int(datetime_to_epoch(self.updated_at)* 1000))
return milliseconds

If you look at the AggregatedActivity class of the framework, you will notice that the serialization_id aka the unique identifier of the activity is calculated based on the number of seconds from epoch. This means that the activities that are performed within one second will be overwritten.
You can solve this by redefining the serialization_id.
serialization_id = number of milliseconds from epoch.
This should work fine.

Related

Surrounding Events in KQL or Matching on Multiple Conditions

Coming from a ELK background, Kibana had some nice functionality where you could view surrounding events of any record you wished https://www.elastic.co/guide/en/kibana/current/discover-document-context.html, i.e. view the 5 preceding and 5 proceeding events.
Does something like this exist in the Kusto Query Language?
Edit: I should also mention the requirement for this as I realise it might exist, but within a different form.
I'm looking to find several events that need to have all occurred during a specific time period, i.e. the previous 5 minutes.
Example; if EventID's 1, 2 and 3 show, I'm not interested. However, if 1, 2, 3 and 4 show (within X minutes of each other) then I would like my query to pick this up.
Any hints or tips are appreciated.
It seems that Time Window Join is what I needed - https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/join-timewindow

Pandas UDF Facebook Prophet / multiple parameters

I'm trying to scale multiple models with Facebook Prophet and Pandas UDF on spark.
Everything works fine but I'd like to refine the models by giving different parameters to the function.
The function is grouped on the ID column of my dataset which is a combination of country and product.
I would like the function to apply country specific holiday to the model, added to a general seasonality dataframe which I use to for example to remove COVID19 impact on the data. Eventually I would like to change any other parameter (e.g. different type of growth) depending on the ID value.
Thank you for your kind help.
The way I think I solved it is by adding another column in the training dataset and then point to the first value of that column for each respective model ID.
So for example if the data has daily data points for the different IDs if the IDs is related to the US country, the new column points to this value for country level seasonality.
day, id, value, country
4/1, US-Item1, 10, US
4/1, IT-Item1, 5, IT
4/1, US-Item2, 15, US

Extracting multiple values from the same column, X, given their shared value in column Y

Before everything, I would like to say thanks for your help! I am currently working on data provided by UK rail; the data I have is currently contained in a relational database on PostgreSQL. My ultimate goal is to extract origin-destination pairs that are served by the railway network.
Right now, entries are stored in rows that each contain one stop/location in a train's journey. For instance, if a train travels from D to G via E and F, then D will be contained in a row and marked as an origin_location, E and F as an intermediate_location and G as a terminating_location. Each of them will also be marked with additional details such as train ID, time of the departure/passage/arrival and date of travel.
Eventually, from these entries I hope to write a query that gives me all possible origin-destination pairs. So for instance, continuing the example above, I would like the query to produce a table that gives me rows as such: D-E; D-F; D-G; E-F; E-G; F-G. The linkages between these locations are based on them sharing a common train identifier that is in another column, "Column Y".
I am at the very start of my research, but I don't expect you to help me find a specific code as a solution. Rather, I just hope you could point me in the direction as to which functions I could read up into using, and we'll see where it goes from there.
Someone has tried to work with the same dataset, albeit building this function with a slightly different purpose - that is, to get the full timetable of all trains running on a particular day (with each train departure/intermediate/terminating location as a row). The function is detailed here: https://github.com/jhumphry/ukraildata_etl/blob/master/sql/mca_get_full_timetable.sql
demo:db<>fiddle
You need a self-join on an ordered stop id (hope you have such a column). You have to join every journey with itself but only with the following stops:
SELECT
s1.journey_id,
s1.stop_name,
s2.stop_name
FROM
schedule s1
JOIN schedule s2 ON s1.journey_id = s2.journey_id AND s1.stop_id < s2.stop_id

If feature 1 and feature 2 are using data driven and, feature 1 fails for one data, how to ignore the run for feature 2 for same data

In Karate API automation ,in Feature A , which is running through data driven , if the scenario fails for one data (say DATA ONE), how to stop a subsequent feature B, that has dependency on feature A(i.e,the functionality of what is done by feature A, for example,let's assume that feature A is for customer creation and feature B is for ticket booking, if Feature A fails for dataOne, I don't want the ticket booking to happen for dataOne), from running for the data (DATA ONE)
Feature A - should run first and uses Data driven
Sample data used in Feature A(EX: CustomerCreation.feature):
DATA ONE - Scenario fails
DATA TWO - Pass
DATA THREE - Pass
Feature B - should run second and uses Data driven
Sample data used in Feature B (EX: TicketBooking.feature):
DATA ONE - should not run
DATA TWO - Should Pass
DATA THREE - Should Pass
Regardless of what I explained in comment, since you edited your question, maybe I can give you some answer.
If you create Customers 1 2 and 3 in feature A.
Let's say Customer 1 isn't created, but 2 and 3 are
In feature B, before you try to book a ticket for each Customer, you should simply check if that Customer exists, with a GET request for example.

Storing DATA FROM ACTUAL GRAPHS in a database (sql?)

I'd like to store data from actual graphs. in other words we might the following for example:
paper: smith
finance type: outgoings
time | 0 10 20 30 ... etc
amount | 10 22 31 44 ... etc
I would like to store the variables paper, finance type and for each the graph data given by time-amount. there will be other variables also (note the above example is fictional)
I'm not here to get solutions although I hardly know anything about databases. Would like to get started. When I type in Google 'store data from graph in database' all I get is information about sql graph types, node etc. I need just some direction for the actual tools to use (MySql or another database type? XML?). I will eventually want to extract the graph data of person and use that information. Google is not being my friend at the moment and I don't know who to ask personally
The database wouldn't be that big but will eventually run into 1000s of entries.
It is possible to model this in a database, but if you hardly know anything about them, you should start learning a bit about ER schema's, normalization (just up to third normal form) and the basic DDL and DML queries.
Anyway, possible model with two tables:
TABLE 'graphs'
- ID
- paper
- finance type
TABLE 'graphdata'
- ID
- GRAPH_REF
- TIME
- AMOUNT
In your table graphs, you put 1 line for each graph you have. You might have a graph for 'smith, outgoings', one for 'smith, incomings', one for "deloitte, reports"... that would be three lines. The ID is just a counter.
In the table 'graphdata', you put 1 line for each data point. Again, the ID is just a counter. The GRAPH_REF is the ID of the graph in the 'graphs' table where this data-point belongs to.
So for your example, you'd have the following graphdata rows:
1 - 1 - 0 - 10
2 - 1 - 10 - 22
3 - 1 - 20 - 31
4 - 1 - 30 - 44
Are you following so far? Now you can make a webpage (or an application, anything you can program that can work with SQL - even Excel or Access will work) that gives a user the choice to create a new graph, or select an existing graph.
Creating a new graph would insert a new row in the 'graphs' table. Then, for each data point, you put a new row in the 'graphdata' table.
When they select an existing graph, you fetch the data points from the graph, and display to them. Maybe they can add/delete points?