Calculating events that lead up to another event using Clickhouse - sql

Hi everyone and I new to complex SQL and to Clickhouse both. So sorry if this seems like a simple question, but I still can't figure it out.
Challenge:
I want to build an event path that leads to a specific event.
You probably have seen these types of analysis in Google Analytics, for instance:
Basically, based on a target "end event", I am looking to build the path taken by the user to arrive at that event.
Suppose I have a table like this:
date
event_name (purchase, add_to_cart, search, browse)
user_id
Now, given the final event, let's say "purchase", I want to understand what are the common paths for users to arrive at that event, just like the picture above.
How would you go about accomplishing that?
I am using Clickhouse for this analysis.
Thanks

Related

Eventbrite `expanded` doesn't seem to be working

The organization I work with puts all its events into a series.
I understand I can get all the series and expand them with the expansion series_dates like this:
https://www.eventbriteapi.com/v3/series/*id*/?token=*token*&expand=series_dates
This works nice and it gives me my series with a list of series_dates for all recurring events. Now what I want is to have a very similar approach but at the organizational level:
https://www.eventbriteapi.com/v3/organizations/*id*/events/?token=*token*&series_filter=parents&expand=series_dates
This API call is getting all of the events which are a Parent series (perfect) but it's not listing all of the child series_dates which should be expanded onto the end.
I am starting to believe the API can't handle this request, but any suggestions will be helpful.

Can I use Vue-analytics with a Measurement ID?

I've been looking for a way to easily integrate Google Analytics and Tag manager to a VueJS project I'm working on. I came across the solution of using a library called vue-analytics for this, but the tutorials I've seen always talk about setting it up with a Tracking ID, that looks like: UA-XXXXXX-X, which I don't have, I have a Measurement ID, that looks like: G-XXXXXXX.
So I was wondering if there's any difference with using either with Vue-analytics, and if there is, is it better that I get a Tracking ID or is there a way to set it up with a Measurement ID?
Thanks so much for any help!

Database design for recurring events with exceptions

I'm building a system that needs to store/manage different types of events. For simplicity, I will focus on designing a calendar (I'm building something slightly different, but calendar is a good analogy and it's easy to reason about). I'd like to hear about possible database/schema design ideas.
Problem Description
I have a calendar with different types of events (for simplicity sake, say there is only 1 type of event: Task). User can add new event for a particular date, edit (change some details, like title or move to another date) or delete. There can be one-time events and recurring events (with different types of recurrence: every X days, every 15th day of the month, every week on Monday; kind of like simple cron). When user moves recurring event, all other instances of this event are moved in the same manner (e.g: +3 days). Important part: recurring events can have exceptions. So, for example, let's say I have an recurring event A which is repeated every 7 days. But I want to change it's date for next week, so instead of Tuesday, it's be assigned to Friday, after that it'll still occur on Tuesday. This "exception" event shouldn't be affected when "parent" event is moved.
Also, every recurring event can have additional info, that is related only to 1 particular instance, e.g: I have the same recurring event A repeated every 7 days, I want to add a note for this week instance that says "X", and I want to add another note for the event A next month that says "Y" - those fields are only visible to that single instances.
Ideas
System with regular, one-time events is pretty straightforward so I won't discuss that and focus only on recurring events.
1. One possible solution is the one that resembles OOP: I can have an Event "class" with fields such as start_date, end_date (can be null), recurrence_type (something like enum with possible values of EVERY_X_DAYS, DAY_OF_WEEK, DAY_OF_MONTH) and recurrence_value (say 7). When user adds new recurring event, I just create such Event in the database. When user wants to change 1 occurrence of this event, I add new entry to the DB of the type/class MovedEvent that "inherits" from Event with different date and has additional field related_to that points to the ID (or UUID, if you will) of the Event that it's related to. But at the same time, I need to keep track of all the MovedEvents (otherwise I'd have 2 events displayed in the same week), so I need to have an array moved_events of IDs that point to all MovedEvents.
Disadvantage: every time I want to display the calendar I need to get Event and select all events from the moved_events, which is not optimal if I'll have a lot of moved events.
2. Another idea is to store every event as a separate record. IMO it's a terrible idea, but I just mentioning it because it's a possibility. Disadvantages: every time I want to edit the main event (e.g: I want to change the event from occurring "every 7 days" to "every 9 days") I need to change every single occurrence of the event. "Exceptions" (changing single instance) is easier, though.
SQL/NoSQL? Scale details
I'm using PostgreSQL in my project, but I have basic knowledge in NoSQL databases and if they are better suited for this kind of a problem, I can use it.
Scale: Let's say I have 5k users, and each will have on average 150 events/week, 40% of which can be "exceptions". Therefore I want to design this system to be efficient.
Similar Questions & Other Resources
I've just started reading Martin Fowler's "Recurring Events for Calendars" (http://martinfowler.com/apsupp/recurring.pdf) but I'm not sure if it applies to my problem and if so, how one would design database schema according to this document (suggestions are welcome).
There are similar questions, but I didn't see any mention of "exceptions" (changing 1 event instance without affecting other), but maybe someone will find these links useful:
Design question: How would you design a recurring event system?
Optimal design for a Database with recurring event
Design option for 'recurring tasks'
Calendar Recurring/Repeating Events - Best Storage Method
What is the best way to represent "Recurring Events" in database?
Sorry for a long question, I wanted to describe the problem well. Yet, I feel that's pretty chaotic, so if you have additional questions, I will happily provide more details. Again, I'd like to hear about possible database/schema design ideas plus any other suggestions. Thank you!
Use iCalendar RRules and ExDates
If it's a recurring event, just store the start/end datetimes and RRules and ExDates for the event.
Use a Materialized View to pre-calculate upcoming actual events, say for the next 30 days or 365 days.
As you are using Postgres, you can use existing python, perl, or javascript RRule libraries (such as dateutil) inside pg function for calculating future events based on the rrules and exdates
UPDATE: check out pg_rrule extension: https://github.com/petropavel13/pg_rrule

Google Analytics retrieve custom variables statistics

Edit refurbished the question that was not clear
New to GA, I'm looking at the way to retrieve automatically custom variables data statistics
The query would have
a start and an end dates (possibly equal)
a variable name
For instance, a Page-level variable Brand takes only three possible values, that are set by the web server, and seen by the client.
The values are Apple, Google and Microsoft.
The query to Google-Analytics could be something like (pseudo-code), provided that I use an authentication token previously acquired
...getstatistics?myToken=123&variable=Brand&datefrom=20110121&dateto=20110121
And the result could be some xml like data
<variable>Brand</variable><value>Apple</value><count>3214</count>
<variable>Brand</variable><value>Google</value><count>4321</count>
<variable>Brand</variable><value>Microsoft</value><count>1345</count>
Meaning for instance that the page-level custom variable Brand was set to the value Apple by the web server (and thus seen by the client / sent to GA) 3214 times.
What is the correct way/protocol to query values/statistics from GA, in order to get statistics related to custom variables?
So, this is my understanding of what you're doing:
You're setting page-level custom variables (important technical note: these need to be called before the _trackPageview or some other call, else they won't be tracked.)
Your code might looks something like this:
_gaq.push(['_setCustomVar', 2, 'Brand', 3]);
Now, when querying the Google Analytics API, its important to note that the slot # is very important, since the slot you're accessing is explicitly named in the query.
So, to do this, you'd need to set your dimensions to ga:customVarName2 and ga:customVarValue2, and decide what metric you're interesting it getting. You mention Page views, so you'd use ga:pageviews. (You're by no means limited to pageviews. You can use any Metric besides a couple of the AdWords specific ones.)
This query would return you all of the custom variable from this slot, and the number of pageviews associated with them.
You also mentioned you'd want to be able to filter by value.
You'd do that by setting the filter value to something like ga:customVarValue2==Apple.
You can see what a query like that would look like here in the query explorer.
Here's a sample screenshot:
Finally, all Google Analytics API queries by default require you to set a date range, so you could query that on your own.
All you need to do is decide which library you want to use as interface, and you're set to go.
Google has a handy resource, called the Google Analytics Data Explorer that can help answer a lot of your questions by letting you experiment through an interface, as long as you login with your Google Analytics credentials.
As you add parameters using their tools, the system will automatically build your URL/Query.
If that's not enough, Google also has some Interactive Examples using JavaScript. Like the Data Explorer, you can also login with your Google Analytics credentials and run the examples to see what data would be returned.
These tools are awesome because they help take the guesswork out of figuring out how to target the exact data you're searching for.

Can you get the exact date a user started following another using the twitter API?

Let's say user A follows user B, and B follows A. I want to know the exact date A started following B and viceversa.
Is this information stored on twitter? Can I retrieve it using the API?
To clear out: The point of this question is finding a way to know who followed who first.
(I'm assuming both A and B deleted the notification e-mails)
No Ignacio, you can't. You just can know who follows who but not the date the follow started.
Looking at the API, there's is no way, there are two calls to get the followers:
User Methods/statuses/followers
and
Social Graph Methods/followers/ids
Neither of them returns dates or even a serial that would let you see who started following first. Really, there's no indication that twitter is internally storing this information, neither in the API nor Twitter's web interface.
This is a very old question, but perhaps some might be interested to know that while you cannot get the date at which someone started following, you can at least infer an "earliest possible following date" from the fact that the list of followers is ordered according to date, and the fact that follower objects come with a created_at timestamp.
Here's a Python function for calculating an "earliest possible following date": https://github.com/BernhardClemm/twitter-follow-dates
Of course Twitter stores it, because Twitter sorts followers and following lists by the date ;)
It is possible to do this, but impractical. When you call the followers API you can page the results. Each returned object contains next_cursor and prev_cursor items. These refer to the first and last records in the next and previous pages. These values are time based and can be used to calculate the time that the respective users followed you.
It follows that, if you set the page size to 1, you can walk through the list of follower IDs one at a time and the next_cursor value will allow you to derive the follow time for the next record.
This is reasonably simple to implement, however, in practice, you'll very quickly hit Twitter's API rate limit.