Creating a session ID and applying it to events - sql

I have event data from an app that helps tell me what people are doing inside my app.
userID|timestamp |name | value |
A | 1 |Launch | 23 |
A | 3 |ClickButton| Header|
B | 2 |Launch | 10 |
B | 5 |ClickBanner| ad |
etc
I am defining a Session as anytime someone has been out of the app for more than 5 minutes, the next entry is a new session. So if you come back in after 4 minutes, it is still the same session.
I use a lag to select the previous launch timestamp, add the value of time in seconds for that and then take the difference for the next launch. So I can select the first timestamp for each 'Session'
Now I need to map each non Launch event back to the session it is a part of so I can easily analyze things such as 'What percent of sessions include an ad click?'
I'm pulling my data using HIVE and am not having success finding an efficient way to do this as my dataset is fairly large.

Related

Find current data set using two SQL tables storing separately historical insertions and deletions

Problem
I need to do daily syncs of our latest internal data to an external audit database that does not offer an update interface. In order to update some records, I need to first generate and send in a deletion file to remove those records, and then follow by an insertion file with the same but updated records in it.
An important detail is that all of the records in deletion files must match the external records verbatim, in order to be deleted.
Proposed approach
Currently I use two separate SQL tables to version control what I have inserted/deleted.
Let's say that right now the inserted_records table looks like this:
id | file_version | contract_id | customer_name | start_year
9 | 6 | 1 | Alice | 2015
10 | 6 | 2 | Bob | 2015
11 | 6 | 3 | Charlie | 2015
Accompanied by a separate and empty deleted_records table with identical columns.
Now, if I want to
change the customer_name from Alice to Dave on line id 9
change the start_year for Bob from 2015 to 2020 on line id 10
Two new lines in inserted_records would be generated, line 12 and 13, in turn creating a new insertion file 7.
id | file_version | contract_id | customer_name | start_year
9 | 6 | 1 | Alice | 2015
10 | 6 | 2 | Bob | 2015
11 | 6 | 3 | Charlie | 2015
12 | 7 | 1 | Dave | 2015
13 | 7 | 2 | Bob | 2020
Then their original column values in line 9 and 10 are then copied onto the previously empty deleted_records, in turn creating a new deletion file 1.
id | file_version | contract_id | customer_name | start_year
1 | 1 | 1 | Alice | 2015
2 | 1 | 2 | Bob | 2015
Now, if I were to send in the deletion file 1 first followed by the insertion file 7, I would get the result that I wanted.
Question
How can I query the current set of records, considering all insertions and deletions that have occurred? Assuming all records in deleted_records always have matches in inserted_records and if multiple, we always delete records with smaller file version numbers first.
I have tried by first writing one to query the inserted_records for the latest records grouped by contract_id.
select top 1 with ties *
from insertion_record
order by row_number() over (partition by contract_id order by file_version desc)
This would give me line 11, 12 and 13, which is what I wanted in this particular example. But if we also wanted to delete the record line 11 with Charlie, then my query wouldn't work anymore as it doesn't take deleted_records into account, and I have no idea how to do it in SQL.
Furthermore, my nut tells me that this approach isn't solid as there are two separate and moving parts, perhaps there is a better approach to solve this?
How can I query the current set of records
I don't understand your question. Every SQL query is against the current set of records, if by that you mean the data currently in the database.
I do see a couple of problems.
Unless the table you're deleting from has a key defined, even an exact match on every column risks deleting more than one row.
You're performing an ad hoc update with UPDATE's transaction guarantee. I suppose the table you're updating is otherwise idle, and as a practical matter you don't have to worry about someone else (or you) re-inserting the deleted rows before your inserts arrive. But it's problem waiting to happen.
If what you're trying to do is produce the set of rows that will be the result of a series of inserts and deletions, you haven't provided enough information to say how that could be done, or even if it's possible. There would have to be some way to uniquely identify rows, so that deletions and insertions can be associated. (They don't match on all columns, after all.) And you'd need some indication of order of operation, because it matters whether INSERT follows or precedes DELETE.

Use excel and vba pivot tables to summarize data before and after dates

I have an excel document that is an dumped output of all service tickets(with statuses, assigned to, submitted by...etc) from our ticket tracking software. There is one row per ticket.
I am trying to make a flexible report generator in vba that will allow me to take in the ticket dump and output a report which will have a copy of the data in one sheet, a summary in another sheet, and a line graph in another sheet.
I feel like a pivot table is the perfect approach for this, the only problem is in the summary.
The data from the ticket dump looks something like this:
| Submitted_On | Priority | Title | Status | Closed_On |
10/10/2016 1 Ticket 1 New
10/11/2016 1 Ticket 2 Fixed 11/10/2016
10/12/2016 3 Ticket 3 Rejected 11/9/2016
10/15/2016 1 Ticket 4 In Review
The problem is the way I want the summary to look. Basically the summary should show all tickets that were opened and closed at the first of every month at exactly midnight within the past three years. In other words, if this report was a time machine, at that exact time X would be open and Y would be closed. Furthermore, The summary table should break that down by priority.
The hard part is that these simulated report dates (first of every month within the last 3 years) are extraneous values and are not within each data row.
So the report would be like this:
| Open | Closed |
| Reporting Date | P1 | P2 | P3 | P1 | P2 | P3 |
1-Oct-2016 6 10 0 3 2 0
1-Nov-2016 4 10 0 5 2 0
1-Dec-2016 6 3 0 5 9 0
Basically the formula for the Open section would be something like:
priority=1 AND Submitted_On<Reporting Date AND (Closed_On>Reporting Date OR Closed_On="")
and the formula for the closed section would be something like:
priority=1 AND Submitted_On<Reporting Date AND Closed_On<Reporting Date
It would be needed where I can filter the data so that its only coming assigned to x or only with these statuses...etc. which is why I don't think a regular sheet with formulas would work.
I thought pivot tables would work but Reporting Date isn't a field.
Do you have any advice as to what I can do to make this report work and be very flexible as far as filtering goes?
Thank you!
P.S. I am using excel 2010, so I do not have access to queries

Get the begin of a union of intervals

Disclaimer
While searching for an answer, I found this question, but I couldn't find a way to express the solution in SQL:
Union of intervals
Background
I'm trying to calculate how long the people in the company I work in are employed. In the database I have (that is already in the company for years and is [sadly] not changeable), each contract is stored as one line. Each line has a lot of information about the employee and the contract, including a contract creation date, a contract rescission date (or infinity, if still active) and the current contract situation ("active" or "deactivated"). There are, however, two problems that are preventing me from simply doing what could seem obvious:
People can be "multicontratual", so the same person could have multiple active lines at the same time.
Sometimes, there are some transfers that result in deactivating one of a person's contracts and creating a new contract line. These transfers must not be counted (i.e., I should take into account both the timelines). There is, however, no explicit flag for the transfers existence in the database, so it was defined that "it is a transfer if there was any contract rescission until 60 days before a new contract is created".
When trying to account for the multiple cases that could arise from this scenario (e.g., if the same person had many contracts through the time, then no contracts during more than 60 days, and then some other contracts, then I'd want to start counting from after the "more-than-60-days" period), I found that two rules solve the problem. I need:
The last contract creation where there was no other contract already active at the time. (this solves the problem 1)
&& there was no other active contract until 60 days before.
To the DB
To solve the problem, I decided to rearrange the rules. I wanted to take all contracts for which there was no other active contract until 60 days before its creation, and then take the "MAX()" of them. So, for example, for the following person, I would say she is active since 1973:
+----------+-----+-----------+-----------+---------------+-----------------+
| CONTRACT | ... | PERSON_ID | STATUS | CREATION_DATE | RESCISSION_DATE |
+----------+-----+-----------+-----------+---------------+-----------------+
| 1 | ... | 1 | deactived | 1973/10/01 | 1999/07/01 |
| 2 | ... | 1 | deactived | 1978/06/01 | 2000/07/01 |
| 3 | ... | 1 | deactived | 2000/08/01 | 2008/06/01 |
| 4 | ... | 1 | active | 2000/08/01 | infinity |
| 5 | ... | 1 | active | 2000/08/01 | infinity |
+----------+-----+-----------+-----------+---------------+-----------------+
I am treating the dates as if they were integers (in fact, they are in the real database). My question is: how could I create a query to take the "1973/10/01"? I.e., how could I get all the "creation_date"s that are distant from (higher than) the others in at least 60, and that are not in the intervals described by the other lines?
[and, anyway, does this seem the best way to solve the problem? (I don't think so)]

Creating a flattened table/view of a hierarchically-defined set of data

I have a table containing hierarchical data. There are currently ~8 levels in this hierarchy.
I really like the way the data is structured, but performance is dismal when I need to know if a record at level 8 is a child of a record at level 1.
I have PL/SQL stored functions which do these lookups for me, each having a select * from tbl start with ... connect by... statement. This works fine when I'm querying a handful of records, but I'm in a situation now where I need to query ~10k records at once and for each of them run this function. It's taking 2-3 minutes where I need it to run in just a few seconds.
Using some heuristics based on my knowledge of the current data, I can get rid of the lookup function and just do childrecord.key || '%' LIKE parentrecord.key but that's a really dirty hack and will not always work.
So now I'm thinking that for this hierarchically-defined table I need to have a separate parent-child table, which will contain every relationship...for a hierarchy going from level 1-8 there would be 8! records, associating 1 with 2, 1 with 3,...,1 with 8 and 2 with 3, 2 with 4,...,2 with 8. And so forth.
My thought is that I would need to have an insert trigger where it will basically run the connect by query and for every match going up the hierarchy it will insert a record in the lookup table. And to deal with old data I'll just set up foreign keys to the main table with cascading deletes.
Are there better options than this? Am I missing another way that I could determine these distant ancestor/descendant relationships more quickly?
EDIT: This appears to be exactly what I'm thinking about: http://evolt.org/working_with_hierarchical_data_in_sql_using_ancestor_tables
So what you want is to materialize the transitive closures. That is, given this application table ...
ID | PARENT_ID
------+----------
1 |
2 | 1
3 | 2
4 | 2
5 | 4
... the graph table would look like this:
PARENT_ID | CHILD_ID
-----------+----------
1 | 2
1 | 3
1 | 4
1 | 5
2 | 3
2 | 4
2 | 5
4 | 5
It is possible to maintain a table like this in Oracle, although you will need to roll your own framework for it. The question is whether it is worth the overhead. If the source table is volatile then keeping the graph data fresh may cost more cycles than you will save on the queries. Only you know your data's profile.
I don't think you can maintain such a graph table with CONNECT BY queries and cascading foreign keys. Too much indirect activity, too hard to get right. Also a materialized view is out, because we cannot write a SQL query which will zap the 1->5 record when we delete the source record for ID=4.
So what I suggest you read a paper called Maintaining Transitive Closure of Graphs in SQL by Dong, Libkin, Su and Wong. This contains a lot of theory and some gnarly (Oracle) SQL but it will give you the grounding to build the PL/SQL you need to maintain a graph table.
"can you expand on the part about it
being too difficult to maintain with
CONNECT BY/cascading FKs? If I control
access to the table and all
inserts/updates/deletes take place via
stored procedures, what kinds of
scenarios are there where this would
break down?"
Consider the record 1->5 which is a short-circuit of 1->2->4->5. Now what happens if, as I said before, we delete the the source record for ID=4? Cascading foreign keys could delete the entries for 2->4 and 4->5. But that leaves 1->5 (and indeed 2->5) in the graph table although they no longer represent a valid edge in the graph.
What might work (I think, I haven't done it) would be to use an additional synthetic key in the source table, like this.
ID | PARENT_ID | NEW_KEY
------+-----------+---------
1 | | AAA
2 | 1 | BBB
3 | 2 | CCC
4 | 2 | DDD
5 | 4 | EEE
Now the graph table would look like this:
PARENT_ID | CHILD_ID | NEW_KEY
-----------+----------+---------
1 | 2 | BBB
1 | 3 | CCC
1 | 4 | DDD
1 | 5 | DDD
2 | 3 | CCC
2 | 4 | DDD
2 | 5 | DDD
4 | 5 | DDD
So the graph table has a foreign key referencing the relationship in the source table which generated it, rather than linking to the ID. Then deleting the record for ID=4 would cascade deletes of all records in the graph table where NEW_KEY=DDD.
This would work if any given ID can only have zero or one parent IDs. But it won't work if it is permissible for this to happen:
ID | PARENT_ID
------+----------
5 | 2
5 | 4
In other words the edge 1->5 represents both 1->2->4->5 and 1->2->5. So, what might work depends on the complexity of your data.

Excel: filtering a time series graph

I have data that looks like the following:
ID | Location | Attendees | StartDate | EndDate
---------------------------------------------
Event1 | Bldg 1 | 10 | June 1 | June 5
Event2 | Bldg 2 | 15 | June 3 | June 6
Event3 | Bldg 1 | 5 | June 3 | June 10
I'd like to create a time series graph showing, for every given date, how many events were active on that date (i.e. started but haven't ended yet). For example, on June 1, there was 1 active event, and on June 4, there were 4 active events.
This should be simple enough to do by creating a new range where my first column consists of consecutive dates, and the second column consists of formulas like the following (I hardcoded June 8 in this example):
=COUNTIFS(Events[StartDate],"<=6/8/2009", Events[EndDate],">6/8/2009")
However, the challenge is that I'd like to be able to dynamically filter the time series graph based on various criteria. For example, I'd like to be able to quickly switch between seeing the above time series only for events in Bldg 1; or for Events with more than 10 attendees. I have at least 10 different criteria I'd like to be able to filter on.
What is the best way to do this? Does Excel have a built-in way to do this, or should I write the filtering code in VBA?
Apart from that my answer is not programming related: That's prime example for using a pivot table. Use this to show data consolidated for e.g. each day. Then you can play around with filtering as you like.
Your question is exactly what pivot tables are made for.