HiveQL Query to Find a Delta Between Rows if a Condition matches - hive

I have some data in data lake:
Person | Date | Time | Number of Friends |
Bob | 02/01 | unix_ts1 | 5 |
Kate | 02/01 | unix_ts2 | 2 |
Jill | 02/01 | unix_ts3 | 3 |
Bob | 02/01 | unix_ts3 | 7 |
Kate | 02/02 | unix_ts4 | 10 |
Jill | 01/29 | unix_ts0 | 1 |
I would like to produce a table like so:
Person | Date | Time | Number of Friends DELTA | Found Diff Between
Bob | 02/01 | unix_ts1 | NaN | (5, NaN)
Kate | 02/01 | unix_ts2 | NaN | (2, NaN)
Jill | 02/01 | unix_ts3 | 2 | (3, 1)
Bob | 02/01 | unix_ts3 | 2 | (7, 5)
Kate | 02/02 | unix_ts4 | 8 | (10, 2)
So, I have a table where each row is identified by a person's name and a time at which the data was recorded. I would like a query that will go and find instances of "Bob" and find deltas for consecutive timestamps and then give the delta, as well as the two values it found the diff between. I would like this to happen for each person.
I found a method to do this when there is just one value, using lag() command, but that would not do a match by Person. I also know how to do this in Pandas if I downloaded the data, but I am wondering if there is a way to do this in Hive.
Is there a way to do this? Thank you!

Using lag window function.
select person
,date
,time
,num_friends-lag(num_friends) over(partition by person order by time) as delta
,concat_ws(',',num_friends,lag(num_friends) over(partition by person order by time)) as found_diff_between
from tbl

Related

Logic to read multiple rows in a table where flag = 'Y'

Consider the following scenario. I have a Customer table, which includes RowStart and EndDate logic, thus writing a new row every time a field value is updated.
Relevant fields in this table are:
RowStartDate
RowEndDate
CustomerNumber
EmployeeFlag
For this, I'd like to write a query, which will return an employee's period of tenure (EmploymentStartDate, and EmploymentEndDate). I.e. The RowStartDate when EmployeeFlag first became 'Y', and then the first RowStartDate where EmployeeFlag changed to 'N' (Ordered of course, by the RowStartDate asc). There is an additional complexity in that the Flag value may change between Y and N multiple times for a single person, as they may become staff, resign and then be employed again at a later date.
Example table structure is:
| CustomerNo | StaffFlag | RowStartDate | RowEndDate |
| ---------- | --------- | ------------ | ---------- |
| 12 | N | 2019-01-01 | 2019-01-14 |
| 12 | N | 2019-01-14 | 2019-03-02 |
| 12 | Y | 2019-03-02 | 2019-10-12 |
| 01 | Y | 2020-03-13 | NULL |
| 12 | N | 2019-10-12 | 2020-01-01 |
| 12 | Y | 2020-01-01 | NULL |
Output could be something like
| CustomerNo | StaffStartDate | StaffEndDate |
| ---------- | -------------- | ------------ |
| 12 | 2019-03-02 | 2019-10-12 |
| 01 | 2020-03-13 | NULL |
| 12 | 2021-01-01 | NULL |
Any ideas on how I might be able to solve this would be really appreciated.
Make sure you order the columns by ID and by dates:
select *
from yourtable
order by CustomerNumber asc,
EmployeeFlag desc,
RowStartDate asc,
RowEndDate asc
This gives you a list of all changes over time per employee.
Subsequently, you want to map two rows into a single row with two columns (two dates mapped into overall start and end date). Others have done this using the lead() function. For details please have a look here: Merging every two rows of data in a column in SQL Server

FIRST & LAST values in Oracle SQL

I am having trouble querying some data. The table I am trying to pull the data from is a LOG table, where I would like to see changes in the values next to each other (example below)
Table:
+-----------+----+-------------+----------+------------+
| UNIQUE_ID | ID | NAME | CITY | DATE |
+-----------+----+-------------+----------+------------+
| xa220 | 1 | John Smith | Berlin | 2020.05.01 |
| xa195 | 1 | John Smith | Berlin | 2020.03.01 |
| xa111 | 1 | John Smith | München | 2020.01.01 |
| xa106 | 2 | James Brown | Atlanta | 2018.04.04 |
| xa100 | 2 | James Brown | Boston | 2017.12.10 |
| xa76 | 3 | Emily Wolf | Shanghai | 2016.11.03 |
| xa20 | 3 | Emily Wolf | Shanghai | 2016.07.03 |
| xa15 | 3 | Emily Wolf | Tokyo | 2014.02.22 |
| xa12 | 3 | Emily Wolf | null | 2014.02.22 |
+-----------+----+-------------+----------+------------+
Desired outcome:
+----+-------------+----------+---------------+
| ID | NAME | CITY | PREVIOUS_CITY |
+----+-------------+----------+---------------+
| 1 | John Smith | Berlin | München |
| 2 | James Brown | Atlanta | Boston |
| 3 | Emily Wolf | Shanghai | Tokyo |
| 3 | Emily Wolf | Tokyo | null |
+----+-------------+----------+---------------+
I have been trying to use FIRST and LAST values, however, cannot get the desired outcome.
select distinct id,
name,
city,
first_value(city) over (partition by id order by city) as previous_city
from test
Any help is appreciated!
Thank you!
Use the LAG function to get the city for previous date and display only the rows where current city and the result of lag are different:
WITH cte AS (
SELECT t.*, LAG(CITY, 1, CITY) OVER (PARTITION BY ID ORDER BY "DATE") LAG_CITY
FROM yourTable t
)
SELECT ID, NAME, CITY, LAG_CITY AS PREVIOUS_CITY
FROM cte
WHERE
CITY <> LAG_CITY OR
CITY IS NULL AND LAG_CITY IS NOT NULL OR
CITY IS NOT NULL AND LAG_CITY IS NULL
ORDER BY
ID, "DATE" DESC;
Demo
Some comments on how LAG is being used and its values checked are warranted. We use the three parameter version of LAG here. The second parameter means the number of records to look back, which in this case is 1 (the default). The third parameter means the default value to use should a given record per ID partition be the first. In this case, we use the default as the same CITY value. This means that the first record would never appear in the result set.
For the WHERE clause above, a matching record is one for which the city and lag city are different, or for where one of the two be NULL and the other not NULL. This is the logic needed to treat a NULL city and some not NULL city value as being different.

sql including null values in transform statement ms-access

I need help including null values in my query. Let's say I have this table:
table1:
| in/out | year | name |
| in | 2011-12| jim
| in | 2011-12| tim
| in | 2012-13| toby
| out | 2011-12| ron
| out |2012-13 | jim
| out |2012-13 | joel
I created this transform statement:
Transform Count(*)
SELECT [in/out] FROM table1 WHERE name = "jim" GROUP BY [in/out]
PIVOT year IN("2011-12", "2012-13");
To get this table:
|in/out| 2011-12 | 2012-13
| in | 1 | 1
The thing is I want to include all in/out values even if they are null so for this example I'd want the table to look like this:
|in/out| 2011-12 | 2012-13
| in | 1 | 1
| out | |
Any help would be greatly appreciated. Thanks!

SQL Order by group of specific values

I tried to found the solution, I might have done wrong researches, that's why I need your help :(
There is 6 different categorie, with different values. I want to select all of them, but ordered in 2 differents groups : the first would contain all between 1 and 3, ordered by another value.
Always in the same request, I want to display category between 4 and 6, ordered by another value.
The better way is to show you before and after, what I would like :
BEFORE
|Category | name |
| 1 | Barney |
| 6 | Ted |
| 6 | Anita |
| 3 | Jessica |
| 2 | Marshall |
| 3 | Lily |
| 4 | Robin |
| 2 | Bryan |
| 5 | Oliver |
AFTER
|Category | name | ----- Alphabetic sort
| 1 | Barney |
| 2 | Bryan |
| 3 | Jessica |
| 3 | Lily |
| 2 | Marshall |
---------------------------Imaginary line which seperate 2 groups : category 1 2 3 and 4 5 6
| 6 | Anita |
| 5 | Oliver |
| 4 | Robin |
| 6 | Ted |
I hope you understood what I meaned !
Thank you for your help !
Try this:
ORDER BY CASE
WHEN category IN (1, 2, 3) THEN 1
WHEN category IN (4, 5, 6) THEN 2
ELSE 3
END,
name
The query uses a CASE expression in order to group together category subsets: subset 1, 2, 3 is assigned a value of 1 and hence has the greatest priority. Subset 4, 5, 6 is assigned a value of 2, whereas the rest of categories get the lowest priority, i.e. the value of 3.
If you are using MySQL or PostgreSQL you can easily get this by using ORDER BY category>3, name, assuming there are only these two possible groups.
In your SELECT statement try:
... ORDER BY IF(Category<4,0,1) ASC, name ASC

Create a pivot table from two tables based on dates

I have two MS Access tables sharing a one to many relationship. Their structures are like the following:
tbl_Persons
+----------+------------+-----------+
| PersonID | PersonName | OtherData |
+----------+------------+-----------+
| 1 | PersonA | etc. |
| 2 | PersonB | |
| 3 | PersonC | |
tbl_Visits
+----------+------------+------------+-----------------------
| VisitID | PersonID | VisitDate | dozens of other fields
+----------+------------+------------+-----------
| 1 | 1 | 09/01/13 |
| 2 | 1 | 09/02/13 |
| 3 | 2 | 09/03/13 |
| 4 | 2 | 09/04/13 | etc...
I wish to create a new table based on the VisitDate field, the column headings of which are Visit-n where n is 1 to the number of visits, Visit-n-Data1, Visit-n-Data2, Visit-n-Data3 etc.
MergedTable
+----------+----------+---------------+-----------------+----------+----------------+
| PersonID | Visit1 | Visit1Data1 | Visit1Data2... | Visit2 | Visit2Data1... |
+----------+----------+---------------+-----------
| 1 | 09/01/13 | | | 09/02/13 |
| 2 | 09/03/13 | | | 09/04/13 |
| 3 | etc. | |
I am really not sure how to do this. Whether SQL query or using DAO then looping through records and columns. It is essential that there is only 1 PersonID per row and all his data appears chronologically into columns.
Start of by ranking the visits with something like
SELECT PersonID, VisitID,
(SELECT COUNT(VisitID) FROM tbl_Visits AS C
WHERE C.PersonID = tbl_Visits.PersonID
AND C.VisitDate < tbl_Visits.VisitDate) AS RankNumber
FROM tbl_Visits
Use this query as a base for the 'pivot'
Since you seem to have some visits of persons on the same day (visit 1 and 2) the WHERE clause needs to be a bit more sophisticated. But I hope you get the basic concept.
Pivoting can be done with multiple LEFT JOINs.
I question if my solution will have a high performance, since I did not test it. It is easier in SQL Server than in MS Access to accomplish.