I am new with hive and SQL. And I have some difficulties with writing of query on hive. The fact is that I need to calculate the ranking of visits to urls on weekend between 08:00 and 17:59 (it is working hours) and between 18:00 and 07:59 (it is free hours). I have data base , which stores userId, timestamp (when user attended the url). And I need to select url, and this two ranking of visits for work hours and free hours on weekend. I don't understand how to do that. I suppose I have to use WHERE to filter data with timestamps which were on weekend. But I don't understand how to compute two ranking of visits.
Example of the row in db:
1234543 1419638963 site.com
where
userId = 1234543
unix timestamp = 1419638963
url = site.com
I would be very grateful for any help!
You can use below SQL.
I am a little confused on rank of columns. Because rank can be based on many things. i simply order the data based on count of visit. see if it suits your requirement.
I used from_unixtime(ux_time) to calculate the date and time from your data. overall_rank should give you overall rank based on both counts.
select
url,
cnt_working,
cnt_free,
row_number() over (order by cnt_working,cnt_free) overall_rank
from
(
select
sum(case when hour(from_unixtime(ux_time)) >=8 and hour(from_unixtime(ux_time)) < 18 then 1 else 0 end ) cnt_working,
sum(case when hour(from_unixtime(ux_time)) >=8 and hour(from_unixtime(ux_time)) < 18 then 0 else 1 end ) cnt_free,
url
from mytable
where extract(dayofweek FROM (from_unixtime(1419638963) )) IN (1,7) -- to include only sat,sun
group by visit
) rs
order by 2,3
Related
I have a dataset like this called data_per_day
instructional_day
points
2023-01-24
2
2023-01-23
2
2023-01-20
1
2023-01-19
0
and so on. the table shows weekdays (days minus holidays and weekends) and the number of points someone has earned. 1 is the start of a streak and 0 is the end of a streak. 2 is max points after a streak has started.
I need to find how long is the latest streak. so in this case the result should be 3
I created a recursive cte but the query returns 2 as the streak count because i'm using lag mechanism with days. instead I need to adjust so that the instructional days are used rather than all dates.
RECURSIVE cte AS (
SELECT
student_unique_id,
instructional_day,
points,
1 AS cnt
FROM
`data_per_day`
WHERE
instructional_day = DATE_ADD(CURRENT_DATE('America/Chicago'), INTERVAL -1 DAY)
UNION ALL
SELECT
a.student_unique_id,
a.instructional_day,
a.points,
c.cnt+1
FROM (
SELECT
*
FROM
`data_per_day`
WHERE
points > 0 ) a
INNER JOIN
cte c
ON
a.student_unique_id = c.student_unique_id
AND a.instructional_day = c.instructional_day - INTERVAL '1' day )
SELECT
student_unique_id,
MAX(cnt) AS streak
FROM
cte --
WHERE
student_unique_id = "419"
GROUP BY
student_unique_id
How do I adjust the query?
This is not a trivial coding exercise, so I won't actually write the code and provide it.
What you have here is a gaps and islands question. You want to identify the largest "island" of days with points within a date range. Depending upon what dates are contained in your data, you may need to generate a list of sequential dates that meet your criteria.
One problem I see is that you are trying to combine the steps to generate the date range (the recursive CTE) with the points. You'll need to separate those steps.
Define the date range.
Generate the dates within the range.
Filter the dates with isweekday = 'no' and isholiday = 'no'. You will probably want to add a row number during this step.
[left] join the dates to your data, including coalesce(points, 0)
Filter the data to points > 0.
Identify the islands.
Identify the largest island per student.
I'm looking through login logs (in Netezza) and trying to find users who have greater than a certain number of logins in any 1 hour time period (any consecutive 60 minute period, as opposed to strictly a clock hour) since December 1st. I've viewed the following posts, but most seem to address searching within a specific time range, not ANY given time period. Thanks.
https://dba.stackexchange.com/questions/137660/counting-number-of-occurences-in-a-time-period
https://dba.stackexchange.com/questions/67881/calculating-the-maximum-seen-so-far-for-each-point-in-time
Count records per hour within a time span
You could use the analytic function lag to look back in a sorted sequence of time stamps to see whether the record that came 19 entries earlier is within an hour difference:
with cte as (
select user_id,
login_time,
lag(login_time, 19) over (partition by user_id order by login_time) as lag_time
from userlog
order by user_id,
login_time
)
select user_id,
min(login_time) as login_time
from cte
where extract(epoch from (login_time - lag_time)) < 3600
group by user_id
The output will show the matching users with the first occurrence when they logged a twentieth time within an hour.
I think you might do something like that (I'll use a login table, with user, datetime as single column for the sake of simplicity):
with connections as (
select ua.user
, ua.datetime
from user_logons ua
where ua.datetime >= timestamp'2018-12-01 00:00:00'
)
select ua.user
, ua.datetime
, (select count(*)
from connections ut
where ut.user = ua.user
and ut.datetime between ua.datetime and (ua.datetime + 1 hour)
) as consecutive_logons
from connections ua
It is up to you to complete with your columns (user, datetime)
It is up to you to find the dateadd facilities (ua.datetime + 1 hour won't work); this is more or less dependent on the DB implementation, for example it is DATE_ADD in mySQL (https://www.w3schools.com/SQl/func_mysql_date_add.asp)
Due to the subquery (select count(*) ...), the whole query will not be the fastest because it is a corelative subquery - it needs to be reevaluated for each row.
The with is simply to compute a subset of user_logons to minimize its cost. This might not be useful, however this will lessen the complexity of the query.
You might have better performance using a stored function or a language driven (eg: java, php, ...) function.
In order to identify human traffic (as opposed to crawlers, bots, etc), I would like to design an SQL query that will identify all unique visitor ID's that have visited websites in the last 20 of 24 hours (as most humans would not be browsing for that long). I believe I understand how I want to structure it, "How many UNIQUE hours have any activity for each visitor in the past 24 hours, and WHERE 20 hours have at least some activity".
While the specifics of such a query would depend on the tables involved, I'm having trouble understanding if my structure is on the right track:
SELECT page_url, affinity, num
FROM (
SELECT AGG GROUP BY visitor_id, pages.page_url, max(v.max_affinity) as affinity, COUNT(*) as num, Row_Number()
OVER (Partition By v.visitor_id ORDER BY COUNT(visitor_id) DESC) AS RowNumber
FROM audience_lab_active_visitors v
SELECT pages ON pages.p_date >= '2017-09-14'
WHERE v.p_date='2017-09-14'
GROUP BY v.vispage_visitors, pages.page_url
) tbl WHERE RowNumber < 20
I don't believe your query is valid SQL, but I have an idea of what you're trying to accomplish. Rather than use a static date, I filtered by the past 24 hours and truncated the current timestamp to the hour, otherwise the query would be considering 25 unique hours. I also removed page_url from the query since it didn't seem relevant to the results based on what you're trying to solve.
For each visitor_id, the query counts the number of unique hours recorded based on the column (timestamp_col in this example) used to record the timestamp of the page view. HAVING COUNT(DISTINCT DATE_TRUNC('hour', timestamp_col)) < 20 returns those you've identified at humans, meaning they visited the website up to 19 of the past 24 hours.
SELECT
visitor_id,
COUNT(DISTINCT DATE_TRUNC('hour', timestamp_col)) AS num,
MAX(v.max_affinity) AS affinity
FROM audience_lab_active_visitors AS v
JOIN pages AS p ON v.page_url = p.page_url
WHERE
v.p_date >= DATE_TRUNC('hour', CURRENT_TIMESTAMP) - INTERVAL '24' hour
GROUP BY 1
HAVING COUNT(DISTINCT DATE_TRUNC('hour', timestamp_col)) < 20;
Hoping you can help with this issue.
I have an energymanagement software running on a system. The data logged is the total value, logged in the column Value. This is done every hour. Along is some other data, here amongst a boolean called Active and an integer called Day.
What I'm going for, is one query that gets me the a list of sorted days, the total powerusage of the day, and the peak-powerusage of the day.
The peak-power usage is counted by using Max/Min of the value where Active is present. Somedays, however, the Active bit isn't set, and the result of this query alone would yield NULL.
This is my query:
SELECT
A.Day, A.Forbrug, B.Peak
FROM
(SELECT
Day, Max(Value) - Min(Value) AS Forbrug
FROM
EL_HT1_K
WHERE
MONTH = 8 AND YEAR = 2016
GROUP By Day) A,
(SELECT
Day, Max(Value) - Min(Value) AS Peak
FROM
EL_HT1_K
WHERE
Month = 8 AND Year = 2016 AND Active = 1
GROUP BY Day) B
WHERE
A.Day = B.Day
Which only returns the result where query B (Peak-usage) would yield results.
What I want, is that the rest of the results from inner query A, still is shown, even though query B yields 0/null for that day.
Is this possible, and how?
FYI. The reason I need this to be in one query, is that the scada system has some difficulties handling multiple queries.
I think you just want conditional aggregation. Based on your description, this seems to be the query you want:
SELECT Day, SUM(Value) as total,
MAX(CASE WHEN Active = 1 THEN Value END) as Peak,
FROM EL_HT1_K
WHERE Month = 8 AND Year = 2016
GROUP BY Day;
I have a table of log entries with an id, timestamp, source_ip (for the IP address) and some other data. I want to group this into "visits", where a visit is all log entries from one IP address where there were < X seconds since last log entry. i.e. for every log entry in a visit, there must be at least one other entry in that visit whose timstamp was < X seconds before or after this one.
If X = 10 minutes IP A has the following requests: 12:00, 12:05, 12:11, 12:40, 12:42, 12:50, 12:52, 14:01, then there are 3 visit groups: [12:00, 12:05, 12:11], [12:40, 12:42, 12:50, 12:52], [14:01].
I would like to do this entirely in SQL, but I'm not sure how. I'm guessing it a form of group by, perhaps with Common Table Expressions (WITH clause)? Can anyone tell me how to generate this? I'd know how to do it in Python (say), but I'd like to have it done in SQL
I'm currently trying this with SQLite 3, but I'm willing to change to PostgreSQL (even to postgresql 9.5).
You can do this in Postgres. I wouldn't recommend SQLite, because it does not support window/analytic functions.
You can find where a group begins using lag() and some date arithmetic. Then you can do a cumulative sum on this information to identify each group:
select l.*,
sum(case when prev_ts + interval '10 minute' > timestamp then 0
else 1
end) over (partition by ip order by timestamp) as groupid
from (select l.*,
lag(timestamp) over (partition by ip order by timestamp) as prev_ts
from logs l
) l;