I'm newbie to Hivesql.
I have a raw table with 6 million records like this:
I want to count the number of IP_address access to each Modem_id everyweek.
The result table I want will be like this:
I did it with left join, and it worked. But since using join will be time-consuming, I want do it with case when statement - but I can't write a correct statement. Do you have any ideas?
This is the join statement I used:
select a.modem_id,
a.Number_of_IP_in_Day_1,
b.Number_of_IP_in_Day_2
from
(select modem_id,
count(distinct ip_address) as Number_of_IP_in_Day_1
from F_ACS_DEVICE_INFORMATION_NEW
where day=1
group by modem_id) a
left join
(select modem_id,
count(distinct param_value) as Number_of_IP_in_Day_2
from F_ACS_DEVICE_INFORMATION_NEW
where day=2
group by modem_id) b
on a.modem_id= b.modem_id;
You can express your logic using just aggregatoin:
select a.modem_id,
count(distinct case when date = 1 then ip_address end) as day_1,
count(distinct case when date = 2 then ip_address end) as day_2
from F_ACS_DEVICE_INFORMATION_NEW a
group by a.modem_id;
You can obviously extend this for more days.
Note: As your question and code are written, this assumes that your base table has data for only one week. Otherwise, I would expect some date filtering. Presumably, that is what the _NEW suffix means on the table name.
Based on your question and further comments, you would like
The number of different IP addresses accessed by each modem
In counts by week (as columns) for 4 weeks
e.g., result would be 5 columns
modem_id
IPs_accessed_week1
IPs_accessed_week2
IPs_accessed_week3
IPs_accessed_week4
My answer here is based on knowledge of SQL - I haven't used Hive but it appears to support the things I use (e.g., CTEs). You may need to tweak the answer a bit.
The first key step is to turn the day_number into a week_number. A straightforward way to do this is FLOOR((day_num-1)/7)+1 so days 1-7 become week 1, days 8-14 become week2, etc.
Note - it is up to you to make sure the day_nums are correct. I would guess you'd actually want info the the last 4 weeks, not the first four weeks of data - and as such you'd probably calculate the day_num as something like SELECT DATEDIFF(day, IP_access_date, CAST(getdate() AS date)) - whatever the equivalent is in Hive.
There are a few ways to do this - I think the clearest is to use a CTE to convert your dataset to what you need e.g.,
convert day_nums to weeknums
get rid of duplicates within the week (your code has COUNT(DISTINCT ...) - I assume this is what you want) - I'm doing this with SELECT DISTINCT (rather than grouping by all fields)
From there, you could PIVOT the data to get it into your table, or just use SUM of CASE statements. I'll use SUM of CASE here as I think it's clearer to understand.
WITH IPs_per_week AS
(SELECT DISTINCT
modem_id,
ip_address,
FLOOR((day-1)/7)+1 AS week_num -- Note I've referred to it as day_num in text for clarity
FROM F_ACS_DEVICE_INFORMATION_NEW
)
SELECT modem_id,
SUM(CASE WHEN week_num = 1 THEN 1 ELSE 0 END) AS IPs_access_week1,
SUM(CASE WHEN week_num = 2 THEN 1 ELSE 0 END) AS IPs_access_week2,
SUM(CASE WHEN week_num = 3 THEN 1 ELSE 0 END) AS IPs_access_week3,
SUM(CASE WHEN week_num = 4 THEN 1 ELSE 0 END) AS IPs_access_week4
FROM IPs_per_week
GROUP BY modem_id;
Related
I am looking for a way to write an SQL statement that selects data for each month of the year, separately.
In the SQL statement below, I am trying to count the number of instances in the TOTAL_PRECIP_IN and TOTAL_SNOWFALL_IN columns when either column is greater than 0. In my data table, I have information for those two columns ("TOTAL_PRECIP_IN" and "TOTAL_SNOWFALL_IN") for each day of the year (365 total entries).
I want to break up my data by each calendar month, but am not sure of the best way to do this. In the statement below, I am using a UNION statement to break up the months of January and February. If I keep using UNION statements for the remaining months of the year, I can get the answer I am looking for. However, using 11 different UNION statements cannot be the optimal solution.
Can anyone give me a suggestion how I can edit my SQL statement to measure from the first day of the month, to the last day of the month for every month of the year?
select monthname(OBSERVATION_DATE) as "Month", sum(case when TOTAL_PRECIP_IN or TOTAL_SNOWFALL_IN > 0 then 1 else 0 end) AS "Days of Rain" from EMP_BASIC
where OBSERVATION_DATE between '2019-01-01' and '2019-01-31'
and CITY = 'Olympia'
group by "Month"
UNION
select monthname(OBSERVATION_DATE) as "Month", sum(case when TOTAL_PRECIP_IN or TOTAL_SNOWFALL_IN > 0 then 1 else 0 end) from EMP_BASIC
where OBSERVATION_DATE between '2019-02-01' and '2019-02-28'
and CITY = 'Olympia'
group by "Month"```
Your table structure is too unclear to tell you the exact query you will need. But a general easy idea is to build the sum of your value and then group by monthname and/or by month. Sice you wrote you only want sum values greater 0, you can just put this condition in the where clause. So your query will be something like this:
SELECT MONTHNAME(yourdate) AS month,
MONTH(yourdate) AS monthnr,
SUM(yourvalue) AS yoursum
FROM yourtable
WHERE yourvalue > 0
GROUP BY MONTHNAME(yourdate), MONTH(yourdate)
ORDER BY MONTH(yourdate);
I created an example here: db<>fiddle
You might need to modify this general construct for your concrete purpose (maybe take care of different years, of NULL values etc.). And note this is an example for a MYSQL DB because you wrote about MONTHNAME() which is in most cases used in MYSQL databases. If you are using another DB type, maybe you need to do some modifications. To make sure that answers match your DB type, tag it in your question, please.
I have a dataset that has product name, order number and the time order was placed.
prod_name,order_no,order_time
a,101,2018-05-01
a,102,2018-06-04
a,103,2018-05-03
b,104,2018-01-21
b,105,2018-01-11
I am trying to build a report that shows time since first order (compared against current time) with an output as below:
prod_name,time_since_first_sale,aging
a,64,Less than 3 months back
b,177,Less than 6 months back
Given below is the SQL I am using:
select DISTINCT b.prod_name,case when((CURRENT_TIMESTAMP - min(a.order_time))) < '90' THEN 'Less than 3 months'
when ((CURRENT_TIMESTAMP - min(order_time))) < '180' THEN 'Less than 6 months'
else 'Other' end as aging
from sales a, prod b where a.id=b.prod_id;
The above SQL when executed returns duplicates, believe it also considers each sale_id in the sales table. How could I modify the above query to get just one record per prod_name. If I however remove the case statement the duplicates are not there. Could any one assist as to what I am doing wrong that pulls in these duplicates.
I am using Amazon Redshift DB.
Thanks..
Never use commas in the FROM clause. Always use proper, explicit, standard JOIN syntax.
Don't use SELECT DISTINCT when you intend GROUP BY.
So your query should look like:
select p.prod_name,
(case when CURRENT_TIMESTAMP - min(s.order_time) < '90'
then 'Less than 3 months'
when CURRENT_TIMESTAMP - min(s.order_time) < '180' then 'Less than 6 months'
else 'Other'
end) as aging
from sales s join
prod p
on s.id = p.prod_id
group by p.prod_name;
Notice that I also added in reasonable table aliases (abbreviations for the table names) and qualified all column references.
Let's say I have two columns: Date and Indicator
Usually the indicator goes from 0 to 1 (when the data is sorted by date) and I want to be able to identify if it goes from 1 to 0 instead. Is there an easy way to do this with SQL?
I am already aggregating other fields in the same table. If I can add this to as another aggregation (e.g. without using a separate "where" statement or passing over the data a second time) it would be pretty awesome.
This is the phenomena I want to catch:
Date Indicator
1/5/01 0
1/4/01 0
1/3/01 1
1/2/01 1
1/1/01 0
This isn't a teradata-specific answer, but this can be done in normal SQL.
Assuming that the sequence is already 'complete' and xn+1 can be derived from xn, such as when the dates are sequential and all present:
SELECT date -- the 1 on the day following the 0
FROM r curr
JOIN r prev
-- join each day with the previous day
ON curr.date = dateadd(d, 1, prev.date)
WHERE curr.indicator = 1
AND prev.indicator = 0
YMMV on the ability of such a query to use indexes efficiently.
If the sequence is not complete the same can be applied after making a delegate sequence which is well ordered and similarly 'complete'.
This can also be done using correlated subqueries, each selecting the indicator of the 'previous max', but.. uhg.
Joining the table against it self it quite generic, but most SQL Dialects now support Analytical Functions. Ideally you could use LAG() but TeraData seems to try to support the absolute minimum of these, and so so they point you to use SUM() combined with rows preceding.
In any regard, this method avoids a potentially costly join and effectively deals with gaps in the data, whilst making maximum use of indexes.
SELECT
*
FROM
yourTable t
QUALIFY
t.indicator
<
SUM(t.indicator) OVER (PARTITION BY t.somecolumn /* optional */
ORDER BY t.Date
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
)
QUALIFY is a bit TeraData specific, but slightly tidier than the alternative...
SELECT
*
FROM
(
SELECT
*,
SUM(t.indicator) OVER (PARTITION BY t.somecolumn /* optional */
ORDER BY t.Date
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
)
AS previous_indicator
FROM
yourTable t
)
lagged
WHERE
lagged.indicator < lagged.previous_indicator
Supposing you mean that you want to determine whether any row having 1 as its indicator value has an earlier Date than a row in its group having 0 as its indicator value, you can identify groups with that characteristic by including the appropriate extreme dates in your aggregate results:
SELECT
...
MAX(CASE indicator WHEN 0 THEN Date END) AS last_ind_0,
MIN(CASE indicator WHEN 1 THEN Date END) AS first_ind_1,
...
You then test whether first_ind_1 is less than last_ind_0, either in code or as another selection item.
I have a database called ‘tweets’. The database 'tweets' includes (amongst others) the rows 'tweet_id', 'created at' (dd/mm/yyyy hh/mm/ss), ‘classified’ and 'processed text'. Within the ‘processed text’ row there are certain strings such as {TICKER|IBM}', to which I will refer as ticker-strings.
My target is to get the average value of ‘classified’ per ticker-string per day. The row ‘classified’ includes the numerical values -1, 0 and 1.
At this moment, I have a working SQL query for the average value of ‘classified’ for one ticker-string per day. See the script below.
SELECT Date( `created_at` ) , AVG( `classified` ) AS Classified
FROM `tweets`
WHERE `processed_text` LIKE '%{TICKER|IBM}%'
GROUP BY Date( `created_at` )
There are however two problems with this script:
It does not include days on which there were zero ‘processed_text’s like {TICKER|IBM}. I would however like it to spit out the value zero in this case.
I have 100+ different ticker-strings and would thus like to have a script which can process multiple strings at the same time. I can also do them manually, one by one, but this would cost me a terrible lot of time.
When I had a similar question for counting the ‘tweet_id’s per ticker-string, somebody else suggested using the following:
SELECT d.date, coalesce(IBM, 0) as IBM, coalesce(GOOG, 0) as GOOG,
coalesce(BAC, 0) AS BAC
FROM dates d LEFT JOIN
(SELECT DATE(created_at) AS date,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|IBM}%' then tweet_id
END) as IBM,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|GOOG}%' then tweet_id
END) as GOOG,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|BAC}%' then tweet_id
END) as BAC
FROM tweets
GROUP BY date
) t
ON d.date = t.date;
This script worked perfectly for counting the tweet_ids per ticker-string. As I however stated, I am not looking to find the average classified scores per ticker-string. My question is therefore: Could someone show me how to adjust this script in such a way that I can calculate the average classified scores per ticker-string per day?
SELECT d.date, t.ticker, COALESCE(COUNT(DISTINCT tweet_id), 0) AS tweets
FROM dates d
LEFT JOIN
(SELECT DATE(created_at) AS date,
SUBSTR(processed_text,
LOCATE('{TICKER|', processed_text) + 8,
LOCATE('}', processed_text, LOCATE('{TICKER|', processed_text))
- LOCATE('{TICKER|', processed_text) - 8)) t
ON d.date = t.date
GROUP BY d.date, t.ticker
This will put each ticker on its own row, not a column. If you want them moved to columns, you have to pivot the result. How you do this depends on the DBMS. Some have built-in features for creating pivot tables. Others (e.g. MySQL) do not and you have to write tricky code to do it; if you know all the possible values ahead of time, it's not too hard, but if they can change you have to write dynamic SQL in a stored procedure.
See MySQL pivot table for how to do it in MySQL.
I have to create a report which has AccountSegment as rows and a 2-week date range as column header. The column values will be a count of the number of records in the table having the associated segment/date range.
So the desired output looks something like this:
AcctSeg 4/9/12-4/20/12 4/23/12-5/4/12 5/7/12-5/18/12
Segment1 100 200 300
Segment2 110 220 330
Segment3 120 230 340
The following query does what I want, but just looks so inefficient and ugly. I was wondering if there is a better way to accomplish the same thing:
SELECT
AccountSegment = S.Segment_Name,
'4/9/2012 - 4/20/2012' = SUM(CASE WHEN date_start BETWEEN '2012-04-09' AND '2012-04-20' THEN 1 END),
'4/23/2012 - 5/4/2012' = SUM(CASE WHEN date_start BETWEEN '2012-04-23' AND '2012-05-04' THEN 1 END),
'5/7/2012 - 5/18/2012' = SUM(CASE WHEN date_start BETWEEN '2012-05-07' AND '2012-05-18' THEN 1 END),
'5/21/2012 - 6/1/2012' = SUM(CASE WHEN date_start BETWEEN '2012-05-21' AND '2012-06-01' THEN 1 END),
'6/4/2012 - 6/15/2012' = SUM(CASE WHEN date_start BETWEEN '2012-06-04' AND '2012-06-15' THEN 1 END),
'6/18/2012 - 6/29/2012' = SUM(CASE WHEN date_start BETWEEN '2012-06-18' AND '2012-06-29' THEN 1 END),
'7/2/2012 - 7/13/2012' = SUM(CASE WHEN date_start BETWEEN '2012-07-02' AND '2012-07-13' THEN 1 END),
'7/16/2012 - 7/27/2012' = SUM(CASE WHEN date_start BETWEEN '2012-07-16' AND '2012-07-27' THEN 1 END),
'7/30/2012 - 8/10/2012' = SUM(CASE WHEN date_start BETWEEN '2012-07-30' AND '2012-08-10' THEN 1 END)
FROM
dbo.calls C
JOIN dbo.accounts a ON C.parent_id = a.id
JOIN dbo.accounts_cstm a2 ON a2.id_c = A.id
JOIN dbo.Segmentation S ON a2.[2012_segmentation_c] = S.Segment_Num
WHERE
c.deleted = 0
GROUP BY
S.Segment_Name
ORDER BY
MIN(S.Sort_Order)
It looks fine, but I would suggest one performance improvement:
where c.deleted = 0 and
date_start between '2012-04-09' AND '2012-08-10'
This will limit the aggregation only to rows you need . . . unless you want everything listed with empty data.
I would be inclined to add else 0 to the case statements, so 0s appear instead of NULLs.
#PaulStock, happy to do so.
This technique plays to the strengths of RDMS which is data retrieval and set manipulation - leave itteration to other programming languages that are better optimised for it like C#.
First of all you need an IndexTable, I keep mine in the master database but if you do not have write access to this by all means keep it in your db.
It looks like this:
Id
0
1
2
...
n
Where n is a sufficiently large number for all your scenarios, 100,000 is good, 1,000,000 is better, 10,000,000 is even better still. Column id is cluster indexed of course.
I'm not going to relate it directly to your query becuase I don't really get it and I'm too lazy to work it out.
Instead I'll relate it to this table called Transactions, where we want to roll up all the transactions that happened on each day (or week or month etc):
Date Amount
2012-18-12 04:58:56.453 10
2012-18-12 06:34:21.456 100
etc
The following query will roll up the data by day
SELECT i.Id, SUM(t.Amount) AS DailyTotal
FROM IndexTable i
INNER JOIN
Transactions t ON i.Id=DATEDIFF(DAY, 0, t.Date)
GROUP BY i.Id
The DATEDIFF function returns the number of dateparts between 2 dates, in this case the number of days between 1900-01-01 0:00:00.000 (DateTime = 0 in SQL Server) and the Date of the transaction (btw there have been 41,261 days since then - see why we need a big table)
All the transactions on the same day will have the same number. Changing to week or month or second (a very big number) is as easy as changing the datepart.
You can put in a startdate later than this of course so long as it is earlier than the data you are interested in but it makes litte to no differance to performance.
I have used an INNER JOIN here so if there are no transactions on a given day then we have no row but a LEFT JOIN will give these empty dates with NULL as the Total (use an ISNULL statement if you want to get 0.
With the normalised data you can then PIVOT as desired to get the output you are looking for.