Data Summarization in Apache Pig/Apache Hive For Given Date Range

Data Summarization in Apache Pig/Apache Hive For Given Date Range - hive

I have a requirement where-in i need to do data summarization on the date range provided as input. To be more specific: If my data looks like:
Input:
Id|amount|date
1 |10 |2016-01-01
2 |20 |2016-01-02
3 |20 |2016-01-03
4 |20 |2016-09-25
5 |20 |2016-09-26
6 |20 |2016-09-28
And If I want the summarization for the month of September, then
I need to calculate count of records on 4 ranges which are:
Current Date, which is each day in September.
Week Start Date(Sunday of the week as per the current date) to Current Date, Ex. if Current Date is 2016-09-28 then week start date would be 2016-09-25
and record counts between 2016-09-25 to 2016-09-28.
Month Start Date to Current Date, which is from 2016-09-01 to Current Date.
Year Start Date to Current Date,which is record count from 2016-01-01 to Current Date.
So My output should have one record with 4 Columns for each day of the month(in this case, Month is September), Something like
Output:
Current_Date|Current_date_count|Week_To_Date_Count|Month_to_date_Count|Year_to_date_count
2016-09-25 |1 |1 |1 |4
2016-09-26 |1 |2 |3 |5
2016-09-28 |1 |3 |3 |6
Important: i can pass only 2 variables, which is range start date and range end date. Rest calculation need to be dynamic.
Thanks in advance

You can join on year, then test each condition separately (using sum(if())):
select a.date, sum(if(a.date=b.date,1,0)),
sum(if(month(a.date)=month(b.date) and weekofyear(a.date)=weekofyear(b.date),1,0)),
sum(if(month(a.date)=month(b.date),1,0)),
count(*) from
(select * from input_table where date >= ${hiveconf:start} and date <${hiveconf:end}) a,
(select * from input_table where date <${hiveconf:end}) b
where year(a.date)=year(b.date) and b.date <= a.date group by a.date;

Related

If condition TRUE in a row (that is grouped)

Table:
|Months |ID|Commission|
|2020-01|1 |2312 |
|2020-02|2 |24412 |
|2020-02|1 |123 |
|... |..|... |
What I need:
COUNT(Months),
ID,
SUM(Commission),
Country
GROUP BY ID...
How it should look:
|Months |ID|Commission|
|4 |1 |5356 |
|6 |2 |5436 |
|... |..|... |
So I want to know how many months each ID received his commission, however (and that's the part where I ask for your help) if the ID is still receiving commission up to this month (current month) - I want to exclude him from the list. If he stopped receiving comm last month or last year, I want to see him in the table.
In other words, I want a table with old clients (who doesn't receive commission anymore)

Use aggregation. Assuming there is one row per month:
select id, count(*)
from t
group by id
having max(months) < date_format(now(), '%Y-%m');
Note this uses MySQL syntax, which was one of the original tags.

Select rows that are duplicates on two columns

I have data in a table. There are 3 columns (ID, Interval, ContactInfo). This table lists all phone contacts. I'm attempting to get a count of phone numbers that called twice on the same day and have no idea how to go about this. I can get duplicate entries for the same number but it does not match on date. The code I have so far is below.
SELECT ContactInfo, COUNT(Interval) AS NumCalls
FROM AllCalls
GROUP BY ContactInfo
HAVING COUNT(AllCalls.ContactInfo) > 1
I'd like to have it return the date, the number of calls on that date if more than 1, and the phone number.
Sample data:
|ID |Interval |ContactInfo|
|--------|------------|-----------|
|1 |3/1/2017 |8009999999 |
|2 |3/1/2017 |8009999999 |
|3 |3/2/2017 |8001234567 |
|4 |3/2/2017 |8009999999 |
|5 |3/3/2017 |8007771111 |
|6 |3/3/2017 |8007771111 |
|--------|------------|-----------|
Expected result:
|Interval |ContactInfo|NumCalls|
|------------|-----------|--------|
|3/1/2017 |8009999999 |2 |
|3/3/2017 |8007771111 |2 |
|------------|-----------|--------|

Just as juergen d suggested, you should try to add Interval in your GROUP BY. Like so:
SELECT AC.ContactInfo
, AC.Interval
, COUNT(*) AS qnty
FROM AllCalls AS AC
GROUP BY AC.ContactInfo
, AC.Interval
HAVING COUNT(*) > 1

The code should like this :
select Interval , ContactInfo, count(ID) AS NumCalls from AllCalls group by Interval, ContactInfo having count(ID)>1;

vertica sql delta

I want to calculate delta value between 2 records my table got 2 column id and timestamp i want to calculate the delta time between the records
id |timestamp |delta
----------------------------------
1 |100 |0
2 |101 |1 (101-100)
3 |106 |5 (106-101)
4 |107 |1 (107-106)
I work with a Vertica data base and I want to create view/projection of this table on my DB.
Is it possible to create this calculate without using udf function?

You can use lag() for this purpose:
select id, timestamp,
coalesce(timestamp - lag(timestamp) over (order by id), 0) as delta
from t;

Grouping 6-hour Data for Individual Days with SQL Server

Using SQL Server 2000,
I have data that contains a datetime and a value, say DateData and OtherData. Data is collected over time in five minute intervals, so for example, DateData is an ordered DateTime from 2/1/2016 to 2/24/2015, with a 5 minute gap between each new data point.
I am currently trying to average the data so that I can grab the average OtherData value every 6 hours, for each day individually. So far, I have come up with the following SQL that groups all the data into 6-hour intervals, so I end up with averages over all days, rather than individual.
SELECT
DATEPART(hour, DateData)/6,
AVG(OtherData) AS AvgData
FROM DATASITE
GROUP BY DATEPART(hour, DateData)/6
--Result:
|Column1 | AvgData|
|0 | 11|
|1 | 12|
|2 | 13|
|3 | 14|
How would I change the above query to give averages for individual days, rather than all days combined?

Add DateData in Group by to group the rows for each day
SELECT
DATEPART(hour, DateData)/6,
AVG(OtherData) AS AvgData
FROM DATASITE
GROUP BY cast(DateData as date),
DATEPART(hour, DateData)/6

SQL query between two times on every day

I'm currently trying to write a query that will allow me to find all records that occur between two times every day. As an example, say you had five records each with their own unique timestamps that represent when the record was created. They look something like this:
|--|------|-------------------|
|id|letter| created_at |
|--|------|-------------------|
|1 |a |2013-10-30 10:00:00|
|2 |b |2013-10-31 18:00:00|
|3 |c |2013-11-01 14:00:00|
|4 |d |2013-11-03 23:00:00|
|5 |e |2013-11-04 05:00:00|
|--|------|-------------------|
I'm trying to write a query that would return all records created between 08:00:00 and 15:00:00. The expected result would be:
|--|------|-------------------|
|id|letter| created_at |
|--|------|-------------------|
|1 |a |2013-10-30 10:00:00|
|3 |c |2013-11-01 14:00:00|
|--|------|-------------------|
What would a query look like to achieve this result? I'm familiar with how to use BETWEEN to get dates but not how to focus on times specifically. Thanks.

Alternatively, extract a native TIME value from your datetime field, and compare date values directly:
SELECT *
FROM yourtable
WHERE TIME(created_at) BETWEEN '08:00:00' AND '15:00:00'
MySQL has a very comprehensive set of date/time manipulation functions available here: http://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html

EDIT Forgot this was MySQL
You can use the EXTRACT function to pull that out
SELECT id, letter, created_at
FROM table
WHERE EXTRACT(HOUR, created_at) BETWEEN 8 AND 15

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Data Summarization in Apache Pig/Apache Hive For Given Date Range - hive

Related

If condition TRUE in a row (that is grouped)

Select rows that are duplicates on two columns

vertica sql delta

Grouping 6-hour Data for Individual Days with SQL Server

SQL query between two times on every day

Categories

Resources