group by a substring in field

group by a substring in field - sql

i have a table which looks like this :
coumn 1 = timestamp : string , column 2 = numOfentites : int
please note i am using hiveql
Fri, 10 Aug 2001 274
Fri, 10 Dec 1999 39
Fri, 10 Mar 2000 107
Fri, 10 May 2002 26
Fri, 10 Nov 2000 351
Fri, 10 Sep 1999 22
Fri, 11 Aug 2000 189
Fri, 11 Dec 1998 1
Fri, 11 Feb 2000 84
Fri, 11 Jan 2002 580
Fri, 11 Jun 1999 12
Fri, 11 May 2001 571
Fri, 12 Apr 2002 41
Now, I retrieved the frequency per year from this table and found out some year XXXX had the most number of entities.
My aim now is to go one level deep and extract the frequency per month for the year XXXX.
I tired using the group by clause on the substring indicating month but it doesn’t work.
can you guys please give me a direction on how to proceed..
Just need a hint not the answer :P trying to learn hiveql here
EDIT
here is the query that i used to extract the frequency of entities on yearly basis.
note that timestamp is the first column of the input.
select dates , count(dates) as numEmails
from (select split(timestamp," ")[3] as dates , count(timestamp)
from dataset
group by timestamp
) mailfreq
group by dates
order by numEmails desc;

I know that hivesql has strange limitations, but won't this work?
select split(timestamp," ")[3] as yr, split(timestamp," ")[2] as mon, count(timestamp)
from dataset
group by split(timestamp," ")[3], split(timestamp," ")[2];

Related

Identify if date is the last date for any given group?

I have a table that is structured like the below - this contains details about all customer subscriptions and when they start/end.
SubKey
CustomerID
Status
StartDate
EndDate
29333
102
7
01 jan 2013
1 Jan 2014
29334
102
6
7 Jun 2013
15 Jun 2022
29335
144
6
10 jun 2021
17 jun 2022
29336
144
2
8 oct 2023
10 oct 2025
I am trying to add an indicator flag to this table (either "yes" or "no") which shows me by each row, if when the [EndDate] of the SubKey is the last one for that CustomerID. So for the above example..
SubKey
CustomerID
Status
StartDate
EndDate
IsLast
29333
102
7
01 jan 2013
1 Jan 2014
No
29334
102
6
7 Jun 2013
15 Jun 2022
Yes
29335
144
6
10 jun 2021
17 jun 2022
Yes
29336
144
2
8 oct 2023
10 oct 2025
Yes
The flag is set to No for the first row, because on 1 Jan 2014, customerID 102 had another SubKey (29334) still active at the time (which didn't end until 15 jun 2022)
The rest of the rows are set to "Yes" because these were the last active subscriptions per CustomerID.
I have been reading about the LAG function which may be able to help. I am just not sure how to make it fit in this scenario.

Probably the easiest method would to use exists with a correlation. Can you try the following for your desired results for excluding rows without an overlap:
select *,
case when exists (
select * from t t2
where t2.customerId = t.customerId
and t2.enddate > t.enddate
and t2.startDate < t.Enddate
) then 'No' else 'Yes' end as IsLast
from t;

Calculation of values that rely on a date variable

I am trying to calculate the value of the last measurement taken (according to the date column) divided by the lowest value recorded (according to the measurement column) if two values in the “SUBJECT” column match and two values in the “PROCEDURE” column match. The the calculation would be produced in a new column. I am having trouble with this and I would appreciate a solution to this matter.
data Have;
input Subject Type :$12. Date &:anydtdte. Procedure :$12. Measurement;
format date yymmdd10.;
datalines;
500 Initial 15 AUG 2017 Invasive 20
500 Initial 15 AUG 2017 Surface 35
500 Followup 15 AUG 2018 Invasive 54
428 Followup 15 AUG 2018 Outer 29
765 Seventh 3 AUG 2018 Other 13
500 Followup 3 JUL 2018 Surface 98
428 Initial 3 JUL 2017 Outer 10
765 Initial 20 JUL 2019 Other 19
610 Third 20 AUG 2019 Invasive 66
610 Initial 17 Mar 2018 Invasive 17
;
*Intended output table
Subject Type Date Procedure Measurement Output
500 Initial 15 AUG 2017 Invasive 20 20/20
500 Initial 15 AUG 2017 Surface 35 35/35
500 Followup 15 AUG 2018 Invasive 54 54/20
428 Followup 15 AUG 2018 Outer 29 29/10
765 Seventh 3 AUG 2018 Other 13 13/19
500 Followup 3 JUL 2018 surface 98 98/35
428 Initial 3 JUL 2017 Outer 10 10/10
765 Initial 20 JUL 2019 Other 19 19/19
610 Third 20 AUG 2019 Invasive 66 66/17
610 Initial 17 Mar 2018 Invasive 17 17/17 ;
*Attempt;
PROC SQL;
create table want as
select a.*,
(select measurement as measurement_last_date
from have
where subject = a.subject and type = a.type
having date = max(date)) / min(a.measurement) as ratio
from have as a
group by subject, type
order by subject, type, date;
QUIT;

I think that you need use statement retain with data step.
the statement will retain your last row and you can 'll compare the last row with actual row processed.
link of some tutorial of how use statement retain.
enter link description here
SAS documentation
enter link description here

Showing data order from Monday-Sunday full week only and hide non-full week data

sorry if I'm shooting newbie questions here.
I want to create a weekly report, but for this weekly report, I want full data from Monday to Sunday
Condition:
Last 4 weeks only
Showing full week (Monday - Sunday)
Hide the result if it's not full week
If i use getdate -14, if I access the data on Wednesday, they will start counting last week from Wednesday 2 weeks ago instead of last Monday. Meanwhile, I want the report to show full week only.
Can anyone share how to do that in SQL?
Here I provide sample data:
Column name = DATE -- Column name: TOTAL_PERSON
- Fri, 1 Jun 2018 -- 10
- Sat, 2 Jun 2018 -- 4
- Sun, 3 Jun 2018 -- 12
- Mon, 4 Jun 2018 -- 15
- Tue, 5 Jun 2018 -- 10
- Wed, 6 Jun 2018 -- 3
- Thu, 7 Jun 2018 -- 1
- Fri, 8 Jun 2018 -- 13
- Sat, 9 Jun 2018 -- 9
- Sun, 10 Jun 2018 -- 23
- Mon, 11 Jun 2018 -- 5
- Tue, 12 Jun 2018 -- 3
- Wed, 13 Jun 2018 -- 1
- Thu, 14 Jun 2018 -- (TODAY)
In this case, if I am accessing on Thu 6 Jun 2018 I want to get TOTAL PERSON data from Mon, 4 Jun 2018 to Sun, 10 Jun 2018 only and not showing data from the rest since the week is not full.
Can anyone help me how to do that?
Thanks a lot!

I think you want:
where datediff(week, date, getdate()) <= 2
This counts the number of week boundaries between two dates, so it returns an entire week.

For MySQL, you can use such a select:
SELECT * FROM `myDB` WHERE `Date`
BETWEEN DATE_SUB(NOW()-INTERVAL DATE_FORMAT(CURRENT_DATE, '%w') DAY, INTERVAL 28 DAY)
AND NOW()- INTERVAL DATE_FORMAT(CURRENT_DATE, '%w') DAY
This uses the capability to transform the current day of this week into a number and substract this to get the last Sunday. from there, we select an intervall of 28 days.
(Only testet with 14 days and a very limited test-dataset, but should work)

How to change start date in a table to a pair of start date and end date using SQL

The title must be confusing, but the thing I am trying to do is very easy to understand with an example. I have a table like this:
Code Date_ Ratio
73245 Jan 1 1975 12:00AM 10
73245 Apr 18 2006 12:00AM 4
73245 Dec 26 2007 12:00AM 10
73245 Jan 30 2009 12:00AM 4
73245 Apr 21 2011 12:00AM 2
Basically for each security it gives some ratio for it with a date when the ratio starts to be effective. This table will be much easier to use if instead of just having a start date, it has a pair of start date and end date, like the following:
Code StartDate_ EndDate_ Ratio
73245 Jan 1 1975 12:00AM Apr 18 2006 12:00AM 10
73245 Apr 18 2006 12:00AM Dec 26 2007 12:00AM 4
73245 Dec 26 2007 12:00AM Jan 30 2009 12:00AM 10
73245 Jan 30 2009 12:00AM Apr 21 2011 12:00AM 4
73245 Apr 21 2011 12:00AM Dce 31 2049 12:00AM(or some random date in far future) 2
How do I transform the original table to the table I want using SQL statements? I have little experience with SQL and I could not figure how.
Please help! Thanks!

In SQL Server 2012:
SELECT code,
date_ AS startDate,
LEAD(date_) OVER (PARTITION BY code ORDER BY date_) AS endDate,
ratio
FROM mytable
In SQL Server 2005 and 2008:
WITH q AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY code ORDER BY date_) AS rn
FROM mytable
)
SELECT q1.code, q1.date_ AS startDate, q2.date_ AS endDate, q1.ratio
FROM q q1
LEFT JOIN
q q2
ON q2.code = q1.code
AND q2.rn = q1.rn + 1

Maybe it would also be possible to use OUTER APPLY, something like:
SELECT t1.Code, t1.Date_ AS StartDate_, ISNULL(t2.EndDate_, CAST('20491231' AS DATETIME)) AS EndDate_
FROM t1 AS t1o
OUTER APPLY
(
SELECT TOP 1 Date_ AS EndDate_
FROM t1
WHERE t1.Code = t1o.Code AND t1.Date_ > t1o.Date_
ORDER BY t1.Date_ ASC
) AS t2

MDX: Aggregates over a set

What I am trying to achieves looks very simple, yet I cannot make it work.
My facts are orders which have a date and I have a typical time dimension with the 'Month" and 'Year' levels.
I would like to get an output which lists the number of orders for the last 6 months and the total, like this:
Oct 2009 20
Nov 2009 30
Dec 2009 25
Jan 2009 15
Feb 2010 45
Mar 2010 5
Total 140
I can create the set with the members Oct 2009 until Mar 2010 and I manage to get this part of my desired output:
Oct 2009 20
Nov 2009 30
Dec 2009 25
Jan 2009 15
Feb 2010 45
Mar 2010 5
Just I fail to get the total line.

You can achieve this by adding the ALL member to the set and then wrapping it all in the VisualTotals() function
SELECT
... on COLUMNS,
VISUALTOTALS (
{[Month].[Month].[Oct 2009]:[Month].[Month].[Mar 2010]
, [Month].[Month].[All] }
) ON ROWS
FROM <cube>

here is one possible solution for Adventure Works DW Demo Cube. The query selects the last 6 Order Counts and add a sum on the date dimension:
WITH MEMBER [Date].[Calendar].[Last 6 Mth Order Count] AS
aggregate(
ClosingPeriod([Date].[Calendar].[Month], [Date].[Calendar].[All Periods]).Lag(6)
: ClosingPeriod([Date].[Calendar].[Month], [Date].[Calendar].[All Periods])
)
SELECT {[Measures].[Order Count]} ON COLUMNS
, {ClosingPeriod([Date].[Calendar].[Month], [Date].[Calendar].[All Periods]).Lag(6)
: ClosingPeriod([Date].[Calendar].[Month], [Date].[Calendar].[All Periods])
,[Date].[Calendar].[Last 6 Mth Order Count]}
ON ROWS
FROM [Adventure Works]

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

group by a substring in field - sql

I know that hivesql has strange limitations, but won't this work? select split(timestamp," ")[3] as yr, split(timestamp," ")[2] as mon, count(timestamp) from dataset group by split(timestamp," ")[3], split(timestamp," ")[2];

Related

Identify if date is the last date for any given group?

Calculation of values that rely on a date variable

Showing data order from Monday-Sunday full week only and hide non-full week data

How to change start date in a table to a pair of start date and end date using SQL

MDX: Aggregates over a set

Categories

Resources