I have the following columns in a Hive table. All the columns are of datatype string. Each row is distinct as the other column( 7 or 8 more columns) values have at least one unique value in it. I want to write a Hive query to select the records where datetime >= 2017-05 and drop records where datetime < 2017-05. Here the output of this should be rows with orderid - 101, 102, 103. All records with orderid 100 should be dropped. Note: Orderid 100 has 1 record with datetime > 2017-05. Still it should be dropped as it has atleast 1 record with datetime < 2017-05. Orderid could be any 12-16 digit number. The table has billions of records.
Can someone help to write a hive query for this? Thanks in advance.
datetime orderid other columns
2017-04-30 17:10:05 100
2017-03-05 12:25:30 100
2017-05-09 08:18:44 100
2017-05-15 04:21:43 101
2017-06-20 11:20:10 101
2017-05-22 05:09:35 102
2017-07-01 06:25:30 102
2017-06-25 08:24:40 103
2017-05-11 11:50:49 103
Output Result:
datetime orderid other columns
2017-05-15 04:21:43 101
2017-06-20 11:20:10 101
2017-05-22 05:09:35 102
2017-07-01 06:25:30 102
2017-06-25 08:24:40 103
2017-05-11 11:50:49 103
select *
from (select *
,min(datetime) over (partition by orderid) as min_datetime
from mytable
) t
where min_datetime >= '2017-05'
;
select *
from (select min(datetime) as date_time, orderid from mytable group by orderid)t
where date_time>='2017-05%';
Related
I have a table with some records about fuel consumption. The important columns in the table are: CONSUME_DATE_FROM and CONSUM_DATE_TO.
I want to calculate average fuel consumption per cars on a monthly basis but some rows are not in the same month. For example some have a three month difference between them and the total of gas per litre is aggregated in a single row.
Now I should find records that have difference more than a month between CONSUME_DATE_FROM and CONSUM_DATE_TO, and duplicate them in current or second table per count of month and divide the total gas per litre between related rows.
I've this table with the following data:
ID VehicleId CONSUME_DATE_FROM CONSUM_DATE_TO GAS_PER_LITER
1 100 2018-10-25 2018-12-01 600
2 101 2018-07-19 2018-07-24 100
3 102 2018-12-31 2019-01-01 400
4 103 2018-03-29 2018-05-29 200
5 104 2018-02-05 2018-02-09 50
The expected output table should be as below
ID VehicleId CONSUME_DATE_FROM CONSUM_DATE_TO GAS_PER_LITER
1 100 2018-10-25 2018-12-01 200
1 100 2018-10-25 2018-12-01 200
1 100 2018-10-25 2018-12-01 200
2 101 2018-07-19 2018-07-24 100
3 102 2018-12-31 2019-01-01 200
3 102 2018-12-31 2019-01-01 200
4 103 2018-03-29 2018-05-29 66.66
4 103 2018-03-29 2018-05-29 66.66
4 103 2018-03-29 2018-05-29 66.66
5 104 2018-02-05 2018-02-09 50
Or as below
ID VehicleId CONSUME_DATE_FROM CONSUM_DATE_TO GAS_PER_LITER DATE_RELOAD_GAS
1 100 2018-10-25 2018-12-01 200 2018-10-01
1 100 2018-10-25 2018-12-01 200 2018-11-01
1 100 2018-10-25 2018-12-01 200 2018-12-01
2 101 2018-07-19 2018-07-24 100 2018-07-01
3 102 2018-12-31 2019-01-01 200 2018-12-01
3 102 2018-12-31 2019-01-01 200 2019-01-01
4 103 2018-03-29 2018-05-29 66.66 2018-03-01
4 103 2018-03-29 2018-05-29 66.66 2018-04-01
4 103 2018-03-29 2018-05-29 66.66 2018-05-01
5 104 2018-02-05 2018-02-09 50 2018-02-01
Can someone please help me out with this query?
I'm using oracle database
Your business rule treats the difference between CONSUME_DATE_FROM and CONSUM_DATE_TO as absolute months. So you expect the difference between 2018-10-25 and 2018-12-01 to be three months whereas the difference in days actually equates to about 1.1 months. So we can't use simple date arithmetic to get your desired output, we need to do some additional massaging of the dates.
The query below implements your desired logic by deriving the first day of the month for CONSUME_DATE_FROM and the last day of the month for CONSUME_DATE_TO, then using ceil() to round the difference up to the nearest whole number of months.
This is calculated in a subquery which is used in the main query with the old connect by level trick to multiply a record by level number of times:
with cte as (
select f.*
, ceil(months_between(last_day(CONSUM_DATE_TO)
, trunc(CONSUME_DATE_FROM,'mm'))) as diff
from fuel_consumption f
)
select cte.id
, cte.VehicleId
, cte.CONSUME_DATE_FROM
, cte.CONSUM_DATE_TO
, cte.GAS_PER_LITER/cte.diff as GAS_PER_LITER
, add_months(trunc(cte.CONSUME_DATE_FROM, 'mm'), level-1) as DATE_RELOAD_GAS
from cte
connect by level <= cte.diff
and prior cte.id = cte.id
and prior sys_guid() is not null
;
"what about if add a additional column "DATE_RELOAD_GAS" that display difference date for similar rows"
From your posted sample it seems like DATE_RELOAD_GAS is the first day of the month for each month bounded by CONSUME_DATE_FROM and CONSUM_DATE_TO. I have amended my solution to implement this rule.
By using connect by level structure with considering to_char(c.CONSUME_DATE_FROM + level - 1,'yyyymm') as month I was able to resolve as below :
select ID, VehicleId, myMonth, CONSUME_DATE_FROM, CONSUM_DATE_TO,
trunc(GAS_PER_LITER/max(rn) over (partition by ID order by ID),2) as GAS_PER_LITER,
'01.'||substr(myMonth,5,2)||'.'||substr(myMonth,1,4) as DATE_RELOAD_GAS
from
(
with consumption( ID, VehicleId, CONSUME_DATE_FROM, CONSUM_DATE_TO, GAS_PER_LITER ) as
(
select 1,100,date'2018-10-25',date'2018-12-01',600 from dual union all
select 2,101,date'2018-07-19',date'2018-07-24',100 from dual union all
select 3,102,date'2018-12-31',date'2019-01-01',400 from dual union all
select 4,103,date'2018-03-29',date'2018-05-29',200 from dual union all
select 5,104,date'2018-02-05',date'2018-02-09', 50 from dual
)
select ID, to_char(c.CONSUME_DATE_FROM + level - 1,'yyyymm') myMonth,
VehicleId, c.CONSUME_DATE_FROM, c.CONSUM_DATE_TO, GAS_PER_LITER,
row_number() over (partition by ID order by ID) as rn
from dual join consumption c
on c.ID >= 2
group by ID, to_char(c.CONSUME_DATE_FROM + level - 1,'yyyymm'), VehicleId,
c.CONSUME_DATE_FROM, c.CONSUM_DATE_TO, c.GAS_PER_LITER
connect by level <= c.CONSUM_DATE_TO - c.CONSUME_DATE_FROM + 1
union all
select ID, to_char(c.CONSUME_DATE_FROM + level - 1,'yyyymm') myMonth,
VehicleId, c.CONSUME_DATE_FROM, c.CONSUM_DATE_TO, GAS_PER_LITER,
row_number() over (partition by ID order by ID) as rn
from dual join consumption c
on c.ID = 1
group by ID, to_char(c.CONSUME_DATE_FROM + level - 1,'yyyymm'), VehicleId,
c.CONSUME_DATE_FROM, c.CONSUM_DATE_TO, c.GAS_PER_LITER
connect by level <= c.CONSUM_DATE_TO - c.CONSUME_DATE_FROM + 1
) q
group by ID, VehicleId, myMonth, CONSUME_DATE_FROM, CONSUM_DATE_TO, GAS_PER_LITER, rn
order by ID, myMonth;
I met an interesting issue that if I consider the join condition in the subquery as c.ID >= 1 query hangs on for huge period of time, so splitted into two parts by union all
as c.ID >= 2 and c.ID = 1
Rextester Demo
First, thanks for your time and your help!
I have two tables:
Table 1
PersId name lastName city
---------------------------------------
1 John Smith Tirana
2 Leri Nice Tirana
3 Adam fortsan Tirana
Table 2
Id PersId salesDate
--------------------------------------------
1 1 2017-01-22 08:00:40 000
2 2 2017-01-22 09:00:00 000
3 1 2017-01-22 10:00:00 000
4 1 2017-01-22 20:00:00 000
5 3 2017-01-15 09:00:00 000
6 1 2017-01-21 09:00:00 000
7 1 2017-01-21 10:00:00 000
8 1 2017-01-21 18:55:00 000
I would like to see the first recent sales between two dates according to each city for each day I want to bring it empty if I do not have a sale
SalesDate > '2017-01-17 09:00:00 000'
and SalesDate < '2017-01-23 09:00:00 000'
Table 2, id = 5 because the record is not in the specified date range
If I wanted my results to look like
Id PersId MinSalesDate MaxSalesDate City
-----------------------------------------------------------------------------
1 1 2017-01-22 08:00:40 000 2017-01-22 20:00:00 000 Tirana
2 2 2017-01-22 09:00:00 000 null Tirana
3 3 null null Tirana
4 1 2017-01-21 09:00:00 000 2017-01-21 18:55:00 000 Tirana
You dont identify how to get ID in the result. You appear to just want Row_Number(). I will leave that out, but this should get you started. You may have to work out conversion issues in the data range check, and I havent checked the query for syntax errors, I will leave that to you.
Select T1.PersId, City
, Min(T2.salesDate) MinSalesDate
, Max(T2.salesDate) MaxSalesDate
From Table1 T1
Left Join Table2 T2
On T1.PersId = T2.PersId
And T2.salesDate Between '2017-01-17 09:00:00 000' And < '2017-01-23 09:00:00 000'
Group BY T1.PersId, T2.City
Try the following using row_number to get min and max sale dates:
SELECT
T2.Id, T1.PersId, T2.MIN_salesDate, T2.MAX_salesDate, T1.City
FROM Table1 T1
LEFT JOIN
(
SELECT MIN(Id) as Id, PersId, MIN(salesDate) as MIN_salesDate, MAX(salesDate) as MAX_salesDate
FROM
(
SELECT
*
,ROW_NUMBER() OVER (PARTITION BY PersId ORDER BY salesDate ASC) as RNKMIN
,ROW_NUMBER() OVER (PARTITION BY PersId ORDER BY salesDate DESC) as RNKMAX
FROM Table2 T2
WHERE salesDate Between '2017-01-17 09:00:00 000' And '2017-01-23 09:00:00 000'
) temp
WHERE RNKMIN = 1 or RNKMAX = 1
GROUP BY PersId
) T2
on T1.PersId = T2.PersId
We have three ProductIDs and their prices are changing by Date
ProductID Price Date
1 100 2016-06-01
2 50 2016-06-05
3 10 2016-06-10
2 60 2016-06-15
1 110 2016-06-20
3 20 2016-06-25
How to select only last updated Price per each ProductID until 2016-06-20 inclusive to get the next output:
ProductID Price
3 10
2 60
1 110
I searched for answer but could't find the specific one.
Have a sub-query that returns each ProductID's last date (until 2016-06-20). Join with that result:
select t1.*
from tablename t1
join (select ProductID, max(Date) Date
from tablename
where Date <= '2016-06-20'
group by ProductID) t2
on t1.ProductID = t2.ProductID and t1.Date = t2.Date
ANSI SQL has date as reserved word, so you may need to delimit that column as "Date". Also ANSI SQL has date literals as date '2016-06-20'.
I have to modify a big pricelist table so that there is only one valid price for every article.
Sometimes the sales employees insert new prices and forgot to change the old infinite validTo dates.
So I have to write a sql-query to change all validTo dates to the next validFrom date minus one day, when the validTo date has infinite validity (9999-12-31).
But I have no idea how can i reach this with only SQL (Oracle 12).
anr price validFrom validTo
1 447.1 2015-06-01 9999-12-31 <
1 447.2 2015-06-16 2015-06-16
1 447.3 2015-06-17 2015-06-17
1 447.4 2015-06-22 2015-06-22
1 447.5 2015-07-06 9999-12-31 <
1 395.0 2015-07-20 2015-07-20
1 447.6 2015-08-03 9999-12-31 <
1 447.7 2015-08-17 9999-12-31 <
1 447.8 2015-08-24 9999-12-31 <
1 395.0 2015-09-07 2015-09-07
1 450.9 2015-11-15 9999-12-31 < no change because it is the last entry
after updating the the table, the result should look like
anr price validFrom validTo
1 447.1 2015-06-01 2015-06-15 <
1 447.2 2015-06-16 2015-06-16
1 447.3 2015-06-17 2015-06-17
1 447.4 2015-06-22 2015-06-22
1 447.5 2015-07-06 2015-07-19 <
1 395.0 2015-07-20 2015-07-20
1 447.6 2015-08-03 2015-08-16 <
1 447.7 2015-08-17 2015-08-23 <
1 447.8 2015-08-24 2015-09-06 <
1 395.0 2015-09-07 2015-09-07
1 450.9 2015-11-15 9999-12-31 <
In order to update an end date you can simply select the minimum of all higher start dates.
update mytable upd
set enddate = coalesce(
(
select min(startdate) - 1
from mytable later
where later.startdate > upd.startdate
and later.anr = upd.anr -- same product
), date'9999-12-31') -- coalesce for the case there is no later record
where enddate = date'9999-12-31';
I have taken anr to be the product id. If it isn't then change the statement accordingly.
Oracle provides an analytic function LEAD that references the current-plus-n-th record given a sort criterion. This function may serve the purpose of selecting the proper date value in an update statement as follows ( let test_prices be the table name, ppk its PK ):
update test_prices p
set p.validTo = (
select ps.vtn
from (
select lead ( p1.validFrom, 1 ) over ( order by p1.validFrom ) - 1 vtn
, ppk
from test_prices p1
) ps
where ps.ppk = p.ppk
)
where to_char(p.validTo, 'YYYY') = '9999'
and p.validFrom != ( select max(validFrom) from test_prices )
;
UPDATE VALID_DATES v
SET validTo = (
SELECT validTo
FROM (
SELECT anr,
validFrom,
COALESCE(
LEAD( validFrom - 1, 1 ) OVER ( PARTITION BY anr ORDER BY validFrom ),
validTo
) AS validTo
FROM valid_dates
) u
WHERE v.anr = u.anr
AND v.validFrom = u.validFrom
)
WHERE validTo = DATE '9999-12-31';
There are two possibilities:
1. Explicit time spans
price validFrom validTo
90.99 2016-01-01 9999-12-31
80.00 2016-01-16 2016-01-17
The first price would be valid both before January 16 and after January 17, whereas the second price was only valid on two days in January.
It would then be a very bad idea to change the first validTo.
2. Implicit time spans
price validFrom
90.99 2016-01-01
80.00 2016-01-16
90.99 2016-01-18
This data represents the same as in the explicit time spans example. The first price is valid before January 16, then the second price is valid until January 17, and afterwards the next price (which equals the first price again) is valid. Here you don't need an EndDate, because it's implicit. Of course the first price is only valid until January 15, because from January 16 there is another price valid (record #2).
So: Either remove the EndDate column completely or let it untouched. Don't simply update it, as you have intended. If you updated your records to next date minus one, you would actually hold data redundantly, which might lead to problems later.
I have this in my table
TempTable
Id Date
1 1-15-2010
2 2-14-2010
3 3-14-2010
4 4-15-2010
i would like to change every record so that they have all same day, that is the 15th
like this
TempTable
Id Date
1 1-15-2010
2 2-15-2010 <--change to 15
3 3-15-2010 <--change to 15
4 4-15-2010
what if i like on the 30th?
the records should be
TempTable
Id Date
1 1-30-2010
2 2-28-2010 <--change to 28 because feb has 28 days only
3 3-30-2010 <--change to 30
4 4-30-2010
thanks
You can play some fun tricks with DATEADD/DATEDIFF:
create table T (
ID int not null,
DT date not null
)
insert into T (ID,DT)
select 1,'20100115' union all
select 2,'20100214' union all
select 3,'20100314' union all
select 4,'20100415'
SELECT ID,DATEADD(month,DATEDIFF(month,'20100101',DT),'20100115')
from T
SELECT ID,DATEADD(month,DATEDIFF(month,'20100101',DT),'20100130')
from T
Results:
ID
----------- -----------------------
1 2010-01-15 00:00:00.000
2 2010-02-15 00:00:00.000
3 2010-03-15 00:00:00.000
4 2010-04-15 00:00:00.000
ID
----------- -----------------------
1 2010-01-30 00:00:00.000
2 2010-02-28 00:00:00.000
3 2010-03-30 00:00:00.000
4 2010-04-30 00:00:00.000
Basically, in the DATEADD/DATEDIFF, you specify the same component to both (i.e. month). Then, the second date constant (i.e. '20100130') specifies the "offset" you wish to apply from the first date (i.e. '20100101'), which will "overwrite" the portion of the date your not keeping. My usual example is when wishing to remove the time portion from a datetime value:
SELECT DATEADD(day,DATEDIFF(day,'20010101',<date column>),'20100101')
You can also try something like
UPDATE TempTable
SET [Date] = DATEADD(dd,15-day([Date]), DATEDIFF(dd,0,[Date]))
We have a function that calculates the first day of a month, so I just addepted it to calculate the 15 instead...