Hive add partition data change

Hive add partition data change - hive

If I have regular table in Hive named employees and it contains the following data
/tab1/employeedata/file1
id, name, dept, year
1, gopal, TP, 2012
2, kiran, HR, 2012
3, kaleel,SC, 2013
4, Prasanth, SC, 2013
After I add a partition by year Hive will create new files as such
/tab1/employeedata/2012/file2
1, gopal, TP, 2012
2, kiran, HR, 2012
/tab1/employeedata/2013/file3
3, kaleel,SC, 2013
4, Prasanth, SC, 2013
My question is that after adding the partition, will my original file1 get deleted? If not, does that mean if I add several different partitions my data will be the original size times the number of partitions?
Thanks,

Related

SQL script to with the shown screenshot

I want to write a sql script to as shown in the screenshot image. Thank you.
enter image description here
I've tried MAX() function to aggregate the ESSBASE_MONTH field to make it distinct and display a single month in the output instead of multiple months. I am yet to figure out how to put 0 in any month that EMPID did not perform any sale like in December under "Total GreaterThan 24 HE Account" and "Total_HE_Accounts"

The fields of the table are not very informative however based on screenshot, this is the best answer I could come up with.
Assuming the table name is SALES;
select
ADJ_EMPID,
ESSBASE_MONTH,
MAX(YTD_COUNT) AS YTD_COUNT,
SUM(TOTAL_24) AS TOTAL_24,
SUM(TOTAL_ACC) AS TOTAL_ACC
from SALES
group by
ADJ_EMPID,
ESSBASE_MONTH
The above will aggregate the monthly 'sales' data as expected.
To add the 'missing' rows such as the December, it is possible to do it by doing a union of the above query with a vitural table.
select
MAX(MONTH_NUMBER) AS MONTH_NUMBER,
ADJ_EMPID,
ESSBASE_MONTH,
MAX(YTD_COUNT) AS YTD_COUNT,
SUM(TOTAL_24) AS TOTAL_24,
SUM(TOTAL_ACC) AS TOTAL_ACC
from (
select
1 as MONTH_NUMBER,
*
from SALES
union all
select * from (values
(1, '300014366', 'January', 0, 0, 0),
(2, '300014366', 'Feburary', 0, 0, 0),
-- add the other missing months as required
(11, '300014366', 'November', 0, 0, 0),
(12, '300014366', 'December', 0, 0, 0)
) TEMP_TABLE (MONTH_NUMBER, ADJ_EMPID, ESSBASE_MONTH, YTD_COUNT, TOTAL_24, TOTAL_ACC)
) as AGGREGATED_DATA
group by
ADJ_EMPID,
ESSBASE_MONTH
order by MONTH_NUMBER;
TEMP_TABLE is a vitural tables which contains all the months and sales as zero. There is a special field MONTH_NUMBER added to sort the months in the proper order.
Not the easiest query to understand, the requirement is not exactly feasible either..
Link to fiddledb for a working solution with PostgreSQL 15.

SQL store results table with month name

I have several CSV's stored to query against. Each CSV represents a month of data. I would like to count all the records in each CSV and save that data to a table as a row in the table. For instance, the table that represents May should return something that looks like this with June following. The data starts in Feb 2018 and continues to Feb 2019 so year value would be needed as well.
Month Results
----------------
May 18 1170
June 18 1167
I want to run the same query against all the tables for purposes of efficiency. I also want the query to work with all future updates eg. a March 19 table gets added, and the query will still work.
So far, I have this query.
SELECT COUNT(*)
FROM `months_data.*`
I am querying in Google Big Query using Standard SQL.

It sounds like you just want an aggregation that counts rows for each month:
SELECT
DATE_TRUNC(DATE(timestamp), MONTH) AS Month,
COUNT(*) AS Results
FROM `dataset.*`
GROUP BY month
ORDER BY month
You can use the DATE_FORMAT function if you want to control the formatting.

You seem to need union all:
select 2018 as yyyy, 2 as mm, count(*) as num
from feb2018
union all
select 2018 as yyyy, 3 as mm, count(*)
from mar2018
union all
. . .
Note that you have a poor data model. You should be storing all the data in a single table with a date column.

How do I average the last 6 months of sales within SQL based on period AND year?

How do I average the last 6 months of sales within SQL?
Here are my tables and fields:
IM_ItemWhseHistoryByPeriod.FISCALCALPERIOD,
IM_ItemWhseHistoryByPeriod.FISCALCALYEAR,
And I need to average these fields
IM_ItemWhseHistoryByPeriod.DOLLARSSOLD,
IM_ItemWhseHistoryByPeriod.QUANTITYSOLD,
The hard part I'm having is understanding how to average the last whole 6 months, ie. fsicalcalperiod 2-6(inside fiscalcalyear 2017).
I'm hoping for some help on what the SQL command text should look like since I'm very new to manipulating SQL outside of the UI.
Sample Data
My Existing SQL String:
SELECT IM_ItemWhseHistoryByPeriod.ITEMCODE,
IM_ItemWhseHistoryByPeriod.DOLLARSSOLD,
IM_ItemWhseHistoryByPeriod.QUANTITYSOLD,
IM_ItemWhseHistoryByPeriod.FISCALCALPERIOD,
IM_ItemWhseHistoryByPeriod.FISCALCALYEAR
FROM MAS_AME.dbo.IM_ItemWhseHistoryByPeriod
IM_ItemWhseHistoryByPeriod
ScaisEdge Attempt #1

if fiscalyear and fiscalperiod are number you could use
select avg(IM_ItemWhseHistoryByPeriod.DOLLARSSOLD) ,
avg(IM_ItemWhseHistoryByPeriod.QUANTITYSOLD)
from my_table
where IM_ItemWhseHistoryByPeriod.FISCALCALYEAR = 2017
and IM_ItemWhseHistoryByPeriod.FISCALCALPERIOD between 2 and 6
or for each item code
select itemcode, avg(IM_ItemWhseHistoryByPeriod.DOLLARSSOLD) ,
avg(IM_ItemWhseHistoryByPeriod.QUANTITYSOLD)
from my_table
where IM_ItemWhseHistoryByPeriod.FISCALCALYEAR = 2017
and IM_ItemWhseHistoryByPeriod.FISCALCALPERIOD between 2 and 6
group by itemcode

Try the following solution and see if it works for you:
select avg(DOLLARSSOLD) as AvgDollarSod,
avg(QUANTITYSOLD) as AvgQtySold
from IM_ItemWhseHistoryByPeriod
where FISCALCALYEAR = '2017
and FISCALCALPERIOD between 2 and 6

Summary data even when department is missing for a day

I have data submitted by several departments that I need to summarise to output on a report.
Most days, every department submits data. Some days, a department might miss submitting data.
I need to reflect a zero value entry for that department for the day, rather than skipping it.
I don't know why, but this is striking me as a difficult challenge.
If my data looks like this:
Date, Department, Employee
1 May 2016, First, Fred
1 May 2016, First, Wilma
1 May 2016, Second, Betty
1 May 2016, Second, Barney
2 May 2016, Second, Betty
3 May 2016, First, Wilma
3 May 2016, Second, Betty
3 May 2016, Second, Barney
If I do a count(*) on this data, the output I am hoping for is:
1 May 2016, First, 2
1 May 2016, Second, 2
2 May 2016, First, 0
2 May 2016, Second, 1
3 May 2016, First, 1
3 May 2016, Second, 2
It's the 3rd line, "2 May 2016, First, 0", that I can't get my output to include.
My underlying data is more complex than above, but above is a reasonable simplex representation of the problem.
I'm at the point where I'm messing around with cursors trying to 'build' this recordset, so I think that's a clue that I need to ask for help.

Assuming that your main table is:
create table mydata
(ReportDate date,
department varchar2(20),
Employee varchar2(20));
We can use the below query:
with dates (reportDate) as
(select to_date('01-05-2016','dd-mm-yyyy') + rownum -1
from all_objects
where rownum <=
to_date('03-05-2016','dd-mm-yyyy')-to_date('01-05-2016','dd-mm-yyyy')+1 ),
departments( department) as
( select 'First' from dual
union all
select 'Second' from dual) ,
AllReports ( reportDate, Department) as
(select dt.reportDate,
dp.department
from dates dt
cross join
departments dp )
select ar.reportDate, ar.department, count(md.employee)
from AllReports ar
left join myData md
on ar.ReportDate = md.reportDate and
ar.department = md.department
group by ar.reportDate, ar.department
order by 1, 2
First we generate dates that we are interested in. In our sample between 01-05-2016 and 03-05-2016. It's in dates WITH.
Next we generate list of departments - Departments WITH.
We cross join them to generate all possible reports - AllReports WITH.
And we use LEFT JOIN to your main table to figure out which data exists and which are missing.

Correct SQL Statement returns correct row that represents front month expiry contract

I have a SQL server 2008 R2 database of trade records for several equity options, each at one minute intervals, and each minute contains records for several expiry. e.g.,
Symbol, TradeDate, Expiry, Open, High, Low, Close
AMZN, 4/01/2009 9:31:00, 4/17/2009, 8, 10, 9, 8.5
AMZN, 4/01/2009 9:31:00, 5/17/2009, 10, 11, 10, 11
AMZN, 4/01/2009 9:31:00, 6/18/2009, 12,13,12,12
GOOG, 4/01/2009 9:31:00, 4/17/2009, 8, 9, 7, 7.5
AMZN, 4/01/2009 9:32:00, 4/17/2009, 8.2, 8.9, 8.3, 8.5
AMZN, 4/01/2009 9:32:00, 5/16/2009, 3, 4, 4, 4
...
AMZN, 4/20/2009 9:31:00, 5/16/2009, 8.5, 9, 8.75, 8.75
AMZN, 4/20/2009 9:31:00, 6/18/2009, 9, 10, 9, 9.2
In options there is always a notion of the front month contract. For this problem, define the front month contract to be: If there are TradeDate entries less than the expiry date for that contract, that is the front month. Else, the front month is the next months contract. So for example, in the data above, on 4/01/2009, the AMZN front month is the contract that expires on 4/17/2009. However, when we move to TradeDate 4/20/2009, the front month is the 5/16/2009 contract since the 4/17/2007 contract expired over the weekend.
What is the SQL statement that would always return all the correct rows giving the "front month contract" based on what the TradeDate is?

From what you have described. The following query should do it. A self join :
SELECT T1.Symbol, T1.TradeDate, T1.Expiry, MIN(T2.expiry) AS FrontMonrhContract, T1.Open, T1.High, T1.Low, T1.Close FROM <TABLE> T1, <TABLE> T2 WHERE T1.TradeDate <= T2.Expiry GROUP BY T1.Symbol , T1.TradeDate, T1.Expiry
But it will not work if there is no entry for a trade of the FrontMonth Contract.
What i would feel better is that you either hand input a list of Expiries or compute them based on some rule, like last Friday of the month or something like that (if there is a rule). So that you do not risk miscalculating the front month if there is no trade for the frontMonth contract.
Better still do it in your application instead of SQL, as SQL is not meant for such work. In your application it would be a simple comparison with a list of expiries.
I have reduced execution time of a charting function which worked on data from a sqlite database by over 90% by doing such computations in the application itself instead of SQL.
Update:
Try the following query. It assumes the table name to be TRADES.
SELECT
T1.Symbol,
T1.TradeDate,
T1.Expiry,
MIN(T2.expiry) AS 'FrontMonrhContract' ,
MIN(T1.[Open]) AS 'Open',
MIN(T1.[High]) AS 'High',
MIN(T1.[Low]) AS 'Low',
MIN(T1.[Close]) AS 'Close'
FROM
TRADES T1, TRADES T2
WHERE T1.TradeDate <= T2.Expiry AND T1.Symbol = T2.Symbol
GROUP BY T1.Symbol , T1.TradeDate, T1.Expiry
I just built a sample table with the data you provided in the question and this query works as expected on that data set. For note I have SQL Server 2005
Update 2:
To optimize the execution of the query, try adding an Index with the three GROUP BY columns Symbol, TradeDate, Expiry in that order.
I created a query execution plan and over 60% time is for resolving the GROUP BY, and after adding this index in my sample db it was completely gone.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hive add partition data change - hive

Related

SQL script to with the shown screenshot

SQL store results table with month name

How do I average the last 6 months of sales within SQL based on period AND year?

Summary data even when department is missing for a day

Correct SQL Statement returns correct row that represents front month expiry contract

Categories

Resources