Changing the partition spec of a hive table and move data - hive

I have a external hive table employee which is partitioned by extract_timestamp (yyyy-mm-dd hh:mm:ss) as below.
empid empname extract_time
1 abc 2019-05-17 00:00:00
2 def 2019-05-18 14:21:00
I am trying to remove the partition by extract_time and change it to year,month and day partition. I am following the below method for this.
1. Create a new table employee_new with partitions year month and day
create external table employee_new
(empid int,
empname string
)
partitioned by (year int,month int,day int)
location '/user/emp/data/employee_new.txt';
2. insert overwrite into employee_new by selecting data from employee table
insert overwrite into employee_new as select*,year(extract_time),month(extract_time)
,day(extract_time)
from employee
3. Drop employee and employee_new and create employee table on top of /user/emp/data/employee_new.txt
Please let me know if this method is efficient and if there are any better ways to do the same.

Partition by date yyyy-MM-dd only, if possible, if upstream process can write hour files to daily folders. For such a small table partitioning by year, month and day separately seems overkill. It will be still too many folders.
If table is partitioned by date yyyy-MM-dd, partition pruning will work for your usage scenario because you are querying by day or year or month.
To filter by year in this case you will provide
where date >= '2019-01-01' and date < '2020-01-01' condition,
to filter by month:
where date >= '2019-01-01' and date < '2020-02-01'
and day: where date = '2019-01-01'
Filesystem listing will work much faster.
And if it is not possible to redesign upstream process to write to yyyy-MM-dd folders then your new design as you described in the question (yyyy/MM/dd folders) is the only solution.

Related

How to partition Delta lake table by Year where I have date in my Delta table

I have a Delta lake table with columns like this.
I want to partition the table by year. For that do I need to add year column to my table or can I directly partition by year without creating year column. If so, how do I do that?

Athena Partition - partition by any month and day

I've read Partitioning data in Athena for query Amazon s3,
I want to create one table with yearly data and to be able to partitioned it by any month and day.
So in the end I would be able to query between desired dates from all the year as below:
Select ...
from ...
where date between between '2015-03-01' and '2015-06-31'
Where 'date' is the partition s3 folders.
I didn't find a way to partitioned date for month and day, meaning to partition the folders under 'year' folder.
Only succussed to partitioned specific dates as doc suggest:
ALTER TABLE orders ADD
PARTITION (dt = '2014-04-21') LOCATION 's3://.../2014/04/21/'
PARTITION (dt = '2016-05-15') LOCATION 's3://.../2016/05/15/'
I have data in the structure:
s3://.../2014/04/21/file.csv
I'm using the partition with Alter Table method as doc suggest.
Could you tell me if there's a good way to do what I need?

How to fetch last one year's data in SQl with out date, timestamp columns?

I have a table containing years of data but no date or timestamp columns. Now I have to fetch last one year's data. How can achieve that when the table does not have any timestamp or date columns ?
How can achieve that when the table does not have any timestamp or date columns?
In general, you cannot; if you do not have any data inside the table to tell you a date associated with the row then there is not any meta-data that will tell you.
If you have enabled flashback (with a large enough history) on the table then you could compare the state of the table now to the state of the table a year ago using something like:
SELECT * FROM table_name
MINUS
SELECT * FROM table_name AS OF ADD_MONTHS(SYSDATE, -12);

Snowflake SQL - Create Temporary Date Table for EoM dates

I'm only a self-taught data-querying guy and am wholly unfamiliar with creating tables and such. The database I'm working on does have a calendar table, but it's only a forward moving calendar moving three years out. I am needing to create a date table for end of month records between two dates, including before the system dates table begins.
How best can one create this in Snowflake SQL?
Thank you much
This will create N end of month records. You can change the start date and change N to be the delta between your dates.
select
row_number() over (order by null) id,
add_months('2020-01-01'::date, + id) - 1
from table(generator(rowcount => 100))

SQL Table for Timesheet creation

I'm creating a timesheet using Infopath. The data will be stored in the database, so for that I have to create a table. This timesheet will be used for the whole year.
I need help in creating a SQL table. The table structure I want for this timesheet is:
Project_Category Mon Tue Wed Thu Fri Sat Sun Total
Project 1
Project 2
Project 3
Project 4
Project 5
Other
Total
The days should be with dates (Like, Monday 01/01/2013) or please suggest me if you have a better way to do this.
I would not store this data in a single table. Consider creating this using multiple tables instead of the single table.
For example, you could have a Projects table with ProjectId and ProjectName. Then you could easily link your ProjectId field to a ProjectSummary table which stores ProjectId, DateField and Total. I have no clue what your Total row is suppose to be, but if it's a calculation of a date range, use SQL to calculate those values and do not store that in the table.
Good luck -- there are lots of resources online to get started with SQL -- do a little searching.
As sgeddes has suggested, multiple tables will probably be a much better way to approach this.
Personally I would avoid having more than 1 day per row and also to make it flexible allow more than one entry per day.
The structure I would create is as follows:
Entry_ID INT IDENTITY(1,1) PRIMARY KEY
Timesheet_ID INT,
Project_ID INT,
DateTimeFrom DATETIME,
DateTimeTo DATETIME
This then allows date based calculations to be much simpler.
eg. Number of hours on project X between 20th June and 25th June would be a query like:
SELECT SUM(DATEDIFF(MINUTES,DateTimeFrom,DateTimeTo)/60) AS [HOURS]
FROM MyTable
WHERE DateTimeFrom >= '2012-06-25' AND DateTimeTo <= '2012-06-29'