Are there any tradeoffs in partitioning using date as a yyyymmdd string versus having multiple partitions for year, month and day as integers?
For every partition that is created in hive, a new directory is created to store that partitioned data. These details are added to hive metastore as well as to the fsimage of hadoop.
when a partition is created as yyyymmdd, will create a single directory, whereas with year,month and date will create three different directories. So more entries in hive metastore and more metadata to store in fsimage. This is wrt to how hive and hadoop see the partition for metadata perspective.
An another view wrt to querying I see is, when partitioned as yyyymmdd, it works well when querying on day(date) basis. Partitioning in year, month , day will give the flexibility to query the data at Year level and Month level effectively in addition to date level querying.
Related
I want to store data to BigQuery by using specific partitions. The partitions are ingestion-time based. I want to use a range of partitions spanning over two years. I use the partition alias destination project-id:data-set.table-id$partition-date.
I get failures since it does recognise the destination as an alias but as an actual table.
Is it supported?
When you ingest data into BigQuery, it will land automatically in the corresponding partition. If you choose a daily ingestion time as partition column, that means that every new day will be a new partition. To be able to "backfill" partitions, you need to choose some other column for the partition (e.g. a column in the table with the ingestion date). When you write data from Dataflow (from anywhere actually), the data will be stored in the partition corresponding to the value of that column for each record.
Direct writes to partitions by ingestion time is not supported using the Write API.
Also using the stream api is not supported if a window of 31 days has passed
From the documentation:
When streaming using a partition decorator, you can stream to partitions within the last 31 days in the past and 16 days in the future relative to the current date, based on current UTC time.
The solution that works is to use BigQuery load jobs to insert data. This can handle this scenario.
Because this operation has lot's of IO involved (files getting created on GCS), it can be lengthy, costly and resource intensive depending on the data.
A approach can be to create table shards and split the Big Table to small ones so the Storage Read and the Write api can be used. Then load jobs can be used from the sharded tables towards the partitioned table would require less resources, and the problem is already divided.
When I push data to BQ using pandas (to_gbq()) the ones which are supposed to be hourly partitioned tables, show up as different tables with same naming convention e.g.:
The naming convention I made for each table is yyyyMMddHH as can be seen in above screenshot. Same as described in official documentation here:
A valid entry from the bound DATE, TIMESTAMP or DATETIME column.
Currently, date values prior to 1960-01-01 and later than 2159-12-31
are placed in a shared UNPARTITIONED partition. NULL values reside in
an explicit NULL partition.
Partitioning identifiers must follow the following formats:
yyyyMMddHH for hourly partitioning.
yyyyMMdd for daily partitioning.
yyyyMM for monthly partitioning.
yyyy for yearly partitioning.
However still they does show up as separate tables. I also made simple daily partitioned tables with daily naming conventions and they seem to show up fine. Problem is just with the hourly partitioned tables.
I have extensive experience working with Hive Partitioned tables. I use Hive 2.X. I was interviewing for a Big Data Solution Architect role and I was asked the below question.
Question: How would you ingest a streaming data in a Hive table partitioned on Date? The streaming data is first stored in S3 bucket and then loaded to Hive. Although the S3 bucket names have a date identifier such as S3_ingest_YYYYMMDD, the content could have data for more than 1 date.
My Answer: Since the content could have more than 1 Date, creating external table might not be possible since we want to read the file and distribute the file based on the date. I suggested we first load the S3 bucket in an external staging table with no partitions and then Load/Insert the final date partition table using Dynamic Partition settings which will dynamically distribute the data to the correct partition directory.
The interviewer said my answer was not correct and I was curious to know what the correct answer was, but ran out of time.
The only caveat in my answer is that, over time the partitioned date directories will have multiple small files that can lead to small file issue, which can always be handled via batch maintenance process.
What are the other/correct options to handle this scenario?
Thanks.
It depends on the requirements.
As per my understanding if one file or folder with S3_ingest_YYYYMMDD files can contain more than one date, then some events are loaded the next day or even later. This is rather common scenario.
Ingestion date and event date are two different dates. Put ingested files into table partitioned by ingestion date (LZ). You can track the initial data. If reprocessing is possible, then use ingestion_date as a bookmark for reprocessing of LZ table.
Then schedule a process which will take two or more last days of ingestion date and load into table partitioned by event_date. Last day will be always incomplete, and may be you need to increase look-back period to 3 or even more ingestion days (using ingestion_date >= current_date - 2 days filter), it depends how many dates back ingestion may load event dates. And in this process you are using dynamic partitioning by event_date and applying some logic - cleaning, etc and loading into ODS or DM.
This approach is very similar to what you proposed. The difference is in first table, it should be partitioned to allow you process data incrementally and to do easy restatement if you need to change the logic or upstream data was also restated and reloaded in the LZ.
What is the optimal size for external table partition?
I am planning to partition table by year/month/day and we are getting about 2GB of data daily.
Optimal table partitioning is such that matching to your table usage scenario.
Partitioning should be chosen based on:
how the data is being queried (if you need to work mostly with daily data then partition by date).
how the data is being loaded (parallel threads should load their own
partitions, not overlapped)
2Gb is not too much even for one file, though it again depends on your usage scenario. Avoid unnecessary complex and redundant partitions like (year, month, date) - in this case date is enough for partition pruning.
Hive partitions definition will be stored in the metastore, therefore too many partitions will take much space in the metastore.
Partitions will be stored as directories in the HDFS, therefore many partitions keys will produce hirarchical directories which make their scanning slower.
Your query will be executed as a MapReduce job, therefore it's useless to make too tiny partitions.
It's case depending, think how your data will be queried. For your case I prefer one key defined as 'yyyymmdd', hence we will get 365 partitions / year, only one level in the table directory and 2G data / partition which is nice for a MapReduce job.
For the completness of the answer, if you use Hive < 0.12, make your partition key string typed, see here.
Usefull blog here.
Hive partitioning is most effective in cases where the data is sparse. By sparse I mean that the data internally has visible partitions such as by year, month or day.
In your case, partitioning by date doesn't make much sense as each day will have 2 Gb of data which is not too big to handle. Partitioning by week or month makes more sense as it will optimize the query time and will not create too many small partition files.
I'm working with a Hive table that is partitioned by year, month, and day. e.g.
year=2015 AND month=201512 AND day = 20151231.
From my limited knowledge of the way Hive works, these are probably set up in a folder structure where the '2015' folder contains 12 month folders, and each month folder has 28-31 day folders inside. In that case, using
WHERE year = 2015 AND month = 201512 AND day = 20151231
would just climb down the directory structure to the 20151231 folder. I would think that using just WHERE day = 20151231 would trigger the same traversal, and therefore be essentially the same query, but we were given sample code which used the year AND month AND day format (i.e. referencing all 3 partitions).
I ran some benchmarks using both options (last night and this morning, when server load is extremely light-to-non-existent), and the time taken is essentially the same. I suspect that the sample code is wrong, and I can just use the day partition, but I want to be sure.
Is there any performance advantage to using several partitions that are subsets of each other in a Hive query?
I know that Hive partitions are treated like columns, but would the same hold true for a non-partitioned column?
When you are running a query like that on a partitioned table, hive will first query the metastore to find which directories have to be included in the map/reduce input and like you saw, it doesn't quite matter how they are arranged ( day=20151231 vs year=2015/month=12/day=31 ).
If you're using mysql for the metastore, it means hive internally will run a sql query to its database to retrieve only the partitions to query.
The difference in performance in this SQL query is also going to be negligible, especially compared to the duration of your map/reduce job.
It's quite different when using non-partition columns, since those are not stored in the metastore, but a full scan of the data is needed.