When I push data to BQ using pandas (to_gbq()) the ones which are supposed to be hourly partitioned tables, show up as different tables with same naming convention e.g.:
The naming convention I made for each table is yyyyMMddHH as can be seen in above screenshot. Same as described in official documentation here:
A valid entry from the bound DATE, TIMESTAMP or DATETIME column.
Currently, date values prior to 1960-01-01 and later than 2159-12-31
are placed in a shared UNPARTITIONED partition. NULL values reside in
an explicit NULL partition.
Partitioning identifiers must follow the following formats:
yyyyMMddHH for hourly partitioning.
yyyyMMdd for daily partitioning.
yyyyMM for monthly partitioning.
yyyy for yearly partitioning.
However still they does show up as separate tables. I also made simple daily partitioned tables with daily naming conventions and they seem to show up fine. Problem is just with the hourly partitioned tables.
Related
My team and I have been using crate for one of our projects over the passed few years. We have a table with hundreds of millions of records and performance is key.
As we've developed more and more features on this project, we've ran into interesting problem. We have a column on this table labeled 'persist_date' which is when the record actually got persisted into the table. These dates may not always align and we could have a start_date of 2021-06-21 with a persist_date of 2021-10-14.
All of our queries up this point have easily been able to add a partition against start_date. Now we are encountering a problem which requires us to use a non-partitioned column (persist_date) to query against.
As I understand it, crateDB is really performant but only when you query against 1 specific partition at a time. My question now is how would I go about creating a partition for this other date column without duplicated my data? Is there anything other than a partition that might help, like the way the table is clustered?
You could use both columns as partition values.
e.g.
CREATE TABLE two_parted (a TEXT, b TEXT, val DOUBLE) PARTITIONED BY (a,b);
If either a or b are used in a selection, this would limit queries to shards that have either value. However this could lead to more shards, so you might want to partitions not on a daily, but weekly or monthly basis.
I have json input file which stores survey data(feedback from the customers).
The columns in json file can vary
for e.g. in first quarter there can
be 70 columns and in next quarter it can have 100 columns and so on.
I want to store all this quarterly data in same table on hdfs.
Is there a way to maintain history either by drop and re-creating the table with changing schema?
How will it behave if the column length goes down let's say in 3rd quarter we get only 30 columns.
First point is that in HDFS you don't store tables just files. You create tables in hive impala etc. on top of files.
Some of the formats support schema merging at read, for example parquet
In general you will be able to recreate your table with a super-set of columns. In Impala you have similar capabilities for schema evolution.
I have extensive experience working with Hive Partitioned tables. I use Hive 2.X. I was interviewing for a Big Data Solution Architect role and I was asked the below question.
Question: How would you ingest a streaming data in a Hive table partitioned on Date? The streaming data is first stored in S3 bucket and then loaded to Hive. Although the S3 bucket names have a date identifier such as S3_ingest_YYYYMMDD, the content could have data for more than 1 date.
My Answer: Since the content could have more than 1 Date, creating external table might not be possible since we want to read the file and distribute the file based on the date. I suggested we first load the S3 bucket in an external staging table with no partitions and then Load/Insert the final date partition table using Dynamic Partition settings which will dynamically distribute the data to the correct partition directory.
The interviewer said my answer was not correct and I was curious to know what the correct answer was, but ran out of time.
The only caveat in my answer is that, over time the partitioned date directories will have multiple small files that can lead to small file issue, which can always be handled via batch maintenance process.
What are the other/correct options to handle this scenario?
Thanks.
It depends on the requirements.
As per my understanding if one file or folder with S3_ingest_YYYYMMDD files can contain more than one date, then some events are loaded the next day or even later. This is rather common scenario.
Ingestion date and event date are two different dates. Put ingested files into table partitioned by ingestion date (LZ). You can track the initial data. If reprocessing is possible, then use ingestion_date as a bookmark for reprocessing of LZ table.
Then schedule a process which will take two or more last days of ingestion date and load into table partitioned by event_date. Last day will be always incomplete, and may be you need to increase look-back period to 3 or even more ingestion days (using ingestion_date >= current_date - 2 days filter), it depends how many dates back ingestion may load event dates. And in this process you are using dynamic partitioning by event_date and applying some logic - cleaning, etc and loading into ODS or DM.
This approach is very similar to what you proposed. The difference is in first table, it should be partitioned to allow you process data incrementally and to do easy restatement if you need to change the logic or upstream data was also restated and reloaded in the LZ.
I have about 11 years of data in a bunch of Avro files. I wanted to partition by the date of each row, but from the documentation it appears I can't because there are too many distinct dates?
Does clustering help on this? The natural cluster key for my data would still have some that'd have data for more than 4000 days.
two solutions i see:
1)
Combine tables sharding (per year) with time partitioning based on your column. I never tested that myself, but it should work, as every shard is seen as a new table in BQ.
With that you are able to easily address the shard plus the partition with one wildcard/variable.
2)
A good workaround is to create an extra column with the date of you field which should be partitioned.
For every data entry longer ago than 9 years (eg: DATE_DIFF(current_date(), DATE('2009-01-01'), YEAR)) format your date to the 1st of the particular month.
With that you are able to create another 29 years of data.
Be aware that you cannot filter based on that column with a date filter eg in DataStudio. But for query it works.
Best Thomas
Currently as per doc clustering is supported for partition table only. In future it might support non-partition tables.
You can put old data per year in single partition.
You need to add extra column to you table for partioning it.
Say, all data for year 2011 will go to partition 20110101.
For newer data (2019) you can have seperate partition for each date.
This is not a clean solution to problem but using this you can optimize further by using clustering to provide minimal table scan.
4,000 daily partitions is just over 10 years of data. If you require a 'table' with more than 10 years of data one workaround would be to use a view:
Split your table into decades ensuring all tables are partitioned on the same field and have the same schema
Union the tables together in a BigQuery view
This results in a view with 4,000+ partitions which business users can query without worrying about which version of a table they need to use or union-ing the tables themselves.
It might make sense to partition by week/month/year instead of day - depending on how much data you have per day.
In that case, see:
Partition by week/year/month to get over the partition limit?
Are there any tradeoffs in partitioning using date as a yyyymmdd string versus having multiple partitions for year, month and day as integers?
For every partition that is created in hive, a new directory is created to store that partitioned data. These details are added to hive metastore as well as to the fsimage of hadoop.
when a partition is created as yyyymmdd, will create a single directory, whereas with year,month and date will create three different directories. So more entries in hive metastore and more metadata to store in fsimage. This is wrt to how hive and hadoop see the partition for metadata perspective.
An another view wrt to querying I see is, when partitioned as yyyymmdd, it works well when querying on day(date) basis. Partitioning in year, month , day will give the flexibility to query the data at Year level and Month level effectively in addition to date level querying.