How to partition a datetime column when creating Athena table - amazon-s3

I have some log files in S3 with the following csv format (sample data in parenthesis):
userid (15678),
datetime (2017-09-14T00:21:10),
tag1 (some random text),
tag2 (some random text)
I want to load into Athena tables and partition the data based on datetime in a day/month/year format. Is there a way to split the datetime on table creation or do I need to run some job before to separate the columns and then import?

Athena supports only External tables of Hive. In external tables to partition the data you data must be in different folders.
There are two ways in which you can do that. Both are mentioned here.

Related

How to ingest orc in druid from base path of a hive external table?

I have a hive external table pointed at location = "hdfs://localhost:8020/sample/path/"
here /sample/path contains various partitions like
/sample/path/cola=123/colb=456
/sample/path/cola=324/colb=432
/sample/path/cola=322/colb=234
I have tried to ingest data into apache druid using index_parallel , while doing so i have to mention complete partion dirs upto leaf level :
"paths":"/sample/path/cola=123/colb=456,/sample/path/cola=324/colb=432,/sample/path/cola=322/colb=234"
Value for these partition columns is lost once they are ingested into druid
Ques : Is there some way i could specify the base path and retain value of partition columns after data ingestion ?
I'm afraid not. You're ingesting the files, and they simply don't contain values for the partition column. To ingest this data you'll have to have the column in your table twice, once as partition column and again as a regular column.

How to maintain history data whose schema changes quarterly using Hadoop

I have json input file which stores survey data(feedback from the customers).
The columns in json file can vary
for e.g. in first quarter there can
be 70 columns and in next quarter it can have 100 columns and so on.
I want to store all this quarterly data in same table on hdfs.
Is there a way to maintain history either by drop and re-creating the table with changing schema?
How will it behave if the column length goes down let's say in 3rd quarter we get only 30 columns.
First point is that in HDFS you don't store tables just files. You create tables in hive impala etc. on top of files.
Some of the formats support schema merging at read, for example parquet
In general you will be able to recreate your table with a super-set of columns. In Impala you have similar capabilities for schema evolution.

Partitioning BigQuery table, loaded from AVRO

I have a bigquery table whose data is loaded from AVRO files on GCS. This is NOT an external table.
One of the fields in every AVRO object is created (date with a long type) and I'd like to use this field to partition the table.
What is the best way to do this?
Thanks
Two issues that prevent from using created as a partition column:
The avro file defines the schema during loading time. There is only one option to partition at this step: select Partition By Ingestion Time, however, most probably will include another field for this purpose.
The field created is long. This value seems to contain a Datetime. If it was Integer you will be able to use Integer Range partitioned tables somehow. But in this case, you would need to convert the long value into a Date/Timestamp to use date/timestamp partitioned tables.
So, from my opinion you can try:
Importing the data as it is into a first table.
Create a second empty table partitioned by created of type TIMESTAMP.
Execute a query reading from the first table and applying a timestamp function on created like TIMESTAMP_SECONDS (or TIMESTAMP_MILLIS) to transform the value to a TIMESTAMP, so each value you insert will be partioned.

Partitioning based on column data?

When creating a partitioned table using bq mk --time_partitioning_type=DAY are the partitions created based on the load time of the data, not a date key within the table data itself?
To create partitions based on dates within the date, is the current approach to manually create sharded tables, and load them based on date, as in this post from 2012?
Yes, partitions created based on data load time not based on data itself
You can use partition decorator (mydataset.mytable1$20160810) if you want to load data into specific partition
Per my understanding, partition by column is something that we should expect to be supported at some point - but not now
Good news, BigQuery currently supports 2 type data partition, included partition by column. Please check here.
I like the feature: An individual operation can commit data into up to 2,000 distinct partitions.

Import CSV to partitioned table on BigQuery using specific timestamp column?

I want to import a large csv to a bigquery partitioned table that has a timestamp type column that is actually the date of some transaction, the problem is that when I load the data it imports everything into one partition of today's date.
Is it possible to use my own timestamp value to partition it? How can I do that.
In BigQuery, currently, partitioning based on specific column is not supported.
Even if this column is date related (timestamp).
You either rely on time of insertion so BigQuery engine will insert into respective partition or you specify which exactly partition you want to insert your data into
See more about Creating and Updating Date-Partitioned Tables
The best way to do that today is by using Google Dataflow [1]. You can develop a streaming pipeline which will read the file from Google Cloud Storage bucket and insert the rows into BigQuery's table.
You will need to create the partitioned table manually [2] before running the pipeline, because Dataflow right now doesn't support creating partitioned tables
There are multiple examples available at [3]
[1] https://cloud.google.com/dataflow/docs/
[2] https://cloud.google.com/bigquery/docs/creating-partitioned-tables
[3] https://cloud.google.com/dataflow/examples/all-examples