I have access to an S3 bucket structured like bucket_name/year/month/day/file.gz with hundreds of files per day. I know that to define a partitioned Athena table over this data it would need to be named as bucket_name/year=year/month=month/day=day. Short of writing a shell script spelling out every day (so, a series of
aws cp --recursive s3://old_bucket/YYYY/MM/DD/* s3://new_bucket/year=YYYY/month=MM/day=DD/
for each value of YYYY/MM/DD in the dataset)
is there a simpler way to approach this? I know about ALTER TABLE ADD PARTITION but again it seems to require me to specify each partition individually.
You don't need to rename the files at all. While it's true that most examples use the Hive-style naming convention Athena does not require it.
There are many ways to add partitions to an Athena table. In your case I would either go with partition projection, which would make new data partitions available immediately. Alternatively you can add partitions manually with ALTER TABLE … ADD PARTITION.
To create a table configured with partition projection you can use this as a starting point:
CREATE EXTERNAL TABLE my_table (
…
)
PARTITIONED BY (
`date` string
)
TBLPROPERTIES (
"projection.enabled" = "true",
"projection.date.type" = "date",
"projection.date.range" = "2020/01/01,NOW",
"projection.date.format" = "yyyy/MM/dd",
"storage.location.template" = "s3://bucket_name/${date}/"
)
You can then query your table with
SELECT *
FROM my_table
WHERE "date" = '2020/10/24'
Note that the date column/partition key is a string and not a DATE. Athena will take the string and interpolate it into the URI given by storage.location.template. Partition projection is pretty clever, and I encourage you to read the docs to find out what the ….range property does, for example.
Also note that date is a reserved word, and to use it in DDL you must quote it in backticks, but in queries it needs to instead be quoted in double quotes. If you want to avoid having to always quote you can name it something else, but if you do you need to change the name both in the PARTITIONED BY part and the TBLPROPERTIES part.
The alternative to partition projection, which is a fairly new feature, is to add partitions manually. You can do this with the Glue Data Catalog API, which is preferable when writing code in my opinion, or you can do it with DDL, which is more compact and easier to fit into a Stack Overflow answer.
Assuming you have a table partitioned by date like above (but without the TBLPROPERTIES since those are partition projection-specific), you can add partitions like this:
ALTER TABLE my_table ADD IF NOT EXISTS
PARTITION (`date` = '2020-10-22') LOCATION 's3://bucket_name/2020/10/22/'
PARTITION (`date` = '2020-10-23') LOCATION 's3://bucket_name/2020/10/23/'
PARTITION (`date` = '2020-10-24') LOCATION 's3://bucket_name/2020/10/24/'
You can then query your table like this:
SELECT *
FROM my_table
WHERE "date" = '2020-10-24'
Note here that I add the partitions with values for the partition keys that don't correspond exactly to how the dates are represented in the S3 URIs (I format the dates with dashes in the standard ISO way, instead of slashes). When adding partitions manually there doesn't have to be any correspondence at all between the partition keys' values and the S3 URI.
Some people will tell you that you must use Hive-style partitioning with Athena, and that you should then add partitions with MSCK REPAIR TABLE. This is not the case, as I hope I've shown above, and using that command to add partitions is not a good idea, it works for a few partitions, but eventually it will start timing out.
Related
I have a bigquery table whose data is loaded from AVRO files on GCS. This is NOT an external table.
One of the fields in every AVRO object is created (date with a long type) and I'd like to use this field to partition the table.
What is the best way to do this?
Thanks
Two issues that prevent from using created as a partition column:
The avro file defines the schema during loading time. There is only one option to partition at this step: select Partition By Ingestion Time, however, most probably will include another field for this purpose.
The field created is long. This value seems to contain a Datetime. If it was Integer you will be able to use Integer Range partitioned tables somehow. But in this case, you would need to convert the long value into a Date/Timestamp to use date/timestamp partitioned tables.
So, from my opinion you can try:
Importing the data as it is into a first table.
Create a second empty table partitioned by created of type TIMESTAMP.
Execute a query reading from the first table and applying a timestamp function on created like TIMESTAMP_SECONDS (or TIMESTAMP_MILLIS) to transform the value to a TIMESTAMP, so each value you insert will be partioned.
I have a very large parquet table containing nested complex types such as structs and arrays. I have partitioned it by date and would like to restrict certain users to, say, the latest week of data.
The usual way of doing this would be to create a time-limited view on top of the table, e.g.:
''' CREATE VIEW time_limited_view
AS SELECT * FROM my_table
WHERE partition_date >= '2020-01-01' '''
This will work fine when querying the view in Hive. However, if I try to query this view from Impala, I get an error:
** AnalysisException: Expr 'my_table.struct_column' in select list returns a complex type **
The reason for this is that Impala does not allow complex types in the select list. Any view I build which selects the complex columns will cause errors like this. If I flatten/unnest the complex types, this would of course get around this issue. However due to the layers of nesting involved I would like to keep the table structure as is.
I see another suggested workaround has been to use Ranger row-level filtering but I do not have Ranger and will not be able to install it on the cluster. Any suggestions on Hive/Impala SQL workarounds would be appreciated
While working on a different problem I came across a kind of solution that fits my needs (but is by no means a general solution). I figured I'd post it in case anyone has similar needs.
Rather than using a view, I can simply use an external table. So firstly I would create a table in database_1 using Hive, which has a corresponding location, location_1, in hdfs. This is my "production" database/table which I use for ETL and contains a very large amount of data. Only certain users have access to this database.
CREATE TABLE database_1.tablename
(`col_1` BIGINT,
`col_2` array<STRUCT<X:INT, Y:STRING>>)
PARTITIONED BY (`date_col` STRING)
STORED AS PARQUET
LOCATION 'location_1';
Next, I create a second, external table in the same location in hdfs. However this table is stored in a database with a much broader user group (database_2).
CREATE EXTERNAL TABLE database_2.tablename
(`col_1` BIGINT,
`col_2` array<STRUCT<X:INT, Y:STRING>>)
PARTITIONED BY (`date_col` STRING)
STORED AS PARQUET
LOCATION 'location_1';
Since this is an external table, I can add/drop date partitions at will without affecting the underlying data. I can add 1 weeks' worth of date partitions to the metastore and as far as end users can tell, that's all that is available in the table. I can even make this part of my ETL job, where each time new data is added, I add that partition to the external table and then drop a partition from a week ago, resulting in this rolling window of 1 weeks' data being made available to this user group without having to duplicate a load of data to a separate location.
This is by no means a row-filtering solution, but is a handy way to use partitions to expose a subset of data to a broader user group without having to duplicate that data in a separate location.
I have a S3 bucket with ~ 70 million JSONs (~ 15TB) and an athena table to query by timestamp and some other keys definied in the JSON.
It is guaranteed, that the timestamp in the JSON is more or less equal to the S3-createdDate of the JSON (or at least equal enough for the purpose of my query)
Can I somehow improve querying-performance (and cost) by adding the createddate as something like a "partition" - which I unterstand seems only to be possible for prefixes/folders?
edit:
I currently simulate that by using the S3 inventory CSV to pre-filter by createdDate and then download all JSONs and do the rest of the filtering, but I'd like to do that completely inside athena, if possible
There is no way to make Athena use things like S3 object metadata for query planning. The only way to make Athena skip reading objects is to organize the objects in a way that makes it possible to set up a partitioned table, and then query with filters on the partition keys.
It sounds like you have an idea of how partitioning in Athena works, and I assume there is a reason that you are not using it. However, for the benefit of others with similar problems coming across this question I'll start by explaining what you can do if you can change the way the objects are organized. I'll give an alternative suggestion at the end, you may want to jump straight to that.
I would suggest you organize the JSON objects using prefixes that contain some part of the timestamps of the objects. Exactly how much depends on the way you query the data. You don't want it too granular and not too coarse. Making it too granular will make Athena spend more time listing files on S3, making it too coarse will make it read too many files. If the most common time period of queries is a month, that is a good granularity, if the most common period is a couple of days then day is probably better.
For example, if day is the best granularity for your dataset you could organize the objects using keys like this:
s3://some-bucket/data/2019-03-07/object0.json
s3://some-bucket/data/2019-03-07/object1.json
s3://some-bucket/data/2019-03-08/object0.json
s3://some-bucket/data/2019-03-08/object1.json
s3://some-bucket/data/2019-03-08/object2.json
You can also use a Hive-style partitioning scheme, which is what other tools like Glue, Spark, and Hive expect, so unless you have reasons not to it can save you grief in the future:
s3://some-bucket/data/created_date=2019-03-07/object0.json
s3://some-bucket/data/created_date=2019-03-07/object1.json
s3://some-bucket/data/created_date=2019-03-08/object0.json
I chose the name created_date here, I don't know what would be a good name for your data. You can use just date, but remember to always quote it (and quote it in different ways in DML and DDL…) since it's a reserved word.
Then you create a partitioned table:
CREATE TABLE my_data (
column0 string,
column1 int
)
PARTITIONED BY (created_date date)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://some-bucket/data/'
TBLPROPERTIES ('has_encrypted_data'='false')
Some guides will then tell you to run MSCK REPAIR TABLE to load the partitions for the table. If you use Hive-style partitioning (i.e. …/created_date=2019-03-08/…) you can do this, but it will take a long time and I wouldn't recommend it. You can do a much better job of it by manually adding the partitions, which you do like this:
ALTER TABLE my_data ADD
PARTITION (created_date = '2019-03-07') LOCATION 's3://some-bucket/data/created_date=2019-03-07/'
PARTITION (created_date = '2019-03-08') LOCATION 's3://some-bucket/data/created_date=2019-03-08/'
Finally, when you query the table make sure to include the created_date column to give Athena the information it needs to read only the objects that are relevant for the query:
SELECT COUNT(*)
FROM my_data
WHERE created_date >= DATE '2019-03-07'
You can verify that the query will be cheaper by observing the difference in the data scanned when you change from for example created_date >= DATE '2019-03-07' to created_date = DATE '2019-03-07'.
If you are not able to change the way the objects are organized on S3, there is a poorly documented feature that makes it possible to create a partitioned table even when you can't change the data objects. What you do is you create the same prefixes as I suggest above, but instead of moving the JSON objects into this structure you put a file called symlink.txt in each partition's prefix:
s3://some-bucket/data/created_date=2019-03-07/symlink.txt
s3://some-bucket/data/created_date=2019-03-08/symlink.txt
In each symlink.txt you put the full S3 URI of the files that you want to include in that partition. For example, in the first file you could put:
s3://data-bucket/data/object0.json
s3://data-bucket/data/object1.json
and the second file:
s3://data-bucket/data/object2.json
s3://data-bucket/data/object3.json
s3://data-bucket/data/object4.json
Then you create a table that looks very similar to the table above, but with one small difference:
CREATE TABLE my_data (
column0 string,
column1 int
)
PARTITIONED BY (created_date date)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://some-bucket/data/'
TBLPROPERTIES ('has_encrypted_data'='false')
Notice the value of the INPUTFORMAT property.
You add partitions just like you do for any partitioned table:
ALTER TABLE my_data ADD
PARTITION (created_date = '2019-03-07') LOCATION 's3://some-bucket/data/created_date=2019-03-07/'
PARTITION (created_date = '2019-03-08') LOCATION 's3://some-bucket/data/created_date=2019-03-08/'
The only Athena-related documentation of this feature that I have come across for this is the S3 Inventory docs for integrating with Athena.
I started working with Theo's answer and it was very close (Thank you, Theo for the excellent and very detailed response), but when adding multiple partitions according to the documentation you only need to specify "ADD" once near the beginning of the query.
I tried specifying "ADD" on each line per Theo's example but received an error. It works when only specified once, though. Below is the format I used which was successful:
ALTER TABLE db.table_name ADD IF NOT EXISTS
PARTITION (event_date = '2019-03-01') LOCATION 's3://bucket-name/2019-03-01/'
PARTITION (event_date = '2019-03-02') LOCATION 's3://bucket-name/2019-03-02/'
PARTITION (event_date = '2019-03-03') LOCATION 's3://bucket-name/2019-03-03/'
...
I have a date partitioned table with around 400 partitions.
Unfortunately one of the columns datatypes has changed and should be changed from INT to STR.
I can change the datatype as follows:
SELECT
CAST(change_var AS STRING) change_var
<rest of columns>
FROM dataset.table_name
and overwrite the table, but the date partitioning is then lost.
Is there any way to keep the partitioning and change a columns datatype?
Option 1.
Export table by partition. I created a simple library to achieve it. https://github.com/rdtr/bq-partition-porter
Then create a new table with a correct type and load data into the new table again, by partition. Be careful about the quota (1000 exports per day). 400 should be okay.
Option 2.
By using Cloud Dataflow, you can export a whole table then use DynamicDestination to import data into BQ by partition. If a number of partitions are too many, this would suffice the requirement.
I expect bq load command to have some way to specify a partition key field name (since it's already described in bq load help). Until then, you need to follow either of these options.
I have defined a table as such:
create external table PageViews (Userid string, Page_View string)
partitioned by (ds string)
row format as delimited fields terminated by ','
stored as textfile location '/user/data';
I do not want all the files in the /user/data directory to be used as part of the table. Is it possible for me to do the following?
location 'user/data/*.csv'
What kmosley said is true. As of now, you can't selectively choose certain files to be a part of your Hive table. However, there are 2 ways to get around it.
Option 1:
You can move all the csv files into another HDFS directory and create a Hive table on top of that. If it works better for you, you can create a subdirectory (say, csv) within your present directory that houses all CSV files. You can then create a Hive table on top of this subdirectory. Keep in mind that any Hive tables created on top of the parent directory will NOT contain the data from the subdirectory.
Option 2:
You can change your queries to make use of a virtual column called INPUT__FILE__NAME.
Your query would look something like:
SELECT
*
FROM
my_table
WHERE
INPUT__FILE__NAME LIKE '%csv';
The ill-effect of this approach is that the Hive query will have to churn through entire data present in the directory even though you only cared about specific files. The query wouldn't filter out files based on the predicate using INPUT__FILE__NAME. It will just filter out the records that don't come from match the predicate using INPUT__FILE__NAME during the map phase (consequently filtering out all records from particular files) but the mappers would run on unnecessary files as well. It will give you the correct result, might have some, probably minor, performance overhead.
The benefit of this approach is the you can use the same Hive table if you had multiple files in your table and you wanted the ability to query all files from that table (or its partition) in a few queries and a subset of the files in other queries. You could make use of the INPUT__FILE__NAME virtual column to achieve that. As an example:
if a partition in your HDFS directory /user/hive/warehouse/web_logs/ looked like:
/user/hive/warehouse/web_logs/dt=2012-06-30/
/user/hive/warehouse/web_logs/dt=2012-06-30/00.log
/user/hive/warehouse/web_logs/dt=2012-06-30/01.log
.
.
.
/user/hive/warehouse/web_logs/dt=2012-06-30/23.log
Let's say your table definition looked like:
CREATE EXTERNAL TABLE IF NOT EXISTS web_logs_table (col1 STRING)
PARTITIONED BY (dt STRING)
LOCATION '/user/hive/warehouse/web_logs';
After adding the appropriate partitions, you could query all logs in the partition using a query like:
SELECT
*
FROM
web_logs_table w
WHERE
dt='2012-06-30';
However, if you only cared about the logs from the first hour of the day, you could query the logs for the first hour using a query like:
SELECT
*
FROM
web_logs_table w
WHERE
dt ='2012-06-30'
AND INPUT__FILE__NAME='00.log';
Another similar use case could be a directory that contains web logs from different domains and various queries need to analyze logs on different sets of domains. The queries can filter out domains using the INPUT__FILE__NAME virtual column.
In both the above use-cases, having a sub partition for hour or domain would solve the problem as well, without having to use the virtual column. However, there might exist some design trade-offs that require you to not create sub-partitions. In that case, arguably, using INPUT__FILE__NAME virtual column is your best bet.
Deciding between the 2 options:
It really depends on your use case. If you would never care about the files are you are trying to exclude from the Hive table, using Option 2 is probably an overkill and you should fix up the directory structure and create a Hive table on top of the directory containing files that you care about.
If the files you are presently excluding follow the same format as the other files (so they can all be part of the same Hive table) and you could see yourself writing a query that would analyze all the data in the directory, then go with Option 2.
I came across this thread when I had a similar problem to solve. I was able to resolve it by using a custom SerDe. I then added SerDe properties which guided what RegEx to apply to the file name patterns for any particular table.
A custom SerDe might seem overkill if you are only dealing with standard CSV files, I had a more complex file format to deal with. Still this is a very viable solution if you don't shy away from writing some Java. It is particularly useful when you are unable to restructure the data in your storage location and you are looking for a very specific file pattern among a disproportionately large file set.
> CREATE EXTERNAL TABLE PageViews (Userid string, Page_View string)
> ROW FORMAT SERDE 'com.something.MySimpleSerDe'
> WITH SERDEPROPERTIES ( "input.regex" = "*.csv")
> LOCATION '/user/data';
No you cannot currently do that. There is a JIRA ticket open to allow regex selection of included files for Hive tables (https://issues.apache.org/jira/browse/HIVE-951).
For now your best bet is to create a table over a different directory and just copy in the files you want to query.