Spark and Parquet: Reading Partitioned Data

Spark and Parquet: Reading Partitioned Data - apache-spark-sql

SparkSQL has a an excellent trick: it will read your parquet data, correctly reading the schema from parquet's metadata. What's more: if you have data partitioned using a key=value schema, SparkSQL will automatically recurse through a directory structure, reading those values in as a column called key. Documentation on this -- along with a pretty clear example -- here.
Unfortunately, my data is partitioned in a way that works nicely for Cascading, doesn't seem to jive with SparkSQL's expectations:
2015/
└── 04
├── 29
│   └── part-00000-00002-r-00000.gz.parquet
└── 30
└── part-00000-00001-r-00000.gz.parquet
In Cascading, I can specify a PartitionTap, tell it the first three items will be year, month, and day, and I'm off to the races. But I cannot figure out how to achieve a similar effect in SparkSQL. Is it possible to do any of:
Just ignore the partitioning; recurse down to parquet data and read everything found. (I am aware that I could roll my own code to this effect using Hadoop's FileSystem API, but I'd really rather not.)
Specify a partial schema -- e.g. "columns are year (int), month (int), day (int), plus infer the rest from parquet"
Specify the whole schema?
(My parquet data contains nested structures, which SparkSQL can read and interact with beautifully as long as I let it do so automagically. If I try to manually specify the Schema, it cannot seem to handle nested structures.)

Related

Partition data while writing to delta sink

In Azure mapping dataflow we now have option to save files in delta format. But that is only available when we select inline dataset (without data bricks subscription). And when the sink dataset is inline dataset, it does not allow to set partition based on any column.
I can write pyspark code to rewrite the delta table with required partition. But that would incur additional cost.
What could be work arounds for getting good performance on delta data?

There was a UI issue that was recently fixed by the engineering team. Until this reflects at your end.
You could do the following as a workaround :
Option 1 :
You can change the type of sink to something else, like a delimited text sink, and you should then see the key columns in Key partitioning. Then, switch the Sink type back to Delta.
Reference : https://learn.microsoft.com/en-us/answers/questions/599075/index.html
Option 2:
You could enable the partitioning at the source end.
The partitioned data was flowing as a stream. I was able to achieve the partitioned data as a result

Partition Athena query by S3 created date

I have a S3 bucket with ~ 70 million JSONs (~ 15TB) and an athena table to query by timestamp and some other keys definied in the JSON.
It is guaranteed, that the timestamp in the JSON is more or less equal to the S3-createdDate of the JSON (or at least equal enough for the purpose of my query)
Can I somehow improve querying-performance (and cost) by adding the createddate as something like a "partition" - which I unterstand seems only to be possible for prefixes/folders?
edit:
I currently simulate that by using the S3 inventory CSV to pre-filter by createdDate and then download all JSONs and do the rest of the filtering, but I'd like to do that completely inside athena, if possible

There is no way to make Athena use things like S3 object metadata for query planning. The only way to make Athena skip reading objects is to organize the objects in a way that makes it possible to set up a partitioned table, and then query with filters on the partition keys.
It sounds like you have an idea of how partitioning in Athena works, and I assume there is a reason that you are not using it. However, for the benefit of others with similar problems coming across this question I'll start by explaining what you can do if you can change the way the objects are organized. I'll give an alternative suggestion at the end, you may want to jump straight to that.
I would suggest you organize the JSON objects using prefixes that contain some part of the timestamps of the objects. Exactly how much depends on the way you query the data. You don't want it too granular and not too coarse. Making it too granular will make Athena spend more time listing files on S3, making it too coarse will make it read too many files. If the most common time period of queries is a month, that is a good granularity, if the most common period is a couple of days then day is probably better.
For example, if day is the best granularity for your dataset you could organize the objects using keys like this:
s3://some-bucket/data/2019-03-07/object0.json
s3://some-bucket/data/2019-03-07/object1.json
s3://some-bucket/data/2019-03-08/object0.json
s3://some-bucket/data/2019-03-08/object1.json
s3://some-bucket/data/2019-03-08/object2.json
You can also use a Hive-style partitioning scheme, which is what other tools like Glue, Spark, and Hive expect, so unless you have reasons not to it can save you grief in the future:
s3://some-bucket/data/created_date=2019-03-07/object0.json
s3://some-bucket/data/created_date=2019-03-07/object1.json
s3://some-bucket/data/created_date=2019-03-08/object0.json
I chose the name created_date here, I don't know what would be a good name for your data. You can use just date, but remember to always quote it (and quote it in different ways in DML and DDL…) since it's a reserved word.
Then you create a partitioned table:
CREATE TABLE my_data (
column0 string,
column1 int
)
PARTITIONED BY (created_date date)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://some-bucket/data/'
TBLPROPERTIES ('has_encrypted_data'='false')
Some guides will then tell you to run MSCK REPAIR TABLE to load the partitions for the table. If you use Hive-style partitioning (i.e. …/created_date=2019-03-08/…) you can do this, but it will take a long time and I wouldn't recommend it. You can do a much better job of it by manually adding the partitions, which you do like this:
ALTER TABLE my_data ADD
PARTITION (created_date = '2019-03-07') LOCATION 's3://some-bucket/data/created_date=2019-03-07/'
PARTITION (created_date = '2019-03-08') LOCATION 's3://some-bucket/data/created_date=2019-03-08/'
Finally, when you query the table make sure to include the created_date column to give Athena the information it needs to read only the objects that are relevant for the query:
SELECT COUNT(*)
FROM my_data
WHERE created_date >= DATE '2019-03-07'
You can verify that the query will be cheaper by observing the difference in the data scanned when you change from for example created_date >= DATE '2019-03-07' to created_date = DATE '2019-03-07'.
If you are not able to change the way the objects are organized on S3, there is a poorly documented feature that makes it possible to create a partitioned table even when you can't change the data objects. What you do is you create the same prefixes as I suggest above, but instead of moving the JSON objects into this structure you put a file called symlink.txt in each partition's prefix:
s3://some-bucket/data/created_date=2019-03-07/symlink.txt
s3://some-bucket/data/created_date=2019-03-08/symlink.txt
In each symlink.txt you put the full S3 URI of the files that you want to include in that partition. For example, in the first file you could put:
s3://data-bucket/data/object0.json
s3://data-bucket/data/object1.json
and the second file:
s3://data-bucket/data/object2.json
s3://data-bucket/data/object3.json
s3://data-bucket/data/object4.json
Then you create a table that looks very similar to the table above, but with one small difference:
CREATE TABLE my_data (
column0 string,
column1 int
)
PARTITIONED BY (created_date date)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://some-bucket/data/'
TBLPROPERTIES ('has_encrypted_data'='false')
Notice the value of the INPUTFORMAT property.
You add partitions just like you do for any partitioned table:
ALTER TABLE my_data ADD
PARTITION (created_date = '2019-03-07') LOCATION 's3://some-bucket/data/created_date=2019-03-07/'
PARTITION (created_date = '2019-03-08') LOCATION 's3://some-bucket/data/created_date=2019-03-08/'
The only Athena-related documentation of this feature that I have come across for this is the S3 Inventory docs for integrating with Athena.

I started working with Theo's answer and it was very close (Thank you, Theo for the excellent and very detailed response), but when adding multiple partitions according to the documentation you only need to specify "ADD" once near the beginning of the query.
I tried specifying "ADD" on each line per Theo's example but received an error. It works when only specified once, though. Below is the format I used which was successful:
ALTER TABLE db.table_name ADD IF NOT EXISTS
PARTITION (event_date = '2019-03-01') LOCATION 's3://bucket-name/2019-03-01/'
PARTITION (event_date = '2019-03-02') LOCATION 's3://bucket-name/2019-03-02/'
PARTITION (event_date = '2019-03-03') LOCATION 's3://bucket-name/2019-03-03/'
...

Keeping archived data schema up to date with running data warehouse

recently our 5-year old MySQL data warehouse (used mostly for business reporting) has gotten quite full and we need to come up with a way to archive old data which is not frequently accessed to clear up space.
I created a process which dumps old data from the DW into .parquet files in Amazon S3, which are then mapped onto an Athena table. This works quite well.
however we sometimes add/rename/delete columns in existing tables. I'd like the changes to be reflected in the old, archived data as well, but I just can't come up with a good way to do it without reprocessing the entire dataset.
is there a 'canon' way to mantain structural compatibility between a live data warehouse and its file-based archived data? I've googled relevant literature and come up with nothing.
should I just accept the fact that if I need to actively maintain schemas then the data is not really archived?

There are tons of materials in internet if you search the term "Schema evolution" in big data space.
The Athena documentation has a chapter on schema updates case by case example here.
If you are re-processing the whole archived dataset to handle schema change, probably you are doing a bit too much.
Since you have parquet files and by default Athena parquet resolves the column by column name rather than by index, you are safe in almost all cases i.e. add new columns, drop columns etc except column rename. TO handle renamed columns (and to handle addition/dropping of columns), the fastest way is to use view. In the view definition you can alias the renamed column. Also, if column rename is mostly the case of your schema evolution and if you are doing it a lot, you can also consider AVRO to gracefully handle that.

Plan A:
It's too late to do this, but PARTITIONing is an excellent tool for getting the data out of the table.
I say "too late" because adding partitioning would require enough space for making a copy of the already-big table. And you don't have that much disk space?
If the table were partitioned by Year or Quarter or Month, you could
Every period, "Export tablespace" to remove the oldest from the partition scheme.
That tablespace will the be a separate table; you could copy/dump/whatever, then drop it.
At about the same time, you would build a new partition to receive new data.
(I would keep the two processes separate so that you could stretch beyond 5 years or shrink below 5 with minimal extra effort.)
A benefit of the method is that there is virtually zero impact on the big table during the processing.
An extra benefit of partitioning: You can actually return space to the OS (assuming you have innodb_file_per_table=ON).
Plan B:
Look at what you do with the oooold data. Only a few things? Possibly involving summarization? So...
Don't archive the old data.
Summarize the data to-be-removed into new tables. Since they will be perhaps one-tenth the size, you can keep them online 'forever'.

Mapping Hive partitions to existing folder structure with arbitrary partition subdirectory names

I'm looking for a solution to a somewhat uncommon problem - generally what I want to achieve is to have partitions of Hive table mapped to specific folders that already exist and should not be renamed to fit default Hive partition naming convention.
Folder structure I have is the following:
<some path>/Daily/database/<partition subfolders by day>
<some path>/Weekly/database/<partition subfolders by day>
<some path>/Lifetime/database/<partition subfolders by day>
What I'd like to have is Period (Daily, etc.) folders being treated as partitions as well.
Now there are two problems:
database subfolder between potential partition subdirectories
Period folders should not follow partition-name=partition-value naming format
I guess the former could be solved by a terrible crutch of adding a dummy column with "database" value in all rows and partitioning by it as well.
Regarding the latter though I'm not sure if it's possible at all, and from what I was able to find it's not, at least in a sensible way. So I'm looking for an advice on this, or at least an expert confirmation it's not possible :)
If that helps my environment is Databricks platform and files are being saved in parquet format.

When creating an external table in hive can I point the location to specific files in a directory?

I have defined a table as such:
create external table PageViews (Userid string, Page_View string)
partitioned by (ds string)
row format as delimited fields terminated by ','
stored as textfile location '/user/data';
I do not want all the files in the /user/data directory to be used as part of the table. Is it possible for me to do the following?
location 'user/data/*.csv'

What kmosley said is true. As of now, you can't selectively choose certain files to be a part of your Hive table. However, there are 2 ways to get around it.
Option 1:
You can move all the csv files into another HDFS directory and create a Hive table on top of that. If it works better for you, you can create a subdirectory (say, csv) within your present directory that houses all CSV files. You can then create a Hive table on top of this subdirectory. Keep in mind that any Hive tables created on top of the parent directory will NOT contain the data from the subdirectory.
Option 2:
You can change your queries to make use of a virtual column called INPUT__FILE__NAME.
Your query would look something like:
SELECT
*
FROM
my_table
WHERE
INPUT__FILE__NAME LIKE '%csv';
The ill-effect of this approach is that the Hive query will have to churn through entire data present in the directory even though you only cared about specific files. The query wouldn't filter out files based on the predicate using INPUT__FILE__NAME. It will just filter out the records that don't come from match the predicate using INPUT__FILE__NAME during the map phase (consequently filtering out all records from particular files) but the mappers would run on unnecessary files as well. It will give you the correct result, might have some, probably minor, performance overhead.
The benefit of this approach is the you can use the same Hive table if you had multiple files in your table and you wanted the ability to query all files from that table (or its partition) in a few queries and a subset of the files in other queries. You could make use of the INPUT__FILE__NAME virtual column to achieve that. As an example:
if a partition in your HDFS directory /user/hive/warehouse/web_logs/ looked like:
/user/hive/warehouse/web_logs/dt=2012-06-30/
/user/hive/warehouse/web_logs/dt=2012-06-30/00.log
/user/hive/warehouse/web_logs/dt=2012-06-30/01.log
.
.
.
/user/hive/warehouse/web_logs/dt=2012-06-30/23.log
Let's say your table definition looked like:
CREATE EXTERNAL TABLE IF NOT EXISTS web_logs_table (col1 STRING)
PARTITIONED BY (dt STRING)
LOCATION '/user/hive/warehouse/web_logs';
After adding the appropriate partitions, you could query all logs in the partition using a query like:
SELECT
*
FROM
web_logs_table w
WHERE
dt='2012-06-30';
However, if you only cared about the logs from the first hour of the day, you could query the logs for the first hour using a query like:
SELECT
*
FROM
web_logs_table w
WHERE
dt ='2012-06-30'
AND INPUT__FILE__NAME='00.log';
Another similar use case could be a directory that contains web logs from different domains and various queries need to analyze logs on different sets of domains. The queries can filter out domains using the INPUT__FILE__NAME virtual column.
In both the above use-cases, having a sub partition for hour or domain would solve the problem as well, without having to use the virtual column. However, there might exist some design trade-offs that require you to not create sub-partitions. In that case, arguably, using INPUT__FILE__NAME virtual column is your best bet.
Deciding between the 2 options:
It really depends on your use case. If you would never care about the files are you are trying to exclude from the Hive table, using Option 2 is probably an overkill and you should fix up the directory structure and create a Hive table on top of the directory containing files that you care about.
If the files you are presently excluding follow the same format as the other files (so they can all be part of the same Hive table) and you could see yourself writing a query that would analyze all the data in the directory, then go with Option 2.

I came across this thread when I had a similar problem to solve. I was able to resolve it by using a custom SerDe. I then added SerDe properties which guided what RegEx to apply to the file name patterns for any particular table.
A custom SerDe might seem overkill if you are only dealing with standard CSV files, I had a more complex file format to deal with. Still this is a very viable solution if you don't shy away from writing some Java. It is particularly useful when you are unable to restructure the data in your storage location and you are looking for a very specific file pattern among a disproportionately large file set.
> CREATE EXTERNAL TABLE PageViews (Userid string, Page_View string)
> ROW FORMAT SERDE 'com.something.MySimpleSerDe'
> WITH SERDEPROPERTIES ( "input.regex" = "*.csv")
> LOCATION '/user/data';

No you cannot currently do that. There is a JIRA ticket open to allow regex selection of included files for Hive tables (https://issues.apache.org/jira/browse/HIVE-951).
For now your best bet is to create a table over a different directory and just copy in the files you want to query.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas