Indexing and partitioning Parquet in S3 - amazon-s3

Is it possible to both index and partition a Parquet file in S3 or is this functionality only available on File Storage types of volumes?
I'm looking for a way to provide researchers to access the same data in S3 via EMR notebooks for (a) generic R and Python scripts, and (b) Spark-enabled querying. But the proprietary solution and query language we have right now provides indexing and partitioning on an NFS store - so I want to preserve this functionality. I see that Delta Lake provides this, but I'm wondering if this can be achieved with simpler tools like Arrow.

You could use Delta lake to Partition a Parquet file. They are also indexed by default.
You can do it like this
%sql
CREATE TABLE UsableTable_unpartitioned
USING DELTA
LOCATION 'Location of the Parquet File on S3' ;
CREATE TABLE UsableTable
USING DELTA
PARTITIONED BY (my_partitioned_column)
LOCATION 'MyS3Location'
select * from UsableTable_unpartitioned;
DROP TABLE UsableTable_unpartitioned;
Verify your partitions and all the required information created :
%sql
describe detail UsableTable
You could expose this table using JDBC

Related

Which file format I have to use which supports appending?

Currently We use orc file format to store the incoming traffic in s3 for fraud detection analysis
We did choose orc file format for following reasons
compression
and ability to query the data using athena
Problem :
As the orc files are read only as soon and we want to update the file contents constantly every 20 minutes
which implies we
need to download the orc files from s3,
read the file
write to the end of file
and finally upload it back to s3
This was not a problem but as the data grows significantly every day ~2GB every day. It is highly costly process to download 10Gb files read it and write and upload it
Question :
Is there any way to use another file format which also offers appends/inserts and can be used by athena to query?
From this article it says avro is file format, but not sure
If athena can be used for querying ?
any other issues ?
Note: My skill on big data technologies is on beginner level
If your table is not partitioned, can simply copy (aws s3 cp) your new orc files to the target s3 path for the table and they will be available instantly for querying via Athena.
If your table is partitioned, you can copy new files to the paths corresponding to your specific partitions. At the end of copying new files to the partition, you need to add or update that partition into Athena's metastore.
For example, if your table is partitioned by date, then you need to run this query to ensure your partition gets added/updated:
alter table dataset.tablename add if not exists
partition (date = YYYYMMDD)
location 's3://your-bucket/path_to_table/date=YYYYMMDD/'

How to Load data into bigquery from buckets partitioned with Year/Month/Day

We have a data stored on gcp bucket in below format -
gs:/gcptest/Year=2020/Month=06/day=18/test1.parquet and so many files under the day=18 folder.
I want to create a table in bigquery with the columns present in the files and partitioned by Year,Month,Day that is present on the file path.
So that when I will load the data into table I can just select the path from the gcp bucket and load the data which will partitioned by values of Year/Month/Day present on the path
BigQuery supports loading externally partitioned data in Avro, Parquet, ORC, CSV and JSON formats that is stored on Cloud Storage using a default hive partitioning layout.
Support is currently limited to the BigQuery web UI, command-line tool, and REST API.
You can see more in Loading externally partitioned data documentation
Also see how to Query externally partitioned data

Presto query engine with Azure Data Lake

I have a requirement to deploy a presto server which can help me query data stored in ADLS in Avro file formats.
I have gone through this tutorial and it seems that the Hive is used as a catalogue/connector in presto to query from ADLS. Can I bypass Hive and have any connector to extract data from ADLS?
Can I bypass Hive and have any connector to extract data from ADLS?
No.
Hive here plays two roles here:
storage for metadata. It contains information like:
schema and table name
columns
data format
data location
execution
it is capable to read data from (HDFS) distributed file systems (like HDFS, S3, ADLS)
it tells how execution can be distributed.

create BigQuery external tables partitioned by one/multiple columns

I am porting a java application from Hadoop/Hive to Google Cloud/BigQuery. The application writes avro files to hdfs and then creates Hive external tables with one/multiple partitions on top of the files.
I understand Big Query only supports date/timestamp partitions for now, and no nested partitions.
The way we now handle hive is that we generate the ddl and then execute it with a rest call.
I could not find support for CREATE EXTERNAL TABLE in the BigQuery DDL docs, so I've switched to using the java library.
I managed to create an external table, but I cannot find any reference to partitions in the parameters passed to the call.
Here's a snippet of the code I use:
....
ExternalTableDefinition extTableDef =
ExternalTableDefinition.newBuilder(schemaName, null, FormatOptions.avro()).build();
TableId tableID = TableId.of(dbName, tableName);
TableInfo tableInfo = TableInfo.newBuilder(tableID, extTableDef).build();
Table table = bigQuery.create(tableInfo);
....
There is however support for partitions for non external tables.
I have a few questions questions:
is there support for creating external tables with partition(s)? Can you please point me in the right direction
is loading the data into BigQuery preferred to having it stored in GS avro files?
if yes, how would we deal with schema evolution?
thank you very much in advance
You cannot create partitioned tables over files on GCS, although you can use the special _FILE_NAME pseudo-column to filter out the files that you don't want to read.
If you can, prefer just to load data into BigQuery rather than leaving it on GCS. Loading data is free, and queries will be way faster than if you run them over Avro files on GCS. BigQuery uses a columnar format called Capacitor internally, which is heavily optimized for BigQuery, whereas Avro is a row-based format and doesn't perform as well.
In terms of schema evolution, if you need to change a column type, drop a column, etc., you should recreate your table (CREATE OR REPLACE TABLE ...). If you are only ever adding columns, you can add the new columns using the API or UI.
See also a relevant blog post about lazy data loading.

Where does hive stores its table?

I am new to Hadoop and I just started working on Hive, I my understanding it provides a query language to process data in HDFS. With HiveQl we can create tables and load data into it from HDFS.
So my question is: where are those tables stored? Specifically if we have 100 GB file in our HDFS and we want to make a hive table out of that data what will be the size of that table and where is it stored?
If my understanding about this concept is wrong please correct me ..
If the table is 100GB you should consider an Hive External Table (as opposed to a "managed table", for the difference, see this).
With an external table the data itself will be still stored on the HDFS in the file path that you specify (note that you may specify a directory of files as long as they all have the same structure), but Hive will create a map of it in the meta-store whereas the managed table will store the data "in Hive".
When you drop a managed table, it drops the underlying data as opposed to dropping a hive external table which only drops the meta-data from the meta-store referencing that data.
Either way you are using only 100GB as viewed by the user and are taking advantage of the HDFS' robustness though duplication of the data.
Hive will create a directory on HDFS. If you didn't specify any location it will create a directory at /user/hive/warehouse on HDFS. After load command the files are moved to the /warehouse/tablename. You can also point to the HDFS directory if it contains partitions (if the files are partitioned), or use external table concept.