Are the relations preserved when transferring data from a relational DB to S3? - amazon-s3

There are options for transferring a DB snapshot from a relational database to S3 in AWS.
But S3 is an object store, so it only stores files (e.g. parquet).
Are the relationships (like keys) between tables in the relational DB somehow carried over to S3? Can queries still be made against the files in S3 that would allow joins to be made between tables?

There are no "keys" like foreign key, primary key in the exported parquet files in S3, but you can still query the the exported data directly through tools like Amazon Athena or Amazon Redshift Spectrum. For more information on using Athena to read Parquet data, see Parquet SerDe in the Amazon Athena User Guide. For more information on using Redshift Spectrum to read Parquet data, see COPY from columnar data formats in the Amazon Redshift Database Developer Guide.
The time it takes for the export to complete depends on the data stored in the database. For example, tables with well distributed numeric primary key or index columns will export the fastest. Tables that don't contain a column suitable for partitioning and tables with only one index on a string-based column will take longer because the export uses a slower single threaded process. For example if a table got a numeric pk and got 100,000 rows, during export data will be "partitioned" in a few portion, each portion are a directory in the S3 bucket, so that when you query data in Athena/Redshift spectrum with that id, AWS know what buckets to scan to get the data and thus improve performance and speed.
In summary, after data exported as columnar format like parquet in S3, you can do inplace query by Athena, load the data to redshift or data store for more analytics, etc..

Related

Small single parquet file on Data Lake, or relational SQL DB?

I am designing a Data Lake in Azure Synapse and, in my model, there is a table that will store a small amount of data (like 5000 rows).
The single parquet file that stores this data will surely be smaller than the smallest recommended size for a parquet file (128 MB) and I know that Spark is not optimized to handle small files. This table will be linked to a delta table, and I will insert/update new data by using the MERGE command.
In this scenario, regarding performance, is it better to stick with a delta table, or should I create a SQL relational table in another DB and store this data there?
It depends on multiple factors like the types of query you will be running and how often you want to run merge command to upsert data to delta.
But even if you do perform analytical queries, looking at the size of data I would have gone with relational DB.

Which file format I have to use which supports appending?

Currently We use orc file format to store the incoming traffic in s3 for fraud detection analysis
We did choose orc file format for following reasons
compression
and ability to query the data using athena
Problem :
As the orc files are read only as soon and we want to update the file contents constantly every 20 minutes
which implies we
need to download the orc files from s3,
read the file
write to the end of file
and finally upload it back to s3
This was not a problem but as the data grows significantly every day ~2GB every day. It is highly costly process to download 10Gb files read it and write and upload it
Question :
Is there any way to use another file format which also offers appends/inserts and can be used by athena to query?
From this article it says avro is file format, but not sure
If athena can be used for querying ?
any other issues ?
Note: My skill on big data technologies is on beginner level
If your table is not partitioned, can simply copy (aws s3 cp) your new orc files to the target s3 path for the table and they will be available instantly for querying via Athena.
If your table is partitioned, you can copy new files to the paths corresponding to your specific partitions. At the end of copying new files to the partition, you need to add or update that partition into Athena's metastore.
For example, if your table is partitioned by date, then you need to run this query to ensure your partition gets added/updated:
alter table dataset.tablename add if not exists
partition (date = YYYYMMDD)
location 's3://your-bucket/path_to_table/date=YYYYMMDD/'

Athena (Hive/Presto) Parquet vs ORC In Count Query

I am testing a large data set (1.5TB, 5.5b records) in athena in both parquet and orc formats. My first test is a simple one, a count query-
SELECT COUNT(*) FROM events_orc
SELECT COUNT(*) FROM events_parquet
The parquet file takes half to run this query as the orc file. But one thing I noticed is that when running a count on a parquet file, it return 0kb as the bytes scanned, where with the orc, it returns 78gb. This makes sense for the parquet because the count is in the meta, no need to scan bytes. The orc also has a meta with the count, but it doesn't seem to be using that meta to determine the counts of these files.
Why doesn't Athena use the metadata in the orc file to determine the count, where it clearly does with parquet files?
The answer is as you say that Athena reads the Parquet metadata, but not the ORC. There is no reason besides that feature not being in the version of Presto and/or ORC serde that Athena uses.
I've also noticed that Athena reads too much data when using ORC, it doesn't skip columns it should, etc. I think the Athena ORC serde is just old and doesn't have all the optimisations you would expect. Athena is after all based on a very old Presto version.
This depends on how those ORC files were created. Could you explain a bit how did you ETL in the data and what are the table definitions?
There are few indexes that ORC has:
Indexes ORC provides three level of indexes within each file:
file level:
statistics about the values in each column across the entire file
stripe level:
statistics about the values in each column for each
stripe
row-level:
statistics about the values in each column for each set of 10,000 rows
within a stripe The file and stripe level column statistics are in the
file footer so that they are easy to access to determine if the rest
of the file needs to be read at all. Row level indexes include both
the column statistics for each row group and the position for seeking
to the start of the row group.
Athena just like PrestoDb the query engine used by Athena can use these indexes to speed up queries.
I would be extremely surprised if Athena would not be using these bits of information for the queries.

Mapping AWS glue table columns to target RDS instance table columns

i have created a glue job, that takes data from S3 bucket and insert into **RDS postgres instance**.
In the S3 bucket I have created different folder (partition).
Can I map different columns in different partitions to same target RDS instance?
When you say partition in s3, is it indicated using the hive style ?? eg: bucket/folder1/folder2/partion1=xx/partition2=xx/partition3=yy/..
If so, you shouldnt be storing data with different structures of information in s3 partitions and then mapping them to a single table. However if its just data in different folders, like s3://bucket/folder/folder2/folder3/.. and these are actually genuinely different datasets, then yes it is possible to map those datasets to a single table. However you cannot do it via the ui. You will need to read these datasets as separate dynamic/data frames and join them using a key in glue/spark and load them to rds

Indexing and partitioning Parquet in S3

Is it possible to both index and partition a Parquet file in S3 or is this functionality only available on File Storage types of volumes?
I'm looking for a way to provide researchers to access the same data in S3 via EMR notebooks for (a) generic R and Python scripts, and (b) Spark-enabled querying. But the proprietary solution and query language we have right now provides indexing and partitioning on an NFS store - so I want to preserve this functionality. I see that Delta Lake provides this, but I'm wondering if this can be achieved with simpler tools like Arrow.
You could use Delta lake to Partition a Parquet file. They are also indexed by default.
You can do it like this
%sql
CREATE TABLE UsableTable_unpartitioned
USING DELTA
LOCATION 'Location of the Parquet File on S3' ;
CREATE TABLE UsableTable
USING DELTA
PARTITIONED BY (my_partitioned_column)
LOCATION 'MyS3Location'
select * from UsableTable_unpartitioned;
DROP TABLE UsableTable_unpartitioned;
Verify your partitions and all the required information created :
%sql
describe detail UsableTable
You could expose this table using JDBC