Using AWS Glue to create table of parquet data stored in S3 in Athena - amazon-s3

I want to preview in Athena data that resides in an S3 bucket. The data is in parquet. This doc here describes the process of how to use AWS Glue to create a preview. One mandatory step here is to input the Column Details. This include entering the column name and its data type. I have two problems with this step:
1 - What if I have no ideas of what columns exist in the parquet file before hand (i.e. I have not seen the content of the parquet before)?
2 - What if there are hundreds if not thousands of columns in there.
Is there a way to make the this work without entering this Column Details ?

The link you provided answers your first question, I think:
What if I have no ideas of what columns exist in the parquet file before hand
Then you should use a Glue crawler to explore the files and have it create a Glue table for you. That table will show up in the AwsDataCatalog catalog as a queryable relation.
What if there are hundreds if not thousands of columns in there.
If you're worried about some column quota limitation, I spent some time looking around documentation to see if there is any mention of a service quota for max columns per table. I could not find any. That doesn't mean that there isn't one, but I would be surprised to see that someone generated a parquet file with more columns than Glue supports.

Related

Table without date and Primary Key

I have 9M records. We needed to do the following operations:-
daily we receive the entire file of 9M records with 150GB of file size
It is truncate and loads in Snowflake. Daily deleting the entire 9B records and loading
We would want to send only incremental file load to Snowflake. Meaning that:
For example, out of 9Million records, we would only have an update in 0.5Million records(0.1 M Inserts,0.3 Deletes, and 0.2 Updates). How we will be able to compare the file and extract only delta file and load to the snowflake. How to do it cost-effectively and fast way in AWS native tools and load to S3.
P.s data doesn't have any date column. It is a pretty old concept written in 2012. We need to optimize this. The file format is fixed width. Attaching sample RAW data.
Sample Data:
https://paste.ubuntu.com/p/dPpDx7VZ5g/
In a nutshell, I want to extract only Insert, Updates, and Deletes into a File. How do you classify this best and cost-efficient way.
Your tags and the question content does not match, but I am guessing that you are trying to load data from Oracle to Snowflake. You want to do an incremental load from Oracle but you do not have an incremental key in the table to identify the incremental rows. You have two options.
Work with your data owners and put the effort to identify the incremental key. There needs to be one. People are sometimes lazy to put this effort. This will be the most optimal option
If you cannot, then look for a CDC(change data capture) solution like golden gate
CDC stage comes by default in DataStage.
Using CDC stage in combination of Transformer stage, is best approach to identify new rows, changed rows and rows for deletion.
You need to identify column(s) which makes row unique, doing CDC with all columns is not recommended, DataStage job with CDC stage consumes more resources if you add more change columns in CDC stage.
Work with your BA to identifying column(s) which makes row unique in the data.
I had the similar problem what you have. In my case, there are no Primary key and there is no date column to identify the difference. So what I did is actually, I used AWS Athena (presto managed) to calculate the difference between source and the destination. Below is the process:
Copy the source data to s3.
Create Source Table in athena pointing the data copied from source.
Create Destination table in athena pointing to the destination data.
Now use, SQL in athena to find out the difference. As I did not have the both primary key and date column, I used the below script:
select * from table_destination
except
select * from table_source;
If you have primary key, you can use that to find the difference as well and create the result table with the column which says "update/insert/delete"
This option is aws native and then it will be cheaper as well, as it costs 5$ per TB in athena. Also, in this method, do not forget to write file rotation scripts, to cut down your s3 costs.

How to relationalize, join and aggregate multiple files from S3

I have a bucket in S3, containing hundreds of folders, each contains files with the same structure, which are csv representation of relational db tables. The different folders differ by content of the data, but overlapping might occur.
In each folder, I want to join 3 tables, and store the output in a dedicated table. The dedicated table should eventually hold joined data from all different folders. Duplications might occur between different folders but the records have a unique key that can help with the aggregation.
Data size for a specific folder, of all the files, can reach to 5 GB of disk space. 2 of the files contain hundreds of thousands of records. The third file can reach up to 20M records.
The result should be stored in AWS RDS, on a postgresql instance. However, I am considering to switch to Redshift. Will it be better for this scale of data?
The 3 tables are:
Providers
Consumers
Transactions
All of them are indexed by the key which is used in the join.
My approach is to iterate over S3 bucket, and for each folder load the 3 files to the db. Then, create the joined table for the 3 tables using sql, and finally adding the joined data to the aggregated table that should contain the data from all folders.
I am currently trying to handle 1 folder, in order to understand better how to make the process optimal, both in time and space.
After loading I noticed the db uses around 2X disk space then what I expected. Why is joining cost so much in disk space? Is there a way of loading and joining with minimal cost? The data loaded initially for each folder is used as a staging table, until I drop the duplicates and load it to the aggregated table. So its lifespan will be relatively short. I tried to use CREATE UNLOGGED TABLE but it didn't had much effect.
CREATE UNLOGGED TABLE agg_data AS SELECT * FROM
transactions t
INNER JOIN consumers c USING (consumer_id)
INNER JOIN providers p USING (provider_id);
This works ok for 1 folder, time wise. It does take a lot more disk space than I assumed it will.
How will this work in a mass scale, for hundreds of folders. How will the aggregation behave over time, as I will need to search for duplicated records in a continuously growing table?
To summarize my questions:
How to choose between RDS and Redshift? My concerns are tens of millions of records in the target table, and the need to drop duplicates while adding new data to the target table.
Why is joining data take so much db storage? Is there a way to minimize it, for data that is temporary?
What is an efficient way of inserting new data to the destination table while dropping duplications?
Will it be better to join and store the files in S3 using AWS Glue, and then load them to the target db? Currently it does not seem like an option, as Glue takes forever to join the data.
I would recommend using Amazon Athena to join the files and produce the desired output.
First, each directory needs to be recognised as a table. This can be done by manually running a CREATE EXTERNAL TABLE command in Athena and pointing at the folder. All files within the folder will be treated as containing data for the table and they should all be of the same format.
If desired, an AWS Glue crawler can instead be used to create the table definition. Create a crawler and point it to the folder. Glue will create the table definition in the AWS Glue Data Catalog, which is accessible to Athena.
Once the three input tables have been defined, you can run a query in Amazon Athena that joins the three tables and produces an output table using CREATE TABLE AS.
See: Creating a Table from Query Results (CTAS) - Amazon Athena
Glue can also be used for Program AWS Glue ETL Scripts in Python - AWS Glue, but I haven't tried this so I can't offer advice on it. However, I have used AWS Glue crawlers to create tables that I then query via Amazon Athena.
Once you have the output data, you can then load it into the database of your choice. Which database you choose depends upon your use-case. I would suggest starting with Amazon RDS for PostgreSQL since it is a traditional database and you seem to be comfortable with it. If you later need improved performance (eg billions or rows instead of millions), you can move to Amazon Redshift.
General comment: It is rather strange that you wish to join those 3 tables since there will presumably be a lot of duplicated data (very denormalized). You could instead simply load those tables into your desired database and then do the joins in the database, possibly being selective as to which columns you wish to include.

ETL on S3 : Duplicate rows : how to update old entries?

During my ETL imports some pre-synchronized entries are supplied multiple times by my source (because updated by the service) and therefore imported multiple times in AWS. I would like to implement a structure that overwrites an entry if it already exists (something close to a key-value store for few rows updated twice).
My requirements entails to operate on one terabyte of data and to operate on glue (or potentially redshift).
I implemented the solution as follows:
I read the data from my source
I save each entry in a different file by choosing the unique identifier of the content as the file name.
I index my raw data with a glue crawler scanning new files on S3
I run a glue job to transform the raw data in an OLAP compliant format (parquet).
Is this the right way to proceed?
This seems personally correct to me even if I have concerns about the large amount of separate files in my raw data (1 file per entry).
Thank you,
Hugo

Table size in AWS-Athena

Is there SQL-based way to retrieve the size of all tables within a database in AWS-Athena?
I'm more familiar with MSSQL and there it is relatively easy to write such query.
The quick way is via s3: ... > Show Properties > Location and lookup the size in the s3-console.
Explainer
You can run SELECT * FROM some_table for each table and look at the result metadata for the amount scanned, but it will be an expensive way to do it.
Athena doesn't really know about the data in your tables the way an RDBMS does, it's only when you query a table that Athena goes out to look at the data. It really S3 that you should as. You can list all objects in the location(s) of your tables and sum their sizes, but that might be a time consuming way of doing it if there are many objects.
The least expensive, and least time consuming way, when there are many hundreds of thousands of objects, is to enable S3 Inventory on the bucket that contains the data for your tables, then use the inventory to sum up the sizes for each table. You can get the inventory in CSV, ORC, or Parquet format, and they all work well with Athena – so if you have a lot of files in your bucket you can still query the inventory very efficiently.
You can read more about S3 Inventory here: https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html

Hive external table(non partitioned) with different file structures in the same location

My hive external table(non partitioned) has data in an S3 bucket. It is an incremental table.
Until today all the files that come to this location through an Informatica process, were of the same structure and all the fields are included columns in my incremental table structure.
Now I have a requirement the a new file comes to the same bucket in additiona to the existing ones, but with additional columns.
I have re-created table with the additional columns.
But when I query the table, I see that the newly added columns are still blank for those specific line items where they are not supposed to be.
Can I actually have a single external table pointing to an S3 location having different file structures?