ETL on S3 : Duplicate rows : how to update old entries? - amazon-s3

During my ETL imports some pre-synchronized entries are supplied multiple times by my source (because updated by the service) and therefore imported multiple times in AWS. I would like to implement a structure that overwrites an entry if it already exists (something close to a key-value store for few rows updated twice).
My requirements entails to operate on one terabyte of data and to operate on glue (or potentially redshift).
I implemented the solution as follows:
I read the data from my source
I save each entry in a different file by choosing the unique identifier of the content as the file name.
I index my raw data with a glue crawler scanning new files on S3
I run a glue job to transform the raw data in an OLAP compliant format (parquet).
Is this the right way to proceed?
This seems personally correct to me even if I have concerns about the large amount of separate files in my raw data (1 file per entry).
Thank you,
Hugo

Related

Using AWS Glue to create table of parquet data stored in S3 in Athena

I want to preview in Athena data that resides in an S3 bucket. The data is in parquet. This doc here describes the process of how to use AWS Glue to create a preview. One mandatory step here is to input the Column Details. This include entering the column name and its data type. I have two problems with this step:
1 - What if I have no ideas of what columns exist in the parquet file before hand (i.e. I have not seen the content of the parquet before)?
2 - What if there are hundreds if not thousands of columns in there.
Is there a way to make the this work without entering this Column Details ?
The link you provided answers your first question, I think:
What if I have no ideas of what columns exist in the parquet file before hand
Then you should use a Glue crawler to explore the files and have it create a Glue table for you. That table will show up in the AwsDataCatalog catalog as a queryable relation.
What if there are hundreds if not thousands of columns in there.
If you're worried about some column quota limitation, I spent some time looking around documentation to see if there is any mention of a service quota for max columns per table. I could not find any. That doesn't mean that there isn't one, but I would be surprised to see that someone generated a parquet file with more columns than Glue supports.

How to relationalize, join and aggregate multiple files from S3

I have a bucket in S3, containing hundreds of folders, each contains files with the same structure, which are csv representation of relational db tables. The different folders differ by content of the data, but overlapping might occur.
In each folder, I want to join 3 tables, and store the output in a dedicated table. The dedicated table should eventually hold joined data from all different folders. Duplications might occur between different folders but the records have a unique key that can help with the aggregation.
Data size for a specific folder, of all the files, can reach to 5 GB of disk space. 2 of the files contain hundreds of thousands of records. The third file can reach up to 20M records.
The result should be stored in AWS RDS, on a postgresql instance. However, I am considering to switch to Redshift. Will it be better for this scale of data?
The 3 tables are:
Providers
Consumers
Transactions
All of them are indexed by the key which is used in the join.
My approach is to iterate over S3 bucket, and for each folder load the 3 files to the db. Then, create the joined table for the 3 tables using sql, and finally adding the joined data to the aggregated table that should contain the data from all folders.
I am currently trying to handle 1 folder, in order to understand better how to make the process optimal, both in time and space.
After loading I noticed the db uses around 2X disk space then what I expected. Why is joining cost so much in disk space? Is there a way of loading and joining with minimal cost? The data loaded initially for each folder is used as a staging table, until I drop the duplicates and load it to the aggregated table. So its lifespan will be relatively short. I tried to use CREATE UNLOGGED TABLE but it didn't had much effect.
CREATE UNLOGGED TABLE agg_data AS SELECT * FROM
transactions t
INNER JOIN consumers c USING (consumer_id)
INNER JOIN providers p USING (provider_id);
This works ok for 1 folder, time wise. It does take a lot more disk space than I assumed it will.
How will this work in a mass scale, for hundreds of folders. How will the aggregation behave over time, as I will need to search for duplicated records in a continuously growing table?
To summarize my questions:
How to choose between RDS and Redshift? My concerns are tens of millions of records in the target table, and the need to drop duplicates while adding new data to the target table.
Why is joining data take so much db storage? Is there a way to minimize it, for data that is temporary?
What is an efficient way of inserting new data to the destination table while dropping duplications?
Will it be better to join and store the files in S3 using AWS Glue, and then load them to the target db? Currently it does not seem like an option, as Glue takes forever to join the data.
I would recommend using Amazon Athena to join the files and produce the desired output.
First, each directory needs to be recognised as a table. This can be done by manually running a CREATE EXTERNAL TABLE command in Athena and pointing at the folder. All files within the folder will be treated as containing data for the table and they should all be of the same format.
If desired, an AWS Glue crawler can instead be used to create the table definition. Create a crawler and point it to the folder. Glue will create the table definition in the AWS Glue Data Catalog, which is accessible to Athena.
Once the three input tables have been defined, you can run a query in Amazon Athena that joins the three tables and produces an output table using CREATE TABLE AS.
See: Creating a Table from Query Results (CTAS) - Amazon Athena
Glue can also be used for Program AWS Glue ETL Scripts in Python - AWS Glue, but I haven't tried this so I can't offer advice on it. However, I have used AWS Glue crawlers to create tables that I then query via Amazon Athena.
Once you have the output data, you can then load it into the database of your choice. Which database you choose depends upon your use-case. I would suggest starting with Amazon RDS for PostgreSQL since it is a traditional database and you seem to be comfortable with it. If you later need improved performance (eg billions or rows instead of millions), you can move to Amazon Redshift.
General comment: It is rather strange that you wish to join those 3 tables since there will presumably be a lot of duplicated data (very denormalized). You could instead simply load those tables into your desired database and then do the joins in the database, possibly being selective as to which columns you wish to include.

How to route/extract different columns from a single CSV file in Nifi?

I have a GetFile processor which fetches a large CSV file that has about a 100 columns. My goal is to extract specific subsets of columns from this CSV file and send them to various different tables in MySQL.
The current approach makes use of GetFile -> Multiple ConvertRecord Processors where in there are different CSVReader and CSVRecordSetWriters defined, which abide to the AVRO schemas based on the SQL table's schema.
Is there a way to have only one GetFile and Route the subsets to different processors as opposed to replicating the large CSV file across multiple flows, which then get picked up by different ConvertRecord processors?
This is the flow I have right now,
As can be seen, the CSV file replicates across multiple paths and makes things very inefficient. For this example the size is 57bytes, but usually I get ~6 GB Files across 60-70 such ConvertRecord paths
How can I efficiently route my data if I am aware of which subsets of columns need to be extracted from the CSV file and sent to different tables? Example:
Column A,B go to one table
Column A,C go to the second table
Column A,D,E go to the third table
Column A,D,F,G go to the third table
....
If you use PutDatabaseRecord then you can have multiple PutDatabaseRecord processors that each use a different read schema to select the appropriate columns, similar to what you are doing with the ConvertRecord processors, but you never actually need to write out the converted data.
Also, there is nothing really inefficient about forking the same flow file to 6 different locations. In your example above, if GetFile picks up a 6GB file, there is only 1 copy of that 6GB content in the content repository and there would be 3 flow files pointing to that same content, so each ConvertRecord would read the same 6GB content. Then they would each write out a new piece of content which would be a subset of the data, and at some point original 6GB would be deleted from the content repo when no flow files referenced it. So its not like every additional connection from GetFile is making a copy of the 6GB.

SparkSQL: intra-SparkSQL-application table registration

Context. I have tens of SQL queries stored in separate files. For benchmarking purposes, I created an application that iterates through each of those query files and passes it to a standalone Spark application. This latter first parses the query, extracts the used tables, registers them (using: registerTempTable() in Spark < 2 and createOrReplaceTempView() in Spark 2), and executes effectively the query (spark.sql()).
Challenge. Since registering the tables can be time consuming, I would like to lazily register the tables, i.e. only once when they are first used, and keep that in form of metadata that can readily be used in the subsequent queries without the need to re-register the tables with each query. It's a sort of intra-job caching but not any of the caching options Spark offers (table caching), as far as I know.
Is that possible? if not can anyone suggest another approach to accomplish the same goal (iterating through separate query files and run a querying Spark application without registering the tables that have already been registered before).
In general, registering a table should not take time (except if you have lots of files it might take time to generate the list of file sources). It is basically just giving the dataframe a name. What would take time is reading the dataframe from disk.
So the basic question is, how is the dataframe (tables) written to disk. If it is written as a large number of small files or a file format which is slow (e.g. csv), this can take some time (having lots of files take time to generate the file list and having a "slow" file format means the actual reading is slow).
So the first thing you can try to do is read your data and resave it.
lets say for the sake of example that you have a large number of csv files in some path. You can do something like:
df = spark.read.csv("path/*.csv")
now that you have a dataframe you can change it to have less files and use a better format such as:
df.coalesce(100).write.parquet("newPath")
If the above is not enough, and your cluster is large enough to cache everything, you might put everything in a single job, go over all tables in all queries, register all of them and cache them. Then run your sql queries one after the other (and time each one separately).
If all of this fails you can try to use something like alluxio (http://www.alluxio.org/) to create an in memory file system and try to read from that.

BigQuery loading incomplete dataset from Cloud Storage?

I want to upload a dataset of Tweets from Cloud Storage. I have an schema based on https://github.com/twitterdev/twitter-for-bigquery, but simpler, because I don't need all the fields.
I uploaded several files to Cloud Storage and then manually imported them from BigQuery. I have tried loading each file ignoring unknown fields, and not ignoring them. But I always end up with a table with lesser rows than the original dataset. Just in case, I took care of eliminating redundant rows from each dataset.
Example job-ids:
cellular-dream-110102:job_ZqnMTr17Yx_KKGEuec3qfA0DWMo (loaded 1,457,794 rows, but the dataset contained 2,387,666)
cellular-dream-110102:job_2xfbTFSvvs-unpP6xZXAfDeDjic (loaded 1,151,122 rows, but the dataset contained 3,265,405).
I don't know why this happens. I have tried to simplify the schema further, as well as ensuring that the dataset is clean (no repeated rows, no invalid data, and so on). The curious thing is that if I take a small sample of tweets (say, 10,000) and then upload the file manually, it works - it loads the 10,000 rows.
How can I find what is causing the problem?