pyspark script to copy any parquet file data to any oracle table

pyspark script to copy any parquet file data to any oracle table - amazon-s3

We have one s3 bucket called Customers/
Inside this we have multiple folders and again sub folders inside them.
And finally we have parquet files of data.
Now I want to read any parquet file (not specific to any file) and load data into oracle.
For now my script is working for one s3 path where it reads one parquet file e.g. customer_info.parquet and it loads data in oracle database table called customer.customer_info
I need help on generating a generic script where we can read any parquet file and load data in any corresponding database table.
for e.g.
S3 location : s3/Customers/new_customrers/new_customer_info.parquet
Oracle Database: Customer
Oracle table : new_customers
S3 location :s3/Customers/old_customrers/old_customer_info.parquet
Oracle Database:Customer
Oracle table:old_customers
S3 location : s3/Customers/current_customrers/current_customer_info.parquet
Oracle Database :Customer
Oracle table:current_customers**
Is there any way to make this copy process generic. Database will be same only oracle tables will be changed according to the parquet file.
My current script is a pyspark script where we are reading one s3 file data into spark dataframe and writting that dataframe to one oracle table.

Related

Is there a way move data from a Delta Table in S3 to Redshift using Copy Command?

I've Delta Tables that are created in a S3 Bucket, need to load this data as-is into Redshift tables.
The delta table has Symlink Format Manifest generated and some delta tables might have partitions.
Is there a way to move this data into Redshift?
I've tried to run the Copy Command on the Delta Table giving the path till the Table Name.
COPY <schema>.<table>
FROM 's3://<bucket-name>/<delta_table_name>/'
IAM_ROLE '<iam_role>'
FORMAT AS PARQUET
Tried to run the Copy Command using the path to the Symlink Format Manifest directory in S3. Neither worked.
COPY <schema>.<table>
FROM 's3://<bucket-name>/<delta_table_name>/_symlink_format_manifest/PARTITION1=VALUE1/PARTITION2=VALUE2/manifest'
IAM_ROLE '<iam_role>'
FORMAT AS PARQUET

Query S3 Bucket With Amazon Athena and modify values

I have an S3 bucket with 500 csv files that are identical except for the number values in each file.
How do I write query that grabs dividendsPaid and make it positive for each file and send that back to s3?

Amazon Athena is a query engine that can perform queries on objects stored in Amazon S3. It cannot modify files in an S3 bucket. If you want to modify those input files in-place, then you'll need to find another way to do it.
However, it is possible for Amazon Athena to create a new table with the output files stored in a different location. You could use the existing files as input and then store new files as output.
The basic steps are:
Create a table definition (DDL) for the existing data (I would recommend using an AWS Glue crawler to do this for you)
Use CREATE TABLE AS to select data from the table and write it to a different location in S3. The command can include an SQL SELECT statement to modify the data (changing the negatives).
See: Creating a table from query results (CTAS) - Amazon Athena

Which file format I have to use which supports appending?

Currently We use orc file format to store the incoming traffic in s3 for fraud detection analysis
We did choose orc file format for following reasons
compression
and ability to query the data using athena
Problem :
As the orc files are read only as soon and we want to update the file contents constantly every 20 minutes
which implies we
need to download the orc files from s3,
read the file
write to the end of file
and finally upload it back to s3
This was not a problem but as the data grows significantly every day ~2GB every day. It is highly costly process to download 10Gb files read it and write and upload it
Question :
Is there any way to use another file format which also offers appends/inserts and can be used by athena to query?
From this article it says avro is file format, but not sure
If athena can be used for querying ?
any other issues ?
Note: My skill on big data technologies is on beginner level

If your table is not partitioned, can simply copy (aws s3 cp) your new orc files to the target s3 path for the table and they will be available instantly for querying via Athena.
If your table is partitioned, you can copy new files to the paths corresponding to your specific partitions. At the end of copying new files to the partition, you need to add or update that partition into Athena's metastore.
For example, if your table is partitioned by date, then you need to run this query to ensure your partition gets added/updated:
alter table dataset.tablename add if not exists
partition (date = YYYYMMDD)
location 's3://your-bucket/path_to_table/date=YYYYMMDD/'

AWS - How to extract CSV reports from a set of JSON files in S3

I have a RDS database with the following structure: CustomerId|Date|FileKey.
FileKey points to a JSON file in S3.
Now I want to create CSV reports with a costumer, date range filters and columns definition (ColumnName + JsonPath), like that:
Name => data.person.name
OtherColumn1 => data.exampleList[0]
OtherColumn2 => data.exampleList[2]
I often need to add and remove columns from the columns definition.
I know I can run a SQL SELECT on RDS, get each S3 file (JSON), extract the data and create my CSV file, but, this is not a good solution because I need to query my RDS instance and make millions of requests to S3 for each report request or each columns definition change.
Saving all data on RDS table instead on S3 is also not a good solution because JSON file contains a lot of data and columns not the same for costumers.
Any idea?

How to efficiently append new data to table in AWS Athena?

I have a table in Athena that is created from a csv file stored in S3 and I am using Lambda to query it. But I have incoming data being processed by the lambda function and want to append a new row to the existing table in Athena. How can I do this? Because I saw in documentation that Athena prohibits some SQL statements like INSERT INTO and CREATE TABLE AS SELECT

If you are adding new data you can save the new data file into the same folder (prefix/key) that the table is in reading from. Athena will read from all files in this folder, the format of the new file just needs to be the same as the existing one.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

pyspark script to copy any parquet file data to any oracle table - amazon-s3

Related

Is there a way move data from a Delta Table in S3 to Redshift using Copy Command?

Query S3 Bucket With Amazon Athena and modify values

Which file format I have to use which supports appending?

AWS - How to extract CSV reports from a set of JSON files in S3

How to efficiently append new data to table in AWS Athena?

Categories

Resources