Write 2 months historical data from S3 to aws timestream - amazon-s3

I want to write 2 months old data from S3 bucket and each day has 4 to 5 parquet files. Now I have converted the parquet files to data frame but the amount of rows forming from 1-day data is around 3.5M.
Now I have created 100 batches to send the records but the overall execution time is too long. Please help me.

Related

Batch is taking 7-15 minutes to processing 50k records

I am trying to fetch 50k records using SAP Query connector and send it to S3. but it is taking 7 minutes to process 50 k records in batch.It is taking time at the time at loading phase for instance. I have put S3 to aggregator. Still facing issue and also for next 50k records batch is taking 15 to 16 minutes and next records 30minutes. Can anyone help.
enter image description here

Appsflyer Data Locker Override data BigQuery

I have an issue with setting up Appsflyer Cost ETL with Google BigQuery. We get parquet files each day.
The issue is the following - each day you get the file with 10 dates.
enter image description here
The problem is that each day you have 6 dates that shoud rewrite yesterday file. And the task is how to set a data transfers or scheduled queries to override the data for each date that you have in newer file to make the data for long period in one table.

AWS Quicksight monthly summarize in percentage from different csv files

1. I have a Lambda function that is running monthly, it is running Athena query, and export the results in a CSV file to my S3 bucket.
2. Now i have a Quicksight dashboard which is using this CSV file in Dataset and visual all the rows from the report into a dashboard.
Everything is good and working until here.
3. Now every month I'm getting a new csv file in my S3 bucket, and i want to add a "Visual Type" in my main dashboard that will show me the difference in % from the previous csv file(previous month).
For example:
My dashboard is focusing on the collection of missing updates.
In May i see i have 50 missing updates.
In June i got a CSV file with 25 missing updates.
Now i want it somehow to reflect into my dashboard with a "Visual Type" that this month we have reduced the number of missing updates by 50%.
And in month July, i get a file with 20 missing updates, so i want to see that we reduced with with 60% from the month May.
Any idea how i can do it?
I'm not sure I quite understand where you're standing, but I'll assume that you have an S3 manifest that points to an S3 directory and not a different manifest (and dataset) per each file.
If that's your case you could try to tackle that comparison creating a calculated field and using the periodOverPeriodPercentDifference
Hope this helps!

HIVE_INVALID_METADATA: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 43 elements while columns.types has 34 elements

I'm working on a client platform. It is a DataLake linked to AWS S3 and AWS ATHENA
I've uploaded a Dataset to an S3 Bucket using AWS GLUE.
The job ran successfully and a table was created under ATHENA.
When I try to "Preview" the content of the table, I get the following error :
HIVE_INVALID_METADATA: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 43 elements while columns.types has 34 elements
P.S: The file i'm uploading contains 34 columns
Thanks to #ThomasRones comment:
This indicates that there are more columns in the header row than that there are columns in the data rows (actually the determined types for these data rows).
Probable cause for this error is that in the header row one or more names contains one or more comma's.

Spark joins- save as dataframes or partitioned hive tables

I’m working on a project with test data close to 1 million records and 4 such files .
The task is to perform around 40 calculations joining the data from 4 different files each close to 1gb .
Currently, I save the data from each into a spark table using saveastable and perform operations . For e.g. - table1 joins with table2 and the results are saved to table3 . Table3(result of 1 and 2 ) joins with table4 and so on . Finally I’m saving these calculations on a different table and generating the reports.
The entire process takes around 20 minutes and my concern is when this code gets to the production with data probably 5 times more than this , will there be performance issues .
Or is it better to save those data from each file in a partitioned way and then perform the joins and arrive to the final resultset .
P.S - The objective is to get instant results and there might be cases where the user is updating a few rows from the file and expecting an instant result. And the data is on a monthly basis , basically once every month with categories and sub-categories within .
What you are doing is just fine, but make sure to cache + count after every resource extensive operations instead of writing all the joins and then save at last step.
If you do not cache in between, spark will run entire DAG from top to bottom at the last step , it may cause JVM to overflow and spill to disk during operations which may in turn affect the execution time.