I'm new to AWS lambda and I'd like to ask whether it is possible to extract data from bigquery using AWS lambda in CSV format and store this into AWS S3 bucket?
Anyone who has done this before? Can you please share the steps?
Thank you
Related
I have large flat files at a location in AWS S3 and need to load the data to Teradata. I am trying to implement a python script and would need help approaching this. I am very new to bigdata and the cloud.
We are using grafana's cloudwatch data source for aws metrics. We would like to differentiate folders in S3 bucket with respect to their sizes and show them as graphs. We know that cloudwatch doesn't give object level metrics but bucket level. In order to monitor the size of the folders in the bucket, let us know if any possible solution out there.
Any suggestion on the same is appreciated.
Thanks in advance.
Amazon CloudWatch provides daily storage metrics for Amazon S3 buckets but, as you mention, these metrics are for the whole bucket, rather than folder-level.
Amazon S3 Inventory can provide a daily CSV file listing all objects. You could load this information into a database or use Amazon Athena to query the contents.
If you require storage metrics at a higher resolution than daily, then you would need to track this information yourself. This could be done with:
An Amazon S3 Event that triggers an AWS Lambda function whenever an object is created or deleted
An AWS Lambda function that receives this information and updates a database
Your application could then retrieve the storage metrics from the database
Thanks for the reply John,
However I found a solution for it using an s3_exporter. It gives metrics according to size of the folders & sub-folders inside S3 bucket.
I am publishing custom metric data (count of how many times operations are being used by customers) to cloudwatch. I want to use these custom metric data to be shown on Amazon Quicksight dashboard ; do anyone know how I can do that?
Use Athena to read the Cloud watch metrics as seen in AWS docs https://docs.aws.amazon.com/athena/latest/ug/athena-prebuilt-data-connectors-cwmetrics.html.
Then you can connect Quicksight with Athena as data source https://docs.aws.amazon.com/quicksight/latest/user/create-a-data-set-athena.html or utilize S3 where data resides now.
I am trying to source the data from Athena or Redshift to Sage maker or AWS Forecast directly without using the flat data. In Sage maker I use Jupyter Notebook python code. Is there anyway to do so without even connecting to S3.
So far I have been using flat data which is not what I wanted.
if you're only using a SageMaker notebook instance, your data doesn't have to be in S3. You can use the boto3 SDK or a SQL connection (depending on the backend) to download data, store it locally, and work on it in your notebook.
If you're using the SageMaker SDK to train, then yes, data must be in S3. You can either do this manually if you're experimenting, or use services like AWS Glue or AWS Batch to automate your data pipeline.
Indeed, Athena data is probably already in S3, although it may be in a format that your SageMaker training code doesn't support. Creating a new table with the right SerDe (say, CSV) may be enough. If not, you can certainly get the job done with AWS Glue or Amazon EMR.
When it comes to Redshift, dumping CSV data to S3 is as easy as:
unload ('select * from mytable')
to 's3://mybucket/mytable'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter ',';
Hope this helps.
If you are using SageMaker you have to use S3 to read data, SageMaker does not read data from Redshift, but will be able to read data from Athena using PyAthena.
If your data source is in Redshift you need to load your data to S3 first to be able to use in SageMaker. If you are using Athena your data is already in S3.
Amazon Machine Learning used to support reading data from Redshift or RDS but unfortunately it's not available any more.
SageMaker Data Wrangler now allows you to read data directly from Amazon Redshift. But I'm not sure if you can from across AWS accounts (e.g. if you had one account for dev and another account for prod)
I have some data stored in S3 . I need to clone/copy this data periodically from S3 to Redshift cluster. To do bulk copy , I can use copy command to copy from S3 to redshift.
Similarly is there any trivial way to copy data from S3 to Redshift periodically .
Thanks
Try using AWS Data Pipeline which has various templates for moving data from one AWS service to other. The "Load data from S3 into Redshift" template copies data from an Amazon S3 folder into a Redshift table. You can load the data into an existing table or provide a SQL query to create the table. The Redshift table must have the same schema as the data in Amazon S3.
Data Pipeline supports pipelines to be running on a schedule. You have a cron style editor for scheduling
AWS Lambda Redshift Loader is a good solution that runs a COPY command on Redshift whenever a new file appears pre-configured location on Amazon S3.
Links:
https://aws.amazon.com/blogs/big-data/a-zero-administration-amazon-redshift-database-loader/
https://github.com/awslabs/aws-lambda-redshift-loader
I believe Kinesis Firehose is the simplest way to get this done. Simply create a Kinesis Forehose stream, point it a a specific table in your Redshift cluster, write data to the stream, done :)
Full setup procedure here:
https://docs.aws.amazon.com/ses/latest/DeveloperGuide/event-publishing-redshift-firehose-stream.html
Kinesis option works only if redshift is publicly accessible.
You can use copy command with lambda. You can configure 2 lambdas. One will create a manifest file for you upcoming new data and another will read from that manifest for load it on redshift with Redshift data api.