Loading Larga data to amazon sagemaker notebook - pandas

I have 2 folder, on each folder I have 70 csv files each one with a size of 3mb to 5mb, so in general the data is like 20 millions rows with 5 columns each.
I used amazon wrangler s3.read_csv to load just one folder with all the 70 csv to a dataframe, not sure if this is a good approach due to the fact the data is really large.
I want to know how can I load the entire csv files from those 2 folders with aws wrangler s3.readcsv, or should I use pyspark?
Also another question is, is it possible to work locally using amazon sagemaker depenencies? I am not sure if using sagemaker notebook for the pipeline development might cost a lot for my client.

You can use PySpark to load data into your notebook as well, see this repo for instructions.
As for SageMaker, you can use the SageMaker Python SDK, or Boto3 to run jobs from your local machine. You can also create a notebook instance with a small instance size, experiment on a subset of your data, and trigger a Processing job to keep your notebook costs low. You only pay for the duration your processing job runs, and you can scale up for preparing the entire dataset.

Related

How to copy Big Data from GCS to S3?

How to copy a few terabytes of data from GCS to S3?
There's nice "Transfer" feature in GCS that allows to import data from S3 to GCS. But how to do the export, the other way (besides moving data generation jobs to AWS)?
Q: Why not gsutil?
Yes, gsutil supports s3://, but transfer is limited by that machine network throughput. How to easier do it in parallel?
I tried Dataflow (aka Apache Beam now), that would work fine, because it's easy to parallelize on like a hundred of nodes, but don't see there's simple 'just copy it from here to there' function.
UPDATE: Also, Beam seems to be computing a list of source files on the local machine in a single thread, before starting the pipeline. In my case that takes around 40 minutes. Would be nice to distribute it on the cloud.
UPDATE 2: So far I'm inclined to use two own scripts that would:
Script A: Lists all objects to transfer, and enqueue a transfer task for each one into a PubSub queue.
Script B: Executes these transfer tasks. Runs on cloud (e.g. Kubernetes), many instances in parallel
The drawback is that it's writing a code that may contain bugs etc, not using a built-in solution like GCS "Transfer".
You could use gsutil running on Compute Engine (or EC2) instances (which may have higher network bandwidth available than your local machine).
Using gsutil -m cp will parallelize copying across objects, but individual objects will still be copied sequentially.

Load a huge data from BigQuery to python/pandas/dask

I read other similar threads and searched Google to find a better way but couldn't find any workable solution.
I have a large large table in BigQuery (assume inserting 20 million rows per day). I want to have around 20 million rows of data with around 50 columns in python/pandas/dask to do some analysis. I have tried using bqclient, panda-gbq and bq storage API methods but it takes 30 min to have 5 millions rows in python. Is there any other way to do so? Even any Google service available to do similar job?
Instead of querying, you can always export stuff to cloud storage -> download locally -> load into your dask/pandas dataframe:
Export + Download:
bq --location=US extract --destination_format=CSV --print_header=false 'dataset.tablename' gs://mystoragebucket/data-*.csv && gsutil -m cp gs://mystoragebucket/data-*.csv /my/local/dir/
Load into Dask:
>>> import dask.dataframe as dd
>>> df = dd.read_csv("/my/local/dir/*.csv")
Hope it helps.
First, you should profile your code to find out what is taking the time. Is it just waiting for big-query to process your query? Is it the download of data> What is your bandwidth, what fraction do you use? Is it parsing of that data into memory?
Since you can make SQLAlchemy support big-query ( https://github.com/mxmzdlv/pybigquery ), you could try to use dask.dataframe.read_sql_table to split your query into partitions and load/process them in parallel. In case big-query is limiting the bandwidth on a single connection or to a single machine, you may get much better throughput by running this on a distributed cluster.
Experiment!
Some options:
Try to do aggregations etc. in BigQuery SQL before exporting (a smaller table) to
Pandas.
Run your Jupyter notebook on Google Cloud, using a Deep Learning VM on a high-memory machine in the same region as your BigQuery
dataset. That way, network overhead is minimized.
Probably you want to export the data to Google Cloud Storage first, and then download the data to your local machine and load it.
Here are the steps you need to take:
Create an intermediate table which will contain the data you want to
export. You can do select and store to the intermediate table.
Export the intermediate table to Google Cloud Storage, to JSON/Avro/Parquet format.
Download your exported data and load to your python app.
Besides downloading the data to your local machine, you can leverage the processing using PySpark and SparkSQL. After you export the data to Google Cloud Storage, you can spin up a Cloud Dataproc cluster and load the data from Google Cloud Storage to Spark, and do analysis there.
You can read the example here
https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example
and you can also spin up Jupyter Notebook in the Dataproc cluster
https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook
Hope this helps.
A couple years late, but we're developing a new dask_bigquery library to help easily move back and forth between BQ and Dask dataframes. Check it out and let us know what you think!

Can I use aws-glue to load data into aerospike?

I am designing an application which should read a txt file from S3 every 15 min, parse the data separated by | and load this data into aerospike cluster in 3 different aws regions.
The file size can range from 0-32 GB and the no of records it may contain is between 5-130 million.
I am planning to deploy a custom Java process in every aws region which will download a file from S3 and loads into aerospike using multiple threads.
I just came across aws glue. Can anybody tell me if I can use aws glue to load this big chunk of data into aerospike? or any other recommendation to set up an efficient and performant application?
Thanks in advance!
AWS Glue does an extract, transform then loads into RedShift, EMR or Athena. You should take a look at AWS Data Pipeline instead, using the ShellCommandActivity to run your s3 data through extraction and transformation and writing the transformed data to Aerospike.

Saving a >>25T SchemaRDD in Parquet format on S3

I have encountered a number of problems when trying to save a very large SchemaRDD as in Parquet format on S3. I have already posted specific questions for those problems, but this is what I really need to do. The code should look something like this
import org.apache.spark._
val sqlContext = sql.SQLContext(sc)
val data = sqlContext.jsonFile("s3n://...", 10e-6)
data.saveAsParquetFile("s3n://...")
I run into problems if I have more than about 2000 partitions or if there is partition larger than 5G.
This puts an upper bound on the maximum size SchemaRDD I can process this way.
The prctical limit is closer to 1T since partitions sizes vary widely and you only need 1 5G partition to have the process fail.
Questions dealing with the specific problems I have encountered are
Multipart uploads to Amazon S3 from Apache Spark
Error when writing a repartitioned SchemaRDD to Parquet with Spark SQL
Spark SQL unable to complete writing Parquet data with a large number of shards
This questions is to see if there are any solutions to the main goal that do not necessarily involve solving one the above problems directly.
To distill things down there are 2 problems
Writing a single shard larger than 5G to S3 fails. AFAIK this a built in limit of s3n:// buckets. It should be possible for s3:// buckets but does not seem to work from Spark and hadoop distcp from local HDFS can not do it either.
Writing the summary file tends to fail once there are 1000s of shards. There seem to be multiple issues with this. Writing directly to S3 produces the error in the linked question above. Writing directly to local HDFS produces an OOM error even on an r3.8xlarge (244G ram) once when there about 5000 shards. This seems to be independent of the actual data volume. The summary file seems essential for efficient querying.
Taken together these problems limit Parquet tables on S3 to 25T. In practice it is actually significantly less since shard sizes can vary widely within an RDD and the 5G limit applies to the largest shard.
How can I write a >>25T RDD as Parquet to S3?
I am using Spark-1.1.0.
From AWS S3 documentation:
The total volume of data and number of objects you can store are unlimited. Individual Amazon S3 objects can range in size from 1 byte to 5 terabytes. The largest object that can be uploaded in a single PUT is 5 gigabytes. For objects larger than 100 megabytes, customers should consider using the Multipart Upload capability.
One way to go around this:
Attache an EBS volume to your system, format it.
Copy the files to the "local" EBS volume.
Snapshot the volume, it goes to your S3 automatically.
It also gives a smaller load on your instance.
To access that data, you need to attache the snapshot as an EBS to an instance.

Best way to copy millions of files from S3 to GCS?

I am searching for a way to move a very large number of files (over 10 million) from an S3 bucket over to Google Cloud Storage but so far am having issues.
Currently I am using gsutil because it has native support for communicating between both S3 and GCS but I am getting less than great performance. Maybe I am just doing things wrong but I have been using the following gsutil command:
gsutil -m cp -R s3://bucket gs://bucket
I spun up a c3.2xlarge AWS instance (16GB 8CPU) so that I could have enough horse power but it doesn't appear that the box is getting any better throughput than a 2GB 2CPU box, I don't get it?
I have been messing around with the ~/.boto config file and currently have the following options set:
parallel_process_count = 8
parallel_thread_count = 100
I thought for sure increasing the thread count by a factor of 10x would help but from my testing so far hasn't made a difference. Is there anything else that can be done to boost performance?
Or is there maybe a better tool for moving S3 data to GCS? I am looking at the SDK's and am half way tempted to write something in Java.
Google Cloud Storage Online Cloud Import was built specifically to import large sizes and number of files to GCS from either a large list of URLs or from an S3 bucket. It was designed for data sizes that would take too long using "gsutil -m" (which was a good thing to try first). It is currently free to use.
(Disclaimer, I am the PM for the project)