I know that there are a lot of dataset related queries on medium and datascience that address the issue of storing data
some of them are :upload data on github;using google drive
However let's say that I want to experiment with a dataset of size 48GB,then what are my options.
As per my understanding github starts giving you remote end hang up message,google drive has storage cap of 40 Gb
Any ideas should I try something like Amazon SE3 bucket
does a colab pro account help?
Disk space in Colab Pro is double the amount available in the free version.
Related
I was wondering if colab uses servers that are in random Google Cloud Platform regions. If so, would it be possible to select a preferred region? This would decrease data transfer costs from GCS to Colab.
I read other similar threads and searched Google to find a better way but couldn't find any workable solution.
I have a large large table in BigQuery (assume inserting 20 million rows per day). I want to have around 20 million rows of data with around 50 columns in python/pandas/dask to do some analysis. I have tried using bqclient, panda-gbq and bq storage API methods but it takes 30 min to have 5 millions rows in python. Is there any other way to do so? Even any Google service available to do similar job?
Instead of querying, you can always export stuff to cloud storage -> download locally -> load into your dask/pandas dataframe:
Export + Download:
bq --location=US extract --destination_format=CSV --print_header=false 'dataset.tablename' gs://mystoragebucket/data-*.csv && gsutil -m cp gs://mystoragebucket/data-*.csv /my/local/dir/
Load into Dask:
>>> import dask.dataframe as dd
>>> df = dd.read_csv("/my/local/dir/*.csv")
Hope it helps.
First, you should profile your code to find out what is taking the time. Is it just waiting for big-query to process your query? Is it the download of data> What is your bandwidth, what fraction do you use? Is it parsing of that data into memory?
Since you can make SQLAlchemy support big-query ( https://github.com/mxmzdlv/pybigquery ), you could try to use dask.dataframe.read_sql_table to split your query into partitions and load/process them in parallel. In case big-query is limiting the bandwidth on a single connection or to a single machine, you may get much better throughput by running this on a distributed cluster.
Experiment!
Some options:
Try to do aggregations etc. in BigQuery SQL before exporting (a smaller table) to
Pandas.
Run your Jupyter notebook on Google Cloud, using a Deep Learning VM on a high-memory machine in the same region as your BigQuery
dataset. That way, network overhead is minimized.
Probably you want to export the data to Google Cloud Storage first, and then download the data to your local machine and load it.
Here are the steps you need to take:
Create an intermediate table which will contain the data you want to
export. You can do select and store to the intermediate table.
Export the intermediate table to Google Cloud Storage, to JSON/Avro/Parquet format.
Download your exported data and load to your python app.
Besides downloading the data to your local machine, you can leverage the processing using PySpark and SparkSQL. After you export the data to Google Cloud Storage, you can spin up a Cloud Dataproc cluster and load the data from Google Cloud Storage to Spark, and do analysis there.
You can read the example here
https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example
and you can also spin up Jupyter Notebook in the Dataproc cluster
https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook
Hope this helps.
A couple years late, but we're developing a new dask_bigquery library to help easily move back and forth between BQ and Dask dataframes. Check it out and let us know what you think!
I am trying to copy 25 TB of data to Azure. Do we have any option to move the date?
Tried to copy but it has taken 1 hr for 1 GB Data, do we have any better solution so that I can do it more quickly?
The problem statement is very general. I would start with asking, how are you transferring the data?
The speed is dependent on so many factors, a few being:
1. Location of the data.
2. Location of the storage account you're writing to.
3. Network speed and bandwidth on the client side.
4. Network speed and bandwidth on the azure storage side. (expected to be good)
If you're writing the data to a Azure Storage account which is in a region closer to you, you're expected to get better speed.
As for the options to write the data:
1. Look at AzCopy.
https://azure.microsoft.com/en-us/documentation/articles/storage-use-azcopy/
Use Import\Export service.
https://azure.microsoft.com/en-us/pricing/details/storage-import-export/
The best way to upload large datasets into the cloud is still the sneakernet
Azure do a thing called the Azure Import/Export Service Basically you buy a SATA hard drive, encrypt it with a numerical bitlocker key, copy data to it, create an Azure import job, then ship the hard drive to them.
This ends up being considerably quicker than trying to upload.
An alternative you might want to look into, would be the AWS Import/Export Snowball for which they will ship you an appliance to copy the data to which you ship back to them when complete. It might be worth considering copying data into AWS via Snowball then copying it across their much faster internet pipes into Azure instead of buying the hardware required to transfer that much data.
If you open the target Storage account in the Azure Portal, there's now a calculator that will accept basic details (how much data etc) and then recommend the best options to you. Its under the heading "Data transfer".
I have to big files range in size between 20 GB to 90 GB. I will download files with Internet Download Manager (IDM) to my Windows server at Azure Virtual Machine. I will need to transfer these files to my Azure Storage account to use it later. The total files size about 550 GB.
Will Azure Storage Explorer do the job, or there are a better solution?
My Azure account is a BizSpark one with 150 $ limit, shall I remove the limit before transferring the files to the storage account?
Any other advice?
Thanks very much in advance.
You should look at the AzCopy tool (http://aka.ms/AzCopy) - it is designed for large transfers of data to and from Azure Storage.
You will save network egress cost if your storage account is in the same region as the VM where you are uploading from.
As for cost, this depends on what all you are using. You can use the Azure price calculator (http://azure.microsoft.com/en-us/pricing/calculator/) to help with estimating, or just use the pricing info directly from Azure website and calculate an estimated usage to see whether you will fit within your $150 limit.
I am searching for a way to move a very large number of files (over 10 million) from an S3 bucket over to Google Cloud Storage but so far am having issues.
Currently I am using gsutil because it has native support for communicating between both S3 and GCS but I am getting less than great performance. Maybe I am just doing things wrong but I have been using the following gsutil command:
gsutil -m cp -R s3://bucket gs://bucket
I spun up a c3.2xlarge AWS instance (16GB 8CPU) so that I could have enough horse power but it doesn't appear that the box is getting any better throughput than a 2GB 2CPU box, I don't get it?
I have been messing around with the ~/.boto config file and currently have the following options set:
parallel_process_count = 8
parallel_thread_count = 100
I thought for sure increasing the thread count by a factor of 10x would help but from my testing so far hasn't made a difference. Is there anything else that can be done to boost performance?
Or is there maybe a better tool for moving S3 data to GCS? I am looking at the SDK's and am half way tempted to write something in Java.
Google Cloud Storage Online Cloud Import was built specifically to import large sizes and number of files to GCS from either a large list of URLs or from an S3 bucket. It was designed for data sizes that would take too long using "gsutil -m" (which was a good thing to try first). It is currently free to use.
(Disclaimer, I am the PM for the project)