I was wondering if colab uses servers that are in random Google Cloud Platform regions. If so, would it be possible to select a preferred region? This would decrease data transfer costs from GCS to Colab.
Related
I know that there are a lot of dataset related queries on medium and datascience that address the issue of storing data
some of them are :upload data on github;using google drive
However let's say that I want to experiment with a dataset of size 48GB,then what are my options.
As per my understanding github starts giving you remote end hang up message,google drive has storage cap of 40 Gb
Any ideas should I try something like Amazon SE3 bucket
does a colab pro account help?
Disk space in Colab Pro is double the amount available in the free version.
I read other similar threads and searched Google to find a better way but couldn't find any workable solution.
I have a large large table in BigQuery (assume inserting 20 million rows per day). I want to have around 20 million rows of data with around 50 columns in python/pandas/dask to do some analysis. I have tried using bqclient, panda-gbq and bq storage API methods but it takes 30 min to have 5 millions rows in python. Is there any other way to do so? Even any Google service available to do similar job?
Instead of querying, you can always export stuff to cloud storage -> download locally -> load into your dask/pandas dataframe:
Export + Download:
bq --location=US extract --destination_format=CSV --print_header=false 'dataset.tablename' gs://mystoragebucket/data-*.csv && gsutil -m cp gs://mystoragebucket/data-*.csv /my/local/dir/
Load into Dask:
>>> import dask.dataframe as dd
>>> df = dd.read_csv("/my/local/dir/*.csv")
Hope it helps.
First, you should profile your code to find out what is taking the time. Is it just waiting for big-query to process your query? Is it the download of data> What is your bandwidth, what fraction do you use? Is it parsing of that data into memory?
Since you can make SQLAlchemy support big-query ( https://github.com/mxmzdlv/pybigquery ), you could try to use dask.dataframe.read_sql_table to split your query into partitions and load/process them in parallel. In case big-query is limiting the bandwidth on a single connection or to a single machine, you may get much better throughput by running this on a distributed cluster.
Experiment!
Some options:
Try to do aggregations etc. in BigQuery SQL before exporting (a smaller table) to
Pandas.
Run your Jupyter notebook on Google Cloud, using a Deep Learning VM on a high-memory machine in the same region as your BigQuery
dataset. That way, network overhead is minimized.
Probably you want to export the data to Google Cloud Storage first, and then download the data to your local machine and load it.
Here are the steps you need to take:
Create an intermediate table which will contain the data you want to
export. You can do select and store to the intermediate table.
Export the intermediate table to Google Cloud Storage, to JSON/Avro/Parquet format.
Download your exported data and load to your python app.
Besides downloading the data to your local machine, you can leverage the processing using PySpark and SparkSQL. After you export the data to Google Cloud Storage, you can spin up a Cloud Dataproc cluster and load the data from Google Cloud Storage to Spark, and do analysis there.
You can read the example here
https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example
and you can also spin up Jupyter Notebook in the Dataproc cluster
https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook
Hope this helps.
A couple years late, but we're developing a new dask_bigquery library to help easily move back and forth between BQ and Dask dataframes. Check it out and let us know what you think!
I need to setup a scheduled daily job that pulls data using a REST API call and then inserts that data into BigQuery.
I traditionally have done these types of tasks using Node.js running on Heroku. My current boss wants me to achieve this using the Google Cloud Platform.
What are some ways to achieve this using Google Cloud Platform?
A few options on GCP:
Spin up a GCE instance and use a cron (a little old school, but will work).
Use Google App Engine and schedule your job(s) that way.
Unfortunately, Google Cloud Functions don't yet support schedulers. Otherwise, that would be perfect.
What is the best way for storing images and Microsoft Office documents:
Google Drive
Google Storage
You may want to consider checking this page to help you choose which storage option suits you best and also learn more.
To differentiate the two:
Google Drive
A collaborative space for storing, sharing, and editing files, including Google Docs and is good for the following:
End-user interaction with docs and files
Collaborative creation and editing
Syncing files between cloud and local devices
Google Cloud Storage
A scalable, fully-managed, highly reliable, and cost-efficient object / blob store and good for these:
Images, pictures, and videos
Objects and blobs
Unstructured data
In addition to that, see Google Cloud Platform - FAQ for more insights.
Different approaches can be taken into consideration, google docs are widely used for online working with office documents etc, it provides probably same layout in comparison to Microsoft office, the advantage is that you can share the document with other people as well, plus you can edit it online at any time.
Google Drive (useful way to store your files)
Every Google Account starts with 15 GB of free storage that's shared across Google Drive, Gmail, and Google Photos. When you upgrade to Google One, your total storage increases to 100 GB or more depending on what plan you choose.
Mediafire (another useful way to store your files)
In mediafire on the basic package it allows you 10 GB of cloud space for free, the files you store in the MediaFire can be encrypted by password encryption. It allows more other features as well. A suggestion to explore.
We have TB of data need to uploaded to bigquery. I remember one of the video from Felipe Hoffa mentioning that we can send a hard drive overnight to Google and they can take care of it. Can Google Bigquery team provide more info on it is possible?
This is the Offline Import mechanism from Google Cloud Storage. You can read about it here:
https://developers.google.com/storage/docs/early-access
Basically, you'd use this mechanism to import to Google Cloud Storage, then run BigQuery import jobs to import to BigQuery from there.
Depending on how many TB of data you are importing, you might just be better off uploading directly to Google Cloud Storage. Gsutil and other tools can do resumable uploads.
If you are talking about 100s of TB or more, you might want to talk to a Google Cloud Support person about your scenarios in detail. They may be able to help you optimize your usage of BigQuery and Cloud Storage.