Im trying to load a .csv file to BQ using console. it has a size of 45 mb. I see that using "upload" i can only load upto 10mb. I dont have access to Drive and dont have access to run bq load from command line on my local machine as permission denied.
Any workaround for this? It will be a great help.Thanks
You can upload the file to a Google Cloud Storage bucket, then copy the "//gs:" storage URL. Then in the console, you can Create Table and select source "Google Cloud Storage" and paste your URL.
I where able to upload a file greater than 10Mb limit following this tutorial
In order to execute the python script, you just need to install the bigquery lib in your virtualenv.
pip install google-cloud-bigquery
If you do not have a dataset created you just need to run the command from console cloud to create a new dataset.
$ bq mk pythoncsv
#Dataset 'healthy-pager-276023:pythoncsv' successfully created.
After create your dataset with sucess just fire the python script to upload your csv.
My final solution is this python script:
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
# JUST FOLLOW THIS PATTERN: <projectid>.<datasetname>.<tablename>
table_id = "healthy-pager-276023.pythoncsv.table_name"
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.CSV, skip_leading_rows=1, autodetect=True,
)
path_to_file_name = "massdata.csv" #<-- PATH TO CSV TO IMPORT
with open(path_to_file_name, "rb") as source_file:
job = client.load_table_from_file(source_file, table_id, job_config=job_config)
job.result() # Waits for the job to complete.
table = client.get_table(table_id) # Make an API request.
print("Loaded {} rows and {} columns to {}".format(table.num_rows, len(table.schema), table_id))
And here my configurations from gcloud console from big query:
Related
I am trying to learn spark SQL in databricks and want to work with the Yelp dataset; however, the file is too large to upload to DBFS from UI. Thanks, Filip
There are several approaches to that:
Use Databricks CLI's dbfs command to upload local data to DBFS.
Download dataset directly from notebook, for example by using %sh wget URL, and unpacking the archive to DBFS (either by using /dbfs/path/... as destination, or using dbutils.fs.cp command to copy files from driver node to DBFS)
Upload files to AWS S3, Azure Data Lake Storage, Google Storage or something like, and accessing data there.
Upload the file you want to load in Databricks to google drive
from urllib.request import urlopen
from shutil import copyfileobj
my_url = 'paste your url here'
my_filename = 'give your filename'
file_path = '/FileStore/tables' # location at which you want to move the downloaded file
# Downloading the file from google drive to Databrick
with urlopen(my_url) as in_stream, open(my_filename, 'wb') as out_file:
copyfileobj(in_stream, out_file)
# check where the file has download
# in my case it is
display(dbutils.fs.ls('file:/databricks/driver'))
# moving the file to desired location
# dbutils.fs.mv(downloaded_location, desired_location)
dbutils.fs.mv("file:/databricks/driver/my_file", file_path)
I hope this helps
I am working on AWS Glue and created an ETL job for upserts. I have a s3 bucket where I have my csv file in a folder. I am reading the file from s3 and want to write back to s3 using delta lake (parquet file) using this code
from delta import *
from pyspark.sql.session import SparkSession
spark = SparkSession.builder \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
inputDF = spark.read.format("csv").option("header", "true").load('s3://demonidhi/superstore/')
print(inputDF)
# Write data as DELTA TABLE
inputDF.write.format("delta").mode("overwrite").save("s3a://demonidhi/current/")
# Generate MANIFEST file for Athena/Catalog
deltaTable = DeltaTable.forPath(spark, "s3a://demonidhi/current/")
I am using a jar file of delta named 'delta-core_2.11-0.6.1.jar' which is in s3 bucket folder and i gave the path of it in python libraby path and in Dependent jars path while creating my job.
Till the reading part the code is working just fine but after that for the writing and manifesting it is not working and giving some error which I am not able to see in GLUE terminal. I tried to follow several different approaches, but not able to figure out how can i resolve this. Any help would be appericiated.
Using the spark.config() notation will not work in Glue, because the abstraction that Glue is using (the GlueContext), will override those parameters.
What you can do instead is provide the config as a parameter to the job itself, with the key --conf and the value spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
My directory structure looks like this: bucket-name/training/file.hdf5
I tried reading this file in sagemaker notebook instance by this code cell:
bucket='bucket-name'
data_key = 'training/file.hdf5'
data_location = 's3://{}/{}'.format(bucket, data_key)
hf = h5py.File(data_location, 'r')
But it gives me error:
Unable to open file (unable to open file: name = 's3://bucket-name/training/file.hdf5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
I have also tried pd.read_hdf(data_location) but was not succesfull.
Trying to read a csv file into dataframe from same key doesnt throw error.
Any help is appreciated. Thanks
Thanks for asking the question here!
Your file is on the remote storage service Amazon S3. The string data_location is not the name of a local file, hence your data reader cannot open it. It order to read the S3 file, you have 2 options:
use a library that can read files from S3. It seems that h5py can do that, if you specify driver='ros3'
alternatively, you can also bring the file from S3 to your machine, and then read it from the machine. For example using the AWS CLI to bring the file from S3 to local aws s3 cp s3://<your bucket>/<your file on s3> /home/ec2-user/SageMaker/ then File(data_location='/home/ec2-user/SageMaker/your-file-name.hdf5') should work
I have multi-files in a folder in my local machine. And every file has the same schema. how can I upload those files into BQ in one line cli?
I tried this:
bq load --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values temp.test_load ./* ./schema.json
But I got this error
Too many positional args, still have...(and continues with the name of all files in the folder)
But when I specife the file name, it uploads it into BQ without any error:
bq load --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values temp.test_load ./file_1.ndjson.gz ./schema.json (this one is working)
How can I make multi uploads from local?
When loading data from local files to BigQuery, files can only be loaded individually as the Wildcards and comma-separated lists are not supported for loading data from local files.
The Wildcards are only supported if you are loading data from Cloud Storage to BigQuery, provided all the files share a common base-name.
I am trying to get the bq CLI to work with multiple service accounts for different projects without having to re-authenticate using gcloud auth login or bq init.
An example of what I want to do, and am able to do using gsutil:
I have used gsutil with a .boto configuration file containing:
[Credentials]
gs_service_key_file = /path/to/key_file.json
[Boto]
https_validate_certificates = True
[GSUtil]
content_language = en
default_api_version = 2
default_project_id = my-project-id
[OAuth2]
on a GCE instance to run an arbitrary gsutil command as a service. The service does not need to be unique or globally defined on the GCE instance: as long as a service is set up in my-project-id and a private key has been created, then the private key file referenced in the .boto config will take care of authentication. For example, if I run
BOTO_CONFIG=/path/to/my/.boto_project_1
export BOTO_CONFIG
gsutil -m cp gs://mybucket/myobject .
I can copy from any project that I have a service account set up with, and for which I have the private key file defined in .boto_project_1. In this way, I can run a similar gsutil command for project_2 just be referencing the .boto_project_2 config file. No manual authentication needed.
The case with bq CLI
In the case of the bigquery command line interpreter, I want to reference a config file or pass a config option like a key file to run a bq load command, ie. upload the same .csv file that is in GCS for various projects. I want to automate this without having to bq init each time.
I have read here that you can configure a .biqqueryrc file and pass in your credential and key files as options; however the answer is from 2012, references outdated bq credential files, and throws errors due to the openssl and pyopenssl installs that it mentioned.
My question
Provide two example bq load commands with any necessary options/biqueryrc files to correctly load a .csv file from GCS into bigquery for two distinct projects without needing to bq init/authenticate manually between the two commands. Assume the .csv file is already correctly in each project's GCS bucket.
Simply use gcloud auth activate-service-account and use the global --project flag.
https://cloud.google.com/sdk/gcloud/reference/auth/activate-service-account
https://cloud.google.com/sdk/gcloud/reference/