I am new to big query . I am trying o create a table by uploading csv. Its size if 290 kb. Even if I fill all the required information the thee dots beside create table keeps moving(like loading ) but even after waiting for a long time, the table doesn't get created.
You can upload the CSV in a bucket and then reference it from BigQuery creation panel.
Here is the official guide from Google, with the screenshot. Should be rather simple. ( https://cloud.google.com/bigquery/docs/schema-detect)
On step 4 of the image below, select the path to the file and CSV format.
On step 5 you can either keep everything like it is or select "External Table" (which I recommend), in order to delete the table in case of error and not lose the CSV.
BigQuery should automatically handle the rest. Please, share more detailed information in case of error.
There are couple of ways through which you can upload CSV file to Bigquery as given below :-
Write an Apache beam code (Python/Java) and get data loaded to Bigquery. Sample code for reading and writing you can combine it.
Write a python script which is responsible for loading data to Bigquery.
import pandas as pd
from pandas.io import gbq
import os
import numpy as np
dept_dt=pd.read_csv('dept_data')
#print(dept_dt)
# Replace with Project Id
project = 'xxxxx-aaaaa-de'
#Replace with service account path
path_service_account = 'xxxxx-aaaa-jhhhhh9874.json'
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = path_service_account
dept_dt.to_gbq(destination_table='test1.Emp_data1',project_id='xxxxxx-ccvd-err',if_exists='fail')
Related
I want to create a Table for my dataset in BigQuery. I want to upload CSV file. When I upload it and clicked on "create table" it is saying:
unexpected error. Tracking number c986854671035387
What is this error and How can I solve this? (I also upgraded my BigQuery to 90 days free trial).
You need to check the data inside CSV. If it has column names and no faulty records.
you can download any sample CSV file from here and try
http://www.mytrapture.com/sampledata/
I am going to work on a data set that contains information about 311 calls in the United States. This data set is available publicly in BigQuery. I would like to copy this directly to my bucket. However, I am clueless about how to do this as I am a novice.
Here is a screenshot of the public location of the dataset on Google Cloud:
I have already created a bucket named 311_nyc in my Google Cloud Storage. How can I directly transfer the data without having to download the 12 gb file and uploading it again through my VM instance?
If you select the 311_service_requests table from the list on the left, an "Export" button will appear:
Then you can select Export to GCS, select your bucket, type a filename, choose format (between CSV and JSON) and check if you want the export file to be compressed (GZIP).
However, there are some limitations in BigQuery Exports. Copying some from the documentation link that apply to your case:
You can export up to 1 GB of table data to a single file. If you are exporting more than 1 GB of data, use a wildcard to export the data into multiple files. When you export data to multiple files, the size of the files will vary.
When you export data in JSON format, INT64 (integer) data types are encoded as JSON strings to preserve 64-bit precision when the data is read by other systems.
You cannot choose a compression type other than GZIP when you export data using the Cloud Console or the classic BigQuery web UI.
EDIT:
A simple way to merge the output files together is to use the gsutil compose command. However, if you do this the header with the column names will appear multiple times in the resulting file because it appears in all the files that are extracted from BigQuery.
To avoid this, you should perform the BigQuery Export by setting the print_header parameter to False:
bq extract --destination_format CSV --print_header=False bigquery-public-data:new_york_311.311_service_requests gs://<YOUR_BUCKET_NAME>/nyc_311_*.csv
and then create the composite:
gsutil compose gs://<YOUR_BUCKET_NAME>/nyc_311_* gs://<YOUR_BUCKET_NAME>/all_data.csv
Now, in the all_data.csv file there are no headers at all. If you still need the column names to appear in the first row you have to create another CSV file with the column names and create a composite of these two. This can be done either manually by pasting the following (column names of the "311_service_requests" table) into a new file:
unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,street_name,cross_street_1,cross_street_2,intersection_street_1,intersection_street_2,address_type,city,landmark,facility_type,status,due_date,resolution_description,resolution_action_updated_date,community_board,borough,x_coordinate,y_coordinate,park_facility_name,park_borough,bbl,open_data_channel_type,vehicle_type,taxi_company_borough,taxi_pickup_location,bridge_highway_name,bridge_highway_direction,road_ramp,bridge_highway_segment,latitude,longitude,location
or with the following simple Python script (in case you want to use it with a table with a big amount of columns that is hard to be done manually) that queries the column names of the table and writes them into a CSV file:
from google.cloud import bigquery
client = bigquery.Client()
query = """
SELECT column_name
FROM `bigquery-public-data`.new_york_311.INFORMATION_SCHEMA.COLUMNS
WHERE table_name='311_service_requests'
"""
query_job = client.query(query)
columns = []
for row in query_job:
columns.append(row["column_name"])
with open("headers.csv", "w") as f:
print(','.join(columns), file=f)
Note that for the above script to run you need to have the BigQuery Python Client library installed:
pip install --upgrade google-cloud-bigquery
Upload the headers.csv file to your bucket:
gsutil cp headers.csv gs://<YOUR_BUCKET_NAME/headers.csv
And now you are ready to create the final composite:
gsutil compose gs://<YOUR_BUCKET_NAME>/headers.csv gs://<YOUR_BUCKET_NAME>/all_data.csv gs://<YOUR_BUCKET_NAME>/all_data_with_headers.csv
In case you want the headers you can skip creating the first composite and just create the final one using all sources:
gsutil compose gs://<YOUR_BUCKET_NAME>/headers.csv gs://<YOUR_BUCKET_NAME>/nyc_311_*.csv gs://<YOUR_BUCKET_NAME>/all_data_with_headers.csv
You can also use the gcoud commands:
Create a bucket:
gsutil mb gs://my-bigquery-temp
Extract the data set:
bq extract --destination_format CSV --compression GZIP 'bigquery-public-data:new_york_311.311_service_requests' gs://my-bigquery-temp/dataset*
Please note that you have to use gs://my-bigquery-temp/dataset* because the dataset is to large and can not be exported to a single file.
Check the bucket:
gsutil ls gs://my-bigquery-temp
gs://my-bigquery-temp/dataset000000000
......................................
gs://my-bigquery-temp/dataset000000000045
You can find more information Exporting table data
Edit:
To compose an object from the exported dataset files you can use gsutil tool:
gsutil compose gs://my-bigquery-temp/dataset* gs://my-bigquery-temp/composite-object
Please keep in mind that you can not use more that 32 blobs (files) to compose the object.
Related SO Question Google Cloud Storage Joining multiple csv files
I Googled for a solution to create a table, using Databticks and Azure SQL Server, and load data into this same table. I found some sample code online, which seems pretty straightforward, but apparently there is an issue somewhere. Here is my code.
CREATE TABLE MyTable
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:sqlserver://server_name_here.database.windows.net:1433;database = db_name_here",
user "u_name",
password "p_wd",
dbtable "MyTable"
);
Now, here is my error.
Error in SQL statement: SQLServerException: Invalid object name 'MyTable'.
My password, unfortunately, has spaces in it. That could be the problem, perhaps, but I don't think so.
Basically, I would like to get this to recursively loop through files in a folder and sub-folders, and load data from files with a string pattern, like 'ABC*', and load recursively all these files into a table. The blocker, here, is that I need the file name loaded into a field as well. So, I want to load data from MANY files, into 4 fields of actual data, and 1 field that captures the file name. The only way I can distinguish the different data sets is with the file name. Is this possible? Or, is this an exercise in futility?
my suggestion is to use the Azure SQL Spark library, as also mentioned in documentation:
https://docs.databricks.com/spark/latest/data-sources/sql-databases-azure.html#connect-to-spark-using-this-library
The 'Bulk Copy' is what you want to use to have good performances. Just load your file into a DataFrame and bulk copy it to Azure SQL
https://docs.databricks.com/data/data-sources/sql-databases-azure.html#bulk-copy-to-azure-sql-database-or-sql-server
To read files from subfolders, answer is here:
How to import multiple csv files in a single load?
I finally, finally, finally got this working.
val myDFCsv = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","true")
.option("header","false")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
myDFCsv.show()
myDFCsv.count()
Thanks for a point in the right direction mauridb!!
This is a question about importing data files from Google Cloud Storage to BigQuery.
I have a number of JSON files that follow a strict naming convention to include some key data not included in the JSON data itself.
For example:
xxx_US_20170101.json.gz
xxx_GB_20170101.json.gz
xxx_DE_20170101.json.gz
Which is client_country_date.json.gz At the moment, I have some convoluted processes in a Ruby app that reads the files, appends the additional data and then writes it back to a file that is then imported into a single daily table for the client in BigQuery.
I am wondering if it is possible to grab and parse the filename as part of the import to BigQuery? I could then drop the convoluted Ruby processes which occasionally fail on larger files.
You could define an external table pointing to your files:
Note that the table type is "external table", and that it points to multiple files with the * glob.
Now you can query for all data in these files, and query for the meta-column _FILE_NAME:
#standardSQL
SELECT *, _FILE_NAME filename
FROM `project.dataset.table`
You can now save these results to a new native table.
Hi I have a problem while using ipython notebooks on datalab.
I want to write the result of a table into a bigQuery table but it does not work and anyone says to use the insert_data(dataframe) function but it does not populate my table.
To simplify the problem I try to read a table and write it to a just created table (with the same schema) but it does not work. Can anyone tell me where I am wrong?
import gcp
import gcp.bigquery as bq
#read the data
df = bq.Query('SELECT 1 as a, 2 as b FROM [publicdata:samples.wikipedia] LIMIT 3').to_dataframe()
#creation of a dataset and extraction of the schema
dataset = bq.DataSet('prova1')
dataset.create(friendly_name='aaa', description='bbb')
schema = bq.Schema.from_dataframe(df)
#creation of the table
temptable = bq.Table('prova1.prova2').create(schema=schema, overwrite=True)
#I try to put the same data into the temptable just created
temptable.insert_data(df)
Calling insert_data will do a HTTP POST and return once that is done. However, it can take some time for the data to show up in the BQ table (up to several minutes). Try wait a while before using the table. We may be able to address this in a future update, see this
The hacky way to block until ready right now should be something like:
import time
while True:
info = temptable._api.tables_get(temptable._name_parts)
if 'streamingBuffer' not in info:
break
if info['streamingBuffer']['estimatedRows'] > 0:
break
time.sleep(5)