I am trying to find the filepath for data that is inserted from CSV stored in GCS into a BigQuery table. In Databricks a Describe Detail/Extended would do the trick but I can't find the equivalent command in BigQuery.
I have searched the BigQuery docs and tried using INFORMATION_SCHEMA.TABLES route but it doesn't provide the information I am looking for.
Any insights are greatly appreciated.
Related
I am quite new to Databricks, and I am trying to do some basic data exploration with koalas.
When I log into Databricks, under DATA I see 2 main tabs, DATABASE TABLES and DBFS. I managed to read csv files as koalas dataframes (ksdf=ks.read_csv('/FileStore/tables/countryCodes.csv'), but I do not know how I could read as koalas dataframe the tables I see under the DATABASE TABLES. None of those tables have filename extensions, I guess those are SQL tables?
Sorry if my question is too basic, and thanks very much for your help.
You just need to use read_table function as pointed in the documentation:
ksdf = ks.read_table('my_database.my_table')
P.S. It's a part of so called Metastore API
I have a similar question asked in this link BigQuery - Export query results to local file/Google storage
I need to extract data from 2 big query tables using joins and where conditions. The extracted data has to be placed in a file on cloud storage. Mostly csv file. I want to go with a simple solution. Can I use big query export data statement In standard sql and schedule it?? Does it has a limitation of 1 Gb export?? If yes, what is the best possible way to implement this? Creating another temp table to save results from the query and using a data flow job to extras the data from the temp table? Please advise.
Basically google cloud now supports below
Please see code snippet in cloud documentation
https://cloud.google.com/bigquery/docs/reference/standard-sql/other-statements#exporting_data_to_csv_format
I’m thinking if I can use the above statement to export data into a file and select query will have join from 2 tables and other conditions.
This query will be a scheduled query in big query.
Any inputs please??
I am trying to query the externally partitioned data with the reference of this BigQuery doc
Google Cloud Storage: (csv data contains string value only)
gs://project/myfolder/count=1000/file_1k.csv
gs://project/myfolder/count=10000/file_10k.csv
gs://project/myfolder/count=100000/file_100k.csv
Source URI prefix: gs://project/myfolder
But I am getting the following error while querying the table,
Error while reading table: project.dataset.partition_table,
error message: Cannot query hive partitioned data for table project.dataset.partition_table without any associated files.
Query:
SELECT * FROM `project.dataset.partition_table` where count=1000 order by rand() LIMIT 100;
Refer the screenshot,
Any inputs here really appreciated.
The problem is that the engine can't find the files related to the partitions.
In your case, its because when you created the table you referenced a folder in GCS but not it's files.
To solve your problem, you should use a wildcard and your path would be gs://project/myfolder/* instead of gs://project/myfolder
I hope it helps
What I am trying to achieve is this:
1. Access a REST API to download hotel reservation data - the data output format is in JSON
2. Convert JSON data into the correct format to be uploaded into SQL table
3. Upload this table of data onto Google BigQuery existing table as additional rows
Do let me know if any further information is required and if I have been clear enough
Thanks in advance
1) pretty good REST API tutorial
2) You can use a local SQL DB or use Cloud SQL. The process would be the same (Parse JSON and insert to DB)
If you decide to use Cloud SQL, you can parse the JSON and save it as a CSV then follow this tutorial
or
simply parse the JSON and insert using one of the following API's
3) Use can easily load data into any BigQuery table by using BigQuery API. You can also directly insert the JSON data into BigQuery
But as Tamir had mentioned, it would be best to ask questions if you encounter errors/issues. Since there are multiple ways to perform this type of scenario, we cannot provide an exact solution for you.
I need to transform a fairly big database table with aws Glue to csv. However I only the newest table rows from the past 24 hours. There ist a column which specifies the creation date of the row. Is it possible, to just transform these rows, without copying the whole table into the csv file? I am using a python script with Spark.
Thank you very much in advance!
There are some Built-in Transforms in AWS Glue which are used to process your data. This transfers can be called from ETL scripts.
Please refer the below link for the same :
https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html
You haven't mentioned the type of database that you are trying connect. Anyway for JDBC connections spark has the option of query, in which you can issue the usual SQL query to get the rows you need.