I tried to run the below code but I keep getting an error.
%%bigquery --project my_project_id df
SELECT
COUNT(*) as total_rows
FROM `dataset.table`
ERROR: 400 POST
https://bigquery.googleapis.com/bigquery/v2/projects/my_project_id/jobs:
ProjectId and DatasetId must be non-empty
Can anyone help out?
According to this documentation, you should first input a variable named df before your --project my_project_id flag to save the results of your query to a variable named df. See below query:
%%bigquery df --project my_project_id
SELECT
COUNT(*) as total_rows
FROM `dataset.table`
Sample Output:
Related
I'm using a databricks notebook and I'd like to retrieve a dataframe from an SQL execution in Spark. I have:
statement = f""" USER {db}; SELECT * FROM {table}
"""
df = spark.sql(statement)
display(df)
However, unlike when I fire off the same statement in an SQL cell in the notebook, I get the following error:
[PARSE_SYNTAX_ERROR] Syntax error at or near 'SELECT': extra input 'SELECT'(line 1...
Where am I going wrong?
I tried to reproduce the same in my environment and got below results:
This my sample demo table Persons.
Create dataframe by using this code as shown in the below image.
df = sqlContext.sql("select * from Persons")
display(df)
I've just started with Hive. I'm working on Databricks community. I write in python but wanted to write something in SQL but there is an error I cannot understand. I cannot see anything wrong in my code. Please help me.
spark.sql("create table happiness_perm as select * from happiness_tmp");
%sql
select Country, count(*) from happiness_perm group by Country
I tried use my data freame df_happiness instead happiness_perm and still I receive this:
Error in SQL statement: AnalysisException: Table or view not found: happiness_perm; line 1 pos 30;
'Aggregate ['Country], ['Country, unresolvedalias(count(1), None)]
+- 'UnresolvedRelation [happiness_perm], [], false
I would really appreciate your help!
Try this:
df = spark.sql("select * from happiness_tmp")
df.createOrReplaceTempView("happiness_perm")
First you get your data into a dataframe, then you write the contents of the dataframe to a table in the catalog.
You can then query the table.
My code uses SQL to query a database hosted in BigQuery. Say I have a list of items stored in a variable:
list = ['a','b','c']
And I want to use that list as a parameter on a query like this:
%%bigquery --project xxx query
SELECT *
FROM `xxx.database.table`
WHERE items in list
As the magic command that calls the database is a full-cell command, how can I make some escape to get it to call the environment variables in the SQL query?
You can try UNNEST and the query in BIGQUERY works like this:
SELECT * FROM `xx.mytable` WHERE items in UNNEST (['a','b','c'])
In your code it should look like this:
SELECT * FROM `xx.mytable` WHERE items in UNNEST (list)
EDIT
I found two different ways to pass variables in Python.
The first approach is below. Is from google documentation[1].
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
query = """
SELECT * FROM `xx.mytable` WHERE items in UNNEST (#list)
"""
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ArrayQueryParameter("list", "STRING", ["a", "b", "c"]),
]
)
query_job = client.query(query, job_config=job_config) # Make an API request.
for row in query_job:
print("{}: \t{}".format(row.name, row.count))
The second approach is in the next document[2]. In your code should look like:
params = {'list': '[“a”,”b”,”c”]'}
%%bigquery df --params $params --project xxx query
select * from `xx.mytable`
where items in unnest (#list)
I also found some documentation[3] where it shows the parameters for %%bigquery magic.
[1]https://cloud.google.com/bigquery/docs/parameterized-queries#using_arrays_in_parameterized_queries
[2]https://notebook.community/GoogleCloudPlatform/python-docs-samples/notebooks/tutorials/bigquery/BigQuery%20query%20magic
[3]https://googleapis.dev/python/bigquery/latest/magics.html
I want to store current_day - 1 in a variable in Hive. I know there are already previous threads on this topic but the solutions provided there first recommends defining the variable outside hive in a shell environment and then using that variable inside Hive.
Storing result of query in hive variable
I first got the current_Date - 1 using
select date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'),1);
Then i tried two approaches:
1. set date1 = ( select date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'),1);
and
2. set hivevar:date1 = ( select date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'),1);
Both the approaches are throwing an error:
"ParseException line 1:82 cannot recognize input near 'select' 'date_sub' '(' in expression specification"
When I printed (1) in place of yesterday's date the select query is saved in the variable. The (2) approach throws "{hivevar:dt_chk} is undefined
".
I am new to Hive, would appreciate any help. Thanks.
Hive doesn't support a straightforward way to store query result to variables.You have to use the shell option along with hiveconf.
date1 = $(hive -e "set hive.cli.print.header=false; select date_sub(from_unixtime(unix_timestamp(),'yyyy-MM-dd'),1);")
hive -hiveconf "date1"="$date1" -f hive_script.hql
Then in your script you can reference the newly created varaible date1
select '${hiveconf:date1}'
After lots of research, this is probably the best way to achieve setting a variable as an output of an SQL:
INSERT OVERWRITE LOCAL DIRECTORY '<home path>/config/date1'
select CONCAT('set hivevar:date1=',date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'),1)) from <some table> limit 1;
source <home path>/config/date1/000000_0;
You will then be able to use ${date1} in your subsequent SQLs.
Here we had to use <some table> limit 1 as hive got a bug in insert overwrite if we don't specify a table name.
The parameterization example in the "SQL Parameters" IPython notebook in the datalab github repo (under datalab/tutorials/BigQuery/) shows how to change the value being tested for in a WHERE clause. Is it possible to use a parameter to change the name of a field being SELECT'd on?
eg:
SELECT COUNT(DISTINCT $a) AS n
FROM [...]
After I received the answer below, here is what I have done (with a dummy table name and field name, obviously):
%%sql --module test01
DEFINE QUERY get_counts
SELECT $a AS a, COUNT(*) AS n
FROM [project_id.dataset_id.table_id]
GROUP BY a
ORDER BY n DESC
table = bq.Table('project_id.dataset_id.table_id')
field = table.schema['field_name']
bq.Query(test01.get_counts,a=field).sql
bq.Query(test01.get_counts,a=field).results()
You can use a field from a Schema object (eg. given a table, get a specific field via table.schema[fieldname]).
Or implement a custom object with a _repr_sql_ method. See: https://github.com/GoogleCloudPlatform/datalab/blob/master/sources/lib/api/gcp/bigquery/_schema.py#L49