Google Cloud datalab error querying BIgQuery tables - google-bigquery

I think I am missing something basic here, can't seem to figure out what it is..
Querying BigQuery date partitioned table from Google cloud datalab. Most of the other queries fetches data as expected, not sure why in this particular table, select would not work, however count(1) query works.
%%sql
select * from Mydataset.sample_sales_yearly_part limit 10
I get below error:
KeyErrorTraceback (most recent call last) /usr/local/lib/python2.7/dist-packages/IPython/core/formatters.pyc in
__call__(self, obj)
305 pass
306 else:
--> 307 return printer(obj)
308 # Finally look for special method names
309 method = get_real_method(obj, self.print_method)
/usr/local/lib/python2.7/dist-packages/datalab/bigquery/commands/_bigquery.pyc in _repr_html_query_results_table(results)
999 1000 def _repr_html_query_results_table(results):
-> 1001 return _table_viewer(results) 1002 1003
/usr/local/lib/python2.7/dist-packages/datalab/bigquery/commands/_bigquery.pyc in _table_viewer(table, rows_per_page, fields)
969 meta_time = ''
970
--> 971 data, total_count = datalab.utils.commands.get_data(table, fields, first_row=0, count=rows_per_page)
972
973 if total_count < 0:
/usr/local/lib/python2.7/dist-packages/datalab/utils/commands/_utils.pyc in get_data(source, fields, env, first_row, count, schema)
226 return _get_data_from_table(source.results(), fields, first_row, count, schema)
227 elif isinstance(source, datalab.bigquery.Table):
--> 228 return _get_data_from_table(source, fields, first_row, count, schema)
229 else:
230 raise Exception("Cannot chart %s; unsupported object type" % source)
/usr/local/lib/python2.7/dist-packages/datalab/utils/commands/_utils.pyc in _get_data_from_table(source, fields, first_row, count, schema)
174 gen = source.range(first_row, count) if count >= 0 else source
175 rows = [{'c': [{'v': row[c]} if c in row else {} for c in fields]} for row in gen]
--> 176 return {'cols': _get_cols(fields, schema), 'rows': rows}, source.length
177
178
/usr/local/lib/python2.7/dist-packages/datalab/utils/commands/_utils.pyc in _get_cols(fields, schema)
108 if schema:
109 f = schema[col]
--> 110 cols.append({'id': f.name, 'label': f.name, 'type': typemap[f.data_type]})
111 else:
112 # This will only happen if we had no rows to infer a schema from, so the type
KeyError: u'DATE'
QueryResultsTable job_Ckq91E5HuI8GAMPteXKeHYWMwMo

You may be hitting an issue that was just fixed in https://github.com/googledatalab/pydatalab/pull/68 (but not yet included in a Datalab release).
The background is that the new "Standard SQL" support in BigQuery added new datatypes that can show up in the results schema, and Datalab was not yet updated to handle those.
The next release of Datalab should fix this, but in the mean time you can work around it by wrapping your date fields in an explicit cast to TIMESTAMP as part of your query.
For example, if you see that error with the following code cell:
%%sql SELECT COUNT(*) as count, d FROM <mytable>
(where 'd' is a field of type 'DATE'), then you can work around the issue by casting that field to a TIMESTAMP like this:
%%sql SELECT COUNT(*) as count, TIMESTAMP(d) FROM <mytable>
For your particular query, you'll have to change '*' to the list of fields, so that you can cast the one with a date to a timestamp.

Related

How to write a query to return only specified list of rows?

There is a table Shops with Shop_number and Shop Address columns.
Also a table called Properties with two columns:
Shop_number
Property_ID
222222
113
222222
114
222222
115
222223
113
222224
113
222225
111
222226
112
A shop can have more than one property.
How to write a query which would return all shop numbers which does not have Property_ID: 113 at all (excluding 222222, because it indeed has other properties, but also 113).
SELECT p.shop_number FROM Properties p
WHERE p.property_id != 113
My query returns also store 222222 which has 113 property_id.
I would like to return shop numbers: 222225 and 222226 in this case only.
Your description is a bit unclear.
Since you already got an answer how to get your result in case you have to use one of the two tables only, let's have a look on your requirements again which can be read as you need both tables.
You are writing: "A shop can have more than one property. How to write a query which would return all shop numbers which does not have Property_ID: 113 at all"
I don't know if this is your intention, but according to your description, you also want to get all shops that don't even occur at all in the properties table.
So we could use such a query:
SELECT s.shop_number
FROM shops s
WHERE NOT EXISTS
(SELECT 1 FROM properties WHERE property_id = 113
AND shop_number=s.shop_number);
This will show all shop numbers that don't appear at all in the properties table and also all shop numbers that appear having properties different from 113 only.
Only those shop_numbers that occur in the properties table and exist there having the property id 113 will be excluded.
And this is exactly what you described as being your requirement. It's the question if what you told us you want to do is really what you want to do ;)
Either use not exists as #Larnu suggests or use group by / having:
select shopnumber
from t
group by shopnumber
having count(case property_id when 113 then 1 end) = 0;
case maps property_id = 113 to 1 and everything else to null. count(x) does not count rows where x is null.

Including variables after filtering selecting only minimum values in SQL?

I am working with a twin dataset and would like to create a table with the Subject ID (Subject) and twin pair ID (twpair) for the twins with the lower (or one of the twins if the values are equal) lifetime total of marijuana use (MJ1a).
A portion of my table looks like this:
Subject
twpair
MJ1a
156
345
10
157
345
7
158
346
20
159
346
3
160
347
4
161
347
4
I'm hoping to create a table with only the twins that have the lower amount of marijuana use which would look like this:
Subject
twpair
MJ1a
157
345
7
159
346
3
161
347
4
This is the SQL code I have so far:
proc sql;
create table one_twin as
select twpair,min(MJ1a) as minUse, Subject
from twins_deviation
group by twpair;
Unfortunately this ends up causing all of the subjects to be remerged back in the dataset. If I don't include the Subject portion I get the correct values for twpair and MJ1a but not the Subject IDs.
How do I filter the dataset to only include those with the minimum values while also including variables of interest like Subject ID? Note that if two pairs of twins have the SAME value I would like to select one but it doesn't matter which I select. Any tips would be extremely appreciated!
This query should give you the desired result.
select a.subject,a.twpair,a.MJ1a from twins_deviation a join (select twpair,min(mj1a) as mj1a from twins_deviation group by twpair)b on a.twpair=b.twpair and a.mj1a=b.mj1a
If your DB supports analytic/window functions ,the same can be accomplished using a rank function ,solution given below.
EDIT1:to handle same values for mj1a
select subject,twpair,mj1a from(select subject,twpair,mj1a ,row_number() over(partition by twpair order by mj1a) as rnk from twins_deviation)out1 where rnk=1;
EDIT2:Updated solution 1 to include only one twin.
select min(subject) as subject,twpair,mj1a from(select a.subject as subject ,a.twpair as twpair,a.MJ1a as MJ1a from twins_deviation a join (select twpair,min(mj1a) as mj1a from twins_deviation group by twpair)b on a.twpair=b.twpair and a.mj1a=b.mj1a)out1 group by twpair,MJ1a;

SQL - adding category to string value (mapping table)

I'm trying to reproduce a mapping that I previously got from an external excel file into a SQL query.
I have specific errors as string (ex. "aborted", "timeout"). A simplified example:
count last_error
452 user_aborted
889 timeout
212 request_denied
98 blacklisted_by_admin
789 login_unsuccessful
340 country_not_available
I would like to map these into categories I have defined, so that the result would be a new column with the error category:
count last_error error_category
452 user_aborted user
889 timeout tech
212 request_denied risk
98 blacklisted_by_admin risk
789 login_bad user
340 country_not_available tech
What is the best way of doing this? I have about 40 errors, and six categories.
You can do case statement like this
case
when last_error in ('user_aborted', 'login_bad') then 'user'
when last_error in ('request_denied', 'blacklisted_by_admin') then 'risk'
when last_error in ('timeout', 'country_not_available') then 'tech'
end as error_category

Comparing data in same table SQL

Table: working_history
ID Field Event_dt Data
145 Reason 10/20/2003 DOM
145 Reason 9/20/2007 LVE
145 Reason 3/17/2008 RTN
145 Reason 4/5/2008 POP
145 Reason 3/7/2009 POP
145 Reason 6/13/2009 TRE
145 status 10/20/2003 A
145 status 6/5/2006 L
145 status 11/27/2006 A
145 status 9/20/2007 L
145 status 3/17/2008 A
145 status 6/12/2009 T
I want anyone who had an ESTA=L, and then check to make sure that their respective Reason event_dt match Status event_dt. In the above table,
145 status 6/5/2006 L
should come back, as event_dt (6/5/2006) for field.status = L, does not have a data where Field = reason on the same date.
SELECT *
FROM working_History WHReason
RIGHT JOIN Working_history WHStatus
on WHReason.ID = WHStatus.ID
and WHReason.field = 'Reason'
and WHStatus.field = 'status' and WHStatus.Data='L'
and WHReason.Event_DT = WHStatus.Event_DT
WHERE WHReason.Event_Date is null
Assuming you're only looking for status's that do NOT have a reason and not the other way.
This basically says, create two sets of information one for fields of reason one for fields of status. Then combine those two results based on their ID and event date including all records from the status set and only those that match in reason. Limit that result to only include those that have no reason event date.
It uses a concept called a Self Join to generate two sets of data allowing one to quickly identify information in one set but not the other. Visual reference

prettify output of query

i have this query and would like to indent the output and get the total from the last column.
Now it gives
person |year|dossiers
------------------------------------------------|----|--------
9210124 |1110| 166
9210124 |1111| 198
9210124 |1112| 162
9210161 |1110| 183
9210161 |1111| 210
9210161 |1112| 142
And i would like to have
person |year|dossiers
------------------------------------------------|----|--------
9210124 |1110| 166
|1111| 198
|1112| 162
9210161 |1110| 183
|1111| 210
|1112| 142
total 1061
Here the query
select
pers_nr "person",
to_char(import_dt,'YYMM') "year and month",
count(pers_nr) "dossiers"
from
rdms_3codon
where
trunc(import_dt) >= trunc(trunc(sysdate, 'Q') -1, 'Q')
and trunc(import_dt) < trunc(sysdate, 'Q')-1/(24*60*60)
group by
pers_nr,
to_char(import_dt,'YYMM')
order by
pers_nr
Could someone help me please ?
As noted in the comments, this is a client function, not a database one. For example, if you are using SQL*Plus, you can use:
break on person
break on report
compute sum label total of dossiers on report
The first line suppresses the duplicate person values; the second and third together generate the total at the bottom. SQL*Plus output formatting etc. is documented here.
Try this one. It will give you the totals at least but the rest
either can be replaced with NULLs also using RANK() for pers_id
or in the code of your application if any...
select
pers_nr "person",
to_char(import_dt,'YYMM') "year and month",
SUM(count(pers_nr)) OVER (ORDER BY year)
FROM ....
hope it helps abit