I need to create a BQ table with binary data. One way is to create it with the BOOL specification. However, I want to be able to directly do arithmetic operations on its columns which is not possible with BOOL.
What is the best way to create this table with the minimum storage requirement?
For example, I create the table below and all columns are implicitly defined as Integers
create or replace table temp1
as
select 1 as a, 0 as b, 1 as c;
Is there a way to reduce table size?
If you are storing images as binary i do recommend GCS. Suppose that you are not storing images, You can store binary data in BigQuery as a string.
for your operations you can use some predefined functions as :
select bqutil.fn.from_binary('111000001')
Or you can use other functions like: SAFE_CONVERT_BYTES_TO_STRING()...etc
Related
I want to be able to get the main information regarding the various columns of tables located in Snowflake, like a df.describe() could do in Pandas:
column names,
data types,
min/max/average for numeric types,
and ideally unique values for string types
maybe other things that I'm missing
Granted, you could simply pull all the data into a local DataFrame then do the "describe" in Pandas, but this would be too costly for Snowflake tables counting millions of rows.
Is there a simple way to do this?
column names
data types
You could always query INFORMATION_SCHEMA:
SELECT *
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME ILIKE 'table_name';
Or
DESCRIBE TABLE 'TABLE_NAME';
min/max/average for numeric types,
and ideally unique values for string types
maybe other things that I'm missing
Automatic Contextual Statistics
Select columns, cells, rows, or ranges in the results table to view relevant information about the selected data in the inspector pane (to the right of the results table). Contextual statistics are automatically generated for all column types. The statistics are intended to help you make sense of your data at a glance.
...
Filled/empty meters
Histograms
Frequency distributions
Email domain distributions
Key distributions
There is no equivalent to df.describe().
The simplest way might be a query that replicates it. For example, you could compose a UDF that took as input the result of get_ddl() for the table and returned as output a query that had the correct SQL (min/max/avg etc) for each column.
If approximate answers are sufficient, one alternative would be to do what you described in a local DataFrame but implement a TABLESAMPLE clause to avoid loading all the data.
If you pursue the query route, the good news is that it should be mostly metadata-only operations which are very fast.
I have a flat file as an input that has multiple layouts:
Client# FileType Data
------- -------- --------------------------------------
Client#1FileType0Dataxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Client#1FileType1Datayyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
Client#1FileType2Datazzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
Client#2FileType0Dataxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
My PLANNED workflow goes as follows: Drop Temp table -Load SQL temp table with columns Client#, FileType, Data and then from there, map my 32 file types to actual permanent SQL table.
My question is, is that even doable and how would you proceed?
Can you, from such a working table, split to 32 sources? With SQL substrings? I am not sure how I will map my columns from the differing file type from my temp table, what 'box' to use in my workflow.
What you are describing is a very reasonable approach to loading data in a database. The idea is:
Create a staging table where all the columns are strings.
Load data into the final table, using SQL manipulations.
The advantage of this approach is that you can debug any data anomalies in the database and that generally makes things much simpler.
The answer to your question is that the following functions are generally very useful in doing this:
substring()
try_convert()
This can get more complicated if the "data" is not fixed width. In that case, you would have to use more complex string processing. In that case, recursive CTEs or JSON functionality might help.
We currently have an audit table of the following form:
TABLE(timestamp, type, data, id) -- all the fields are of type varchar / clob
The issue is that for some types there are actually several well suited ids. We currently only store one of those but it would be interesting to be able to query by any of those ids while still keeping only one column for that.
With Oracle's recent support for JSON we were thinking about maybe storing all the ids as a JSON object:
{ orderId:"xyz123"; salesId:"31232131" }
This would be interesting if we could continue to make queries by id with very good performance. Is this possible? Does Oracle allow for indexing in these kind of situations or would it always end up being a O(n) text search over all the millions of rows this table has?
Thanks
Although you can start fiddling around with features such as JSON types, nested tables and the appropriate indexing methods, I wouldn't recommend that unless you specifically want to enhance your Oracle skills.
Instead, you can store the ids in a junction table:
table_id synonym_id
1 1
1 10
1 100
2 2
With an index on synonym_id, table_id, looking up the synonym should be really fast. This should be simple to maintain. You can guarantee that a given synonym only applies to one id. You can hide the join in a view, if you like.
Of course, you can go down another route, but this should be quite efficient.
1)Oracle's JSON support allows functional indexing on the JSON data (b-tree index on JSON_VALUE virtual column.)
2) for non-indexed fields the json date needs to be parsed to get to the field values. Oracle uses a streaming parser with early termination. This means that fields at the beginning of the JSON data are found faster than those at the end.
I am attempting to fix the schema of a Bigquery table in which the type of a field is wrong (but contains no data). I would like to copy the data from the old schema to the new using the UI ( select * except(bad_column) from ... ).
The problem is that:
if I select into a table, then Bigquery is removing the required of the columns and therefore rejecting the insert.
Exporting via json loses information on dates.
Is there a better solution than creating a new table with all columns being nullable/repeated or manually transforming all of the data?
Update (2018-06-20): BigQuery now supports required fields on query output in standard SQL, and has done so since mid-2017.
Specifically, if you append your query results to a table with a schema that has required fields, that schema will be preserved, and BigQuery will check as results are written that it contains no null values. If you want to write your results to a brand-new table, you can create an empty table with the desired schema and append to that table.
Outdated:
You have several options:
Change your field types to nullable. Standard SQL returns only nullable fields, and this is intended behavior, so going forward it may be less useful to mark fields as required.
You can use legacy SQL, which will preserve required fields. You can't use except, but you can explicitly select all other fields.
You can export and re-import with the desired schema.
You mention that export via JSON loses date information. Can you clarify? If you're referring to the partition date, then unfortunately I think any of the above solutions will collapse all data into today's partition, unless you explicitly insert into a named partition using the table$yyyymmdd syntax. (Which will work, but may require lots of operations if you have data spread across many dates.)
BigQuery now supports table clone features. A table clone is a lightweight, writeable copy of another table
Copy tables from query in Bigquery
what is a good solution for get best performance of SELECTing XML fields?Cosider a table with 100000 record.If I select all table content and create a table variable and then create index on it and then calculate values on it ? or create several index on my xml field and then calculate values?
If any one has better solution please say.
thanks
If you need to extract only certain single values from your XML, you could also investigate this approach:
define a stored function that takes your XML as input parameter, and extracts whatever you need to extract from it, and returns an INT or a VARCHAR or something
define a computed, persisted column on your base table (where your XML is stored) that calls this function
Once you have a computed, persisted column, you can easily index that column.
Unfortunately, this only works if you need to extract a single piece of information from your XML column - it doesn't work if you need to extract an entire list of values from the XML
If you need to extract lists of information, then your only real option would be XML indices. In our experience, though - the overhead (in terms of disk space) is extremely high - a database we had went from about 2 GB to over 10 GB when we added a single XML index....... it's an option and you could try it - but be aware of the potential downsides, too.