BigQuery: Too many total leaf fields 10852 - google-bigquery

I am importing some data from Google Cloud Datastore with about 980 columns. I have it exported first to Bucket and attempting to import it into BigQuery (using the GCP guide here). However, I get the error Too many total leaf fields: 10852.
I know for certain that none of the entities have more than 1000 fields. Is there a possibility that the import process is transforming my data and creating additional fields?

The schema's generated by the Managed Import/Export service will not contain more than 10k fields. So, it looks like you are importing into a BigQuery table that already has data. BigQuery will take the union of the existing schema and the new schema. So even if any given entity has less than 1000 fields, if the union of all field names in all your entities of a kind, plus the existing fields in the BigQuery schema.
Some options you have include:
1) Use a new table for each import into BigQuery.
2) Try using projectionFields to limit the fields loaded into BigQuery.

Jim Morrison's solution (using projectionFields) solved the issue for me.
I ended up passing a list of entity columns I was interested in and only exporting this subset to BigQuery. The following command line instruction achieves this.
bq --location=US load --source_format=DATASTORE_BACKUP --projection_fields="field1, field4, field2, field3" --replace mydataset.table gs://mybucket/2019-03-25T02:56:02_47688/default_namespace/kind_data/default_namespace_datakind.export_metadata

Related

Airflow Pipeline CSV to BigQuery with Schema Changes

Background
I'm need to design an Airflow pipeline to load CSV's into BigQuery.
I know the CSV's frequently have a changing schema. After loading the first file the schema might be
id | ps_1 | ps_1_value
when the second file lands and I load it it might look like
id | ps_1 | ps_1_value | ps_1 | ps_2_value.
Question
What's the best approach to handling this?
My first thought on approaching this would be
Load the second file
Compare the schema against the current table
Update the table, adding two columns (ps_2, ps_2_value)
Insert the new rows
I would do this in a PythonOperator.
If file 3 comes in and looks like id | ps_2 | ps_2_value I would fill in the missing columns and do the insert.
Thanks for the feedback.
After loading two prior files example_data_1.csv and example_data_2.csv I can see that the fields are being inserted into the correct columns, with new columns being added as needed.
Edit: The light bulb moment was realizing that the schema_update_options exist. See here: https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.SchemaUpdateOption.html
csv_to_bigquery = GoogleCloudStorageToBigQueryOperator(
task_id='csv_to_bigquery',
google_cloud_storage_conn_id='google_cloud_default',
bucket=airflow_bucket,
source_objects=['data/example_data_3.csv'],
skip_leading_rows=1,
bigquery_conn_id='google_cloud_default',
destination_project_dataset_table='{}.{}.{}'.format(project, schema, table),
source_format='CSV',
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_APPEND',
schema_update_options=['ALLOW_FIELD_RELAXATION', 'ALLOW_FIELD_ADDITION'],
autodetect=True,
dag=dag
)
Basically, the recommended pipeline for your case consists in creating a temporary table for treating your new data.
Since AirFlow is an orchestration tool, its not recommended to create big flows of data through it.
Given that, your DAG could be very similar to your current DAG:
Load the new file to a temporary table
Compare the actual table's schema and the temporary table's schema.
Run a query to move the data from the temporary table to the actual table. If the temporary table has new fields, add them to the actual table using the parameter schema_update_options. Besides that, if your actual table has fields in NULLABLE mode, it will be able to easily deal with missing columns case your new data have some missing field.
Delete your temporary table
If you're using GCS, move your file to another bucket or directory.
Finally, I would like to point some links that might be useful to you:
AirFlow Documentation (BigQuery's operators)
An article which shows a problem similar to yours ans where you can find some of the mentioned informations.
I hope it helps

Load order of entires in big query tables

I have some sample data that I've been loading into Google BigQueries. I have been importing the data in ndjson format. If I load the data all in one file, I see them show up in a different order in the table's preview tab than when I sequentially import them one ndjson line at a time.
When importing sequentially I wait till I see the following output:
Waiting on bqjob_XXXX ... (2s) Current status: RUNNING
Waiting on bqjob_XXXX ... (2s) Current status: DONE
The order the rows show up seems to match the order I append them as the job importing them seem to finish before I move on to the next. But when loading them all in one file, they show up in a different order than they exist in my data file.
So why do the data entries show up in a different order when loading in bulk? How are the data entries queued to be loaded and also how are they indexed into the table?
BigQuery has no notion of indexes. Data in BigQuery tables have no particular order that you can rely on. If you need to get ordered data out of BigQuery you will need to use explicit ORDER BY in your query - which btw quite not recommended for large results as it increases resource cost and can end up with Resources Exceeded error.
BigQuery internal storage can "shuffle" your data rows internally for the best / most optimal performance of querying. So again - there is no such things as physical order of data in BigQuery tables
Oficial language in docs is like this - line ordering is not guaranteed for compressed or uncompressed files.

Bigquery and Tableau

I attached Tableau with Bigquery and was working on the Dash boards. Issue hear is Bigquery charges on the data a query picks everytime.
My table is 200GB data. When some one queries the dash board on Tableau, it runs on total query. Using any filters on the dashboard it runs again on the total table.
on 200GB data, if someone does 5 filters on different analysis, bigquery is calculating 200*5 = 1 TB (nearly). For one day on testing the analysis we were charged on a 30TB analysis. But table behind is 200GB only. Is there anyway I can restrict Tableau running on total data on Bigquery everytime there is any changes?
The extract in Tableau is indeed one valid strategy. But only when you are using a custom query. If you directly access the table it won't work as that will download 200Gb to your machine.
Other options to limit the amount of data are:
Not calling any columns that you don't need. Do this by hiding unused fields in Tableau. It will not include those fields in the query it sends to BigQuery. Otherwise it's a SELECT * and then you pay for the full 200Gb even if you don't use those fields.
Another option that we use a lot is partitioning our tables. For instance, a partition per day of data if you have a date field. Using TABLE_DATE_RANGE and TABLE_QUERY functions you can then smartly limit the amount of partitions and hence rows that Tableau will query. I usually hide the complexity of these table wildcard functions away in a view. And then I use the view in Tableau. Another option is to use a parameter in Tableau to control the TABLE_DATE_RANGE.
1) Right now I learning BQ + Tableau too. And I found that using "Extract" is must for BQ in Tableau. With this option you can also save time building dashboard. So my current pipeline is "Build query > Add it to Tableau > Make dashboard > Upload Dashboard to Tableau Online > Schedule update for Extract
2) You can send Custom Quota Request to Google and set up limits per project/per user.
3) If each of your query touching 200GB each time, consider to optimize these queries (Don't use SELECT *, use only dates you need, etc)
The best approach I found was to partition the table in BQ based on a date (day) field which has no timestamp. BQ allows you to partition a table by a day level field. The important thing here is that even though the field is day/date with no timestamp it should be a TIMESTAMP datatype in the BQ table. i.e. you will end up with a column in BQ with data looking like this:
2018-01-01 00:00:00.000 UTC
The reasons the field needs to be a TIMESTAMP datatype (even though there is no time in the data) is because when you create a viz in Tableau it will generate SQL to run against BQ and for the partitioned field to be utilised by the Tableau generated SQL it needs to be a TIMESTAMP datatype.
In Tableau, you should always filter on your partitioned field and BQ will only scan the rows within the ranges of the filter.
I tried partitioning on a DATE datatype and looked up the logs in GCP and saw that the entire table was being scanned. Changing to TIMESTAMP fixed this.
The thing about tableau and Big Query is that tableau calculates the filter values using your query ( live query ). What I have seen in my project logging is, it creates filters from your own query.
select 'Custom SQL Query'.filtered_column from ( your_actual_datasource_query ) as 'Custom SQL Query' group by 'Custom SQL Query'.filtered_column
Instead, try to create the tableau data source with incremental extracts and also try to have your query date partitioned ( Big Query only supports date partitioning) so that you can limit the data use.

querying google fusion table

I have a Google fusion table with 3 row layouts as shown below:
We can query the fusion table as,
var query = new google.visualization.Query("https://www.google.com/fusiontables/gvizdata?tq=select * from *******************");
which select the data from the first row layout ie Rows 1 by default. Is there any way that we can query the second or 3rd Row layout of a fusion table?
API queries apply to the actual table data. The row layout tabs are just different views onto that data. You can get the actual query being executed for a tab with Tools > Publish; the HTML/JavaScript contains the FusionTablesLayer request.
I would recommend using the regular Fusion Tables APi rather than the gvizdata API because it's much more flexible and not limited to 500 response rows.
The documentation for querying a Fusion Tables source has not been updated yet to account for the new structure, so this is just a guess. Try appending #rows:id=2 to the end of your table id:
select * from <table id>#rows:id=2
A couple of things:
Querying Fusion Tables with SQL is deprecated. Please see the porting guide.
Check out the Working With Rows part of the documentation. I believe this has your answers.

How to map dynamic dynamoDB columns in EMR Hive

I have a table in Amazon dynamoDB with a record structure like
{"username" : "joe bloggs" , "products" : ["1","2"] , "expires1" : "01/01/2013" , "expires2" : "01/02/2013"}
where the products property is a list of products belonging to the user and the expires n properties relate to the products in the list, the list of products is dynamic and there are many. I need to transfer this data to S3 in a format like
joe bloggs|1|01/01/2013
joe bloggs|2|01/02/2013
Using hive external tables I can map the username and products columns in dynamoDB, however I am unable to map the dynamic columns. Is there a way that I could extend or adapt the org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler in order to interpret and structure the data retrieved from dynamo before hive ingests it? or is there an alternative solution to convert the dynamo data to first normal form?
One of my key requirements is that i maintain the throttling provided by the dynamodb.throughput.read.percent setting so that I do not compromise operational use of the table.
You could build a specific UDTF(User defined table-generating functions) for that case.
I'm not sure how Hive handles asterisk(probably for your case) as an argument for the function.
Something like what Explode (source) does.