BigQuery loading incomplete dataset from Cloud Storage? - google-bigquery

I want to upload a dataset of Tweets from Cloud Storage. I have an schema based on https://github.com/twitterdev/twitter-for-bigquery, but simpler, because I don't need all the fields.
I uploaded several files to Cloud Storage and then manually imported them from BigQuery. I have tried loading each file ignoring unknown fields, and not ignoring them. But I always end up with a table with lesser rows than the original dataset. Just in case, I took care of eliminating redundant rows from each dataset.
Example job-ids:
cellular-dream-110102:job_ZqnMTr17Yx_KKGEuec3qfA0DWMo (loaded 1,457,794 rows, but the dataset contained 2,387,666)
cellular-dream-110102:job_2xfbTFSvvs-unpP6xZXAfDeDjic (loaded 1,151,122 rows, but the dataset contained 3,265,405).
I don't know why this happens. I have tried to simplify the schema further, as well as ensuring that the dataset is clean (no repeated rows, no invalid data, and so on). The curious thing is that if I take a small sample of tweets (say, 10,000) and then upload the file manually, it works - it loads the 10,000 rows.
How can I find what is causing the problem?

Related

Selecting a Subset of Columns for BigQuery Table Load

I am working with 3rd-party dataset that is loaded into a Google Bucket, and I am attempting to load it into BigQuery to run some pruning steps. The dataset has a set of columns that don't follow BQ column naming convention, and thus the load fails when considering the full file (Which is approximately 4,000 partitions and a little more than a TB)
I can build out a distributed Python process to drop/rename the offending columns, but this seems like an ineffective use of computational resources. Does BigQuery offer some direct method of loading a subset of columns from a parquet file?

ETL on S3 : Duplicate rows : how to update old entries?

During my ETL imports some pre-synchronized entries are supplied multiple times by my source (because updated by the service) and therefore imported multiple times in AWS. I would like to implement a structure that overwrites an entry if it already exists (something close to a key-value store for few rows updated twice).
My requirements entails to operate on one terabyte of data and to operate on glue (or potentially redshift).
I implemented the solution as follows:
I read the data from my source
I save each entry in a different file by choosing the unique identifier of the content as the file name.
I index my raw data with a glue crawler scanning new files on S3
I run a glue job to transform the raw data in an OLAP compliant format (parquet).
Is this the right way to proceed?
This seems personally correct to me even if I have concerns about the large amount of separate files in my raw data (1 file per entry).
Thank you,
Hugo

Is partitioning helpful in Amazon Athena if query doesn't filter based on partition?

I have a large amount of data, but there is no particular column I would like to filter based on (that is, my 'where clause' can be any column). In this scenario, does partitioning provide any benefit (maybe helps with read-parallelism?) when the queries end up scanning all the data?
If there is no column all, or most, queries would filter on then partitions will only hurt performance. Instead aim for files around 100 MB, as few as possible, Parquet if possible, and put all files directly under the table's LOCATION.
The reason why partitions would hurt performance is that when Athena starts executing your query it will list all files, and the way it does it is as if S3 was a file system. It starts by listing the table's LOCATION, and if it finds anything that looks like a directory it will list it separately, and so on, recursively. If you have a deep directory structure this can end up taking a lot of time. You want to help Athena by having all your files in a flat structure, but also fewer than 1000 of them, because that's the page size for S3's list operation. With more than 1000 files you want to have directories so that Athena can parallelize the listing (but as few as possible still, because there's a limit to how many listings it will do in parallel).
You want to keep file sizes to around 100 MB because that's a good size that trades off how long it takes to process a file against the overhead of getting it from S3. The exact recommendation is 128 MB.

Can BigQuery table extracted rows be randomized

I am currently extracting a BigQuery table into sharded .csv's in Google Cloud Storage -- is there any way to shuffle/randomize the rows for the extract? The GCS .csv's will be used as training data for a GCMLE model, and the current exports are in a non-random order as they are bunched up by similar "labels".
This causes issues when training a GCMLE model as you must hand the model random examples of all labels within each batch. While GCMLE/TF has the ability to randomize the order of rows WITHIN individual .csv's, but there is not (to my knowledge) any way to randomize the rows selected within multiple .csv's. So, I am looking for a way to ensure that the rows being output to the .csv are indeed random.
Can BigQuery table extracted rows be randomized?
No. Extract Job API (thus any client built on top of it) has nothing that would allow you to do so.
I am looking for a way to ensure that the rows being output to the .csv are indeed random.
You should first create tables corresponding to your csv file and then extract them one-by-one into separate csv. In this case you can control what goes into what csv
If your concern is cost of processing (you will need to scan table as many times as csv files you will need) - you can check partitioning approaches in Migrating from non-partitioned to Partitioned tables . This still involves cost but substantially reduced one
Finally, zero cost option is to use Tabledata.list API with paging while distributing response throughout your csv files - you can do this easily in client of your choice

BigQuery streamed data is not in table

I've got an ETL process which streams data from a mongo cluster to BigQuery. This runs via cron on a weekly basis, and manually when needed. I have a separate dataset for each of our customers, with the table structures being identical across them.
I just ran the process, only to find that while all of my data chunks returned a "success" response ({"kind": "bigquery#tableDataInsertAllResponse"}) from the insertAll api, the table is empty for one specific dataset.
I had seen this happen a few times before, but was never able to re-create. I've now run it twice more with the same results. I know my code is working, because the other datasets are properly populated.
There's no 'streaming buffer' in the table details, and running a count(*) query returns 0 response. I've even tried removing cached results from the query, to force freshness - but nothing helps.
Edit - After 10 minutes from my data stream (I keep timestamped logs) - partial data now appears in the table; however, after another 40 minutes, it doesn't look like any new data is flowing in.
Is anyone else experiencing hiccups in streaming service?
Might be worth mentioning that part of my process is to copy the existing table to a backup table, remove the original table, and recreate it with the latest schema. Could this be affecting the inserts on some specific edge cases?
Probably this is what is happening to you: BigQuery table truncation before streaming not working
If you delete or create a table, you must wait a least 2 minutes to start streaming data on it.
Since you mentioned that all other tables are working correctly and only the table that has the deletion process is not saving data then probably this explains what you are observing.
To fix this issue you can either wait a bit longer before streaming data after the delete and create operations or maybe changing the strategy to upload the data (maybe saving it into some CSV file and then using job insert methods to upload the data into the table).