Prevent output for query --destination_table command - google-bigquery

Is there a way to prevent screen output for the query --destination_table?
I wan to move data sets through the workflow, but not necessarily see the all the rows
bug on job_73d3dffab7974d9db360f5c31a3a9fa7

This is a known issue, we'll fix it in the next version of bq. To work around, you can add --max_rows=0. This only changes the number of rows that get sent back, not the number of rows that get returned by the query (you can use LIMIT N for that in the query).

Related

How to know the first ran jobId of a cached query in big-query?

When we run a query in big-query environment, the results are cached in the temporary table. From next time onwards, when we ran the same query multiple times, the subsequent runs will fetch the results from the cache for the next 24 hrs with some exceptions. Now my use case is, in the subsequent runs, i want to know like from which jobId this query cache results are got, previous first time run of the query ??
I have checked all the java docs related to query didn't find that info. We have cacheHit variable, which will tell you whether the query has fetched from the cache or not . Here i want to know one step further, from what jobId, the results got fetched. I expected like, may be in this method i can know the info, but i am always getting the null value for that. I also want to know what is meant by parentJob in big-query context.
It's unclear why you'd even care about this other than as a technical exercise. If you want to build your own application caching layer that's a different concern. More details about query caching can be found on https://cloud.google.com/bigquery/docs/cached-results.
The easiest way to do this would probably be by traversing jobs.list until you find a job that has the same destination table (it'll be prefaced with an anon prefix), and where the cacheHit stat is false/not present.
Your inquiry about parentJob is unrelated to this exercise. It's for finding all the child jobs created as part of a script or multi-statement execution. More information about this can be found on https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting-concepts.

Pentaho "Return value id can't be found in the input row"

I have a pentaho transformation, which is used to read a text file, to check some conditions( from which you can have errors, such as the number should be a positive number). From this errors I'm creating an excel file and I need for my job the number of the lines in this error file plus to log which lines were with problem.
The problem is that sometimes I have an error " the return values id can't be found in the input row".
This error is not every time. The job is running every night and sometimes it can work without any problems like one month and in one sunny day I just have this error.
I don't think that this is from the file, because if I execute the job again with the same file it is working. I can't understand what is the reason to fail, because it is saying the value "id", but I don't have such a value/column. Why it is searching a value, which doesn't exists.
Another strange thing is that normally the step, which fails should be executed at all( as far as I know), because no errors were found, so we don't have rows at all to this step.
Maybe the problem is connected with the "Prioritize Stream" step? Here I'm getting all errors( which use exactly the same columns). I tried before the grouping steps to put a sorting, but it didn't help. Now I'm thinking to try with "Blocking step".
The problem is that I don't know why this happen and how to fix it. Any suggestions?
see here
Check if all your aggregates ins the Group by step have a name.
However, sometimes the error comes from a previous step: the group (count...) request data from the Prioritize Stream, and if that step has an error, the error gets reported mistakenly as coming from the group rather than from the Prioritze.
Also, you mention a step which should not be executed because there is no data: I do not see any Filter which would prevent rows with missing id to flow from the Prioritize to the count.
This is a bug. It happens randomly in one of my transformations that often ends up with empty stream (no rows). It mostly works, but once in a while it gives this error. It seems to only fail when the stream is empty though.

BigQuery paging issues with tableData.list()

We're trying to load 120,000 rows from a BigQuery table using tableData.list. Our first call,
https://www.googleapis.com/bigquery/v2/projects/{myProject}/datasets/{myDataSet}/tables/{myTable}/data?maxResults=90000
returns a pageToken as expected and the first 22482 (1 to 22482) rows. We assume this is due to the 10mb serialized JSON limit. Our second call however,
https://www.googleapis.com/bigquery/v2/projects/{myProject}/datasets/{myDataSet}/tables/{myTable}/data?maxResults=90000&pageToken=CIDBB777777QOGQIBCIL6BIQSC7QK===
returns not the next rows, but 22137 rows starting at row 900001 to the 112137, without a pageToken.
Strangely, If we change maxResults to 100,000, we get rows starting from 100,000.
We are able to work around this using startRowIndex to page. In this case, we start with the first call being startRowIndex =0 and we never get a pageToken in the response. We keep making calls until all rows are retrieved. However, the concern is without pageTokens, if the row order changes while the calls are being made, the data could be invalid.
We are able to reproduce this behavior on multiple tables with different sizes and structures.
Is there any issue with paging, or should we be structuring our calls differently?
This is a known, high priority bug. That has been fixed.

how to get next 1000 records the fastest way

I'm using Azure Table Storage.
Let's say i have a Partition in my Table with 10,000 records, and I would like to get records number 1000 to 1999. And next time i would like to get records number 4000 to 4999 etc.
What is the fastest way of doing that?
All I can find till now are two options, which I don't like very much:
1. run a query which returns all 10,000 records, and filter out what I want when I get all 10,000 records.
2. Run a query whichs returns 1000 records at a time, and use a continuation token to get the next 1000 records.
Is it possible to get a continuation token without downloading all corresponding records? It would be great if i can get Continuation Token 1, than get Continuation token 2, and with CT2 get records 2000 to 2999.
Theoretically you should be able to use continuation tokens without downloading the actual data for the first 1000 recors by closing the connection you have after the first request. And I mean closing it at TCP level. And before you read all data. Then open a new connection and use continuation token there. Two WebRequests will not do it since the HTTP implementation will likely use keep alive wchich means all your data is going to be read in the background even though you don't read it in your code. Actually you can configure your HTTP requests to not use keep alive.
However, another way is naturally if you know the RowKey and can search on that but I assume you don't know which row keys will be in each 1000 entity batch.
Last I would ask why you have this problem in the first place. And what your access pattern is. If inserts are common and getting these records is rare I wouldn't bother making it more efficient. if this is like a paging problem i would probably get all data on the first request and cache it (in the cloud). if inserts are rare but you need to run this query often I would consider making the insertion of data have one partion for every 1000 entities and rebalance as needed (due to sorting) as entities are inserted.

SSIS 2005 - Ignore row insert failures

I would like to ignore the errors that may occur when a batch is commited. In my case, unique columns.
The OLE DB Destination Error Output is set to "Ignore failure" but it is still failling. The Data Flow "stop on failure properties" are set to false and the MaximumErrorCount to 0.
I don't want to do row redirection to be able to keep the fast load mode.
Thank you
A few comments:
You can't use ignore errors because ignore row errors still passes the records to the destination. You have to use redirection to get rid of the bad rows.
If you don't want to keep a copy of the bad rows, then you can send them to a Row Count transformation since that has minimal performance impact. Alternatively, you can output the bad rows to a flat file or to another table so you can review the errors on a future date.
Fast load options are properties of the destination and not the source. You can use fast load even if you redirect error rows from your source to somewhere else. I just ran a performance test on a million row data set with the fast load ORDER option and the performance was basically identical when I added error redirection and redirected 500K rows to a Row Count transformation. I also verified performance was slower when I removed the fast load option, so I'm certain this has 0 impact.
I finally redirected the error stream into a test node (to test the errorcode and check that it is a row insertion error) that redirects these rows into a "OLE DB Query" node in which I do nothing with SELECT 1, to simply ignore them.