How to translate internal BQ column userid (INTEGER) to e-mail - google-bigquery

I'm extracting data from Scheduled Queries using this command
bq ls --transfer_config --transfer_location=us --format=csv
One column in the result is called userid (data type INTEGER) and it's a column referring to the user, who created the scheduled query. Hence quite important information.
I'd like to transform this information to a more readable format = an e-mail address. But I haven't been able to find out how.
PS I wonder why this internal value is used here. In other BigQuery system data user names are always presented in readable e-mail format. (Maybe it's because Scheduled Queries is still in beta version).

userID has been deprecated according to Docs
Please don't rely on that field. There is no other field in place right now. AFAIK only from auditlogs you can obtain these informations.

Related

Is the column of type JSON deprecated?

In the bigquery console, when creating a table, there used to be type JSON as an option for the column types but weirdly enought it was never present in their docs We used this column type in our production tables, and discovered later on that you can't select it in queries otherwise bigquery throws an error, and the json functions also didn't work with it. So we simply stopped using this column in the queries but they still exist in our tables.
However, in the past couple of days, all queries against this table are failing with this error 400 Json is not enabled for current project. and this column type is not present in the bigquery console anymore. It seems it was removed or deprecated? I checked the release notes, but the latest release was way before the error occured. This broke our production environment, and we couldnt even export the data because exporting gave the same error. Instead we had to use a new table without this column which meant we lost all our history.
Did anyone face the same problem with any other column types before, is it normal that a type is deprecated without users being notified beforehand. This is making me question the reliability of bigquery.
Please reach out to Google Cloud support and we will help you fix your issue with that problematic table. You may also want to try fixing it yourself using the ALTER TABLE DROP COLUMN statement that is currently in public preview [1]. This will drop the erroneous column (the data in that column only will be lost). The rest of the data will remain usable.
[1] https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#alter_table_drop_column_statement
I ran into the same error message few days ago and was surprised to read about this policy change that's not backed up by a mitigation process. My attempt to use Vlad Grachev suggestion to drop this column did not prevail, as the console does not allow to query this table (same "Json is not enabled for current project." error).
My only remediation at this point is:
build a new table where the json column is switched to type string
create a pipeline that transforms the objects to strings
migrate the data through the pipeline to the new table
In BigQuery Json data can be stored in a column type "Record.Are you referring the same by JSON column type?
BigQuery uses the RECORD (or STRUCT) type to represent nested structure. A column of RECORD type is in fact a large column containing multiple child columns. For more information Refer the link below,
Json Data in BigQuery
if you are not refering to the Record Data type, The Json Column type might be a test feature that might not dependent on deprecation scheme

RediSql (for redis): Get column names as well as data type?

I am using the excellent RediSql, a module for Redis, to get a powerful caching solution.
When sending a command to Redis, that interacts with the SqLite db in the background, like this:
REDISQL.EXEC db "SELECT * FROM jobcache"
I get a result like this:
I get a type for the integer column, but not for the string, and no column names are provided.
Is there a way to get column name and defined data type always? I would need this, as I need to convert the results back to a more standard sql result format.
unfortunately, at the moment this is not possible with the EXEC command.
You can use the QUERY.INTO command reference
QUERY.INTO add the result of your query into a stream, it adds the column and the values for each row. Then you can consume the stream in whichever way you prefer.
When doing query (reads) against RediSQL is a good practice to use the .QUERY family of commands, this avoids useless replication of data, in the case you are in a cluster setup.
Moreover, it is possible to use the .QUERY commands also against replica of the main redis instance, while the .EXEC commands can be used only against the primary instance.

Azure Stream Analytics -> how much control over path prefix do I really have?

I'd like to set the prefix based on some of the data coming from event hub.
My data is something like:
{"id":"1234",...}
I'd like to write a blob prefix that is something like:
foo/{id}/guid....
Ultimately I'd like to have one blob for each id. This will help how it gets consumed downstream by a couple of things.
What I don't see is a way to create prefixes that aren't related to date and time. In theory I can write another job to pull from blobs and break it up after the stream analytics step. However, it feels like SA should allow me to break it up immediately.
Any ideas?
{date} , {time} and {partition} are the only ones supported in blob output prefix. {partition} is a number.
Using a column value in blob prefix is currently not supported.
If you have a limited number of such {id}s then you could workaround by writing multiple "select --" statements with different filters writing to different outputs and hardcode the prefix in the output. Otherwise it is not possible with just ASA.
It should be noted that now you actually can do this. Not sure when it was implemented but you can now use a single property from your message as a custom partition key and the syntax is exactly as the OP has asked for: foo/{id}/something/else
More details are documented here: https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-custom-path-patterns-blob-storage-output
Key points:
Only one custom property allowed
Must be a direct reference to an existing message property (i.e. no concatenations like {prop1+prop2})
If the custom property results in too many partitions (more than 8,000) then an arbitrary number of blobs may be created for the same parition

Error: Schema changed for Timestamp field (additional)

I am getting an error message when I query a specific table in my data set that has a nullable timestamp field. In the BigQuery web tool, I run simple query, e.g.:
SELECT * FROM [reztrack.201401] LIMIT 100
The result I get is: Error: Schema changed for Timestamp field date
Example Job ID: esiteisthebomb:job_6WKi7ZhSi8D_Ewr8b5rKV-a5Eac
This is the exact same issue that was noted here: Error: Schema changed for Timestamp field.
Also logged this under: https://code.google.com/p/google-bigquery/issues/detail?id=307 but I was unsure since it said we should be logging everything in Stackoverlfow.
Any information on how to fix this for this or other tables would be greatly appreciated.
Note: The original answer states to contact google support, but Google support for BigQuery was moved to StackOverflow. Therefore I assume that means to open it as a new question in hopes the engineers will respond.
BigQuery recently improved the representation of its internal timestamp format (there had previously been a lot of cases where timestamps broke in strange ways and this change should fix that). Your table still was using the old timestamp format, and you tickled a bug in the old format when schemas changed (in this case, the field went from REQUIRED to OPTIONAL).
We have an automated process that coalesces tables to make their storage more efficient. I scheduled this to run over your table, and have verified that it has rewritten your table using the new timestamp format.
You should now be able to query this field of your table without further problems.

Removing privacy data from a database?

Say that I needed to share a database with a partner. Obviously I have customer information in that database. Short of going through and identifying every column that contains privacy information and a custom script to 'scrub' the data, is there any tool or script which can scrub the data, but keep the format in tact (for example, if a string is 5 characters, it would stay 5 characters, only scrubbed)?
If not, how would you accomplish something like this, preferably in TSQL?
You may consider only share VIEW, create VIEWs to hide data that you don't want share.
Example:
CREATE VIEW v_customer
AS
SELECT
NAME,
LEFT(CreditCard,5) + '****' As CreditCard -- OR, don't show this column at all
....
FROM customer
Firstly I need to state professional interest I work for IBM which has tools that do exactly this.
Step 1. Ensure you identify all the PII (Personally Identifiable Information). When sharing database information it is typical that the obvious column names like "name" are found but you also need to find the "hidden" data where either the data is embedded in a standard format eg string-name-string and column name is something like "reference code" or is in free format text fields . as you have seen this is not going to be an easy job unless you automate it. The Tool for this is InfoSphere Discovery
Step 2. What context does the "scrubbed" data need to be in. Changing named fields to random characters has problems when testing as users focus on text errors rather than functional failures, therefore change names to real but ficticious. Credit card information often needs to be "valid". by that I mean it needs to have a valid prefix say 49XX but the rest an invalid sequence. Finally you need to ensure that every instance of the change is propogated through the database to maintain consistency. Tool for this is Optim Test Data Management with Data Privacy option.
The two tools integrate to give a full data privacy solution.
Based on the original question, it seems you need the fields to be the same length, but not in a "valid" format? How about:
UPDATE customers
SET email = REPLICATE('z', LEN(email))
-- additional fields as needed
Copy/paste and rename tables/fields as appropriate. I think you're going to have a hard time finding a tool that's less work, unless your schema is very complicated, or my formatting assumptions are incorrect.
I don't have an MSSQL database in front of me right now, but you can also find all of the string-like columns by something like:
SELECT *
FROM INFORMATION_SCHEMA.COLUMNS
WHERE DATA_TYPE IN ('...', '...')
I don't remember the exact values you need to compare for, but if you run the query and see what's there, they should be pretty self-explanatory.