I am trying to populate populate and collect metadata for the business in GBQ. Basically, the business doesn't have access to tables, we create authorised views for them that they use in their reports.
The problem is, if I populate the column description field in the table, the views based on that table won't inherit the metadata, the same with sharded tables.
There is going to be a degree of data entry to populate the metadata, but I would really like to be able to share it across related views.
Is it possible to automate BQ metadata in any way?
There are some different options to both get information about a table or a view (https://cloud.google.com/bigquery/docs/tables#getting_information_about_tables), and to update that information (https://cloud.google.com/bigquery/docs/updating-views#update-description).
Depending on your specific case you can use the bq command line or a programming language SDK to automatize the process for retrieving and updating BigQuery's metadata.
Related
I need to understand the below:
1.) How does one BigQuery connect to another BigQuery and apply some logic and create another BigQuery. For e.g if i have a ETL tool like Data Stage and we have some data been uploaded for us to consume in form of a BigQuery. So in DataStage or using any other technology how do i design the job so that the source is one BQ and the Target is another BQ.
2.) I want to achieve like my input will be a VIEW (BigQuery) and then need to run some logic on the BigQuery View and then load into another BigQuery view.
3.) What is the technology used to connected one BigQuery to another BigQuery is it https or any other technology.
Thanks
If you have a large amount of data to process (many GB), you should do the transformation of the data directly in the Big Query database. It would be very slow to extract all the data, run it through something locally, and send it back. You don't need any outside technology to make one view depend on another view, besides access to the relevant data.
The ideal job design will be an SQL query that Big Query can process. If you are trying to link tables/views across different projects then the source BQ table must be listed in fully-specified form projectName.datasetName.tableName in the FROM clauses of the SQL query. Project names are globally unique in Google Cloud.
Permissions to access the data must be set up correctly. BQ provides fine-grained control over who can access, and it is in the BQ documentation. You can also enable public access to all BQ users if that is appropriate.
Once you have that SQL query, you can create a new view by sending your SQL to Google BigQuery either through the command line (the bq tool), the web console, or an API.
1) You can use BigQuery Connector in DataStage to read and write to bigquery.
2) Bigquery use namespaces in the format project.dataset.table to access tables across projects. This allows you to manipulate your data in GCP as it were in the same database.
To manipulate your data you can use DML or standard SQL.
To execute your queries you can use the GCP Web console or client libraries such as python or java.
3) BigQuery is a RESTful web service and use HTTPS
Is it possible to disable the ability to use the "Save Results" (see attached image) feature in the Google BigQuery UI for specific users? I can see that we can disable exports through a role permission but can't seem to find a way to disable this functionality for users.
No, there is no way to disable this option in BigQuery. Every user that can run a query is able to save the results.
The most you can do to limit users from specific data is to create BigQuery Views of the tables. By creating views you can limit the access to specific columns and fields.
You have to create the views in a new dataset for which you will provide viewing permission to the users you want, while cutting them off from the original tables. This way, they will be able to query only the views and be able to save these query results without having access to the sensitive parts of the original tables.
I am looking to implement row level security in GCP, though the authorized view is a way to handle this, my use case has multiple tables created dynamically. How can we implement access control here, Do we need to have a separate view created for each table and saved in a different data set?
Following the procedure of creating authorize views, you could:
Create your source dataset (where your dynamic tables would reside).
Create one dataset (to store views) for every group of users with the same permissions, e.g. one dataset for developers, other for analysts, other for testers, etc.
Assign a project-level Cloud IAM role to your users (for them to be able to create BigQuery query jobs).
Assign to each user access controls to the dataset which will contain the views and correspond to their group (from step 2).
Dynamically create a table.
Dynamically create a view for each dataset (from step 2) querying the information that group should have access to (Please note you cannot create the views upfront, since the reference tables should exist).
Dynamically authorize the views.
For steps 3 and 4, alternatively, you can assign your users to Google Groups (you'll have one dataset per group).
Also be mindful of these limits.
I am using report builder 3.0 (very similar to SQL server reporting services) to create reports for users on an application using SQL server 2012 database.
To set the scene, we have a database with over 1200 tables. We actually only need about 100 of these for reporting purposes. But it is very common that we need to combine fields from multiple tables together to get a common resource of data that my colleagues and I need for our reports.
Eg if I want a view of a customer, I would want to bring in information about the customer from the customer_table, information about his phone details from the Phone table, information about his account(s) from the accounts table and so on. Then I might need another view of the accounts - account type, various balance amounts, opening date, status etc.
What I would love to do is create a "customer view" where we combine all these fields into a single combined virtual table. Then we have an "Accounts view". It would be easier to use, easier to manage etc. Then we use this for all our reports going forwards. And when we need to, we can combine the customer and accounts view to use on a report plus actual tables into one combo-dataset to use on a report.
I am unsure about the right way to do this.
I see I can create a data source. This doesn't seem right as this appears to be what one might do if working off 2 or more databases. We are using just 1 database.
Then there are report models. It seems these are being deprecated and phased out so this doesn't seem a good option.
Finally I see we can create shared datasets. However, this option (as far as I can tell) won't allow me to combine this with another dataset. So using the example above, I won't be able to combine the customer view and the account view with this approach to use for a report to display details about the customer and his/her accounts.
Would appreciate guidance on the best way to achieve what I am trying to do...
Thanks
I can only speak from personal experience, but using the the data source approach has been good for our purposes. We have a single database with 50+ tables in it. This is linked to as a shared data source in the project so is available to all 50+ reports.
We then use Stored Procedures to make the information in the databases available to the reports, each report has it's own Stored Procedure that joins as many tables as required to provide the data for the report. The advantage of using Stored Procedures also allows you to only return rows you are interested in, rather than entire tables.
I'm not certain if this is the kind of answer that you were after, but describes how we solve a similar (smaller) issue.
I've got a nice SSAS tabular model with users processing away. Certain users need access to certain information, such as confidential info (e.g., SS numbers), that should not be visible to everyone. How should I handle this?
This indicates that there is no way to use roles to remove columns, only rows. Is my only option to make a copy of the model and maintain both? This can't be such an edge case...
I guess I can jury-rig something with a scm fork and code-generation, but I'd rather not go down that road.
Alternatively, is there any way to hide the columns (per user/role), so that at least they don't show up in client tools?
One method that requires very little additional development is to use the method described in the following blog post. http://blog.westmonroepartners.com/a-workaround-for-column-security-in-the-sql-server-analysis-services-bism-tabular-model/
The blog contains a link to an SSIS package which will replicate an existing cube, with the exception of the sensitive data columns. The users who cannot view the sensitive data columns can be given access to the second cube that does not contain sensitive data columns.
One way to achieve this is to create Perspectives. You can create different perspectives for different group of users. And end users can connect to their specific model.