Data Discovery & Classification is built into Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse Analytics. It provides basic capabilities for discovering, classifying, labeling, and reporting the sensitive data in your databases.
However, the Data Discovery and Classification engine fails to identify a bunch of tables. At the moment, it is only able to identify our contacts table.
Can someone let me know if the table needs to be in a certain format for the Data Discovery to able to discover and classify tables?
If not, can someone explain why the Data Discovery and Classification engine is unable identify/discover other tables?
Azure SQL Data Discovery and Classification identify based on Information types of columns. It works using the column names.
To directly add classification, you can follow below steps:
Manage Information Types
To detect tables, it uses Information types of columns which are built-in as shown below it will match pattern with your column name and detect it. you can edit this Information Type by clicking on Configure and add the pattern of your column name and save it.
After creating table with column name matching the patterns mentioned in the Information Type It is successfully detecting tables.
Related
I asked one of our company partner to give us read/write ODBC access so that we can pull raw data, crate views from their Case management system. they mentioned that they can provide us with data dump of the tables with in their website where we would be able to pull data from.
i looked into what i can do by data dump of tables. and found that this is detailed record of tables in the database. it is implemented to take backup of a database or multiple databases available in the server so that their data contents can be renovated in the event of any data loss.
I am looking how can i use this to write my own sql query and get what i need and create views. where can i read more about how else i can use data dump of a table
thanks
I have N databases, for example 10 databases.
Every database has the same schema, but different data.
Now i would like to take every data of each database from the table "Table1" and insert them in a common table in a new database "DWHDatabase" in a table named Table1Common.
so it's an insert like n to 1.
How i can do that? i'm trying to solve my issues with the elastic queries but seems it's a 1 to 1 stuff
Use Azure Data Factory with Linked Services to each database. Use the Copy activity to load the data.
You can also paramaterize the solution.
Parameterize linked services
Parameters in Azure Data Factory by Catherine Wilhemsen
Elastic query is best suited for reporting scenarios in which the majority of the processing (filtering, aggregation) may be done on the external source side. It is unsuitable for ETL procedures involving significant amounts of data transfer from a distant database (s). Consider Azure Synapse Analytics for large reporting workloads or data warehousing applications with more sophisticated queries.
You may use the Copy activity to copy data across on-premises and
cloud-based data storage. After you've copied the data, you may use
other actions to alter and analyse it. The Copy activity may also be
used to publish transformation and analysis findings for use in
business intelligence (BI) and application consumption.
MSFT Copy Activity Overview: Here.
I'm very new to Snowflake, so forgive me if the answer is obvious.
I am loading the data from on-prem into Azure using Data Factory, and then ingesting into Snowflake using COPY INTO. However, I need to enable access for some of the transformed data to other platforms, meaning that if I perform transformation in Snowflake, I'll need to create an external table in Azure (essentially pushing this data back to Azure so other platforms can access it).
As we don't particularly want to introduce a new tool, I have two options for our fairly basic transformation:
do the transformation in ADF
do the transformation in Snowflake in SQL scripts and then create an external table so other teams can access the data using other tools (these platforms don't integrate with Snowflake)
Are there any major drawbacks to option 2 apart from increased storage costs?
I'm trying to weigh up the following: maintenance effort (our team's skills lie in SQL not ADF), cost, and performance.
Any advice would be appreciated.
As stated in the question, there are many possible answers for this scenario - with my favorite being the second one ("do the transformation in Snowflake in SQL scripts and then create an external table so other teams can access the data using other tools").
If you need to make the results of these transformations available on Azure storage, Azure Data Factory supports this natively:
Copy data from Snowflake that utilizes Snowflake's COPY into [location] command to achieve the best performance. https://learn.microsoft.com/en-us/azure/data-factory/connector-snowflake#supported-capabilities
Or you could manage this inside Snowflake using the same COPY INTO that ADF uses.
Let me add a couple screenshots from the Snowflake webinar "Data Warehouse or Data Lake? How You Can Have Both in a Single Platform":
https://resources.snowflake.com/webinars-thought-leadership/data-warehouse-or-data-lake-how-you-can-have-both-in-a-single-platform-3
I have 3 databases with the same structure, but different data, since they are from different clients.
Now, I have an existing SSAS project. Its Data Source Views, Cubes and Dimensions can only use or access one DB.
What I want is to be able to use multiple databases with the same structure, and create a cube using them.
Each client must also be able to use the cube, but they can only see their own data.
Are these possible? Can you please provide insights and some useful references?
Easy Solution
The easiest way to solve this would be to just have three Analysis Services databases. Setup would be easy, you would have just three structurally identical databases, and no need to manage security within the cubes, only access to the cube. It is easy to manage, and difficult to make errors allowing users to get access to data they should not see. And as nobody should be allowed to access data form other companies, there is no need for one common cube.
Just deploy your project three times using a different Analysis Services database name.
Then change the data source object of the deployed databases to point to the different relational databases.
For the first step, in Business Intelligence Development Studio, right click on the project node in Solution Explorer, select the bottom entry ("Properties"), and then select "Deployment". Here, you can enter the server to deploy the solution to, as well as the database name. After closing the dialog, right click on the project node again, and select Deploy. Repeat this step, using three different database names.
Then, connect to your Analysis Services server in SQL Server Management Studio, open each database, and edit the data source object of each database to point to its relational database.
After that, re-process the Analysis Services database.
Alternatively, you can also do everything in BIDS, i. e. between changing the target database for deployment and deploying, change the data source there, and after deployment, possibly, re-process the Analysis Services database.
If you assume you will need to change and deploy the cube definition several times, you probably could make use of configurations which you can edit in the project properties dialog using the "Configuration Manager" button. You would have three configurations, one for each target Analysis Services database. You could select one of the configurations in the dropdown list in the toolbar for each deployment without the need to edit properties again and again.
If you need to do this often, I think it would not be difficult to automate the steps to change the database and reprocess the cube, either via XMLA, or via AMO, or in PowerShell. But to implement this this would be another question.
More Complex Solution
If you really want to have everything in one cube, then you will have to have a union of the tables from the different sources in the data source view. If all three relational databases are on the same SQL Server instance, you can define this either as a named query in the data source view, or as a view in one of the databases, maybe even better as a view or table in a separate relational database. You can access a table or view from another database running in the same instance of SQL Server in the form NameOfDB.Schema.Tablename.
In case these databases are on different instances, you could use linked servers.
And of course, you will have to manage the keys in these different databases so that the same dimension entry has the same key, and different dimension entries have different keys. And you will have to set up security in the cube so that no user can see data that is not meant to be seen.
While you could use different data source objects in Analysis Services for different tables or named queries in Analysis Services, each of these only uses one, as actually, this is one SQL statement that is sent to this source. And dimensions need to be based on one data source view object like one named query, view, or table. For fact tables, you could get around this using partitions, but not for dimensions.
The company i am working for is implementing Share-point with reporting servers that runs on an SQL back end. The information that we need lives on two different servers. The first server being the Manufacturing server that collects data from PLCs and inputs that information into a SQL database, the other server is our erp server which has data for payroll and hours worked on specific projects. The i have is to create a view on a separate database and then from there i can pull the information from both servers. I am having a little bit of trouble with the syntax for connecting the two servers to run the View. We are running ms SQL. If you need any more information or clarification please let me know.
Please read this about Linked Servers.
Alternatively you can make a Data Warehouse - which would be a reporting data base. You can feed this by either making procs with linked servers or use SSIS packages if they're not linked.
It all depends on a project size and complexity, but in many cases it is difficult to aggregate data from multiple sources with Views. The reason is that the source data structure is modeled for the source application and not optimized for reporting.
In that case, I would suggest going with an ETL process, where you would create a set of Extract, Transform and Load jobs to get data from multiple sources (databases) into a target database where data will be stored in the format optimized for reporting.
Ralph Kimball has many great books on the subject, for example:
1) The Data Warehouse ETL Toolkit
2) The Data Warehouse Toolkit
They are truly worth the read if you are dealing with data