I have big data that one HANA cannot hold, I want to use Smart Data Access to create a virtual table on another HANA box and build a calculation view union local table and virtual table.
Is it a viable solution? I am not sure mechanism of calculation views. It is not materialized, right?
When I do some grouping on my remote table via calculation view, raw data does not pass through from remote to local, only aggregated values pass through, is it correct?
SAP HANA Smart Data Access (SDA) provides you with the option to federate queries over multiple data sources and to access data stored in external databases (incl. external SAP HANA databases) by a query or a calculation model in SAP HANA.
In this scenario, your SAP HANA database acts as a database client to the external databases. This involved the transfer of queries and result sets between SAP HANA and the external databases and obviously also the materialisation of result sets - SAP HANA cannot reach into another database's memory and read internal data representation.
The "smart" bit in SDA is that the SAP HANA query processor is aware of the technical capabilities of the external data sources and creates queries accordingly. This includes choosing join strategies, filter push-downs and e.g. group by push-downs.
Whether or not any of these operations are actually done for a specific query depends on the capabilities of the data sources, the expected data volumes and on the specific query.
In practical terms, this means that you have to check how the actual execution looks like for all of your queries.
Related
In our current system, we have a lot of ECC tables replicated to SAP HANA with SDI (Smart Data Integration). Replication tasks can be real-time or on demand, but sometimes a replication task comes too late and the data in the replicated table is very different from the source table.
What would be the best approach in SAP HANA to check these delta values?
ERP system uses DB2 database
DB2LogReaderAdapter is used to read DB2 database tables
Remote source is created in the Cloud (Virtual table)
There are about 260 replication tasks
Replication tasks contain only one object
Replication tasks are based on virtual tables
The biggest issue faced right now is latency in the remote source tables (delta values)
There is no easy/straightforward way to "check" delta values here.
The 260 replication tasks are processed independently from each other; regardless of transactional compounding in the source system.
That means, that if table A and B are updated in the same transaction, but replicated in separate tasks to HANA, the data will be written to HANA in separate transactions. The data in HANA will be lagging behind the source system.
Usually, this difference should only last a relatively short time (maybe a few secs.), but, of course, if you do aggregation queries and want to see current valid sums etc. this leads to wrong data.
One way to deal with this is to implement the queries in a way that takes this into account, by e.g. filtering on data that has been changed half an hour ago (or longer), and to exclude newer data.
Note that as the replication via LogReader is de-coupled from the source system's transaction processing, this problem of "lagging data" is built-in conceptionally and cannot be generally avoided.
All one can do is to reduce the extend of the lag and cope with the differences in the upstream processing.
This very issue is one of the reasons for why remote data access is usually preferred over replication for cases like operational reporting.
And if you do need data-loading (e.g. to avoid additional load on the source system) then a ETL/ELT approach into data stores (DWH/BW-like) makes the situation a lot better structures.
In fact, the current S/4 HANA & BW/4 HANA setups usually use a combination of scheduled data loads and ad-hoc fetching of new data via operational delta queues from the source system.
Lars,
If we need to replicate data from ECC on Oracle to a HANA instance, should we use SLT (because of cluster tables for example) or SDI already covers all functionality SLT provides?
Regards, Chris
We are trying to create a cross-database query using Azure's preview Elastic Query. So we will be creating an External Table to make these queries happen.
Unfortunately, I have some apprehension about how the queries will be executed. I don't want a query or stored procedure to fail at run-time because the database connection fails. I just don't understand how the External Tables work.
Azure's External Table docs have good information on how to query and create the table. I just can't find information that specifically spells out how the data exists.
Oracle's version of external tables is just flat files that are referenced. SQL*Loader loads data from external files into tables of an Oracle database. I couldn't find any documentation about Azure doing the same. (Is it implied that they are the same? Is that a stupid question?)
If it is this way (external flat files), when the external table gets updated, does SQL Server update the flat files so our external table stays up to date? Or will I have to delete/create the link again every time I want to run the query for up to date information?
Per Microsoft Support:
Elastic queries basically works as remote queries which means the data is not stored locally but is pulled from the source database every time you run a query. When you execute a query on an external table, it makes a connection to the source database and gets the data.
With that being said, you do not have to delete/create the links. Once you have performed these steps, you can access the horizontally partitioned table “mytable” as though it were a local table. Azure SQL Database automatically opens multiple parallel connections to the remote databases where the tables are physically stored, processes the requests on the remote databases, and returns the results.
There is no specific risk associated with using this feature but it is simply like opening connections to the source database so it can pull data. Besides this you can expect some slowness when executing a remote query but nothing that will cause any other issues with the database.
In case any of the database becomes unavailable, queries that are using the affected DB as source or target will experience query cancellations or timeouts.
I know that OLAP is used in Power Pivot, as far as I know, to speed up interacting with data.
But I know that big data databases like Google BigQuery and Amazon RedShift have appeared in the last few years. Do SQL targeted BI solutions like Looker and Chart.io use OLAPs or do they rely on the speed of the databases?
Looker relies on the speed of the database but does model the data to help with speed. Mode and Periscope are similar to this. Not sure about Chartio.
OLAP was used to organize data to help with query speeds. While used by many BI products like Power Pivot and Pentaho, several companies have built their own ways of organizing data to help with query speed. Sometimes this includes storing data in their own data structures to organize the data. Many cloud BI companies like Birst, Domo and Gooddata do this.
Looker created a modeling language called LookML to model data stored in a data store. As databases are now faster than they were when OLAP was created, Looker took the approach of connecting directly to the data store (Redshift, BigQuery, Snowflake, MySQL, etc) to query the data. The LookML model allows the user to interface with the data and then run the query to get results in a table or visualization.
That depends. I have some experience with BI solution (for example, we worked with Tableau), and it can operate is two main modes: It can execute the query against your server, or can collect the relevant data and store it on the user's machine (or on the server where the app installed). When working with large volumes, we used to make Tableau query the SQL Server itself, that's because our SQL Server machine is very strong compared to the other machines we had.
In any way, even if you store the data locally and want to "refresh" it, when it updates the data it needs to retrieve it from the database, which sometimes can also be an expensive operation (depends on how your data is built and organized).
You should also notice that you compare 2 different families of products: while Google BigQuery and Amazon's RedShift are actually database engines that used to store the data and also query it, most of the BI and reporting solutions are more concerend about querying the data and visualizing it and therefore (generally speaking) are less focused on having smart internal databases (at least from my experience).
My application relies on data that is stored in a SAP Database. In addition to that, I need to persist data that references the SAP data but cannot be stored in the SAP database. So there is another (MySQL) database.
What would be the cleanest way to connect those 2 datasources?
I think it would be better/safer to constantly get the data I need from SAP and also store it in the MySQL database, since I can reference them with FKs. But I do not want to always check for changes in the SAP database and then copy all the data to MySQL since queries to SAP are usually very expensive.
The company i am working for is implementing Share-point with reporting servers that runs on an SQL back end. The information that we need lives on two different servers. The first server being the Manufacturing server that collects data from PLCs and inputs that information into a SQL database, the other server is our erp server which has data for payroll and hours worked on specific projects. The i have is to create a view on a separate database and then from there i can pull the information from both servers. I am having a little bit of trouble with the syntax for connecting the two servers to run the View. We are running ms SQL. If you need any more information or clarification please let me know.
Please read this about Linked Servers.
Alternatively you can make a Data Warehouse - which would be a reporting data base. You can feed this by either making procs with linked servers or use SSIS packages if they're not linked.
It all depends on a project size and complexity, but in many cases it is difficult to aggregate data from multiple sources with Views. The reason is that the source data structure is modeled for the source application and not optimized for reporting.
In that case, I would suggest going with an ETL process, where you would create a set of Extract, Transform and Load jobs to get data from multiple sources (databases) into a target database where data will be stored in the format optimized for reporting.
Ralph Kimball has many great books on the subject, for example:
1) The Data Warehouse ETL Toolkit
2) The Data Warehouse Toolkit
They are truly worth the read if you are dealing with data