ETL choice, building an ETL that deals with SQL query engine (impala) or native database directly? [closed] - impala

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I am trying to build an ETL that map the source tables to a dimensional, star schema model
our data warehouse is basically Impala on top of Kudu database
my question is, should I:
A- build an ETL that deals with kudu tables directly using Python (link)
or
B- or create UDFs (equivalent to stored procedures in SQL) in impala that does the insertion/joins etc to map source tables to star-schema model, and schedule it using Nifi or any scheduler such as Airflow etc
In my opinion, I think it would be better to deal with the native database rather than dealing with the SQL engine on top of it. but it is just an assumption.

Why not approach C, :) a bit of both.
Both has pros and cons.
A - use python to build ETL -
pros - better control, flexible to do any logic you want.
cons - you have to code in python and code in sql. If something fails, it will be a nightmare to do RCA. Maintenance may be harder in comparison.
- performance wise, this approach will be poorer in case of huge volume of data.
B - Use SQL to fetch data directly -
pros - faster performance. less coding.
cons - difficult to implement complex logic. Maintenance of code and schedule may be hard.
In addition to above, pls consider, your/teams comfort on python/SQL and future maintainability.
Currently we are using approach B in my cloudera project. We create views and then use insert to load final tables directly. We hardly need any UDF.
Now, my recommendation, please use approach B. And use approach A only in case you really can not create complex logic.
EDIT :
Lets say, we have to load orders table. So we execute following blocks to load orders and dependent org,cust,prod tables.
Load customer |
load org | --> Load Orders final.
load product |
load order stage|
Load customer block is collection of scripts like-
insert overwrite cust_stg select * from cust_stg_vw; -- This loads into stage table
insert overwrite cust select * from cust_vw; -- This loads into cust table
And similarly other blocks are written. Putting them in blocks gives us flexibility to put them in any order/anywhere we want to improve performance.

Related

Should I name tables based on date & time of creation, and use EXEC() and a variable to dynamically refer to these tables? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
TL;DR: Current company creates new table for every time period, such as sales_yyyymmdd, and use EXEC() to dynamically refer to table names, making the entire query red and hard to read. What kind of changes can I suggest to them to both improve readability and performance?
Some background: I'm a Data analyst (and not a DBA), so my SQL knowledge can be limited. I recently moved to a new company which use MS SQL Server as their database management system.
The issues: The DAs here share a similar style of writing SQL scripts, which includes:
Naming tables based on their time of creation, e.g. data for sales record everyday will be saved into a new table of that day, such as sales_yyyymmdd. This means there are a huge amount of tables like this. Note that the DAs has their own database to tinker with, so they are allowed to created any amount of tables there.
Writing queries enclosed in EXEC() and dynamically refer to table names based on some variable #date. As such, their entire scripts become a color red which is difficult for me to read.
They also claim that enclosing queries in EXEC(), per their own words, makes the scripts running entirely when stored as scheduled jobs, because when they write them the "normal way", sometimes these jobs stop mid-way.
My questions:
Regarding naming and creating new tables for every new time period: I suppose this is obviously a bad practice, at least in terms of management due to the sheer amount of tables. I suggested merging them and add a created_date column, but the DAs here argued that both ways take up the same amount of disk space, so why bother with such radical change. How do I explain this to them?
Regarding the EXEC() command: My issue with this way of writing queries is that it's hard to maintain and share to other people. My quick fix for now (if issue 1 remains), is to use one single EXEC() command to copy the tables needed to temp tables, then select these temp tables instead. If new data need to be merged, I first insert them into temp tables, manipulate them here, and finally merge into the final, official table. Would this method affect performance at all (as there is an extra step involving temp tables)? And is there any better way that both helps with readability and performance?
I don't have experience scheduling jobs myself on my own computer, as my previous company has a dedicated data engineering team that take my SQL scripts and automate the job on a server. My googling also has not yielded any result yet. Is it true that using EXEC() keeps jobs from being interrupted? If not, what is the actual issue here?
I know that the post is long, and I'm also not a native speaker. I hope that I explain my questions clearly enough, and I appreciate any helps/answers.
Thanks everyone, and stay safe!
While I understand the reasons for creating a table for each day, I do not think this is the correct solution.
Modern databases do very good job partitioning data, SQL Server also has this feature. In fact, such use-cases are exactly the rason why partitioning was created in the first place. For me that would be the way to go, as:
it's not a WTF solution (your description easily understandable, but it's a WTF still)
partitioning allows for optimizing partition-restricted queries, particularly time-restricted queries
it is still possible to execute a non-partition based query, while for the solution you showed it would require an union, or multiple unions
As everybody mentioned in the comments, You can have single table Sales and have extra column in the table to hold the date, the data got inserted.
Create table Sales to hold all sales data
CREATE TABLE Sales
(
col1 datatype
col2 datatype
.
.
InsertedDate date --This contains the date for which sales data correspond to
)
Insert all the existing tables data into the above table
INSERT INTO sales
SELECT *,'20200301' AS InsertedDate FROM Sales_20200301
UNION ALL
SELECT *,'20200302' AS InsertedDate FROM Sales_20200302
.
.
UNION ALL
SELECT *,'20200331' AS InsertedDate FROM Sales_20200331
Now, you can modify EXEC query with variable #date to direct query. You can easily read the script without them being in the red color.
DECLARE #date DATE = '20200301'
SELECT col1,col2...
FROM Sales
WHERE InsertedDate = #date
Note:
If data is huge, you can think of partitioning the data based on the Inserteddate.
The purpose of database is not to create tables. It is to use tables. To be honest, this is a nuance that is sometimes hard to explain to DBAs.
First, understand where they are coming from. They want to protect data integrity. They want to be sure that the database is available and that people can use the data they need. They may have been around when the database was designed, and the only envisioned usage was per day. This also makes the data safe when the schema changes (i.e. new columns are added).
Obviously, things have changed. If you were to design the database from scratch, you would probably have a single partitioned table; the partitioning would be by day.
What can you do? There are several options.
You do have some options, depending on what you are able to do and what the DBAs need. The most important thing is to communicate the importance of this issue. You are trying to do analysis. You know SQL. Before you can get started on a problem, you have to deal with the data model, thinking about execs, date ranges, and a whole host of issues that have nothing to do with the problems you need to solve.
This affects your productivity. And affects the utility of the database. Both of these are issues that someone should care about.
There are some potential solutions:
You can copy all the data into a single table each day, perhaps as a separate job. This is reasonable if the tables are small.
You can copy the latest data into a single table.
You can create a view that combines the data into a single view.
The DBAs could do any of the above.
I obviously don't know the structure of the existing code or how busy the DBAs are. However, (4) does not seem particularly cumbersome, regardless of which solution is chosen.
If you have no available space for a view or copy of the data, I would write SQL generation code that would construct a query like this:
select * from sales_20200101 union all
select * from sales_20200102 union all
. . .
This will be a long string. I would then just start my queries with:
with sales as (
<long string here>
)
<whatever code here>;
Of course, it would be better to have a view (at least) that has all the sales you want.

Can converting a SQL query to PL/SQL improve performance in Oracle 12c? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have been given an 800 lines SQL Query which is taking around 20 hours to fetch around 400 million records.
There are 13 tables which are partitioned by month.
The tables have records ranging from 10k to 400 million in each partition.
The tables are indexed on primary keys.
The query uses many inline views and outer joins and a few group by functions.
DBAs say we cannot add more indexes as it would slow down the performance since it is an OLTP system.
I have been asked to convert the query logic to pl/sql and then populate a table in chunks.Then do a select * from that table.
My end result should be a query which can be fed to my application.
So even after I use pl/sql to populate a table in chunks,ultimately I need to fetch the data from that table as a query.
My question is, since pl/sql would require select and insert both, are there any chances pl/sql can be faster than sql?
Are there any cases where pl/sql is faster for any result which is achievable by sql?
I will be happy to provide more information if the given info doesn't suffice.
Implementing it as a stored procedure could be faster because the SQL will already be parsed and compiled when the procedure is created. However, given the volume of data you are describing its unclear if this will make a significant difference. All you can do is try it and see.
I think you really need to identify where the performance problem is; where the time is being spent. For example (and I have seen examples of this many times), the majority of the time might be in fetching to 400M rows to whatever the "client" is. In that case, re-writing the query or as PL/SQL will make no difference.
Anyway, once you can enumerate the problem, you have a better chance of getting sound answers, rather than guesses...

Oracle: How to identify data and schema changes [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I have a requirement to gather what are the database data changes or schema changes occurred after executing a nightly batch. For example there is a table employee which has two records. After nightly batch suppose one record inserted and one record updated. I want to capture what record is updated and what record is inserted. I am using Oracle database. I am looking for a script to do this as we have some issues to get licenses for new tools that does this task. So anyone can advise how this can be done programatically or using Oracle 11g built in functions? Any sample code is greatly appreciated.As we have large number of tables, I am looking for a generic way to do this.
Thanks
I would suggest using triggers on the changes you want to capture and inserting that information into another table that captures those changes.
There's some info right here in stackoverflow the best way to track data changes in oracle
If triggers are not a viable option, look into INSERTing into 2 tables at once, one being your target table and one being you logging/change capture table.
Here is an example on stackoverflow
Oracle INSERT into two tables in one query
A third option would be table auditing. See the following on stackoverflow
Auditing in Oracle
In OLTP systems, you can add audit columns in the table create_date, update_date or last_modified_time, transaction_type.
With create_date, update_date - you can set default sysdate to create_date and then you need to modify application logic to update update_date. Trigger also will work, instead of changing code at the small cost of performance.
With last_modified_time, transaction_type - you need to update those 2 fields on insert or update as part of your application logic or using trigger.

DB Schema: Why not create new table for each 'entity'? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Sorry about the vague title.
An example: I'm guessing SO has one large table that lists all answers, in a schema like:
[ Ques No, Ans No, Text , Points ]
[ 22, 0 , "Win", 3 ],
[ 22, 1 , "Tin", 4 ],
[ 23, 0 , "Pin", 2 ]
My question is would it be better if there were two tables: Table_Ques22 and Table_Ques23? Can someone please list the pros and cons?
What comes to my mind:
Cons of multiple tables: Overhead of meta storage.
Pros of multiple tables: Quickly answer queries like, find all answers to Ques 22. (I know there are indices, but they take time to build and space to maintain).
Databases are designed to handle large tables. Having multiple tables with the same structure introduces a lot of problems. These come to mind:
Queries that span multiple rows ("questions" in your example) become much more complicated and performance suffers.
Maintaining similar entities is cumbersome. Adding an index or partitioning a single table is one thing. Doing it to hundreds of tables is much harder.
Maintaining triggers is cumbersome.
When a new row appears (new question), you have to incur the overhead of creating a table rather than just adding to an existing table.
Altering a table, say to add a new column or rename an existing one, is very cumbersome.
Although putting all questions in one table does use a small additional amount of storage, you have to balance that against the overhead of having very small tables. A table with data has to occupy at least one data page, regardless of whether the data is 10 bytes or 10 Gbytes. If a data page is 16 kbytes, that is a lot of wasted space to support multiple tables for a singe entity.
As for database limits. I'm not even sure a database could support a separate table for each question on Stack Overflow.
There is one case where having parallel table structures is useful. That is when security requirements require that the data be separated, perhaps for client confidentiality reasons. However, this is often an argument for separate databases, not just separate tables.
What about: SQL Servers are not made for people ignoring the basics of the relational theoream.
You ahve a ton of problems with cross question queries in your part, which will totally kill all the gains. Typical beginner mistake - I suggest a good book about SQL basics.

Good resources for learning database optimization part [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I am good at database(sql) programming part but I want to move ahead into database optimization part like: where and when to indexes, how to decide which query is better than other, how to optimize database. Can you guide me some good resources or books which can lead me to this?
Inside Microsoft SQL Server 2005: Query Tuning and Optimization,
Inside Microsoft SQL Server 2005: T-SQL Querying
Inside Microsoft SQL Server 2005: The Storage Engine
have very deep and thorough explanation of optimizing sql server querying.
SQL Server Query Performance Tuning Distilled, Second Edition
I've recently been focusing on this for my company, and I've learned some interesting things about specifically query optimization.
I've run SQL Profiler for a half hour at a time and logged queries that required 1000 reads or more (then later ones that required 50 CPU or more).
I originally focused on individual queries with the highest reads and CPU. However, having written the logs to a database, I was able to query aggregate results to see which queries required the most aggregate reads and CPU. Targeting these actually helped a lot more than only targeting the most expensive queries.
The most expensive query might be run once a day, so it's good to optimize that. However, if the 10th most expensive query is run 100 times an hour, it's much more helpful to optimize that first.
Here's a summary of what I've learned so far, which can help you get started in identifying queries for optimization:
A Beginner's Guide to Database Query Optimization
Highly Inefficient Linq Queries that Break Database Indexing
An Obscure Performance Pitfall for Test Accounts and Improperly Indexed Database Tables
Please find some tips for database/query optimization.
Apply functions to parameters, not columns
One of the most common mistakes seen when looking at database queries, is the improper use of functions against database tables. Whenever we need to apply a function to a column and validate the result against a value, it's worth checking if we have the reverse function that we can apply against the given column. In this way, the database engine can use an index against that column, and there isn't the need to define a functional based index.
against a 60 rows table with no indexes whatsoever, the following query
SELECT ticker.SYMBOL,
ticker.TSTAMP,
ticker.PRICE
FROM ticker
WHERE TO_CHAR(ticker.TSTAMP, 'YYYY-MM-DD') = '2011-04-01'
executes in 0.006s, whereas, the "reverse" query
SELECT ticker.SYMBOL,
ticker.TSTAMP,
ticker.PRICE
FROM ticker
WHERE
ticker.TSTAMP = TO_DATE('2011-04-01', 'YYYY-MM-DD')
-- executes in 0.004s
Exists clause instead of IN (subquery)
Another observed pattern in database development is that people choose the easy and the most convenient solution and for this tip, we will take a look at finding an element in a list. The easiest and most convenient solution is using the IN operator.
SELECT symbol, tstamp, price
FROM ticker
WHERE price IN (3,4,5);
--or
SELECT symbol, tstamp, price
FROM ticker
WHERE price IN (SELECT price FROM threshold WHERE action = 'Buy');
This approach is ok when we have a small manageable list. When the list becomes extensively large and when the list is dynamic(it will be generated based on parameters that we'll have only at runtime) this approach tends to becomes quite costly for the database. The alternative solution is the use of the EXISTS operator as shown in the below code snippet:
SELECT symbol, tstamp, price
FROM ticker t
WHERE EXISTS (SELECT 1 FROM threshold m WHERE t.price = m.price AND m.action = 'Buy');
This approach will be faster because once the engine has found a hit, it will quit looking as the condition has proved true. With IN it will collect all the results from the subquery before further processing.