I recently came across Apache Kylin, and was curious what it's use cases are. From what I can tell, it seems to be a tool designed to solve very specific problems related to upwards of 10+ billion rows, aggregating, caching and querying data from other sources (HBase, Hadoop, Hive). Am I correct in this assumption?
Apache Kylin's use case is interactive big data analysis on Hadoop. It lets you query big Hive tables at sub-second latency in 3 simple steps.
Identify a set of Hive tables in star schema.
Build a cube from the Hive tables in an offline batch process.
Query the Hive tables using SQL and get results in sub-seconds, via Rest API, ODBC, or JDBC.
The use case is pretty general that it can fast query any Hive tables as long as you can define star schema and model cubes from the tables. Check out Kylin terminologies if you are not sure what is star schema and what is cube.
Kylin provides ANSI SQL interface, so you can query the Hive tables pretty much the same way you used to. One limitation however is Kylin provides only aggregated results, or in other word, SQL should contain a "group by" clause to yield correct result. This is usually fine because big data analysis focus more on the aggregated results rather than individual records.
Related
I've been trying to research how to optimize my Informatica map's performance. I am trying to figure out if I should try to have my source qualifier do as much of the work as possible, so doing calculations, joining tables, handle conditional logic, or will my mappings perform better by using transformations such as expression transformation to handle calculations and other manipulations, joiner transformation for joining tables, and so on. I am also a bit new to Informatica.
There is no right answer to your question. It depends on the power of your DBMS v. the power of your Informatica environment and how efficiently your DBMS performs each type of transformation compared to your Informatica environment.
The only way is to try it and see - on your specific environment.
I think #NickW gave best answer, but here are my two cents.
If yor mapping has simple transformations like -
filter,
aggregation,
case-when/if else clause,
join 5/10 tables or have lookup on few tables in same DB
you can use SQL. Normally DB can handle these kind of operations better than infa. Also, this will help less data transfer between DB and informatica so we can assume this will be faster.
Factors to consider - if you have large table with no index, the SQL itself taking hours to return data, you can think about using informatica. In such scenario please run the query in DB and see if you can improve perf.
Pls note informatica fetches physical data into infa server so in case of large table, this can be a problem for infa as well.
Now, if you have
complex mapping logic,
join with different DB, or lookup form different DB
File/XML/COBOL/XL source
you need to use informatica only.
I'm working in SQL Workbench in Redshift. We have daily event tables for customer accounts, the same format each day just with updated info. There are currently 300+ tables. For a simple example, I would like to extract the top 10 rows from each table and place them in 1 table.
Table name format is Events_001, Events_002, etc. Typical values are Customer_ID and Balance.
Redshift does not appear to support declare variables, so I'm a bit stuck.
You've effectively invented a kind of pseudo-partitioning; where you manually partition the data by day.
To manually recombine the tables create a view to union everything together...
CREATE VIEW
events_combined
AS
SELECT 1 AS partition_id, * FROM events_001
UNION ALL
SELECT 2 AS partition_id, * FROM events_002
UNION ALL
SELECT 3 AS partition_id, * FROM events_003
etc, etc
That's a hassle, you need to recreate the view every time you add a new table.
That's why most modern databases have partitioning schemes built in to them, so all the boiler-plate is taken care of for you.
But RedShift doesn't do that. So, why not?
In general because RedShift has many alternative mechanisms for dividing and conquering data. It's columnar, so you can avoid reading columns you don't use. It's horizontally partitioned across multiple nodes (sharded), to share the load with large volumes of data. It's sorted and compressed in pages to avoid loading rows you don't want or need. It has dirty pages for newly arriving data, which can then be cleaned up with a VACUUM.
So, I would agree with others that it's not normal practice. Yet, Amazon themselves do have a help page (briefly) describing your use case.
https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-time-series-tables.html
So, I'd disagree with "never do this". Still, it is a strong indication that you've accidentally walked in to an anti-pattern and should seriously re-consider your design.
As others have pointed out many small tables in Redshift is really inefficient, like terrible if taken to the extreme. But that is not your question.
You want to know how to perform the same query on multiple tables from SQL Workbench. I'm assuming you are referring to SQLWorkbench/J. If so you can define variables in the bench and use these variable in queries. Then you just need to update the variable and rerun the query. Now SQLWorkbench/J doesn't offer any looping or scripting capabilities. If you want to loop you will need to wrap the bench in a script (like a BAT file or a bash script).
My preference is to write a jinja template with the SQL in it along with any looping and variable substitution. Then apply a json with the table names and presto you have all the SQL for all the tables in one file. I just need to run this - usually with the psql cli but at times I'm import it into my bench.
My advice is to treat Redshift as a query execution engine and use an external environment (Lambda, EC2, etc) for the orchestration of what queries to run and when. Many other databases (try to) provide a full operating environment inside the database functionality. Applying this pattern to Redshift often leads to problems. Use Redshift for what it is great at and perform the other actions elsewhere. In the end you will find that the large AWS ecosystem provides extended capabilities as compared to other databases, it's just that these aren't all done inside of Redshift.
Hi I recently joined a new job that uses Hive and PostgreSQL. The existing ETL scripts gather data from Hive partitioned by dates and creates tables for those data in PostgreSQL and then the PostgreSQL scripts/queries perform left joins and create the final table for reporting purpose. I have heard in the past that Hive joins are not a good idea. However, I noticed that Hive does allow joins so I'm not sure why it's a bad idea.
I wanted to use something like Talend or Mulesoft to create joins and do aggregations within hive and create a temporary table and transfer that temporary table as the final table to PostgreSQL for reporting.
Any suggestions, especially if this is not a good practice with HIVE. I'm new to hive.
Thanks.
The major issue with joining in hive has to do with data locality.
Hive queries are executed as MapReduce jobs and several mappers will launch, as much as possible, in the nodes where the data lies.
However, when joining tables the two rows of data from LHS and RHS tables will not in general be in the same node, which may cause a significant amount of network traffic between nodes.
Joining in Hive is not bad per se, but if the two tables being joined are large may result in slow jobs.
If one of the tables is significantly smaller than the other you may want to store it in HDFS cache, making its data available in every node, which allows the join algorithm to retrieve all data locally.
So, there's nothing wrong with running large joins in Hive, you just need to be aware they need their time to finish.
Hive is growing in maturity
It is possible that arguments against using joins, no longer apply for recent versions of hive.
The most clear example I found in the manual section on join optimization:
The MAPJOIN implementation prior to Hive 0.11 has these limitations:
The mapjoin operator can only handle one key at a time
Therefore I would recommend asking what the foundation of their reluctance is, and then checking carefully whether it still applies. Their arguments may well still be valid, or might have been resolved.
Sidenote:
Personally I find Pig code much easier to re-use and maintain than hive, consider using Pig rather than hive to do map-reduce operations on your (hive table) data.
Its perfectly fine to do joins in HIVE, I am a ETL tester and have performed left joins on big tables in Hive most of the time the queries run smoothly but some times the job do get stuck or are slow due to network traffic.
Also depends on number of Nodes the cluster is having.
Thanks
I have a stored procedure in SQL Server that also queries tables in the same database and in a different Oracle database. This is for a data warehouse project that joins several large tables across databases and queries them.
Is it better to copy the table(with ~3 mil records) to the same database and then query it, or is the slowdown not significant from the table being in a different database? The query is complicated and can take hours.
I'm not necessarily looking for a specific answer, informed opinion and/or specific further reading are also very appreciated. Thanks!
I always prefer stage layer, or somebody calls it integration layer.
In your case (on blind) it's perhaps best solution to:
Copy table once
Create a sync step (Insert/Update) based on primary key(s)
Schedule step 2
Run your query
If there is some logical data-integrity rule, you can create second step by simple SQL based on timestamps.
I want to use Power Pivot for one of my Presentation Engine Applicaiton for Transactional Data.
Following are the questions for which I am looking for an answer.
What is PowerPivot?
Can I use power pivot if I have 100 M rows in one of my SQL server table?
For Handling 100M rows can I store it in simple SQL server database table or do I need columnar database?
How exactly does power pivot function?
PowerPivot is simply a BI tool. There are many good BI tools, especially if you want to get into the open-source areas. Look at Pentaho, Jaspersoft, and BIRT/Actuate. These tools also can connect to many different sources/databases.
For question 3, it's all about how you're using the data. If you always query based upon the same filtering criteria, then using indexes may work for you. Assuming 100 million rows is about 50 gigs of raw data, you're starting to see the "shift" in query response/scale between a row-oriented approach and a column-oriented approach. If the queries are ad-hoc or your database size will continue to grow, then you should consider a columnar database like Infobright.