Importing oracle to hadoop - sql

I have one oracle table which takes 3 hours to respond to a select query. I was thinking about importing it into hadoop for processing.
Would it be a good idea ? If I will use hive to perform the same query, would there be any performance gain ?
If yes, then how should I import my table into hadoop? Since table has composite primary key, sqoop is not an option. One more thing, Should I use HBase? Which approach will be better?

The performance in hadoop depends on the Size of the data. If the data is really huge you can see performance improvement. If the data is small , its better to tweak your query.

Related

Materialized view vs table - performance

I am quite new to Redshift but have quite some experience in the BI area. I need help from an expert Redshift developer. Here's my situation:
I have an external (S3) database added to Redshift. This will suffer very frequent changes, approx. every 15 minutes. I will run a lot of concurrent queries directly from Qlik Sense against this external DB.
As best practices say that Redshift + Spectrum works best when the smaller table resides in Redshift, I decided to move some calculated dimension tables locally and leave the outer tables in S3. The challenge I have is if it's better suited to use materialized views for this or tables.
I already tested both, with DIST STYLE = ALL and proper SORT KEY and the test show that MVs are faster. I just don't understand why that is. Considering the dimension tables are fairly small (<3mil rows), I have the following questions:
shall we use MVs and refresh them via scheduled task or use table and perform some form of ETL via stored procedure (maybe) to refresh it.
if table is used: I tried casting the varchar keys (heavily used in joins) to bigint to force encoding to AZ64, but queries perform worse than without casting (where encode=LZO). Is this because in the external DB it's stored as varchar?
if MV is used: I also tried above casting in the query behind MV, but the encoding says NONE (figured out by checking the table created behind the scene). Moreover, even without casting, most of the key columns used in joins have no encoding. Might it be that this is the reason why MVs are faster than table? And should I not expect the opposite - no encoding = worse performance?
Some more info: in S3, we store in the form of parquet files, with proper partitioning. In Redshift, the tables are sorted against the same columns as S3 partitioning, plus some more columns. And all queries use joins against these columns in the same order and also a filter in the where clause on these columns. So the query is well structured.
Let me know if you need any other details.

Indexing and cluster in Greenplum database

I am new to Greenplum database. I have a question.
Is Cluster on table mandatory after creating an index on a column in Greenplum in case of row-based distribution?
The "massively parallel" (MPP) nature of Greenplum's software-level architecture, when coupled with the throughput capabilities of modern servers makes indexes unnecessary in most cases.
To say it differently, the speed of table scans in Greenplum is a feature, rather than a bottleneck. Please refer to this great writeup on how MPP works under the hood : https://dwarehouse.wordpress.com/2012/12/28/introduction-to-massively-parallel-processing-mpp-database/
If your data is not updated frequently and you need quickly return the result, you can use clustered index table. it will cost much time. you can build index for the column-oriented table.

Is it bad to do joins in Hive?

Hi I recently joined a new job that uses Hive and PostgreSQL. The existing ETL scripts gather data from Hive partitioned by dates and creates tables for those data in PostgreSQL and then the PostgreSQL scripts/queries perform left joins and create the final table for reporting purpose. I have heard in the past that Hive joins are not a good idea. However, I noticed that Hive does allow joins so I'm not sure why it's a bad idea.
I wanted to use something like Talend or Mulesoft to create joins and do aggregations within hive and create a temporary table and transfer that temporary table as the final table to PostgreSQL for reporting.
Any suggestions, especially if this is not a good practice with HIVE. I'm new to hive.
Thanks.
The major issue with joining in hive has to do with data locality.
Hive queries are executed as MapReduce jobs and several mappers will launch, as much as possible, in the nodes where the data lies.
However, when joining tables the two rows of data from LHS and RHS tables will not in general be in the same node, which may cause a significant amount of network traffic between nodes.
Joining in Hive is not bad per se, but if the two tables being joined are large may result in slow jobs.
If one of the tables is significantly smaller than the other you may want to store it in HDFS cache, making its data available in every node, which allows the join algorithm to retrieve all data locally.
So, there's nothing wrong with running large joins in Hive, you just need to be aware they need their time to finish.
Hive is growing in maturity
It is possible that arguments against using joins, no longer apply for recent versions of hive.
The most clear example I found in the manual section on join optimization:
The MAPJOIN implementation prior to Hive 0.11 has these limitations:
The mapjoin operator can only handle one key at a time
Therefore I would recommend asking what the foundation of their reluctance is, and then checking carefully whether it still applies. Their arguments may well still be valid, or might have been resolved.
Sidenote:
Personally I find Pig code much easier to re-use and maintain than hive, consider using Pig rather than hive to do map-reduce operations on your (hive table) data.
Its perfectly fine to do joins in HIVE, I am a ETL tester and have performed left joins on big tables in Hive most of the time the queries run smoothly but some times the job do get stuck or are slow due to network traffic.
Also depends on number of Nodes the cluster is having.
Thanks

Under what conditions would SELECT by PRIMARY KEY be slow?

Chasing down some DB performance issues in a fairly typical EclipseLink/JPA application.
I am seeing frequent queries that are taking 25-100ms. These are simple queries, just selecting all columns from a table where its primary key is equal to a value. They shouldn't be slow.
I'm looking at the query time in the postgres log, using the log_min_duration_statement so this should eliminate any network or application overhead.
This query is not slow, but it is used very often.
Why would selecting * by primary key be slow?
Is this specific to postgres or is it a generic DB issue?
How can I speed this up? In general? For postgres?
Sample query from the pg log:
2010-07-28 08:19:08 PDT - LOG: duration: 61.405 ms statement: EXECUTE <unnamed> [PREPARE: SELECT coded_ele
ment_key, code_system, code_system_label, description, label, code, concept_key, alternate_code_key FROM coded
_element WHERE (coded_element_key = $1)]
Table has around 3.5 million rows.
I have also run EXPLAIN and EXPLAIN ANALYZE on this query, its only doing an index scan.
Select * makes your database work harder, and as a general rule, is a bad practice. There are tons of questions/answers on stackoverflow talking about that.
have you tried replacing * with the field names?
Could you be getting some kind of locking contention? What kind of locks are you taking when performing these queries?
Well, I don't know much about postgres SQL, so I'll give you a tip for MS SQL Server which might be applicable.
MS SQL Server has the concept of a "cluster index" which is the physical layout of the data on the disk. It's good to use on field where you'll be seeking a range between to values (date fields mostly). It's not much use if you're looking for a exact value (like a primary key lookup). However, sometimes the primary key index is inadvertantly set as a clustered index. This makes an index lookup into a table scan.
The the row unusually large or contain BLOBs and large binary fields?
Is this directly through console or is this query being run through some data access API like jdbc or ADO.NET? You mention JPA that looks like a data access API. For short queries, data access API become a larger percent of execution time-- creating the command, creating objects to hold the rows and cells, etc.
select * is almost always a very very bad idea.
If the order of the fields changes, it will break your code.
According to comments, this isn't really important given the abstraction library you're using.
You're probably returning more data from the table than you actually want. Selecting for the specific fields you want can save transfer time.
25ms is about the lower bound you're going to see on almost any kind of SQL query -- that's only two disk accesses! You might want to look into ways to reduce the number of times the query is run rather than trying to optimize the query.

What is the best way to build an index to get the fastest read response?

I need to index up to 500,000 entries for fastest read. The index needs to be rebuilt periodically , on disk. I am trying to decide between a simple file like a hash on disk or a single table in an embedded database. I have no need for an RDBMS engine.
I'm assuming you're referring to indexing tables on a relational DBMS (like mySql, Oracle, or Postgres).
Indexes are secondary data stores that keep a record of a subset of fields for a table in a specific order.
If you create an index, any query that includes the subset of fields that are indexed in its WHERE clause will perform faster.
However, adding indexes will reduce INSERT performance.
In general, indexes don't need to be rebuilt unless they become corrupted. They should be maintained on the fly by your DBMS.
Perhaps BDB? It is a high perf. database that doesn't use a DBMS.
If you've storing state objects by key, how about Berkeley DB.
cdb if the data does not change.
/Allan
PyTables Pro claims that "for situations that don't require fast updates or deletions, OPSI is probably one of the best indexing engines available". However I've not personally used it, but the F/OSS version of PyTables gives already gives you good performance:
http://www.pytables.org/moin/PyTablesPro
This is what MapReduce was invented for. Hadoop is a cool java implementation.
If the data doesn't need to be completely up to date, you might also like to think about using a data warehousing tool for OLAP purposes (such as MSOLAP). The can perform lightning fast read-only queries based on pre-calculated data.