spark jdbc read tuning where table without primary key - apache-spark-sql

I am reading 30M records from oracle table with no primary key columns.
spark jdbc reading hangs and not fetching any data. where i can get the result from Oracle SQLDeveloper within few seconds for same query.
oracleDf = hiveContext.read().format("jdbc").option("url", url)
.option("dbtable", queryToExecute)
.option("numPartitions ","5")
.option("fetchSize","1000000")
.option("user", use).option("password", pwd).option("driver", driver).load().repartition(5);
i cannot use partition columns as i do not have primary key column.
can anyone advice to improve performance.
Thanks

There are many a things that can be used to optimize your DF creation. You might want to drop repartition and also use predicates to parallelize data retrieval process for Spark actions.
If the filter is not based on primary key or an indexed column, exploring ROWID is a possibility.

Related

How to create index in BigQuery on a jsonPayload field of a GCP log?

I'm exporting my GCP logs to BigQuery to view them better. From here, it seems like fields in the jsonPayload of my logEntry, would be turned into a jsonPayload. field in the BigQuery table. One of the fields in the jsonPayload is an internal log ID.
Since BigQuery now allows adding indexes to tables, I want to add an index on this jsonPayload.logId field in the table for improved query performance. How do I do that? Or should I be only depending on partitioning and/or clustering for better performance?

Cassandra data modeling blob

I am thinking of using cassandra for storing my data. I have a server_id, start_time, end_time, messages_blob.
CREATE TABLE messages (
server_id uuid,
start bigint,
end bigint,
messages_blob blob,
PRIMARY KEY ((server_id), start,end)
) WITH CLUSTERING ORDER BY (start,end);
I have two types of queries:
get all server_ids and messages_blob at start time > 100 and start time < 300.
get all messages_blob's for a bunch of server_ids at a time.
Can the above schema help me do it? I need to put billions of records in this table very quickly and do reads after all inserts have happened. The reads queries are not too many, compared to writes, but i need the data back as quickly as possible.
With this table structure you can only execute the 2nd query - you'll just need to execute queries for every single server_id separately, best via async API.
For 1st query, this table structure won't work, as Cassandra needs to know partition key (server_id) to perform query - otherwise it will require a full scan that will timeout when you have enough data in table.
To execute this query you have several choices.
Add another table that will have start as partition key, and there you can store primary keys of records in first table. Something like this:
create table lookup (start bigint, server_id uuid, end bigint,
primary key(start, server_id, end));
this will require that you write data into 2 tables, or you maybe can use materialized view for this task (although it could be problematic if you use OSS Cassandra, as it has plenty of bugs there). But you'll need to be careful with size of partition for that lookup table.
Use Spark for scanning the table - because you have start as first clustering column, then Spark will able to perform predicates pushdown, and filtering will happen inside Casasndra. But it will be much slowly than using lookup table.
Also, be very careful with blobs - Cassandra doesn't work well with big blobs, so if you have blobs with size more than 1Mb, you'll need to split them into multiple pieces, or (better) to store them on file system, or some other storage, like, S3, and keep in Cassandra only metadata.

Make existing bigquery table clustered

I have a quite huge existing partitioned table in bigquery. I want to make the table clustered, at least for the new partition.
From the documentation: https://cloud.google.com/bigquery/docs/creating-clustered-tables, it is said that we are able to Creating a clustered table when you load data and I have tried to load a new partition using clustering fields: job_config.clustering_fields = ["event_type"].
The load finished successfully, however it seems that the new partition is not clustered (I am not really sure how to check whether it is clustered or not, but when I query to that particular partition it would always scan all rows).
Is there a good way to make clustering field for an existing partitioned table?
Any comment, suggestion, or answer is well appreciated.
Thanks a lot,
Yosua
BigQuery supports changing an existing non-clustered table to a clustered table and vice versa. You can also update the set of clustered columns of a clustered table.
You can change the clustering specification in the following ways:
Call the tables.update or tables.patch API method.
Call the bq command-line tool's bq update command with the --clustering_fields flag.
Reference
https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec
This answer is no longer valid / correct
https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec
You can only specify clustering columns when a table is created
So, obviously you cannot expect existing non-clustered table and especially just new partitions to become clustered
The "workaround" is to create new table to be properly partitioned / clustered and load data into it from Google Cloud Storage (GCS). You can export data from original table into GCS first for this so whole process will be free of charge
What I missed from the above answers was a real example, so here it goes:
bq update --clustering_fields=tool,qualifier,user_id my_dataset.my_table
Where tool, qualifier and user_id are the three columns I want the table to be clustered by (in that order) and the table is my_dataset.my_table.

Records updating based on primary key is slow

I'm updating 60k records in 200k records using informatica based only only one primary key. Still it is running longer. Is there a way to reduce the time as we cannot create index on primary key again which is not necessary.
60k updated rows out of 200k total is a pretty high hit rate. An indexed read would be extremely inefficient compared to a Full Table Scan. So really you don't want to use the primary key index to do an update like this.
However it's difficult to provide more assistance unless you can post the exact query that's being executed, preferably with its explain plan.
Most likely your target table has multiple indexes defined on it (regardless of how many keys you are using to do the update) also there might be multiple foreign keys which need to be resolved against their related tables. Ignore informatica for a minute and try run the update directly on the db and resolve
The best way for this is deleting and inserting which is a faster alternative way and it works.

Why is Select Count(*) slower than Select * in hive

When i am running queries in VirtualBox Sandbox with hive. I feel Select count(*) is too much slower than the Select *.
Can anyone explain what is going on behind?
And why this delay is happening?
select * from table
It can be a Map only job But
Select Count(*) from table
It can be a Map and Reduce job
Hope this helps.
There are three types of operations that a hive query can perform.
In order of cheapest and fastest to more expensive and slower here they are.
A hive query can be a metadata only request.
Show tables, describe table are examples. In these queries the hive process performs a lookup in the metadata server. The metadata server is a SQL database, probably MySQL, but the actual DB is configurable.
A hive query can be an hdfs get request.
Select * from table, would be an example. In this case hive can return the results by performing an hdfs operation. hadoop fs -get, more or less.
A hive query can be a Map Reduce job.
Hive has to ship the jar to hdfs, the jobtracker queues the tasks, the tasktracker execute the tasks, the final data is put into hdfs or shipped to the client.
The Map Reduce job has different possibilities as well.
It can be a Map only job.
Select * from table where id > 100 , for example all of that logic can be applied on the mapper.
It can be a Map and Reduce job,
Select min(id) from table;
Select * from table order by id ;
It can also lead to multiple map Reduce passes, but I think the above summarizes some behaviors.
This is because the DB is using clustered primary keys so the query searches each row for the key individually, row by agonizing row, not from an index.
Run optimize table. This will ensure that the data pages are
physically stored in sorted order. This could conceivably speed up a
range scan on a clustered primary key.
create an additional non-primary index on just the change_event_id
column. This will store a copy of that column in index pages which be
much faster to scan. After creating it, check the explain plan to
make sure it's using the new index