How to efficiently sample hive table and run a query? - hive

I want to sample 100 rows from a big_big_table (millions and millions of rows), and run some query on these 100 rows. Mainly for testing purposes.
The way I wrote this runs for really long time, as if it reads the whole big_big_table, and only then take the LIMIT 100:
WITH sample_table AS (
SELECT *
FROM big_big_table
LIMIT 100
)
SELECT name
FROM sample_table
ORDER BY name
;
Question: What's the correct/fast way of doing this?

Check hive.fetch.task.* configuration properties
set hive.fetch.task.conversion=more;
set hive.fetch.task.aggr=true;
set hive.fetch.task.conversion.threshold=1073741824; --1Gbyte
Set these properties before your query and if you are lucky, it will work without map-reduce. Also consider limiting to single partition.
It may not work depending on the storage type/serde and files size. If files are of small size/splittable and table is native, it may work fast without Map-Reduce started.

Related

Optimize joining two big tables ORACLE 19C

how can I optimize the query below :
SELECT A.CNACT, A.FACML, A.LCACT, H.CAECH, H.CMECH, H.MCCMP,H.DAHIS,RANK() OVER (PARTITION BY H.CNACT,H.CAECH,H.CMECH ORDER BY H.DAHIS DESC) RK
FROM NATACF A,HISTER H WHERE A.CNACT = H.CNACT;
select count (*) FROM NATACF; -->74794
select count (*) FROM HISTER; -->2100720
you find in attachment the execution plan
Thank you.
As you see window sort and hash JOIN are not optimised effectively. What is the best way to optimise this?
the screenshot below of prod database :
Long story short, you want ALL data from *both" tables - no filtering in place.
Oracle reads whole smaller (driving) table into hash map.
Uses joining column CNACT as a key
They reads whole bigger table and performs lookup in the hashmap for each row read.
the complexity is O(N+M), each row is read only once
There is no way to evaluate such a query in faster way (aside from dirty tricks like putting both tables into CLUSTER, pinning tables in buffer cache,...).
PS: it is strange that explain plan shows 2 sec - both tables are not actually so big.
While prod DB says 5hours.
Try to execute the query using set timing on set echo off set pagesize 0 set termout off set feedback off set pause off set verify off set headings off. Basically read the whole result and then discard it and print exec. time. And you will see.
Maybe it is the app (or network) who has problem to transfer the whole big result set. Is such a case you would see "SQL*Net Message to client" wait event in AWR. Like the database is waiting for the application to accept more data. Like you are sending about 14GB of data into the application.
For example Java has problems with GC, or each row is used to costly Java Object creation.
we resolved the problem using the with :
WITH Z AS(SELECT
| X.DATRA,X.COINI,X.COINT,X.NUCPT,N.COBCN,X.CNACT,N.LCACT,N.CNACR,DECODE(X.CSOPT,NULL,X.CAECH,O.CAECR) CAECH,
| DECODE(X.CSOPT,NULL,X.CMECH,O.CMECR) CMECH,X.CSOPT,X.MTSNA,N.COTSJ,X.CSENS,X.QTCCP,D.TXCHA,R.MAINT,X.CODEV,
| D.CDVRF,C.TYEDI,C.NUSES
10: | FROM CUMPOR X,
| MRXIDE C,.....
The WITH clause - The materialized subquery data is persistent through the query.

How can I optimize my varchar(max) column?

I'm running SQL Server and I have a table of user profiles which contains columns for the user's personal info and a profile picture.
When setting up the project, I was given advice to store the profile image in the database. This seemed OK and worked fine, but now I'm dealing with real data and querying more rows the data is taking a lifetime to return.
To pull just the personal data, the query takes one second. To pull the images I'm looking at upwards of 6 seconds for 5 records.
The column is of type varchar(max) and the size of the data varies. Here's an example of the data lengths:
28171
4925543
144881
140455
25955
630515
439299
1700483
1089659
1412159
6003
4295935
Is there a way to optimize my fetching of this data? My query looks like this:
SELECT *
FROM userProfile
ORDER BY id
Indexing is out of the question due to the data lengths. Should I be looking at compressing the images before storing?
If takes time to return data. Five seconds seems a little long for a few megabytes, but there is overhead.
I would recommend compressing the data, if retrieval time is so important. You may be able to retrieve and uncompress the data faster than reading the uncompressed data.
That said, you should not be using select * unless you specifically want the image column. If you are using this in places where it is not necessary, that can improve performance. If you want to make this save for other users, you can add a view without the image column and encourage them to use the view.
If it is still possible to take one step back.Drop the idea of Storing images in table. Instead save path in DB and image in folder.This is the most efficient .
SELECT *
FROM userProfile
ORDER BY id
Do not use * and why are you using order by ? You can order by AT UI code

How to limit BigQuery query size for testing a query sample through the web user-interface?

I would like to know if it is possible to limit the bigquery query size when running a query through the web user-interface?
My idea is just to test the query but instead of querying all my tables; I would like just to query a part of it with for instance a number of row.
Limit is not optimizing my query cost, so the idea is to find a function similar to "row_number" or "fetch".
Sorry I'm a marketer and not a developer, so thank you in advance for your kind help.
How to limit BigQuery query size for testing ... ?
1 - Try to minimize number of tables involved in your testing
In your query – there are 60+ tables involved for respectively dates between 2016-12-11 and nowadays
SELECT <fields_list> FROM
TABLE_DATE_RANGE([XXX:85801771.ga_sessions_],
TIMESTAMP('20161211'),
TIMESTAMP('20170315'))
Instead you can use same day as start and end of time range, thus drastically reducing number of involved tables (down to just one table) and overall scan size. For example
SELECT <fields_list> FROM
TABLE_DATE_RANGE([XXX:85801771.ga_sessions_],
TIMESTAMP('20161211'),
TIMESTAMP('20161211'))
2 - Minimize number of rows. Ability to do so really depends on how your table is being loaded with data. If table loaded incrementally - you can use so called table decorators.
Note - this technique works with tables within last 7 days
For example, below will scan only data that was in table at one hour ago (so called snapshot decorator)
SELECT <fields_list> FROM [XXX:85801771.ga_sessions_20170212#-3600000]
This works well with the most recent day's table especially at the start of the day when size of table is not big yet
So, to limit further, you can use below version (so called range decorator) - gives you data added between one hour and half an hour ago
SELECT <fields_list> FROM [XXX:85801771.ga_sessions_20170212#-3600000--1800000]
Finally, #0 is a special case that references the oldest possible snapshot of the table: either 7 days in the past, or the table's creation time if the table is less than 7 days old. For example
SELECT <fields_list> FROM [XXX:85801771.ga_sessions_20170210#0]
3 - Test against Sampled Table. If you expect experimenting with your query again and again - you can first prepare downsized version of your table with just as many rows as you need and applying sampling logic that fit in your business logic. To limit number of rows you can use LIMIT Clause. To get random rows you can use RAND function for example
After sampled table is prepared - run all your query against it till when you have final version - after this - you can run it against your original table(s)
And btw, to create sampled table you need to set destination table under options in Web UI.

Slow BigQuery Response Time

I don't really consider this as a duplicate of this
or this. It might be slightly related any way. But, i wrote a simple query on data SELECT * FROM [opendata.openQueryData] LIMIT 1000 that is not even up to 100,00 rows in total and it is taking forever. meanwhile similar query on sample data SELECT * FROM [publicdata:samples.shakespeare] LIMIT 1000 just took 1.2s. Is there anything i have to do to achieve this speed on my own data ?
edit
And i just noticed the number of rows on my data is not showing unlike the rest of the samples dataset and some i have used before now, Could this be the reason for slow query response ?

long running queries: observing partial results?

As part of a data analysis project, I will be issuing some long running queries on a mysql database. My future course of action is contingent on the results I obtain along the way. It would be useful for me to be able to view partial results generated by a SELECT statement that is still running.
Is there a way to do this? Or am I stuck with waiting until the query completes to view results which were generated in the very first seconds it ran?
Thank you for any help : )
In general case the partial result cannot be produced. For example, if you have an aggregate function with GROUP BY clause, then all data should be analysed, before the 1st row is returned. LIMIT clause will not help you, because it is applied after the output is computed. Maybe you can give a concrete data and SQL query?
One thing you may consider is sampling your tables down. This is good practice in data analysis in general to get your iteration speed up when you're writing code.
For example, if you have table create privelages and you have some mega-huge table X with key unique_id and some data data_value
If unique_id is numeric, in nearly any database
create table sample_table as
select unique_id, data_value
from X
where mod(unique_id, <some_large_prime_number_like_1013>) = 1
will give you a random sample of data to work your queries out, and you can inner join your sample_table against the other tables to improve speed of testing / query results. Thanks to the sampling your query results should be roughly representative of what you will get. Note, the number you're modding with has to be prime otherwise it won't give a correct sample. The example above will shrink your table down to about 0.1% of the original size (.0987% to be exact).
Most databases also have better sampling and random number methods than just using mod. Check the documentaion to see what's available for your version.
Hope that helps,
McPeterson
It depends on what your query is doing. If it needs to have the whole result set before producing output - such as might happen for queries with group by or order by or having clauses, then there is nothing to be done.
If, however, the reason for the delay is client-side buffering (which is the default mode), then that can be adjusted using "mysql-use-result" as an attribute of the database handler rather than the default "mysql-store-result". This is true for the Perl and Java interfaces: I think in the C interface, you have to use an unbuffered version of the function that executes the query.