I'm kinda of new to Hive and Hadoop . I have a query which is taking 10 minutes to complete the query .
Size of the data is 10GB
Statistics:Num rows: 4457541 Data size: 1854337449 Basic stats: COMPLETE Column stats: COMPLETE
Partition and Bucketing is done in the table .
How can I improve the below query .
select * fromtbl1 where clmn='Abdul' and loc='IND' and TO_UNIX_TIMESTAMP(ts) > (UNIX_TIMESTAMP() - 5*60*60);
set hive.vectorized.execution.reduce.enabled=true;
set hive.tez.container.size=8192;
set hive.fetch.task.conversion = none;
set mapred.compress.map.output=true;
set mapred.output.compress=true;
set hive.fetch.task.conversion=none;
-----------+--+
| Explain |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| Plan not optimized by CBO. |
| |
| Stage-0 |
| Fetch Operator |
| limit:-1 |
| Stage-1 |
| Map 1 |
| File Output Operator [FS_2973] |
| compressed:false |
| Statistics:Num rows: 49528 Data size: 24516360 Basic stats: COMPLETE Column stats: COMPLETE |
| table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"} |
| Select Operator [SEL_2972] |
| outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"] |
| Statistics:Num rows: 49528 Data size: 24516360 Basic stats: COMPLETE Column stats: COMPLETE |
| Filter Operator [FIL_2971] |
| predicate:((section = 'xysaa') and (to_unix_timestamp(ts) > (unix_timestamp() - 18000))) (type: boolean) |
| Statistics:Num rows: 49528 Data size: 24516360 Basic stats: COMPLETE Column stats: COMPLETE |
| TableScan [TS_2970] |
| ACID table:true |
| alias:pp |
| Statistics:Num rows: 4457541 Data size: 1854337449 Basic stats: COMPLETE Column stats: COMPLETE |
| |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
None of the parameters helped us to resolve the query in shorter period of time .
According to the plan, query runs on mapper, vectorizing is not enabled. Try this:
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled=true;
Tune mapper parallelism:
set tez.grouping.max-size=67108864;
set tez.grouping.min-size=32000000;
Play with these settings to increase the number of mappers running. Ideally it should run without this setting:
set hive.tez.container.size=8192;
One more recommendation is to replace unix_timestamp() with UNIX_TIMESTAMP(current_timestamp). This function is not deterministic and its value is not fixed for the scope of a query execution, therefore prevents proper optimization of queries - this has been deprecated since 2.0 in favor of CURRENT_TIMESTAMP constant.
(UNIX_TIMESTAMP(current_timestamp) - 5*60*60)
Also your files are very small. the size of partition is 200-500, 12 files per partition, 20-50Mb is the file size. Fortunately it is ORC and you can concatenate files using ALTER TABLE CONCATENATE COMMAND.
12 files is not a big deal and you probably will not notice an improvement when querying single partition.
See also this answer: https://stackoverflow.com/a/48487306/2700344
Related
I have a table generated by chart that lists the results of a compliance scan
These results are typically Pass, Fail, and Error - but sometimes there is "Unknown" as a response
I want to show the percentage of each (Pass, Fail, Error, Unknown), so I do the following:
| fillnull value=0 Pass Fail Error Unknown
| eval _total=Pass+Fail+Error+Unknown
<calculate percentages for each field>
<append "%" to each value (Pass, Fail, Error, Unknown)>
What I want to do is eliminate a "totally" empty column, and only display it if it actually exists somewhere in the source data (not merely because of the fillnull command)
Is this possible?
I was thinking something like this, but cannot figure out the second step:
| eventstats max(Unknown) as _unk
| <if _unk is 0, drop the field>
edit
This could just as easily be reworded to:
if every entry for a given field is identical, remove it
Logically, this would look something like:
if(mvcount(values(fieldname))<2), fields - fieldname
Except, of course, that's not valid SPL
could you try that logic after the chart :
``` fill with null values ```
| fillnull value=null()
``` do 90° two time, droping empty/null ```
| transpose 0 include_empty=false | transpose 0 header_field=column | fields - column
[edit:] it is working when I do the following but not sure it is easy to make it working on all conditions
| stats count | eval keep=split("1 2 3 4 5"," ") | mvexpand keep
| table keep nokeep
| fillnull value=null()
| transpose 0 include_empty=false | transpose 0 header_field=column | fields - column
[edit2:] and if you need to add more null() could be done like that
| stats count | eval keep=split("1 2 3 4 5"," "), nokeep=0 | mvexpand keep
| table keep nokeep
| foreach nokeep [ eval nokeep=if(nokeep==0,null(),nokeep) ]
| transpose 0 include_empty=false | transpose 0 header_field=column | fields - column
I have recently discovered Google Big Query and it's open datasets. Upon performing the following query on the 311_service requests table in the new_york dataset, the cloud console reports the bytes billed to be 130 MB.
SQL Query:
SELECT unique_key FROM `bigquery-public-data.new_york.311_service_requests` LIMIT 10
Query Returns:
+------+-------------+
| Rows | unique_key |
+------+-------------+
| 1 | 37911459 |
| 2 | 38162601 |
| 3 | 32560181 |
| 4 | 38259076 |
| 5 | 36034528 |
| 6 | 36975822 |
| 7 | 38028455 |
| 8 | 37993135 |
| 9 | 37988664 |
| 10 | 35382611 |
+------+-------------+
For a query returning such a small amount of data, why is the bytes billed valued at 130 MB?
Is there a way to optimize this? Should the results of a query be stored in another database for later retrieval?
why is the bytes billed valued at 130 MB?
Query pricing refers to the cost of running your SQL commands and user-defined functions. BigQuery charges for queries by using one metric: the number of bytes processed (also referred to as bytes read). You are charged for the number of bytes processed whether the data is stored in BigQuery or in an external data source such as Cloud Storage, Google Drive, or Cloud Bigtable.
When you run a query, you're charged according to the total data processed in the columns you select, even if you set an explicit LIMIT on the results. The total bytes per column is calculated based on the types of data in the column. For more information about how we calculate your data size, see Data size calculation.
Query pricing is based on your usage pattern: a monthly flat rate for queries or pricing based on interactive queries. Enterprise customers generally prefer flat-rate pricing for queries because that model offers consistent month-to-month costs. On-demand (or interactive) pricing offers flexibility and is based solely on usage.
You can see more at https://cloud.google.com/bigquery/pricing
So, in your case 130MB is the size of the respective unique_key column
Should the results of a query be stored in another database for later retrieval?
sure
You can do so to manage cost for consecutive processing of that small data w/o touching the original one
Have in mind - this will invoke storage price for you - see same above mentioned link for details
I have a namespace and a set in aerospike with information inside.
After performing a SELECT * FROM mytecache.search_results I get around 600 rows of this:
"bdee9a37e3217f28a057b5e0ebbef43f_Sabre_13" | "a:11:{s:5:"price";a:14:{s:5:"total";d:2947.5999999999999;s:11:"pricePerPax";d:1473.8;s:8:"totalflt";d:2947.5999999999999;s:9:"baseprice";d:1901.0799999999999;s:3:"tax";d:986.51999999999998;s:10:"servicefee";i:60;s:11:"markupprice";d:1961.0799999999999;s | "flight_bdee9a37e3217f28a057b5e0ebbef43f" | 147380 | LIST('["LH", "LH", "LH", "LH"]'....
As you know, the first column is the primary key, so I try to select the row (which just appeared in SELECT) with:
aql> SELECT * FROM mytecache.search_results WHERE PK = "bdee9a37e3217f28a057b5e0ebbef43f_Sabre_13"
and I get:
Error: (2) AEROSPIKE_ERR_RECORD_NOT_FOUND
Explain returns this:
aql> EXPLAIN SELECT * FROM mytecache.search_results WHERE PK="bdee9a37e3217f28a057b5e0ebbef43f_Sabre_13"
+------------------+--------------------------------------------+-------------+-----------+--------+---------+----------+----------------------------+-------------------+-------------------------+---------+
| SET | DIGEST | NAMESPACE | PARTITION | STATUS | UDF | KEY_TYPE | POLICY_REPLICA | NODE | POLICY_KEY | TIMEOUT |
+------------------+--------------------------------------------+-------------+-----------+--------+---------+----------+----------------------------+-------------------+-------------------------+---------+
| "search_results" | "6F62BE5323F0A51EF7DFDD5060A02E62CD2453F0" | "mytecache" | 623 | 2 | "FALSE" | "STRING" | "AS_POLICY_REPLICA_MASTER" | "BB9B25B63000056" | "AS_POLICY_KEY_DEFAULT" | 1000 |
+------------------+--------------------------------------------+-------------+-----------+--------+---------+----------+----------------------------+-------------------+-------------------------+---------+
1 row in set (0.002 secs)
asinfo outputs:
1 : node
BB9B25B63000056
2 : statistics
cluster_size=1;cluster_key=883272F81547;cluster_integrity=true;cluster_is_member=true;uptime=1146;system_free_mem_pct=90;system_swapping=false;heap_allocated_kbytes=6525778;heap_active_kbytes= 6529792;heap_mapped_kbytes=6619136;heap_efficiency_pct=99;heap_site_count=14;objects=6191;tombstones=0;tsvc_queue=0;info_queue=0;delete_queue=0;rw_in_progress=0;proxy_in_progress=0;tree_gc_queue=0; client_connections=19;heartbeat_connections=0;fabric_connections=0;heartbeat_received_self=7637;heartbeat_received_foreign=0;reaped_fds=47;info_complete=12326;proxy_retry=0;demarshal_error=0;early_ tsvc_client_error=0;early_tsvc_batch_sub_error=0;early_tsvc_udf_sub_error=0;batch_index_initiate=0;batch_index_queue=0:0,0:0,0:0,0:0;batch_index_complete=0;batch_index_error=0;batch_index_timeout=0 ;batch_index_unused_buffers=0;batch_index_huge_buffers=0;batch_index_created_buffers=0;batch_index_destroyed_buffers=0;batch_initiate=0;batch_queue=0;batch_error=0;batch_timeout=0;scans_active=0;qu ery_short_running=0;query_long_running=0;sindex_ucgarbage_found=0;sindex_gc_locktimedout=0;sindex_gc_list_creation_time=39;sindex_gc_list_deletion_time=0;sindex_gc_objects_validated=11734;sindex_gc _garbage_found=0;sindex_gc_garbage_cleaned=0;paxos_principal=BB9B25B63000056;migrate_allowed=true;migrate_partitions_remaining=0;fabric_bulk_send_rate=0;fabric_bulk_recv_rate=0;fabric_ctrl_send_rat e=0;fabric_ctrl_recv_rate=0;fabric_meta_send_rate=0;fabric_meta_recv_rate=0;fabric_rw_send_rate=0;fabric_rw_recv_rate=0
3 : features
peers;cdt-list;cdt-map;pipelining;geo;float;batch-index;replicas-all;replicas-master;replicas-prole;udf
4 : cluster-generation
1
5 : partition-generation
0
6 : build_time
Sat Nov 4 02:19:49 UTC 2017
7 : edition
Aerospike Community Edition
8 : version
Aerospike Community Edition build 3.15.0.2
9 : build
3.15.0.2
10 : services
11 : services-alumni
12 : build_os
debian8
Which seems wrong. Why is the query not returning the result, when it is clearly there ?
"As you know, the first column is the primary key," -- no first column is your first bin. The primary key is whatever you used to write the record. Aerospike does not store your primary key.
You are quite likely using a wrong PK for the record whose first bin is "bdee9a37e3217f28a057b5e0ebbef43f_Sabre_13"
https://www.aerospike.com/docs/tools/aql/data_management.html
“STATUS” is the status of the operation. It will be the code as returned by the client for the record. E.g: 0 is AEROSPIKE_OK and 2 is AEROSPIKE_ERR_RECORD_NOT_FOUND.
So, even your explain execution in AQL is telling you in the STATUS that the record is not found.
I have a column table with a single column.
I would like to create a table type with all the elements in the column of the above mentioned table as column names with fixed datatype and size and use it in a function.
similarly like below:
Dynamic creation of table in tsql
Any suggestions would be appreciated.
EDIT:
To finish a product, a machine has to perform different Jobs on the material with different tools.
I have a list of Jobs a machine can perform and a list of Tools. a specific tool for a specific Job.
Each job needs a specific tool and number of hours (to change the tool once it reached its change time). A Job can be performed many times on a product. (in this case if a Job is performed for 1 hour = tool has been used for 1 hour)
For each product, a set of tools will be at work in a sequence. so I Need a report for each product, number of hours the tool has worked.
EDIT 2:
Product table
---------+-----+
ProductID|Jobs |
---------+-----+
1 | job1 |
1 | job2 |
1 | job3 |
1 | . |
1 | . |
1 |100th |
2 | job1 |
2 | . |
2 | . |
2 |200th |
Jobs table
-------+-------+-------
Jobs | tool | time
-------+-------+-------
job1 |tool 10| 2
job1 |tool 09| 1
job2 |tool 11| 4
job3 |tool 17| 0.5
required report (this table does not physically exist)
----------+------+------+------+------+------+-----
productID | job1 | job2 | job3 | job4 | job5 | . . .
----------+------+------+------+------+------+------
1 | 20 | 10 | 5 | . | . | .
----------+------+------+------+------+------+------
2 | 10 | 13 | 5 | . | . | .
----------+------+------+------+------+------+------
Based on the added information, there are two main requirements here:
You want to sum up the time spent for producing each product grouped by the jobs involved
and
You want to have a cross-table report showing the times from step 1 against products and jobs.
For the first bit, you probably could do this with a query like this:
SELECT
p.product_id,
j.jobs,
SUM(j.time) as SUM_TIME
FROM
products p
INNER JOIN jobs j
ON p.jobs = j.jobs
GROUP BY
p.product_id,
j.jobs;
For the second part: this is usually called a PIVOT report.
SAP HANA does not provide a dynamic SQL command for generating output in this form (other DBMS have that).
However, this dynamic transformation is usually relevant for the data presentation and not so much for the processing.
So, as you probably want to use some form of front end for this report (e.g. MS Excel, Crystal Reports, Business Objects X, Tableau, ...) I would recommend doing the transformation and formatting in the frontend report. Look for "PIVOT" or "CROSSTAB" options to do that.
In America’s Cup yachting, we generate large datasets where at every time-stamp (e.g. 100Hz) we need to store maybe 100-1000 channels of sensor data (e.g. speed, loads, pressures). We store this in MS SQL Server and need to be able to retrieve subsets of channels of the data for analysis, and perform queries such as the maximum pressure on a particular sensor in a test, or over an entire season.
The set of channels to be stored stays the same for several thousand time-stamps, but day-to-day will change as new sensors are added, renamed, etc... and depending on testing, racing or simulating, the number of channels can vary greatly.
The textbook way to structure the SQL tables would probably be:
OPTION 1
ChannelNames
+-----------+-------------+
| ChannelID | ChannelName |
+-----------+-------------+
| 50 | Pressure |
| 51 | Speed |
| ... | ... |
+-----------+-------------+
Sessions
+-----------+---------------+-------+----------+
| SessionID | Location | Boat | Helmsman |
+-----------+---------------+-------+----------+
| 789 | San Francisco | BoatA | SailorA |
| 790 | San Francisco | BoatB | SailorB |
| ... | ... | ... | |
+-----------+---------------+-------+----------+
SessionTimestamps
+-------------+-------------+------------------------+
| SessionID | TimestampID | DateTime |
+-------------+-------------+------------------------+
| 789 | 12345 | 2013/08/17 10:30:00:00 |
| 789 | 12346 | 2013/08/17 10:30:00:01 |
| ... | ... | ... |
+-------------+-------------+------------------------+
ChannelData
+-------------+-----------+-----------+
| TimestampID | ChannelID | DataValue |
+-------------+-----------+-----------+
| 12345 | 50 | 1015.23 |
| 12345 | 51 | 12.23 |
| ... | ... | ... |
+-------------+-----------+-----------+
This structure is neat but inefficient. Each DataValue requires three storage fields, and at each time-stamp we need to INSERT 100-1000 rows.
If we always had the same channels, it would be more sensible to use one row per time-stamp and structure like this:
OPTION 2
+-----------+------------------------+----------+-------+----------+--------+-----+
| SessionID | DateTime | Pressure | Speed | LoadPt | LoadSb | ... |
+-----------+------------------------+----------+-------+----------+--------+-----+
| 789 | 2013/08/17 10:30:00:00 | 1015.23 | 12.23 | 101.12 | 98.23 | ... |
| 789 | 2013/08/17 10:30:00:01 | 1012.51 | 12.44 | 100.33 | 96.82 | ... |
| ... | ... | ... | | | | |
+-----------+------------------------+----------+-------+----------+--------+-----+
However, the channels change every day, and over the months the number of columns would grow and grow, with most cells ending up empty. We could create a new table for every new Session, but it doesn’t feel right to be using a table name as a variable, and would ultimately result in tens of thousands of tables – also, it becomes very difficult to query over a season, with data stored in multiple tables.
Another option would be:
OPTION 3
+-----------+------------------------+----------+----------+----------+----------+-----+
| SessionID | DateTime | Channel1 | Channel2 | Channel3 | Channel4 | ... |
+-----------+------------------------+----------+----------+----------+----------+-----+
| 789 | 2013/08/17 10:30:00:00 | 1015.23 | 12.23 | 101.12 | 98.23 | ... |
| 789 | 2013/08/17 10:30:00:01 | 1012.51 | 12.44 | 100.33 | 96.82 | ... |
| ... | ... | ... | | | | |
+-----------+------------------------+----------+----------+----------+----------+-----+
with a look-up from Channel column IDs to channel names – but this requires an EXEC or eval to execute a pre-constructed query to obtain the channel we want – because SQL isn’t designed to have column names as variables. On the plus side, we can re-use columns when channels change, but there will still be many empty cells because the table has to be as wide as the largest number of channels we ever encounter. Using a SPARSE table may help here, but I am uncomfortable with the EXEC/eval issue above.
What is the right solution to this problem, that achieves efficiency of storage, inserts and queries?
I would go with Option 1.
Data integrity is first, optimization (if needed) - second.
Other options would eventually have a lot of NULL values and other problems stemming from not being normalized. Managing the data and making efficient queries would be difficult.
Besides, there is a limit on the number of columns that a table can have - 1024, so if you have 1000 sensors/channels you are already dangerously close to the limit. Even if you make your table a wide table, which allows 30,000 columns, still there is a limitation on the size of the row in a table - 8,060 bytes per row. And there are certain performance considerations.
I would not use wide tables in this case, even if I was sure that the data for each row would never exceed 8060 bytes and growing number of channels would never exceed 30,000.
I don't see a problem with inserting 100 - 1000 rows in Option 1 vs 1 row in other options. To do such INSERT efficiently don't make 1000 individual INSERT statements, do it in bulk. In various places in my system I use the following two approaches:
1) Build one long INSERT statement
INSERT INTO ChannelData (TimestampID, ChannelID, DataValue) VALUES
(12345, 50, 1015.23),
(12345, 51, 12.23),
...
(), (), (), (), ........... ();
that contains 1000 rows and execute it as normal INSERT in one transaction, rather than 1000 transactions (check the syntax details).
2) Have a stored procedure that accepts a table-valued parameter. Call such procedure passing 1000 rows as a table.
CREATE TYPE [dbo].[ChannelDataTableType] AS TABLE(
[TimestampID] [int] NOT NULL,
[ChannelID] [int] NOT NULL,
[DataValue] [float] NOT NULL
)
GO
CREATE PROCEDURE [dbo].[InsertChannelData]
-- Add the parameters for the stored procedure here
#ParamRows dbo.ChannelDataTableType READONLY
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
BEGIN TRANSACTION;
BEGIN TRY
INSERT INTO [dbo].[ChannelData]
([TimestampID],
[ChannelID],
[DataValue])
SELECT
TT.[TimestampID]
,TT.[ChannelID]
,TT.[DataValue]
FROM
#ParamRows AS TT
;
COMMIT TRANSACTION;
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION;
END CATCH;
END
GO
If possible, accumulate data from several timestamps before inserting to make the batches larger. You should try with your system and find the optimal size of the batch. I have batches around 10K rows using the stored procedure.
If you have your data coming from sensors 100 times a second, then I would at first dump the incoming raw data in some very simple CSV file(s) and have a parallel background process that would insert it into the database in chunks. In other words, have some buffer for incoming data, so that if the server can't cope with the incoming volume, you would not loose your data.
Based on your comments, when you said that some channels are likely to be more interesting and queried several times, while others are less interesting, here is one optimization that I would consider. In addition to having one table ChannelData for all channels have another table InterestingChannelData. ChannelData would have the whole set of data, just in case. InterestingChannelData would have a subset only for the most interesting channels. It should be much smaller and it should take less time to query it. In any case, this is an optimization (denormalization/data duplication) built on top of properly normalized structure.
Is your process like this:
Generate data during the day
Analyse data afterwards
If these are separate activities then you might want to consider using different 'insert' and 'select' schemas. You could create a schema that's fast for inserting on the boat, then afterwards you batch upload this data into an analysis optimised schema. This requires a transformation step (where for example you map generic column names into useful column names)
This is along the lines of data warehousing and data marts. In this kind of design, you batch load and optimise the schema for reporting. Does your current daily upload have much of a window?