I have two table in Apache hive. The first is called traffic_violations and the second is called cars.
So, in traffic_violations I have a column called fatl with values "Yes" or "No". I set this column as a STRING. I make the join between the table with an id.
So, I have this query:
select gender, fatal, substr(date_of_stop,7,10) as year, make
from traffic_violations t
join cars c on t.id = c.id
where fatal = "Yes"
group by date_of_stop, make, gender, fatal
If I remove the WHERE clause, the query works, but with this clause it does not work.
Hive print this message:
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 1809950422 HDFS Write: 0 SUCCESS
Stage-Stage-2: HDFS Read: 797286840 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 31.228 seconds
but Hive doesn't print the result of this query.
How do I resolve this problem?
Thanks to all!
Related
I was thinking to use this together with DBT to check that all the DAG, dependencies and such is correct without incurring in costs.
I was thinking of adding a LIMIT 0 in BigQuery queries. I'm not finding any official doc stating whether this is the case.
Are those queries not billed?
Correct, this will not bill any data. You can run a dry run to verify:
dzagales#cloudshell:~ (elzagales)$ bq query --use_legacy_sql=false --dry_run 'SELECT * FROM `bigquery-public-data.austin_311.311_service_requests` LIMIT 0'
Query successfully validated. Assuming the tables are not modified, running this query will process 0 bytes of data.
dzagales#cloudshell:~ (elzagales)$ bq query --use_legacy_sql=false --dry_run 'SELECT * FROM `bigquery-public-data.austin_311.311_service_requests` LIMIT 1'
Query successfully validated. Assuming the tables are not modified, running this query will process 254787 bytes of data.
Above you can see a LIMIT 0 bills 0 bytes, while a LIMIT 1 will scan the whole table.
currently, I am using hive with s3 storage.
I have total 1000000 partitions right now. I am facing a problem where:
If I do:
Query execution time is less than 1 second.
select sum(metric) from foo where pt_partition_number = 'bar1'
select sum(metric) from foo where pt_partition_number = 'bar2'
But if I do
select sum(metric) from foo where pt_partition_number IN ('bar1','bar2')
The query is taking near about 30 seconds. I am thinking hive is doing directory scan in case of second query.
Is there a way to optimize query:
My request pattern always access two partition data.
I have a day partitioned table with approx 300k rows in the streaming buffer. When running an interactive, non-cached, standard SQL query using
SELECT .. FROM .. WHERE _PARTITIONTIME IS NULL
The query validator says:
Valid: This query will process 0 B when run.
And after executing, the job information tab says:
Bytes Processed 0 B
Bytes Billed 0 B
The query is certainly returning real-time results each time I run it. Is this actually a free operation?
I just installed presto and when I use the presto-cli to query hive data, I get the following error:
~$ presto --catalog hive --schema default
presto:default> select count(*) from test3;
Query 20171213_035723_00007_3ktan, FAILED, 1 node
Splits: 131 total, 14 done (10.69%)
0:18 [1.04M rows, 448MB] [59.5K rows/s, 25.5MB/s]
Query 20171213_035723_00007_3ktan failed: com.facebook.presto.hive.$internal.org.codehaus.jackson.JsonParseException: Invalid UTF-8 start byte 0xa5
at [Source: java.io.ByteArrayInputStream#6eb5bdfd; line: 1, column: 376]
The error only happens if I use aggregate function such as count, sum, etc.
But when I use the same query on Hive CLI, it works (but take a lot of time since it converts the query into a map-reduce job).
$ hive
WARNING: Use "yarn jar" to launch YARN applications.
Logging initialized using configuration in file:/etc/hive/2.4.2.0-258/0/hive-log4j.properties
hive> select count(*) from test3;
...
MapReduce Total cumulative CPU time: 17 minutes 56 seconds 600 msec
Ended Job = job_1511341039258_0024
MapReduce Jobs Launched:
Stage-Stage-1: Map: 87 Reduce: 1 Cumulative CPU: 1076.6 sec HDFS Read: 23364693216 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 17 minutes 56 seconds 600 msec
OK
51751422
Time taken: 269.143 seconds, Fetched: 1 row(s)
The point is the same query works on Hive but not on Presto and I could not figure out why. I suspect it is because the 2 json library using on Hive and on Presto are different, but I'm not really sure.
I created the external table on Hive with the query:
hive> create external table test2 (app string, contactRefId string, createdAt struct <`date`: string, timezone: string, timezone_type: bigint>, eventName string, eventTime bigint, shopId bigint) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE LOCATION '/data/data-new/2017/11/29/';
Can anyone help me with this?
posting this here for easy reference:
from where OP documented a solution:
I successfully fixed the problem by using this serde: https://github.com/electrum/hive-serde (add to presto at /usr/lib/presto/plugin/hive-hadoop2/ and to hive cluster at /usr/lib/hive-hcatalog/share/hcatalog/)
I am querying hive table which contains around 10m rows. earlier query use to finish quickly on hive 0.14. Then I moved to hive 1.2.1 and now its not starting MR job
hive> select count(1) from nodes;
Query ID = lagvankarh_20160608221653_5dd82f87-3527-4eb6-9a59-f11ccaf0a125
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
I get SocketTimeout error after long time.