How to compare BIGINT in Pig - apache-pig

In table commit_time is BIGINT and value store is like 20190508143744
when I try to compare with commit_time > 1000 it works without error
but when I try with commit_time > 20190508143743, it gives error as below
2019-05-29 17:35:38,390 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: For input string: "20190508143743"
Steps:
pig -useHCatalog
custProf = LOAD 'alisy3p.cust_change' using org.apache.hive.hcatalog.pig.HCatLoader();
// this step gives error
deviceChange= filter custProf by (commit_time > 20190508143743);
Also Tried:
deviceChange= filter custProf by (commit_time > (bigint)20190508143743);
deviceChange= filter custProf by (commit_time > (long)20190508143743);

According to the Pig documentation, you should be able to specific a biginteger constant by adding BI to the end of the number. Try this:
deviceChange = filter custProf by (commit_time > 20190508143743BI);

Answer:
deviceChange= filter custProf by (commit_time > 20190508143743L);
BIGINT is not supported and BIGINTEGER is different as per the documentation of hive we could use Long.
https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore#HCatalogLoadStore-DataTypeMappings
https://pig.apache.org/docs/r0.17.0/basic.html#constants
Thanks to Savagedata for his input!

Related

Cannot define a BigQuery column as ARRAY<STRUCT<INT64, INT64>>

I am trying to define a table that has a column that is an arrays of structs using standard sql. The docs here suggest this should work:
CREATE OR REPLACE TABLE ta_producer_conformed.FundStaticData
(
id STRING,
something ARRAY<STRUCT<INT64,INT64>>
)
but I get an error:
$ bq query --use_legacy_sql=false --location=asia-east2 "$(cat xxxx.ddl.temp.sql | awk 'ORS=" "')"
Waiting on bqjob_r6735048b_00000173ed2d9645_1 ... (0s) Current status: DONE
Error in query string: Error processing job 'xxxxx-10843454-yyyyy-
dev:bqjob_r6735048b_00000173ed2d9645_1': Illegal field name:
Changing the field (edit: column!) name does not fix it. What I am doing wrong?
The fields within the struct need to be named so this works:
CREATE OR REPLACE TABLE ta_producer_conformed.FundStaticData
(
id STRING,
something ARRAY<STRUCT<x INT64,y INT64>>
)

BigQuery table not found error on location

i'm trying to run below script but i keep getting error that the dataset isn't found. The problem is caused by $date on the Select Query, how do I fix this?
My goal is to copy the table from another dataset with data matching based on the date.
#!/bin/bash
date="20160805"
until [[ $date > 20160807 ]];
do
bq query --use_legacy_sql=false --destination_table="google_analytics.ga_sessions_${date}" 'SELECT g.* FROM `10241241.ga_sessions_$date` g, UNNEST (hits) as hits where hits.page.hostname="www.googlemerchandisestore.com" '
date=$(date +'%Y%m%d' -d "$date + 1 day")
done
Below error:
BigQuery error in query operation: Error processing job 'test-247020:bqjob_r6a2d68fbc6d04a34_000001722edd8043_1': Not found: Table test-247020:10241241.ga_sessions_ was not found in location EU
BigQuery error in query operation: Error processing job 'test-247020:bqjob_r5c42006229434f72_000001722edd85ae_1': Not found: Table test-247020:10241241.ga_sessions_ was not found in location EU
BigQuery error in query operation: Error processing job 'test-247020:bqjob_r6114e0d3e72b6646_000001722edd8960_1': Not found: Table test-247020:10241241.ga_sessions_ was not found in location EU
The problem is that you are using single quotes for the query, therefore bash doesn't replace $date with value of parameter. You need to keep double quotes for the query string:
date="20160805"
until [[ $date > 20160807 ]];
do
bq query --use_legacy_sql=false --destination_table="google_analytics.ga_sessions_${date}" "SELECT g.* FROM \`10241241.ga_sessions_$date\` g, UNNEST (hits) as hits where hits.page.hostname=\"www.googlemerchandisestore.com\" "
date=$(date +'%Y%m%d' -d "$date + 1 day")
done

How to store the output of a query in a variable in HIVE

I want to store current_day - 1 in a variable in Hive. I know there are already previous threads on this topic but the solutions provided there first recommends defining the variable outside hive in a shell environment and then using that variable inside Hive.
Storing result of query in hive variable
I first got the current_Date - 1 using
select date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'),1);
Then i tried two approaches:
1. set date1 = ( select date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'),1);
and
2. set hivevar:date1 = ( select date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'),1);
Both the approaches are throwing an error:
"ParseException line 1:82 cannot recognize input near 'select' 'date_sub' '(' in expression specification"
When I printed (1) in place of yesterday's date the select query is saved in the variable. The (2) approach throws "{hivevar:dt_chk} is undefined
".
I am new to Hive, would appreciate any help. Thanks.
Hive doesn't support a straightforward way to store query result to variables.You have to use the shell option along with hiveconf.
date1 = $(hive -e "set hive.cli.print.header=false; select date_sub(from_unixtime(unix_timestamp(),'yyyy-MM-dd'),1);")
hive -hiveconf "date1"="$date1" -f hive_script.hql
Then in your script you can reference the newly created varaible date1
select '${hiveconf:date1}'
After lots of research, this is probably the best way to achieve setting a variable as an output of an SQL:
INSERT OVERWRITE LOCAL DIRECTORY '<home path>/config/date1'
select CONCAT('set hivevar:date1=',date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'),1)) from <some table> limit 1;
source <home path>/config/date1/000000_0;
You will then be able to use ${date1} in your subsequent SQLs.
Here we had to use <some table> limit 1 as hive got a bug in insert overwrite if we don't specify a table name.

Simple query from Dataflow to BigQuerySource throws exception

I'm trying to write a simple Dataflow job that utilizes the query parameter within the BigQuerySource class.
In simplest terms, I can access a BigQuery table using the BigQuerySource class, and then filter against it. I cannot query / filter directly against the BigQuery table using the BigQuerySource.
Here's some code. Filtering in-line, within my Dataflow pipeline, works fine:
import argparse
import apache_beam as beam
parser = argparse.ArgumentParser()
parser.add_argument('--output', required=True)
known_args, pipeline_args = parser.parse_known_args(None)
p = beam.Pipeline(argv=pipeline_args)
source = 'bigquery-public-data:samples.shakespeare'
rows = p | 'read'>>beam.io.Read(beam.io.BigQuerySource(source))
f = rows | 'filter' >> beam.Map(lambda row: 1 if (row['word_count'] > 1) else 0)
f | 'write' >> beam.io.WriteToText(known_args.output)
p.run()
Replacing that middle stanza with a single line query gives an error.
f = p | 'read' >> beam.io.Read(beam.io.BigQuerySource('SELECT 1 FROM ' \
+ 'bigquery-public-data:samples.shakespeare where word_count > 1'))
The error returned looks like a syntax error.
(a29eabc394a38f62): Workflow failed. Causes:
(a29eabc394a38cfa): S04:read+write/Write/WriteImpl/WriteBundles+write/Write/WriteImpl/Pair+write/Write/WriteImpl/WindowInto(WindowIntoFn)+write/Write/WriteImpl/GroupByKey/Reify+write/Write/WriteImpl/GroupByKey/Write failed.,
(fb6d0643d7f13886): BigQuery execution failed.,
(fb6d0643d7f13b03): Error: Message: Encountered " "-" "- "" at line 1, column 59. Was expecting: <EOF>
Do I need to escape the - characters in the BigQuery project name?
In BigQuery Legacy SQL - you should escape whole table reference with [ and ]
For Standard SQL you should use back-ticks for the same reason
See also Escaping reserved keywords and invalid identifiers

Group by expression in pig

Consider I have a dataset with tuples (f1, f2). I want to get my data in two bags: one where fi is null and the other where f1 values are not null. I try:
raw = LOAD 'somedata' USING PigStorage() AS (f1:chararray, f2:chararray);
raw_group = GROUP raw BY f1 is null;
raw_count = FOREACH raw_group GENERATE group, COUNT_STAR(raw);
I expect to get two groups with keys true and false. When I run it in grunt I get the following:
2013-12-26 14:56:10,958 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1200: <line 1046, column 25> Syntax error, unexpected symbol at or near 'f1'
I can do a workaround:
raw_group = GROUP raw BY (f1 is null)?0:1;
, but I really like to understand what's going on here, as I just started to learn Pig. According to Pig documentation I can use expressions as a grouping key. Do I miss something here or nulls are treated differently in Pig?
The boolean datatype was introduced in Pig 0.10. The expression f1 is null is a boolean, so it can't appear as a field in a relation, which it would do if it were the value of group. Prior to Pig 0.10, booleans could only be used in FILTER statements or in the ternary operator, as you showed in your workaround.
While I haven't tried this out, presumably if you were to attempt the same thing in Pig 0.10 or later, your original attempt would succeed.