Pig - trouble with datafu's TransposeTupletoBag - apache-pig

I can't get the example code for LinkedIn DataFu's TransposeTupletoBag (at http://linkedin.github.io/datafu/docs/current/datafu/pig/util/TransposeTupleToBag.html) to work.
register datafu-1.1.0.jar
define Transpose datafu.pig.util.Transpose();
x = LOAD 'input.txt' AS (id:int,val1:int,val2:int,val3:int);
dump x
(1,10,11,12)
y = FOREACH x GENERATE id, Transpose(val1 .. val3);
2013-11-08 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070:
Could not resolve datafu.pig.util.Transpose using imports: [, java.lang.,
org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: pig_1383941559971.log
For some reason, Pig cannot find Transpose(). I'm able to use other DataFu functions, so it's not a problem with the JAR path. I'm using Pig 0.11.1 in local mode.

This turned out to be an easy fix. Either DataFu's documentation was mistyped or the API has been updated; Transpose() should be TransposeTupleToBag() (which makes much more sense, anyway). Just replace
define Transpose datafu.pig.util.Transpose();
with
define Transpose datafu.pig.util.TransposeTupleToBag();
and you're good to go.

Related

Pylint: same pylint and pandas version on 2 machines, 1 fails

I have 2 places running the same linting job:
Machine 1: Ubuntu over SSH
pandas==1.2.3
pylint==2.7.4
python 3.8.10
Machine 2: Gitlab CI Docker image, python:3.8.12-buster
pandas==1.2.3
pylint==2.7.4
Python 3.8.12
The Ubuntu machine is able to lint all the code fine, and it has for many months. Same for the CI job, except it had been running Python 3.7.8. Now that I upgraded the Docker image to Python 3.8.12, it throws several no-member linting errors on some Pandas objects. I've tried clearing CI caches etc.
I wish I could provide something more reproducible. But, to check my understanding of what a linter is doing, is it theoretically possible that a small version difference in python messes up pylint like this? For something like a no-member error on Pandas objects, I would think the dominant factor is the pandas version, but those are equal, so I'm confused!
Update:
I've looked at the Pandas code for pd.read_sql_query, which is what's causing the no-member error. It says:
def read_sql_query(
sql,
con,
index_col=None,
coerce_float=True,
params=None,
parse_dates=None,
chunksize: Optional[int] = None,
) -> Union[DataFrame, Iterator[DataFrame]]:
In Docker, I get E1101: Generator 'generator' has no 'query' member (no-member) (because I'm running .query on the returned dataframe). So it seems Pylint thinks that this function returns a generator. But it does not make this assumption in my other setup. (I've also verified the SHA sum of pandas/io/sql.py matches). This seems similar to this issue, but I am still baffled by the discrepancy in environments.
A fix that worked was to bump a limit like:
init-hook = "import astroid; astroid.context.InferenceContext.max_inferred = 500"
in my .pylintrc file, as explained here.
I'm unsure why/if this is connected to my change in Python version, but I'm happy to use this and move on for now. It's probably complex.
(Another hack was to write a function that returns the passed arg if the passed arg is a dataframe, and returns 1 dataframe if the passed arg is an iterable of dataframes. So the ambiguous-type object could be passed through this wrapper to clarify things for Pylint. While this was more intrusive on our codebase, we had dozens of calls to pd.read_csv and pd.real_sql_query, and only about 3 calls caused confusion for Pylint, so we almost used this solution)

Delta Table : org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'FROM'

I am trying to run the query on EMR/EMR Notebooks (Spark with Scala) -
SELECT max(version), max(timestamp) FROM (DESCRIBE HISTORY delta.`s3://a/b/c/d`)
But I am getting the following error -
The same query works fine on Databricks.
Another doubt that I have is - why does the colour of s3 location change post //.
So I tried to break the above query and only run the Describe HISTORY query. And for some reason it says -
Error Log -
An error was encountered:
org.apache.spark.sql.AnalysisException: Table or view not found: HISTORY;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:47)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:835)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:787)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:817)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:810)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:71)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:810)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:756)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1$$anonfun$2.apply(RuleExecutor.scala:92)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1$$anonfun$2.apply(RuleExecutor.scala:92)
at org.apache.spark.sql.execution.QueryExecutionMetrics$.withMetrics(QueryExecutionMetrics.scala:141)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:91)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:88)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:88)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:80)
at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:164)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$execute$1.apply(Analyzer.scala:156)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$execute$1.apply(Analyzer.scala:156)
at org.apache.spark.sql.catalyst.analysis.AnalysisContext$.withLocalMetrics(Analyzer.scala:104)
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:155)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:126)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:125)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:125)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:76)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80)
at org.apache.spark.sql.SparkSession.table(SparkSession.scala:630)
at org.apache.spark.sql.execution.command.DescribeColumnCommand.run(tables.scala:714)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:196)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:196)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3391)
at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$executeQuery$1(SQLExecution.scala:83)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1$$anonfun$apply$1.apply(SQLExecution.scala:94)
at org.apache.spark.sql.execution.QueryExecutionMetrics$.withMetrics(QueryExecutionMetrics.scala:141)
at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$withMetrics(SQLExecution.scala:178)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:93)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:200)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:92)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3390)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:196)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:81)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:644)
... 50 elided
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'history' not found in database 'default';
at org.apache.spark.sql.hive.client.HiveClient$$anonfun$getTable$1.apply(HiveClient.scala:81)
at org.apache.spark.sql.hive.client.HiveClient$$anonfun$getTable$1.apply(HiveClient.scala:81)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:81)
at org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:84)
at org.apache.spark.sql.hive.HiveExternalCatalog.getRawTable(HiveExternalCatalog.scala:141)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:723)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:723)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:98)
at org.apache.spark.sql.hive.HiveExternalCatalog.getTable(HiveExternalCatalog.scala:722)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getTable(ExternalCatalogWithListener.scala:138)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupRelation(SessionCatalog.scala:706)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:832)
UPDATED (18-Feb-2021) -> What I have tried till now.
Query Using Spark Sql -
spark.sql("SELECT max(version), max(timestamp) FROM (DESCRIBE HISTORY delta.s3://a/b/c/d)")
But this Didnt work. Same Error.
Create Spark Session with -
spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
and spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog.
But its throwing the same error.
UPDATE 2 (18-Feb-2021) :- Trying the approach as mentioned by #alex.
Using PySpark.
It was working partly and but not completely.
Thanks in Advance.
Per documentation, to get support for DESCRIBE HISTORY you need to configure Spark SQL Extensions and Catalog by passing 2 properties (see docs):
spark.sql.extensions to value io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog to value org.apache.spark.sql.delta.catalog.DeltaCatalog
Update:
For Spark 2.4.x, the Delta 0.6.1 should be used, and its documentation has following code snippet to activate extensions:
spark.sparkContext._jvm.io.delta.sql.DeltaSparkSessionExtension() \
.apply(spark._jsparkSession.extensions())
spark = SparkSession(spark.sparkContext, spark._jsparkSession.cloneSession())

Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast

I wanted to do the sum of a column which contains long type numbers.
I tried many possible ways but still the cast error is not getting resolved.
My pig code:
raw_ds = LOAD '/tmp/bimallik/data/part-r-00098' using PigStorage(',') AS (
d1:chararray, d2:chararray, d3:chararray, d4:chararray, d5:chararray,
d6:chararray, d7:chararray, d8:chararray, d9:chararray );
parsed_ds = FOREACH raw_ds GENERATE d8 as inBytes:long, d9 as outBytes:long;
X = FOREACH parsed_ds GENERATE (long)SUM(parsed_ds.inBytes) AS inBytes;
dump X;
Error snapshot:
2015-11-20 02:16:26,631 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045:
Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
Details at logfile: /users/bimallik/pig_1448014584395.log
2015-11-20 02:17:03,629 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complet
#ManjunathBallur Thanks for the input.
I changed my code as below now
<..same as before ...>
A = GROUP parsed_ds by inBytes;
X = FOREACH A GENERATE SUM(parsed_ds.inBytes) as h;
DUMP X;
Now A is generating a bag of common inBytes and X is giving sum of each bag's inBytes's summation which is again consisting of multiple rows where as I need one single summation value.
In local mode I was getting the same issue (pig -x local).
I have tried all the solutions available on the internet but noting seems to be working for me.
I toggled my PIG from local to the mapreduce mode and tried the solution. It worked.
In mapreduce mode all the solutions seem to be working.

Accumulo-Pig error - Connector info for AccumuloInputFormat can only be set once per job

Versions:
Accumulo 1.5
Pig 0.10
Attempted:
Read/write data in/into Accumulo from Pig, using accumulo-pig.
Encountered an error - any insight into getting past this error is greatly appreciated.
Switching to Accumulo 1.4 is not an option as we are using the Accumulo Thrift Proxy in our C# codebase.
Impact:
This is currently a roadblock in our project.
Source reference:
Source code - https://git-wip-us.apache.org/repos/asf/accumulo-pig.git
Error:
In attemtping to read a dataset in Accumulo, from Pig, I am getting the following error-
org.apache.pig.backend.executionengine.ExecException: ERROR 2118:
Connector info for AccumuloInputFormat can only be set once per job
Code snippet:
DATA = LOAD 'accumulo://departments?instance=indra&user=root&password=xxxxxxx&zookeepers=cdh-dn01:2181' using org.apache.accumulo.pig.AccumuloStorage() AS (row, cf, cq, cv, ts, val);
dump DATA;
Try using the ACCUMULO-1783-1.5 branch from the same repository. The way that Pig sets up the InputFormat doesn't play nicely with how Accumulo sets up InputFormats (notably, Accumulo makes a funny assertion that you never call the same static method more than one for a Configuration).
I have been using pig 0.12 -- I doubt there's a difference in how 0.10 sets up the InputFormats as opposed to 0.12, but I'm not positive YMMV.
I just pushed a fix to the above branch that gets rid of the previously mentioned limitation on Hadoop version.

Unable to specify schema during storage with pig scripts

Ignore above query. Its incorrect.
I have following pig script
A = LOAD 'textinput' using PigStorage() as (a0:chararray, a1:chararray, a2:chararray, a3:chararray, a4:chararray, a5:chararray, a6:chararray, a7:chararray, a8:chararray,a9:chararray);
describe A;
store A into 'output2' using PigStorage();
This works fine.
However when i modify the store statement to
store A into 'output3' using PigStorage() as (a0:chararray, a1:chararray, a2:chararray, a3:chararray, a4:chararray, a5:chararray, a6:chararray, a7:chararray, a8:chararray,a9:chararray);
It fails with below error
2013-05-04 11:49:56,296 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: mismatched input 'as' expecting SEMI_COLON
You don't specify a schema when storing an output with pig. The schema of the alias you're storing is whatever it was when you created it. If you wished to change the way it's stored you could do something like
B = FOREACH A GENERATE (insert transformation here);
STORE B INTO 'output3';
If you wished to change the way PigStorage writes your alias to disk you could create your own StoreFunc