Check if two collections have common part - hive

I have a hive table in which two columns are arrays (created using collect_set built in function). I would like to get only those rows in which "any element in col1 = any element in col2", so simply they have any common part. I see that there is a array_contains(Array, value) function, but it needs one value, not a collection. Is it possible to express such condition?

You can explode array elements and use simple where condition
See explode UDTF manual

Related

pandas - running into problems setting multiple columns using results from pd.apply()

I have a function that returns tuples. When I apply this to my pandas dataframe using pd.apply() function, the results look this way.
The Date here is an index and I am not interested in it.
I want to create two new columns in a dataframe and set their values to the values you see in these tuples.
How do I do this?
I tried the following:
This errors out citing mismatch between expected and available values. It is seeing these tuples as a single entity, so those two columns I specified on the left hand side are a problem. Its expecting only one.
And what I need is to break it down into two parts that can be used to set two different columns.
Whats the correct way to achieve this?
Make your function return a pd.Series, this will be expanded into a frame.
orders.apply(lambda x: pd.Series(myFunc(x)), axis=1)
use zip
orders['a'], orders['b'] = zip(*df['your_column'])

Accessing complex types in AWS Athena

I've used Glue to generate tables for Athena. I have some nested array/struct values (complex types) that I'm having trouble accessing via query.
I have two tables, the one in question is named "sample_parquet".
ids (array<struct<idType:string,idValue:string>>)
The the cell has the value:
[{idtype=ttd_id, idvalue=cf275376-8116-4cad-a035-e241e14b1470}, {idtype=md5_email, idvalue=932babe184fb11c92b09b3e13e936124}]
And I've tried:
select ids.idtype from sample_parquet limit 1
Which yields:
SYNTAX_ERROR: line 1:8: Expression "ids" is not of type ROW
And:
select s.idtype from sample_parquet.ids s limit 1;
Which yields:
SYNTAX_ERROR: line 1:22: Schema sample_parquet does not exist
I've also tried:
select json_extract(ids, '$.idtype') as idtype from sample_parquet limit 1;
Which yields:
SYNTAX_ERROR: line 8:8: Unexpected parameters (array(row(idtype varchar,idvalue varchar)), varchar(8)) for function json_extract. Expected: json_extract(varchar(x), JsonPath) , json_extract(json, JsonPath)
Thanks for any help.
You are trying to access the elements of an array like you'd access a dictionary/key-value.
Use UNNEST to flatten the array and then you can use the . operator.
For more information on working with JSONs and ARRAYs on AWS Docs.
ids is a column of type array, not a relation (e.g. a table, view, or a subquery). Confusingly, when dealing with nested types in Athena/Presto you have to stop thinking in terms of SQL and instead think more as you would in a programming language.
There are dedicated functions that act on arrays, maps, as well as lambda functions (no relationship with the AWS service), that can be used to dig into nested types.
When you say SELECT ids.idtype … I assume that what you're after could be written like ids.map((id) => id.ittype) in JavaScript. In Athena/Presto this could be expressed as SELECT transform(ids, id -> id.idtype) ….
The result of transform will be a relation with a column of type array<string>. If you want each element of that array as a separate row, you need to use UNNEST, but if you instead want the first value you can use the element_at function. There are also other functions that you may be familiar with such as filter, slice, and flatten that produce new arrays, as well as reduce, which produce a scalar value.

add column in spark dataframe referring another dataframe using udf

I've a dataframe "Forecast" with columns - Store, Item, FC_startdate, FC_enddate, FC_qty
Another dataframe "Actual" with columns - Store, Item, Saledate, Sales_qty.
I want to create a UDF with parameters passed - p_store, p_item, p_startdate, p_enddate and get the sum of Sales_qty in between these dates and add this as a new column (Act_qty) to "Forecast" dataframe.
but spark is not allowing to pass a dataframe in UDF along with fields of Forecast.
Instead of using merge - What can be the solution?
After defining and registering your udf, you can use the udf function in your transformation code like any other function of the spark-sql library.
Similar to the spark-sql library functions you can only pass columns of your dataframe and return the processed value. Dataframes cannot be passed to udf's.
So in your case you can transform your current dataframe into another dataframe by using the udf as a function and then proceed ahead.
https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html
A golden rule is, that anything that can be done without UDFs, should be done without UDFs, they should be applied more-so when you require a very specific transformation on a singular row, rather than for the big aggregation type operation you decribe.
In this case it seems like you could just use SparkSQL: Select rows of Actual, where the Saledate is between the dates you would like (Spark understands dates natively, refer to the documentation), sum SalesQty per Store or Item, or both (I am not sure what you intend to do), rename the sum column and join this new dataframe into the Forecast using Store or Item or both again.
If you, however, insist on using UDFs you will have to pass columns, rather than dataframes as arguments but I can't think of a straightforward way of how to achieve what you describe using UDFs while not sacrificing a lot of performance.

What is the order of data across multiple nested fields in BigQuery?

Given a BigQuery table with the schema: target:STRING,evName:STRING,evTime:TIMESTAMP, consider the following subselect:
SELECT target,
NEST(evName) AS evNames,
NEST(evTime) AS evTimes,
FROM [...]
GROUP BY target
This will group events by target into rows with two repeated fields evNames and evTimes. I understand that the values within each of the repeated fields are not ordered in any predictable way, but is the ordering guaranteed to be consistent between the two repeated fields?
In other words, if I pick N-th value from evNames and N-th value from evTimes within a given row, will they form a proper pair from the original table?
What I would really like to do is to create a nested repeated record, something like:
SELECT target, NEST(RECORD(evName, evTime)) AS events FROM [...] GROUP BY target
but I believe creating RECORDs on the fly like this is currently not supported.
By the way, this question is motivated by the desire to use recently introduced BigQuery user defined functions to implement state machines, as an alternative to window functions tricks.
Note: I realize that an alternative is to emulate record by serializing multiple fields into a single string representation, e.g.:
SELECT target, NEST(CONCAT(evName, ',', STRING(evTime))) ...
and then deserialize the "record" in later stages, but I'd like to avoid that if I can.

Django: how to filter for rows whose fields are contained in passed value?

MyModel.objects.filter(field__icontains=value) returns all the rows whose field contains value. How to do the opposite? Namely, construct a queryset that returns all the rows whose field is contained in value?
Preferably without using custom SQL (ie only using the ORM) or without using backend-dependent SQL.
field__icontains and similar are coded right into the ORM. The other version simple doesn't exist.
You could use the where param described under the reference for QuerySet.
In this case, you would use something like:
MyModel.objects.extra(where=["%s LIKE CONCAT('%%',field,'%%')"], params=[value])
Of course, do keep in mind that there is no standard method of concatenation across DMBS. So as far as I know, there is no way to satisfy your requirement of avoiding backend-dependent SQL.
If you're okay with working with a list of dictionaries rather than a queryset, you could always do this instead:
qs = MyModel.objects.all().values()
matches = [r for r in qs if value in r[field]]
although this is of course not ideal for huge data sets.