Pig Latin: how to try-catch for errors regarding casting? - apache-pig

Let's say I have a structured data in Pig named "my_great_data" and it has the following fields: "a_field", "b_field" and "c_field".
I want to write a filter statement that filters in all "rows" in "my_great_data" that their "a_field" can be casted to long data type.
Something like:
my_great_data_output = filter my_great_data BY (a_field can be casted to long);

You could write a simple Python UDF to do the casting and return a null if it's not possible. You can then use IS NOT NULL in a filter to remove the records that failed to cast.
Something like:
from pig_util import outputSchema
#outputSchema("cast_value:long")
def castToLong(num):
if num is None:
return None
try:
# Should handle strings with a decimal point
# Use long(num) if these should not be cast.
r = long(float(num))
return r
except:
return None

Related

Django - Function with template returns `TypeError: not enough arguments for format string`

I am trying to use PostgreSQL's FORMAT function in Django to format phone number strings.
I can accomplish this with the following SQL query:
SELECT
phone_number, FORMAT('(%s) %s-%s', SUBSTRING(phone_number,3,3), SUBSTRING(phone_number,6,3), SUBSTRING(phone_number,9,4))
FROM core_user
WHERE phone_number iS NOT NULL
which returns a result like:
Trying to implement this into Django to be used for an ORM query, I did the following:
class FormatPhoneNumber(Func):
function = "FORMAT"
template = "%(function)s('(%s) %s-%s', SUBSTRING(%(expressions)s,3,3), SUBSTRING(%(expressions)s,6,3), SUBSTRING(%(expressions)s,9,4))"
ORM query:
User.objects.annotate(phone_number2=FormatPhoneNumber(f"phone_number"))
Returns the following error:
File /venv/lib/python3.10/site-packages/django/db/models/expressions.py:802, in Func.as_sql(self, compiler, connection, function, template, arg_joiner, **extra_context)
800 arg_joiner = arg_joiner or data.get("arg_joiner", self.arg_joiner)
801 data["expressions"] = data["field"] = arg_joiner.join(sql_parts)
--> 802 return template % data, params
TypeError: not enough arguments for format string
I believe it is due to this line '(%s) %s-%s' that is supplied to the FORMAT function.
Does anyone have any ideas on how I can make this work?
Yes, you use two consecutive percentages to produce a percentage after formatting, so:
class FormatPhoneNumber(Func):
function = "FORMAT"
template = "%(function)s('(%%s) %%s-%%s', SUBSTRING(%(expressions)s,3,3), SUBSTRING(%(expressions)s,6,3), SUBSTRING(%(expressions)s,9,4))"
But normally formatting is not done by the database, normally you do this in the model, in a model field, or in the view or template.
I was not able to get the solution using the FORMAT function to work, but I did get this to work with REGEXP_REPLACE which gives the same output:
class FormatPhoneNumber2(Func):
function = "REGEXP_REPLACE"
template = "%(function)s(SUBSTRING(%(expressions)s,3,10), '(\d{3})(\d{3})(\d{4})', '(\\1) \\2-\\3')"
In [27]: result = (
...: User.objects.filter(phone_number__isnull=False)
...: .annotate(phone_number2=FormatPhoneNumber2("phone_number"))
...: .values_list("phone_number", "phone_number2")
...: )
In [28]: for user in result:
...: print(user)
('+16502553199', '(650) 255-3199')
('+12147047099', '(214) 704-7099')
('+12147547099', '(214) 754-7099')

Passing Optional List argument from Django to filter with in Raw SQL

When using primitive types such as Integer, I can without any problems do a query like this:
with connection.cursor() as cursor:
cursor.execute(sql='''SELECT count(*) FROM account
WHERE %(pk)s ISNULL OR id %(pk)s''', params={'pk': 1})
Which would either return row with id = 1 or it would return all rows if pk parameter was equal to None.
However, when trying to use similar approach to pass a list/tuple of IDs, I always produce a SQL syntax error when passing empty/None tuple, e.g. trying:
with connection.cursor() as cursor:
cursor.execute(sql='''SELECT count(*) FROM account
WHERE %(ids)s ISNULL OR id IN %(ids)s''', params={'ids': (1,2,3)})
works, but passing () produces SQL syntax error:
psycopg2.ProgrammingError: syntax error at or near ")"
LINE 1: SELECT count(*) FROM account WHERE () ISNULL OR id IN ()
Or if I pass None I get:
django.db.utils.ProgrammingError: syntax error at or near "NULL"
LINE 1: ...LECT count(*) FROM account WHERE NULL ISNULL OR id IN NULL
I tried putting the argument in SQL in () - (%(ids)s) - but that always breaks one or the other condition. I also tried playing around with pg_typeof or casting the argument, but with no results.
Notes:
the actual SQL is much more complex, this one here is a simplification for illustrative purposes
as a last resort - I could alter the SQL in Python based on the argument, but I really wanted to avoid that.)
At first I had an idea of using just 1 argument, but replacing it with a dummy value [-1] and then using it like
cursor.execute(sql='''SELECT ... WHERE -1 = any(%(ids)s) OR id = ANY(%(ids)s)''', params={'ids': ids if ids else [-1]})
but this did a Full table scan for non empty lists, which was unfortunate, so a no go.
Then I thought I could do a little preprocessing in python and send 2 arguments instead of just the single list- the actual list and an empty list boolean indicator. That is
cursor.execute(sql='''SELECT ... WHERE %(empty_ids)s = TRUE OR id = ANY(%(ids)s)''', params={'empty_ids': not ids, 'ids': ids})
Not the most elegant solution, but it performs quite well (Index scan for non empty list, Full table scan for empty list - but that returns the whole table anyway, so it's ok)
And finally I came up with the simplest solution and quite elegant:
cursor.execute(sql='''SELECT ... WHERE '{}' = %(ids)s OR id = ANY(%(ids)s)''', params={'ids': ids})
This one also performs Index scan for non empty lists, so it's quite fast.
From the psycopg2 docs:
Note You can use a Python list as the argument of the IN operator using the PostgreSQL ANY operator.
ids = [10, 20, 30]
cur.execute("SELECT * FROM data WHERE id = ANY(%s);", (ids,))
Furthermore ANY can also work with empty lists, whereas IN () is a SQL syntax error.

query using string in PyTables 3

I have a table:
h5file=open_file("ex.h5", "w")
class ex(IsDescription):
A=StringCol(5, pos=0)
B=StringCol(5, pos=1)
C=StringCol(5, pos=2)
table=h5file.create_table('/', 'table', ex, "Passing string as column name")
table=h5file.root.table
rows=[
('abc', 'bcd', 'dse'),
('der', 'fre', 'swr'),
('xsd', 'weq', 'rty')
]
table.append(rows)
table.flush()
I am trying to query as per below:
find='swr'
creteria='B'
if creteria=='B':
condition='B'
else:
condition='C'
value=[x['A'] for x in table.where("""condition==find""")]
print(value)
It returns:
ValueError: there are no columns taking part in condition condition==find
Is there a way to use condition as a column name in above query?
Thanks in advance.
Yes, you can use Pytables .where() to search based on a condition. The problem is how you constructed your query for the table.where(condition). See Note about strings under Table.where() in the Pytables Users Guide:
A special care should be taken when the query condition includes string literals. ... Python 3 strings are unicode objects.
in Python 3, “condition” should be defined like this:
condition = 'col1 == b"AAAA"'
The reason is that in Python 3 “condition” implies a comparison between a string of bytes (“col1” contents) and an unicode literal (“AAAA”).
The simplest form of your query is shown below. It returns a subset of rows that match the condition. Note use of single and double quotes for string and unicode:
query_table = table.where('C=="swr"') # search in column C
I rewrote your example as best I could. See below. It shows several ways to enter the condition. I'm not smart enough to figure out how to combine your creteria and find variables into a single condition variable with string and unicode characters.
from tables import *
class ex(IsDescription):
A=StringCol(5, pos=0)
B=StringCol(5, pos=1)
C=StringCol(5, pos=2)
h5file=open_file("ex.h5", "w")
table=h5file.create_table('/', 'table', ex, "Passing string as column name")
## table=h5file.root.table
rows=[
('abc', 'bcd', 'dse'),
('der', 'fre', 'swr'),
('xsd', 'weq', 'rty')
]
table.append(rows)
table.flush()
find='swr'
query_table = table.where('C==find')
for row in query_table :
print (row)
print (row['A'], row['B'], row['C'])
value=[x['A'] for x in table.where('C == "swr"')]
print(value)
value=[x['A'] for x in table.where('C == find')]
print(value)
h5file.close()
Output shown below:
/table.row (Row), pointing to row #1
b'der' b'fre' b'swr'
[b'der']
[b'der']

Returning a tuple column type from slick plain SQL query

In slick 3 with postgres, I'm trying to use a plain sql query with a tuple column return type. My query is something like this:
sql"""
select (column1, column2) as tup from table group by tup;
""".as[((Int, String))]
But at compile time I get the following error:
could not find implicit value for parameter rconv: slick.jdbc.GetResult[((Int, String), String)]
How can I return a tuple column type with a plain sql query?
GetResult[T] is a wrapper for function PositionedResult => T and expects an implicit val with PositionedResult methods such as nextInt, nextString to extract positional typed fields. The following implicit val should address your need:
implicit val getTableResult = GetResult(r => (r.nextInt, r.nextString))
More details can be found in this Slick doc.

Error 1045 on sum function in pig latin with an int

The following pig latin script:
data = load 'access_log_Jul95' using PigStorage(' ') as (ip:chararray, dash1:chararray, dash2:chararray, date:chararray, date1:chararray, getRequset:chararray, location:chararray, http:chararray, code:int, size:int);
splitDate = foreach data generate size as size:int , ip as ip, FLATTEN(STRSPLIT(date, ':')) as h;
groupedIp = group splitDate by h.$1;
a = foreach groupedIp{
added = foreach splitDate generate SUM(size); --
generate added;
};
describe a;
gives me the error:
ERROR 1045:
<file 3.pig, line 10, column 39> Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
This error makes me think I need to cast size as an int, but if i describe my groupedIp field, I get the following schema.
groupedIp: {group: bytearray,splitDate: {(size: int,ip: chararray,h: bytearray)}} which indicates that size is an int, and should be able to be used by the sum function.
Am I calling the sum function incorrectly? Let me know if you would like to see any thing else, such as the input file.
SUM operates on a bag as input, but you pass it the field 'size'.
Try to eliminate the nested foreach and use:
a = foreach groupedIp generate SUM(splitDate.size);
Do some dumps of your data. I'll bet some of the stuff in the size column is non-integer, and Pig runs into that and dies. You could also code up your own isInteger udf to check this before the rest of your processing, and throw out any that aren't integers.
SUM, AVG and COUNT are functions that always work on a bag, therefore group the data and then join with the original set like below:
A = load 'nyse_data.txt' as (exchange:chararray, symbol:chararray,date:chararray, pen:float,high:float, low:float, close:float,volume:int, adj_close:float);
G = group A by symbol;
C = foreach G generate group, SUM(A.open);