Try-catch in pig? - apache-pig

Try-catch in pig? - apache-pig

I have to load a log that should fit in the pattern. Unfortunately some records don't.
It occurs as an error when I'm trying to store data in HCatalog.
Is it possible to store the records that fit the pattern in the HCalatlog, and keep other in a file for further processing?
Or maybe it is possible to do something like try-catch in Pig?
I can't find any solution on but it must be simple - I just don't believe nobody faced that problem earlier!
I will be grateful for any hints.

Edited Answer
People have faced this issue before, but the answer is usually "UDF". Unfortunately, I think that's probably the best answer for your question: a UDF that performs the data validation using java or python try/catch error handling.
Another answer is to use SPLIT to evaluate the data in the field and direct the data into the appropriate alias. This is a common method of handling non-expected data.
Original Answer:
In version .12 of Pig, you have the ASSERT operator, which isn't as good as try/catch, but it's better than nothing.
From the docs:
Suppose we have relation A.
A = LOAD 'data' AS (a0:int,a1:int,a2:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
Now, you can assert that a0 column in your data is >0, fail if otherwise
ASSERT A by a0 > 0 'a0 should be greater than 0';

The ASSERT method in JamCon's answer is often helpful, but as you say, your particular issue can't be addressed by it. If you are simply looking to test for the presence of extra columns, one possible workaround would be to load your data as normal, but in the AS clause, add an extra column called error:chararray. Typically, you would expect this to be NULL, but if there are extra columns, it won't be. So
SPLIT a INTO good IF error IS NULL, bad IF error IS NOT NULL;
to separate out the lines which have extra records.
Ugly, but in this particular case it should work for you.

Related

SSIS Fuzzy Grouping Always return the same result with different similarity thrshold

Can anyone tell me why my similarity is always 1.
My goal is AAB and AAC can be set as the same group for example.
Thanks

After I tried different source data, I got the goal what I need.
I think for sample data, it should be better to use some real example in the world.
Instead of AAA and AAC, maybe use Name column like Sara vs Saraa then ssis would say they are in the same group. However, i found for Don vs Done, they won't. So....it may not good idea to filter the records that has typo with different letter?
*** try to create more than one column to be you comparison column

Strange behavior on GROUP BY and LIKE?

Below is simplified example of my data. As you can see – there are just two rows here
So I run below and suddenly getting unexpected result
What I expected was something like:
Why am I getting wrong result?
Moreover, when I run below – I am getting only one row. Why second row with id=1 is not showing??
Is there BigQuery bug or what?
Disclaimer: I was asked exactly this type of question few times offline (outside of StackOverflow) and recently saw very same question on SO (I can't understand this BigQuery magic. find string with LIKE) but unfortunately it was deleted so I decided to Post this on my own

The reason for GROUP BY not grouping those two rows is that str field in those rows are actually different. Unfortunately, BigQuery Web UI collapses spaces in result panel when it is in Table mode. To see real/original values you can switch to JSON mode, as below
Same reason is for unexpected result for use of LIKE
As of how to deal with this? It depends!
For example you can kind of normalize your strings by suppressing spaces by yourself as it is shown below
P.S. In our internal tools – we just fixed the issue with suppressed spaces and just simply show all spaces:

'-999' used for all condition

I have a sample of a stored procedure like this (from my previous working experience):
Select * from table where (id=#id or id='-999')
Based on my understanding on this query, the '-999' is used to avoid exception when no value is transferred from users. So far in my research, I have not found its usage on the internet and other company implementations.
#id is transferred from user.
Any help will be appreciated in providing some links related to it.

I'd like to add my two guesses on this, although please note that to my disadvantage, I'm one of the very youngest in the field, so this is not coming from that much of history or experience.
Also, please note that for any reason anybody provides you, you might not be able to confirm it 100%. Your oven might just not have any leftover evidence in and of itself.
Now, per another question I read before, extreme integers were used in some systems to denote missing values, since text and NULL weren't options at those systems. Say I'm looking for ID#84, and I cannot find it in the table:
Not Found Is Unlikely:
Perhaps in some systems it's far more likely that a record exists with a missing/incorrect ID, than to not be existing at all? Hence, when no match is found, designers preferred all records without valid IDs to be returned?
This however has a few problems. First, depending on the design, user might not recognize the results are a set of records with missing IDs, especially if only one was returned. Second, current query poses a problem as it will always return the missing ID records in addition to the normal matches. Perhaps they relied on ORDERing to ease readability?
Exception Above SQL:
AFAIK, SQL is fine with a zero-row result, but maybe whatever thing that calls/used to call it wasn't as robust, and something goes wrong (hard exception, soft UI bug, etc.) when zero rows are returned? Perhaps then, this ID represented a dummy row (e.g. blanks and zeroes) to keep things running.
Then again, this also suffers from the same arguments above regarding "record is always outputted" and ORDER, with the added possibility that the SQL-caller might have dedicated logic to when the -999 record is the only record returned, which I doubt was the most practical approach even in whatever era this was done at.
... the more I type, the more I think this is the oven, and only the great grandmother can explain this to us.

If you want to avoid exception when no value transferred from user, in your stored procedure declare parameter as null. Like #id int = null
for instance :
CREATE PROCEDURE [dbo].[TableCheck]
#id int = null
AS
BEGIN
Select * from table where (id=#id)
END
Now you can execute it in either ways :
exec [dbo].[TableCheck] 2 or exec [dbo].[TableCheck]
Remember, it's a separate thing if you want to return whole table when your input parameter is null.
To answer your id = -999 condition, I tried it your way. It doesn't prevent any exception

PostgreSQL extract keys from jsonb, exception "cannot call jsonb_object_keys on a scalar"

I am trying to get my head around with jsonb in Postgres. There are quite a few issues here, What I wanted to do was something like:
SELECT table.column->>'key_1' as a FROM "table"
I tried with -> and also some combinations of brackets as well, but I was always getting nil in a.
So I tried to get all keys first to see if it is even recognizing jsonb or not.
SELECT jsonb_object_keys(table.column) as a FROM "table"
This threw an error:
cannot call jsonb_object_keys on a scalar
So, to check the column type(which I created, so I know it IS jsonb, but anyway)
SELECT pg_typeof(column) as a FROM "table" ORDER BY "table"."id" ASC LIMIT 1
This correctly gave me "jsonb" in the result.
values in the column are similar to {"key_1":"New York","key_2":"Value of key","key_3":"United States"}
So, I am really confused on what actually is going on here and why is it calling my json data to be scalar? What does it actually means and how to solve this problem?
Any help in this regard will be greatly helpful.
PS: I am using rails, posted this as a general question for the problem. Any rails specific solution would also work.

So the issue turned out to be OTHER than only SQL.
As I mentioned I am using rails(5.1), I had used default value '{}' for the jsonb column. And I was using a two-way serializer for the column by defining it in my model for the table.
Removing this serializer and adjusting the default value to {} actually solved the problem.
I think my serializer was doing something to the values, but still, in the database, it had correct value like i mentioned in the question.
It is still not 100% clear to me what was the problem. But it is solved anyway. If anyone can shed some light on what exactly the problem was, that will be great.
Hope this might help someone.

In my case the ORM layer somehow managed to wrote a null string into the JSON column and Postgres was happy with it. Trying to execute json_object_keys on such value resulted in the OP error.
I have managed to track down the place that allow such null strings and after fixing the code, I have also fixed the data with the following query:
UPDATE tbl SET col = '{}'::jsonb WHERE jsonb_typeof(col) <> 'object';
If you intentionally mix the types stored in the column (e.g. sometimes it is an object, sometimes array etc), you might want to filter out all rows that don't contain objects with a simple WHERE:
SELECT jsonb_object_keys(tbl.col) as a FROM tbl WHERE jsonb_typeof(col) = 'object';

SELECT FROM (lv_tablename) error: the output table is too small

I have an ABAP class method, say, select_something. select_something has an exporting parameter, say, et_result. et_result is of type standard table because the type of et_result cannot be determined until runtime.
The method sometimes gives a short dump saying With ABAP/4 Open SQL array select, the output table is too small at "select * into table et_result from (lv_tablename) where..."
Error analysis:
......in this particular case, the database table is 3806 bytes wide, but the internal table is only 70 bytes wide.
I tried "any table" too and the error is the same.

You could return a data reference. Your query will no longer fail, and you can assign the data to a correctly typed field symbol afterwards.
" Definition
class-methods select_all
importing
!tabname type string
returning
value(results) type ref to data.
...
...
" Implementation
method select_all.
data dref type ref to data.
create data dref type standard table of (tabname).
field-symbols <tab> type any table.
assign dref->* to <tab>.
select * from (tabname) into table <tab>.
get reference of <tab> into results.
endmethod.
Also, I agree with #vwegert that dynamic queries (and programming for that matter) should be avoided when possible.

What you're trying to do looks horribly wrong on many levels. NEVER use SELECT FROM (whatever) unless someone points a gun at your head AND the door is locked tight. You'll loose every kind of static error checking the system might be able to provide you with. For example, the compiler will no longer be able to tell you "Hey, that table you're reading from is 3806 bytes wide." It simply can't tell, even if you use constants. You'll find that out the hard way, producing short dumps, especially when switching between unicode and NUC systems, quite likely some in production systems. No fun.
(Actually there are a few - very very VERY few - good uses for dynamic table names in the SELECT statement. I need them about once every two to three years, and I code quite a lot weird stuff. Just avoid them wherever you can, even at the cost of writing more code. It's just not worth the trouble fixing broken stuff later.)
Then, changing the generic formal parameter type does not do anything to the type of the actual parameter. If you pass a STANRDARD TABLE OF mandt WITH DEFAULT KEY to your method, that table will have lines of 3 characters. It will be a STANDARD TABLE, and as such, it will also be an ANY TABLE, and that's about it. You can twist the generic types anywhere you like, there's no way to enforce correctness using generic types the way you use them. It's up to the caller to make sure that all the right types are used. That's a bad way to fly.

First off, I agree with vwegert's response, try to avoid dynamic sql selections if you can
That said, check the short dump. If the error is an exception class, you can wrap the SELECT statement in a try/catch block and at least stop it from dumping.
You can also try "INTO CORRESPONDING FIELDS OF TABLE et_result". If ET_RESULT is dynamic, you might have to cast it into the proper structure using RTTS. This might give you some ideas...

Couldn't agree more to vwegert, but if there is absolutely no other way (and there usually is) of performing your task than using dynamic select statements and dynamically typed parameters, do some checks on the type of the table and the parameter at runtime.
Use CL_ABAP_TYPEDESCR and its subclasses to do so.
This way, you can handle errors at runtime without your program dumping,
But as vwegert said, this dynamic stuff is pure evil and will most certainly break at some point during runtime. Adding the necessary error handling will most likely be a lot more work and a lot harder than redesigning your code to none dynamic SQL and typed parameters

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas