Apache-pig Number Extraction After a specific String - apache-pig

I have a file with 10,1900 lines with Delimiter as 5 ('|') [obviously 6 columns now] , and I have statement in sixth column like "Dropped 12 (0.01%)" !! I am longing to extract the number after Dropped within brackets;
Actual -- Dropped 12 (0.01%)
Expected -- 0.01
I need a solution using Apache pig.

You are looking for the REGEX_EXTRACT function.
Let's say you have a table A that looks like:
+--------------------+
| col1 |
+--------------------+
| Dropped 12 (0.01%) |
| Dropped 24 (0.02%) |
+--------------------+
You can extract the number in parenthesis with the following:
B = FOREACH A GENERATE REGEX_EXTRACT(col6, '.*\\((.*)%\\)', 1);
+---------+
| percent |
+---------+
| 0.01 |
| 0.02 |
+---------+
I'm specifying a regex capture group for whatever characters are between ( and %). Notice that I'm using \\ as the escape character so that I match the opening and closing parenthesis.

Related

Conditionally remove a field in Splunk

I have a table generated by chart that lists the results of a compliance scan
These results are typically Pass, Fail, and Error - but sometimes there is "Unknown" as a response
I want to show the percentage of each (Pass, Fail, Error, Unknown), so I do the following:
| fillnull value=0 Pass Fail Error Unknown
| eval _total=Pass+Fail+Error+Unknown
<calculate percentages for each field>
<append "%" to each value (Pass, Fail, Error, Unknown)>
What I want to do is eliminate a "totally" empty column, and only display it if it actually exists somewhere in the source data (not merely because of the fillnull command)
Is this possible?
I was thinking something like this, but cannot figure out the second step:
| eventstats max(Unknown) as _unk
| <if _unk is 0, drop the field>
edit
This could just as easily be reworded to:
if every entry for a given field is identical, remove it
Logically, this would look something like:
if(mvcount(values(fieldname))<2), fields - fieldname
Except, of course, that's not valid SPL
could you try that logic after the chart :
``` fill with null values ```
| fillnull value=null()
``` do 90° two time, droping empty/null ```
| transpose 0 include_empty=false | transpose 0 header_field=column | fields - column
[edit:] it is working when I do the following but not sure it is easy to make it working on all conditions
| stats count | eval keep=split("1 2 3 4 5"," ") | mvexpand keep
| table keep nokeep
| fillnull value=null()
| transpose 0 include_empty=false | transpose 0 header_field=column | fields - column
[edit2:] and if you need to add more null() could be done like that
| stats count | eval keep=split("1 2 3 4 5"," "), nokeep=0 | mvexpand keep
| table keep nokeep
| foreach nokeep [ eval nokeep=if(nokeep==0,null(),nokeep) ]
| transpose 0 include_empty=false | transpose 0 header_field=column | fields - column

SQL - Extracting first 5 consecutive numbers from alphanumeric string

I am using AWS Athena, so functions are a bit limiting. But essentially I want to extract the first 5 consecutive and sequential numbers from a alphanumeric field.
From the first example, you can see it ignores the first 1 because there aren't 4 trailing numbers. I want to find and extract the first 5 numbers that are given together from this field. The output field is what I am hoping to achieve.
This will find an exact sequence of 5 digits.
a sequence of less or more than 5 digits will be ignored.
^|\D = Indication for the start of the text OR a non-digit character
\d{5} = 5 digits
\D|$ = A non-digit character OR indication for the end of the text
with t (Example) as (values ('Ex/l/10345/Pl'), ('Ex/23453PlWL'), ('ID09456//'))
select Example, regexp_extract(Example, '(^|\D)(\d{5})(\D|$)', 2) as Output
from t
+---------------+--------+
| Example | Output |
+---------------+--------+
| Ex/l/10345/Pl | 10345 |
| Ex/23453PlWL | 23453 |
| ID09456// | 09456 |
+---------------+--------+

Confusing matching behaviour of pandas extract(all)

I have a strange problem. But first, I want to match a hierarchy-based string onto the value of a column in a pandas data frame and count the occurrence of the current node and all of its children.
| index | hierarchystr |
| ----- | --------------------- |
| 0 | level0level00level000|
| 1 | level0level01 |
| 2 | level0level02level021|
| 3 | level0level02level021|
| 4 | level0level02level020|
| 5 | level0level02level021|
| 6 | level1level02level021|
| 7 | level1level02level021|
| 8 | level1level02level021|
| 9 | level2level02level021|
Assume that there are 300k lines. Each node can have multiple children with again multiple children so on and so forth (here represented by level0-2 strings). Now I have a separate hierarchy where I extract the hierarchy strings from. Now to the problem:
#hstrs = ["level0", "level1", "level0level01", "level0level02", "level0level02level021"]
pat = "|".join(hstrs)
s = df.hierarchystr.str.extract('(' + pat + ')', expand=True)[0]
df1 = df.groupby(s).size().reset_index(name='Count')
df1 = df1[df1 > 200]
size = len(df1)
The size of the found matched substrings with occurrence greater than 200 differ every RUN! "level0" should match every row where the hierarchy str level0 is included and should build a group with all its subchildren and that size needs to be greater than 200.
Edit:// levelX is just an example, i have thousands of nodes, with different names and again thousands of different subchilds. The hstrs strings do not include each other, besides the parent nodes. (E.g. "parent1" is included in "parent1subchild1" and "parent1subchild2")
I traced it back to a different order of the hierarchy strings in the array hstrs. So I changed the code and compare each substring individually:
for hstr in hstrs:
s = df.hierarchystr.str.extract('(' + hstr + ')', expand=True)
s2 = s.count()
s3 = s2.values[0]
if s3 > 200:
list.append(hstr)
This is slow as hell, but the result sticks the same, no matter which order hstrs has. But for efficiency is it possible to do the same with only one regex matching group, all at once for all hstrs?
Edit://
expected output would be:
|index| 0 | Count |
|-----|---------------------|-------|
|0 |level0 | 5 |
|1 |level1 | 3 |
|2 |level0level01 | 1 |
|3 |level0level02 | 4 |
|4 |level0level02level021| 3 |
Edit2://
it has something to do with the ordering of hstrs. I think with the match and stop after the first match the behavior of the extract method. If the ordering is different the hierarchy strings in the pat will be matched differently which results in different sizes of each group. A high hierarchy (short str) will be matched first, the lower hierarchy levels in the same pat won't be matched again. But IDK what to do against this behavior.
Edit3://
an alternative would be, but is also slow as hell:
for hstr in hstrs:
s = df[df.hierarchystr.str.contains(fqn)]
s2 = s.count()
s3 = s2.values[0]
if s3 > 200:
beforeset.append(fqn)
Edit4://
I think what I am searching for is the opportunity to do a "group_by" with "contains" or "is in" for the hstrs. I am glad for every Idea. :)
Edit5://
Found a simple, but not satisfying alternative (but faster than the previous tries):
containing =[item for hierarchystr in df.hierarchystr for item in hstrs if item in hierarchystr]
containing = Counter(containing)
df1 = pd.DataFrame([containing]).T
nodeNamesWithOver200 = df1[df1 > 200].dropna().index.values

SQLAlchemy getting label names out from columns

I want to use the same labels from a SQLAlchemy table, to re-aggregate some data (e.g. I want to iterate through mytable.c to get the column names exactly).
I have some spending data that looks like the following:
| name | region | date | spending |
| John | A | .... | 123 |
| Jack | A | .... | 20 |
| Jill | B | .... | 240 |
I'm then passing it to an existing function we have, that aggregates spending over 2 periods (using a case statement) and groups by region:
grouped table:
| Region | Total (this period) | Total (last period) |
| A | 3048 | 1034 |
| B | 2058 | 900 |
The function returns a SQLAlchemy query object that I can then use subquery() on to re-query e.g.:
subquery = get_aggregated_data(original_table)
region_A_results = session.query(subquery).filter(subquery.c.region = 'A')
I want to then re-aggregate this subquery (summing every column that can be summed, replacing the region column with a string 'other'.
The problem is, if I iterate through subquery.c, I get labels that look like:
anon_1.region
anon_1.sum_this_period
anon_1.sum_last_period
Is there a way to get the textual label from a set of column objects, without the anon_1. prefix? Especially since I feel that the prefix may change depending on how SQLAlchemy decides to generate the query.
Split the name string and take the second part, and if you want to prepare for the chance that the name is not prefixed by the table name, put the code in a try - except block:
for col in subquery.c:
try:
print(col.name.split('.')[1])
except IndexError:
print(col.name)
Also, the result proxy (region_A_results) has a method keys which returns an a list of column names. Again, if you don't need the table names, you can easily get rid of them.

How to represent and insert into an ordered list in SQL?

I want to represent the list "hi", "hello", "goodbye", "good day", "howdy" (with that order), in a SQL table:
pk | i | val
------------
1 | 0 | hi
0 | 2 | hello
2 | 3 | goodbye
3 | 4 | good day
5 | 6 | howdy
'pk' is the primary key column. Disregard its values.
'i' is the "index" that defines that order of the values in the 'val' column. It is only used to establish the order and the values are otherwise unimportant.
The problem I'm having is with inserting values into the list while maintaining the order. For example, if I want to insert "hey" and I want it to appear between "hello" and "goodbye", then I have to shift the 'i' values of "goodbye" and "good day" (but preferably not "howdy") to make room for the new entry.
So, is there a standard SQL pattern to do the shift operation, but only shift the elements that are necessary? (Note that a simple "UPDATE table SET i=i+1 WHERE i>=3" doesn't work, because it violates the uniqueness constraint on 'i', and also it updates the "howdy" row unnecessarily.)
Or, is there a better way to represent the ordered list? I suppose you could make 'i' a floating point value and choose values between, but then you have to have a separate rebalancing operation when no such value exists.
Or, is there some standard algorithm for generating string values between arbitrary other strings, if I were to make 'i' a varchar?
Or should I just represent it as a linked list? I was avoiding that because I'd like to also be able to do a SELECT .. ORDER BY to get all the elements in order.
As i read your post, I kept thinking 'linked list'
and at the end, I still think that's the way to go.
If you are using Oracle, and the linked list is a separate table (or even the same table with a self referencing id - which i would avoid) then you can use a CONNECT BY query and the pseudo-column LEVEL to determine sort order.
You can easily achieve this by using a cascading trigger that updates any 'index' entry equal to the new one on the insert/update operation to the index value +1. This will cascade through all rows until the first gap stops the cascade - see the second example in this blog entry for a PostgreSQL implementation.
This approach should work independent of the RDBMS used, provided it offers support for triggers to fire before an update/insert. It basically does what you'd do if you implemented your desired behavior in code (increase all following index values until you encounter a gap), but in a simpler and more effective way.
Alternatively, if you can live with a restriction to SQL Server, check the hierarchyid type. While mainly geared at defining nested hierarchies, you can use it for flat ordering as well. It somewhat resembles your approach using floats, as it allows insertion between two positions by assigning fractional values, thus avoiding the need to update other entries.
If you don't use numbers, but Strings, you may have a table:
pk | i | val
------------
1 | a0 | hi
0 | a2 | hello
2 | a3 | goodbye
3 | b | good day
5 | b1 | howdy
You may insert a4 between a3 and b, a21 between a2 and a3, a1 between a0 and a2 and so on. You would need a clever function, to generate an i for new value v between p and n, and the index can become longer and longer, or you need a big rebalancing from time to time.
Another approach could be, to implement a (double-)linked-list in the table, where you don't save indexes, but links to previous and next, which would mean, that you normally have to update 1-2 elements:
pk | prev | val
------------
1 | 0 | hi
0 | 1 | hello
2 | 0 | goodbye
3 | 2 | good day
5 | 3 | howdy
hey between hello & goodbye:
hey get's pk 6,
pk | prev | val
------------
1 | 0 | hi
0 | 1 | hello
6 | 0 | hi <- ins
2 | 6 | goodbye <- upd
3 | 2 | good day
5 | 3 | howdy
the previous element would be hello with pk=0, and goodbye, which linked to hello by now has to link to hey in future.
But I don't know, if it is possible to find a 'order by' mechanism for many db-implementations.
Since I had a similar problem, here is a very simple solution:
Make your i column floats, but insert integer values for the initial data:
pk | i | val
------------
1 | 0.0 | hi
0 | 2.0 | hello
2 | 3.0 | goodbye
3 | 4.0 | good day
5 | 6.0 | howdy
Then, if you want to insert something in between, just compute a float value in the middle between the two surrounding values:
pk | i | val
------------
1 | 0.0 | hi
0 | 2.0 | hello
2 | 3.0 | goodbye
3 | 4.0 | good day
5 | 6.0 | howdy
6 | 2.5 | hey
This way the number of inserts between the same two values is limited to the resolution of float values but for almost all cases that should be more than sufficient.