Regex capturing inside a group - sql

I working on a method to get all values based on a SQL query and then scape them in php.
The idea is to get the programmer who is careless about security when is doing a SQL query.
So when I try to execute this:
INSERT INTO tabla (a, b,c,d) VALUES ('a','b','c',a,b)
The regex needs to capture 'a' 'b' 'c' a and b
I was working on this a couple of days.
This was as far I can get with 2 regex querys, but I want to know if there is a better way to do:
VALUES ?\((([\w'"]+).+?)\)
Based on the previous SQL this will match:
VALUES ('a','b','c',a,b)
The second regex
Will match
a b c a b
Previously removing VALUES, of course.
This way will match a lot of the values I gonna insert.
But doesn't work with JSON for example.
{a:b, "asd":"ads" ....}
Any help with this?

First, I think you should know that SQL support many types of single/double quoted string:
'Northwind\'s category name'
'Northwind''s category name'
"Northwind \"category\" name"
"Northwind ""category"" name"
"Northwind category's name"
'Northwind "category" name'
'Northwind \\ category name'
'Northwind \ncategory \nname'
to match them, try with these patterns:
combine patterns together:
PHP5.4.5 sample code:
$pat = '/\bVALUES\s*\((\s*(?:"[^\\"]*(?:(?:\\.|"")[^\\"]*)*"|\'[^\\\']*(?:(?:\\.|\'\')[^\\\']*)*\'|\w+)(?:\s*,\s*(?:"[^\\"]*(?:(?:\\.|"")[^\\"]*)*"|\'[^\\\']*(?:(?:\\.|\'\')[^\\\']*)*\'|\w+))*)\)/';
$sql_sample1 = "INSERT INTO tabla (a, b,c,d) VALUES ('a','b','c',a,b)";
if( preg_match($pat, $sql_sample1, $matches) > 0){
printf("%s\n", $matches[0]);
printf("%s\n\n", $matches[1]);
$sql_sample2 = 'INSERT INTO tabla (a, b,c,d) VALUES (\'a\',\'{a:b, "asd":"ads"}\',\'c\',a,b)';
if( preg_match($pat, $sql_sample2, $matches) > 0){
printf("%s\n", $matches[0]);
printf("%s\n", $matches[1]);
VALUES ('a','b','c',a,b)
VALUES ('a','{a:b, "asd":"ads"}','c',a,b)
'a','{a:b, "asd":"ads"}','c',a,b
If you need to get each value from result, split by , (like parsing CSV)
I hope this will help you :)


How to retrieve columns that end with one of characters in a list

I'm trying to learn SQL and figuring out a way to retrieve all columns whose name ends with one of characters in an a list (using JDBC queries):
public Map<Long, Set<Long>> groupCountriesBy(Set<Integer> countryIdLastDigits) {
String query = "SELECT FROM countries c"
+ " WHERE LIKE '%[dea]'"
+ " GROUP BY ";
var args = new MapSqlParameterSource("countryIdLastDigits", countryIdLastDigits);
WHERE LIKE '%[dea]' does return all columns that end with either d, e or a, but I did not manage to find a way to pass countryIdLastDigits to this SQL query.
Could you please share with me some pointers / hints? Probably I'm missing few SQL concepts / commands.
Most SQL dialects have left and right string functions so perhaps something like
where right(col,1) in ('d', 'e', 'a')
will be all you need.

query using string in PyTables 3

I have a table:
h5file=open_file("ex.h5", "w")
class ex(IsDescription):
A=StringCol(5, pos=0)
B=StringCol(5, pos=1)
C=StringCol(5, pos=2)
table=h5file.create_table('/', 'table', ex, "Passing string as column name")
('abc', 'bcd', 'dse'),
('der', 'fre', 'swr'),
('xsd', 'weq', 'rty')
I am trying to query as per below:
if creteria=='B':
value=[x['A'] for x in table.where("""condition==find""")]
It returns:
ValueError: there are no columns taking part in condition condition==find
Is there a way to use condition as a column name in above query?
Thanks in advance.
Yes, you can use Pytables .where() to search based on a condition. The problem is how you constructed your query for the table.where(condition). See Note about strings under Table.where() in the Pytables Users Guide:
A special care should be taken when the query condition includes string literals. ... Python 3 strings are unicode objects.
in Python 3, “condition” should be defined like this:
condition = 'col1 == b"AAAA"'
The reason is that in Python 3 “condition” implies a comparison between a string of bytes (“col1” contents) and an unicode literal (“AAAA”).
The simplest form of your query is shown below. It returns a subset of rows that match the condition. Note use of single and double quotes for string and unicode:
query_table = table.where('C=="swr"') # search in column C
I rewrote your example as best I could. See below. It shows several ways to enter the condition. I'm not smart enough to figure out how to combine your creteria and find variables into a single condition variable with string and unicode characters.
from tables import *
class ex(IsDescription):
A=StringCol(5, pos=0)
B=StringCol(5, pos=1)
C=StringCol(5, pos=2)
h5file=open_file("ex.h5", "w")
table=h5file.create_table('/', 'table', ex, "Passing string as column name")
## table=h5file.root.table
('abc', 'bcd', 'dse'),
('der', 'fre', 'swr'),
('xsd', 'weq', 'rty')
query_table = table.where('C==find')
for row in query_table :
print (row)
print (row['A'], row['B'], row['C'])
value=[x['A'] for x in table.where('C == "swr"')]
value=[x['A'] for x in table.where('C == find')]
Output shown below:
/table.row (Row), pointing to row #1
b'der' b'fre' b'swr'

Perl: for (min .. max) uses random order, but I want it in order 0,1,2,

As I am a total beginner to perl, oracle sql and everything else. I have to write a script to parse an excel file and write the values into an oracle sql database.
Everything is good so far. But it writes the rows in random order into the database.
for ($row_min .. $row_max) {...insert into db code $sheetValues[$_][col0] etc...}
I don't get it why the rows are inserted in a random order?
And obviously how can I get them in order? excel_row 0 => db_row 0 and so on...
The values in the array are in order! The number of rows is dynamic.
Thanks for your help, I hope you got all the information you need.
sub parseWrite {
my #sheetValues;
my $worksheet = $workbook->worksheet(0);
my ($row_min, $row_max) = $worksheet->row_range();
print "| Zeile $row_min bis $row_max |";
my ($col_min, $col_max) = $worksheet->col_range();
print " Spalte $col_min bis $col_max |<br>";
for my $row ($row_min .. $row_max) {
for my $col ($col_min .. $col_max) {
my $cell = $worksheet->get_cell ($row,$col);
next unless $cell;
$sheetValues[$row][$col] = $cell->value();
print $sheetValues[$row][$col] .
"(".$row."," .$col.")"."<br>";
for ($row_min .. $row_max) {
my $sql="INSERT INTO t_excel (
'$sheetValues[$_][0 ]',
'$sheetValues[$_][1 ]',
'$sheetValues[$_][2 ]',
'$sheetValues[$_][3 ]',
'$sheetValues[$_][4 ]',
'$sheetValues[$_][5 ]'
With in order I mean that my PL/SQL Developer 8.0.3 (given by my company)
shows with SELECT * FROM t_excel;
But shell = (2,0), maggie = (0,0) and 13 = (1,0) in the array.
The rows are being inserted in the order you expect. I believe the mistaken assumption here is that SELECT will return rows in the same order they're inserted. This is not true. While implementations may make it seem like it does, SELECT has no default order. You're thinking a table is basically like a big list, INSERT is adding to the end of it, and SELECT just iterates through it. That's not a bad approximation, but it can lead you to make bad assumptions. The reality is that you can say little for sure about how a table is stored.
SQL is a declarative language which means you tell the computer what you want. This is different from a most other language types where you tell the computer what to do. SELECT * FROM sometable says "give me all the rows and all their columns in the table". Since you didn't give an order, the database can return them in whatever order it likes. Contrast with the procedural meaning which would be "iterate through all the rows in the table" as if the table was some sort of list.
Most languages encourage you to take advantage of how data is stored. Declarative languages prevent you from knowing how data is stored.
If you want your SELECT to be ordered, you have to give it an ORDER BY.

Pig Latin - Extracting fields meeting two different filter criteria from chararray line and grouping in a bag

I am new to Pig Latin.
I want to extract all lines that match a filter criteria (have a word "line_token" ) from log files and then from these matching lines extract two different fields meeting two separate field match criteria . Since the lines aren't structured well I am loading them as a char array.
When I try to run the following code - I get an error
"Invalid resource schema: bag schema must have tuple as its field"
I have tried to perform an explicit cast to a tuple but that does not work
input_lines = LOAD '/inputdir/' AS ( line:chararray);
filtered_lines = FILTER input_lines BY (line MATCHES '.*line_token1.*' );
tokenized_lines = FOREACH filtered_lines GENERATE FLATTEN(TOKENIZE(line)) AS tok_line;
my_wordbag = FOREACH tokenized_lines {
word1 = FILTER tok_line BY ( $0 MATCHES '.*word_token1.*' ) ;
word2 = FILTER tok_line BY ( $0 MATCHES '.*word_token1.*' ) ;
GENERATE word1 , word2 as my_tuple ;
-- I also tried --> GENERATE (word1 , word2) as my_tuple ;
dump my_wordbag;
I suppose I am taking a very wrong approach.
Please note - my logs aren't structured well - so I cant mend the way I load
Post loading and initial filtering for lines of interest ( which is straightforward) - I guess I need to do something different rather than tokenize line and iterate through fields trying to find fields.
Or maybe I should use joins ?
Also if I know the structure of line beforehand well as all text fields, then will loading it differently ( not as a chararray) make it an easier problem ?
For now I made a compromise - I added a extra filter clause in my original - line filter and settled for picking just one field from line. When I get back to it I will try with joins and post that code ... - here's my working code that gets me a useful output - but not all that I want.
-- read input lines from poorly structured log
input_lines = LOAD '/log-in-dir-in-hdfs' AS ( line:chararray) ;
-- Filter for line filter criteria and date interested in passed as arg
filtered_lines = FILTER input_lines BY (
( line MATCHES '.*line_filter1*' )
AND ( line MATCHES '.*line_filter2.*' )
AND ( line MATCHES '.*$forDate.*' )
) ;
-- Tokenize every line
tok_lines = FOREACH filtered_lines
GENERATE TOKENIZE(line) AS tok_line;
-- Pick up specific field frm tokenized line based on column filter criteria
fnames = FOREACH tok_lines {
fname = FILTER tok_line BY ( $0 MATCHES '.*field_selection.*' ) ;
GENERATE FLATTEN(fname) as nnfname;
-- Count occurances of that field and store it with field name
-- My original intent is to store another field name as well
-- I will do that once I figure how to put both of them in a tuple
flgroup = FOREACH fnames
GENERATE FLATTEN(TOKENIZE((chararray)$0)) as cfname;
grpfnames = group flgroup by cfname;
readcounts = FOREACH grpfnames GENERATE COUNT(flgroup), group ;
STORE readcounts INTO '/out-dir-in-hdfs';
As I understand, after the FLATTEN operation, you have single line (tok_line) in each row and you want to extract 2 words from each line. REGEX_EXTRACT will help you achieve this. I'm not a REGEX expert so will leave writing the REGEX part up to you.
data = FOREACH tokenized_lines
REGEX_EXTRACT(tok_line, <first word regex goes here>) as firstWord,
REGEX_EXTRACT(tok_line, <second word regex goes here>) as secondWord;
I hope this helps.
You must refer to the alias, not the column.
word1 = FILTER tokenized_lines BY ( $0 MATCHES '.*word_token1.*' ) ;
word1 and word2 are going to be aliases as well, not columns.
How do you need the output to look like?

SQL regex and field

I want to change the query to return multiply values in extra_fields, how can I change the regex? Also I don't understand what extra_fields is - is it a field? If so why it is not called with the table prefix like i.extra_fields?
CASE WHEN i.modified = 0 THEN i.created ELSE i.modified END AS lastChanged, AS categoryname, AS categoryid,
c.alias AS categoryalias,
c.params AS categoryparams
FROM #__k2_items AS i
LEFT JOIN #__k2_categories AS c ON = i.catid
WHERE i.published = 1
AND i.access IN(1,1)
AND i.trash = 0
AND c.published = 1
AND c.access IN(1,1)
AND c.trash = 0
AND (i.publish_up = '0000-00-00 00:00:00'
OR i.publish_up <= '2013-06-12 22:45:19'
AND (i.publish_down = '0000-00-00 00:00:00'
OR i.publish_down >= '2013-06-12 22:45:19'
AND extra_fields REGEXP BINARY '(.*{"id":"2","value":\["[^\"]*1[^\"]*","[^\"]*2[^\"]*","[^\"]*3[^\"]*"\]}.*)'
The extra_fields is a column of the #__k2_items table. The table qualifier can be omitted, because it is not ambiguous in this query. The column is JSON encoded. That is a serialization format used to store information which is not searchable by design. Applying a RegExp may work one day, but fail another day, since there is no guarantee for id preceeding value (as in your example).
The right way
The right way to filter this is to ignore the extra_fields condition in the SQL query an evaluate in the resultset instead. Example:
$rows = $db->loadObjectList('id');
foreach ($rows as $id => $row) {
$extra_fields = json_decode($row->extra_fields);
if ($extra_fields->id != 2) {
The short way
If you can't change the database layout (which is true for extensions you want to keep updateable), you must split the condition into two, because there is no guarantee for a certain order of the subfields. For some reason, one day value may occur before id. So change your query to
AND extra_fields LIKE '%"id":"2"%'
AND extra_fields REGEXP BINARY '"value":\[("[^\"]*[123][^\"]*",?)+\]'
Prepare an intermediate table to hold the contents of extra_fields. Each extra_fields field will be converted into a series of records. Then do a join.
Create a trigger and cronjob to keep the temp table in sync.
Another way is to write UDF in Perl that will decode the field, but AFAIK it is not indexable in mysql.
Using an external search engine is out of scope.
Ok, i didnt want to change the db strucure, i gost some help and changed the regex intoAND extra_fields REGEXP BINARY '(.*{"id":"2","value":\[("[^\"]*[123][^\"]*",?)+\]}.*)'
and i got the right resaults