Parsing a slash and apostrophe with Python regex - sql

I am attempting to parse a Wikipedia SQL dump with the Python regular expressions library. The ultimate goal is to import this dump into PostgreSQL, but I know the apostrophes in strings need to be doubled, beforehand.
Every apostrophe in a string in this dump is preceded by a backwards slash, though, and I'd rather not remove the backwards slashes.
(42,'Thirty_Years\'_War',33,5,0,0)
Using the command
re.match(".*?([\w]+?'[\w\s]+?).*?", line)
I cannot identify the apostrophe in the middle of 'Thirty_Years\'_War', when 'line' is parsed from a text file.
For comparison, these lines work fine when parsed (sans the last line).
The person's car
The person's car's gasoline
Hodges' Harbrace Handbook
'Hodges' Harbrace Handbook'
portspeople',1475,29,0,0),(42,'Thirty_Years\'_War',33,5,0,0)
Correct and expected output (sans the last line):
The person''s car
The person''s car''s gasoline
Hodges'' Harbrace Handbook
('Hodges'' Harbrace Handbook')
portspeople',1475,29,0,0),(42,'Thirty_Years\'_War',33,5,0,0)
Using the command
re.match(".*?([\w\\]+?'[\w\s]+?).*?", line)
breaks it.
The person''s car
The person''''s car''''s gasoline
Hodges'' Harbrace Handbook
(''''''''Hodges'''''''' Harbrace Handbook'''''''')
portspeople'''''''''''''''',1475,29,0,0),(42,''''''''''''''''Thirty_Years\''''''''''''''''_War'''''''''''''''',33,5,0,0)
Is it stuck in some sort sort of loop? What is the correct regex code to use?
I am not thinking about SQL injection attacks because this script is only going to be used for parsing dumps of Wikipedia articles (that don't contain examples of SQL injection attacks).

If the dump consists of things like the string you provided, you could try something like this:
re.findall(r"[^,\(\)]+")
Where the character class contains all known separators.
EDIT: Only use regex for parsing when there is no better way :)

Most Python database interfaces will take care of quoting SQL statements for you. For example, with the psycopg driver, you would write something like:
mystring="""This is 'a string' that contains single quotes."""
c.execute('INSERT INTO mytable (mycolumn) VALUES (%s)', mystring)
...and the database driver will take care of correctly quoting the values for you. Look at some of the examples in the documentation. In fact, their first example is remarkably like this one.

Related

How to add a small bit of context in a grammar?

I am tasked to parse (and transform) a code of a computer language, that has a slight quirk in its rules, at least I see it this way. To be exact, the compiler treats new lines (as well as semicolons) as statement separators, but other than that (e.g. inside the statement) it treats them as spacers (whitespace).
As an example, this code:
try
local x = 5 / 0
catch (i)
print(i + "\n")
is proved to be equivalent to this:
try local x = 5 / 0 catch (i) print(i + "\n")
I don't see how I can express such a rule in EBNF, or specifically in Lark EBNF dialect. I mean in a sensible way. I probably could define all possible newline positions inside all statements, but it would be cumbersome and error-prone.
I wish to find a way to treat newlines contextually. Is there a proven method for this, preferably within Python/Lark domain? If I have to modify the parser for that purpose, then where should I start?
Or if I misunderstood something in this language in particular or in machine language parsing in general, or my statement of the problem is wrong, I'd also be happy to get educated.
(As you may guess, the language in question has a well proven implementation, but no officially defined grammar. Also, it is Squirrel, for all that it matters.)
The relevant quote from the "specification" is this:
A squirrel program is a simple sequence of statements.:
stats := stat [';'|'\n'] stats
[...] Statements can be separated with a new line or ‘;’ (or with the keywords case or default if inside a switch/case statement), both symbols are not required if the statement is followed by ‘}’.
These are relatively complex rules and in their totality not context free if newlines can also be ignored everywhere else. Note however that in my understanding the text implies that ; or \n are required when no of the other cases apply. That would make your example illegal. That probably means that the BNF as written is correct, e.g. both ; and \n are optionally everywhere. In that case you can (for lark) just put an %ignore "\n" statement and it should work fine.
Also, lark should not complain if you both ignore the \n and use it in a rule: Where useful it will match it in a rule, otherwise it will just ignore it. Note however that this breaks if you use a Terminal that includes the \n (e.g. WS or /\s/). Just have \n as an extra case.
(For the future: You will probably get faster response for lark questions if you ask over on gitter or at least put a link to SO there.)

Multi-line text in a .env file

In vue, is there a way to have a value span multiple lines in an .env file. Ex:
Instead of:
someValue=[{"someValue":"Here is a really really long piece which should be split into multiple lines"}]
I want to do something like:
someValue=`[{"someValue":"Here is a really
really long piece which
should be split into multiple lines"}]`
Doing the latter gives me a JSON parsing error if I try to do JSON.parse(someValue) in my code
I don't know if this will work, but I can't format a comment appropriately enough to get the point across so see if this will work:
someValue=[{"someValue":"Here is a really\
really long piece which\
should be split into multiple lines"}]
Where "\" should escape the newline similar to how you can write long bash commands while escaping the newline. I'm not certain the .env interpreter will support it though.
EDIT
Looks like this won't work. This syntax was actually proposed, but I don't think it was incorporated. See motdotla/dotenv#333 (which is what Vue uses to parse .env).
Like #zero298 said, this isn't possible. Likely you could delimit the entry with a character that wouldn't show up normally in the text (^ is a good candidate), then parse it within the application using string.replace('^', '\n');

Inconsistent line endings in SSIS Flat File import

I have a large, pipe delineated text file with no text qualifiers, and it looks like whatever spit out this file accidentally spit out false "LF" markers in the last column every few hundred rows.
The last column is a descriptive column, and It is not text qualified in any way like it should be.
file looks similar to this:
id|data|data|data|data|Description[LF]
id|data|data|data|data|Description[LF]
id|data|data|data|data|Description[LF]
id|data|data|data|data|Descr[LF]
iption[LF]
id|data|data|data|data|Description[LF]
Id|data|data|data|data|Description[LF]
id|data|data|data|data|Descripti[LF]
on[LF]
id|data|data|data|data|Description[LF]
id|data|data|data|data|Description[LF]
id|data|data|data|data|Description[LF]
id|data|data|data|data|Description[LF]
id|data|data|data|data|D[LF]
escription[LF]
I'm pretty new to SSIS and SQL in general, Does anyone have any advice on how to fix this?
I did actually find a way to fix it in Notepad++, because I don't know C# and I don't know SSIS well enough..
The ID was 8 Digits long, and followed by 7 Blank spaces. That was absolutely unique to this file.
In notepad++ I used (Find Extended) to search and replace "\n"(LF) with nothing
then I used the this expression for find:
(\d\d\d\d\d\d\d\d[[:blank:]][[:blank:]][[:blank:]][[:blank:]][[:blank:]][[:blank:]][[:blank:]])
to find all 8 digit numbers with 7 trailing spaces, and for replace, used this:
\r\n\1
to put a [CR][LF] in front of those 8 digit numbers.
Lo and behold it worked!
But either way.. My boss contacted the client and is requesting a better file. Now I get kudos, and we get proper data. Thanks for the advice all!
If I had to take a guess, I would say that this is occurring because of how the file is created... you are probably having data that just happens to include certain special characters which are being incorrectly interpreted as a Line Feed.
Check this site to see if the data within your problem lines match any of these encodings. If this is the case then ultimately you have two options available:
1) Create some elaborate and complicated ETL process to detect and correct the file data before you process it. This is inadvisable as it will be a major pain to create and maintain.
2) Try changing the way this file is produced. Most text export wizards will allow you to place quotes (") around text items so that your import process can quickly detect something as a text block as opposed to a series of encoded characters to interpret.

Objective C parse string for middle chars

This is a bit of a puzzler for me. I have a string that looks like:
fanspd<fanspd>3</fanspd>
doorinprocess<doorinprocess>0</doorinprocess>
timeremaining<timeremaining>0</timeremaining>
macaddr<macaddr>60:CB:FB:99:99:C1</macaddr>
ipaddr<ipaddr>10.0.0.6</ipaddr>
model<model>4.4eWHF</model>
softver: <softver>2.14.2</softver>
interlock1: <interlock1>0</interlock1>
interlock2: <interlock2>0</interlock2>
cfm: <cfm>2200</cfm>
power: <power>120</power>
inside: <house_temp>-99</house_temp>
<DNS1>10.0.0.1</DNS1>
attic: <attic_temp>76</attic_temp>
OA: <oa_temp>-99</oa_temp>
server response: <server_response>Ó£àêEE²ç©þ]kõ «jsÐ</server_response>
DIP Switches: <DIPS>11100</DIPS>
Remote Switch: <switch2>1111</switch2>
Setpoint:<Setpoint>0</Setpoint>
The string includes the "/n" so I have split it into corrisponding lines that look like
fanspd<fanspd>0</fanspd>
All I really want is the char(s) in the middle of the line. In the above example it would be 0.
I can match everything with regular expressions but by doing the following:
(.*)(<[a-z]+>)(.*)(</[a-z]+>)
But what I'd like is something more that would exclude or strip away or remove all the junk and grab the middle chars.
(!(.*)(!<[a-z]+>))(.*)(!(</[a-z]+>))
I've tried this and it does not work. I've also thought of doing another [NSstring componentsSeparatedByString:#"(with either < or or >"] but that would leave be with more parsing yet to do and I think there should be a way to get just the chars inbetween the tags with either regular expressions or string compare or some such way to parse out the
Any suggestions or help would be greatly appreciated.
Thanks
Two things.
Your regular expression does not escape the forward slash.
Your regular expression seems overly complicated for what you are trying to do.
If all you want is that lone middle character with regular expressions,
Try this:
<[a-z]+>(.*)<\/[a-z]+>
Here's a great tool to play around with:
http://rubular.com
Heck you could probably even get away with:
<[a-z]+>(.*)<\/
EDIT:
I figured out your problem partially, some of the tags part way down contain characters other than a through z. So here you go:
<.+>(.*)<\/.+>

Reading blocks of text from a CSV file - vb.net

I need to parse a CSV file with blocks of text being processed in different ways according to certain rules, e.g.
userone,columnone,columntwo
userthirteen,columnone,columntwo
usertwenty,columnone,columntwo
customerone,columnone<br>
customertwo,columntwo<br>
singlevalueone
singlevaluetwo
singlevalueone_otherruleapplies
singlevaluethree_otherruleapplies
Each block of text will be grouped so the first three rows will be parsed using certain rules and so on. Notice that the last two groups have only one single column but each group must be handled in a different way.
I have the chance to propose the customer the format of the file so I'm thinking to propose the following.
[group 1]
userone,columnone,columntwo
userthirteen,columnone,columntwo
usertwenty,columnone,columntwo
[group N]
rowN
A kind of sections like the INI files from some years ago. However I'd like to hear your comments because I think there must be a better way to handle this.
I proposed to use XML but the customer prefers the text files.
Any suggestions are welcome.
m0dest0.
Ps. using VB.net and VS 2008
You can use regular expression groups set to either an enum line mode if each line has the same format, or to an enum multi-line if the format is not constrained to a single line. For each line in multiline you can include \n in your pattern to cross multiple lines to find you pattern. If its on a single line you don't need to include \n also know as Carriage return line feed in your regex matching pattern.
vb.net as well as many other modern programming language has extensive support for grouping operations. You can use index groups, or named groups.
Each name such as header1 or whatever you want to name it would be in this format: <myname>
See this link for more info: How do I access named capturing groups in a .NET Regex?.
Good luck.