Removing all white-spaces except for tabs and linebreaks - removing-whitespace

Is there a way to remove all whitespace characters except for tabs and linebreaks?
If I were to use .replaceAll("\s+", "") or .replaceAll(" ", "") I would also delete every tab or linebreak.

for python language, you can use this function:
def replaceWhiteSpace(text):
res = []
for i in text:
res = text.str.split()
for j in res:
text2 = ' '.join(j)
return text2
testDataset = replaceWhiteSpace(text)
put your a column in your dataframe in text, and the result will be stored in testDataset

Related

Substring Method in VB.Net

I have Textboxes Lines:
{ LstScan = 1,100, DrwR2 = 0000000043 }
{ LstScan = 2,200, DrwR2 = 0000000041 }
{ LstScan = 3,300, DrwR2 = 0000000037 }
I should display:
1,100
2,200
3,300
this is a code that I can't bring to a working stage.
Dim data As String = TextBox1.Lines(0)
' Part 1: get the index of a separator with IndexOf.
Dim separator As String = "{ LstScan ="
Dim separatorIndex = data.IndexOf(separator)
' Part 2: see if separator exists.
' ... Get the following part with Substring.
If separatorIndex >= 0 Then
Dim value As String = data.Substring(separatorIndex + separator.Length)
TextBox2.AppendText(vbCrLf & value)
End If
Display as follows:
1,100, DrwR2 = 0000000043 }
This should work:
Function ParseLine(input As String) As String
Const startKey As String = "LstScan = "
Const stopKey As String = ", "
Dim startIndex As String = input.IndexOf(startKey)
Dim length As String = input.IndexOf(stopKey) - startIndex
Return input.SubString(startIndex, length)
End Function
TextBox2.Text = String.Join(vbCrLf, TextBox1.Lines.Select(AddressOf ParseLine))
If I wanted, I could turn that entire thing into a single (messy) line... but this is more readable. If I'm not confident every line in the textbox will match that format, I can also insert a Where() just before the Select().
Your problem is you're using the version of substring that takes from the start index to the end of the string:
"hello world".Substring(3) 'take from 4th character to end of string
lo world
Use the version of substring that takes another number for the length to cut:
"hello world".Substring(3, 5) 'take 5 chars starting from 4th char
lo wo
If your string will vary in length that needs extracting you'll have to run another search (for example, searching for the first occurrence of , after the start character, and subtracting the start index from the newly found index)
Actually, I'd probably use Split for this, because it's clean and easy to read:
Dim data As String = TextBox1.Lines(0)
Dim arr = data.Split()
Dim thing = arr(3)
thing now contains 1,100, and you can use TrimEnd(","c) to remove the final comma
thing = thing.TrimEnd(","c)
You can reduce it to a one-liner:
TextBox1.Lines(0).Split()(3).TrimEnd(","c)

Inserting a word within a given string

I want the user to input a phrase such as "Python" and have the program put the word "test" in the middle.... So it would print "pyttesthon".
After I input the phrase however, I am not sure which function to use.
You can just concatenate strings like so:
stringToInsert = "test"
oldString = "Python"
newString = oldString[0:len(oldString)/2] + stringToInsert + oldString[len(oldString)/2:]

Replace before save to CSV

I'm using scrapy's export to CSV but sometimes the content I'm scraping contains quotes and comma's which i don't want.
How can I replace those chars with nothing '' before outputting to CSV?
Heres my CSV containing the unwanted chars in the strTitle column:
strTitle,strLink,strPrice,strPicture
"TOYWATCH 'Metallic Stones' Bracelet Watch, 35mm",http://shop.nordstrom.com/s/toywatch-metallic-stones-bracelet-watch-35mm/3662824?origin=category,0,http://g.nordstromimage.com/imagegallery/store/product/Medium/11/_8412991.jpg
Heres my code which errors on the replace line:
def parse(self, response):
hxs = Selector(response)
titles = hxs.xpath("//div[#class='fashion-item']")
items = []
for titles in titles[:1]:
item = watch2Item()
item ["strTitle"] = titles.xpath(".//a[#class='title']/text()").extract()
item ["strTitle"] = item ["strTitle"].replace("'", '').replace(",",'')
item ["strLink"] = urlparse.urljoin(response.url, titles.xpath("div[2]/a[1]/#href").extract()[0])
item ["strPrice"] = "0"
item ["strPicture"] = titles.xpath(".//img/#data-original").extract()
items.append(item)
return items
EDIT
Try adding this line before the replace.
item["strTitle"] = ''.join(item["strTitle"])
strTitle = "TOYWATCH 'Metallic Stones' Bracelet Watch, 35mm"
strTitle = strTitle.replace("'", '').replace(",",'')
strTitle == "TOYWATCH Metallic Stones Bracelet Watch 35mm"
In the end the solution was:
item["strTitle"] = [titles.xpath(".//a[#class='title']/text()").extract()[0].replace("'", '').replace(",",'')]

How to split a string by using more than one delimiter

Below a script I used in my SSIS package.
If (Row.AnswerType.Trim().ToUpper = "MULTIPLE SELECT" And _
Row.SurveyQuestionID = Row.SurveyDefinitionDetailQuestionNumber) Then
Dim Question1 As String = Row.SurveyDefinitionDetailAnswerChoices.ToUpper.Trim()
Dim ans1 As String = Row.SurveyAnswer.ToUpper.Trim()
For Each x As String In ans1.Split(New [Char]() {CChar(vbTab)})
If Question1.Contains(x) Then
Row.IsSkipped = False
Else
Row.IsSkipped = True
'Row.IsAllowed = True
Row.ErrorDesc = "Invalid Value in Answer Column For Multiple Select!"
End If
Next
End If
This script only succeeds when having a tab as delimiter. But I need both tab and non tab characters as delimiters.
Add all the needed characters to the character array
ans1.Split(New [Char]() { CChar(vbTab), CChar(" "), CChar(";") })
Or
ans1.Split(New [Char]() { CChar(vbTab), " "C, ";"C })
by using the character literal suffix C.

How to find which delimiter was used during string split (VB.NET)

lets say I have a string that I want to split based on several characters, like ".", "!", and "?". How do I figure out which one of those characters split my string so I can add that same character back on to the end of the split segments in question?
Dim linePunctuation as Integer = 0
Dim myString As String = "some text. with punctuation! in it?"
For i = 1 To Len(myString)
If Mid$(entireFile, i, 1) = "." Then linePunctuation += 1
Next
For i = 1 To Len(myString)
If Mid$(entireFile, i, 1) = "!" Then linePunctuation += 1
Next
For i = 1 To Len(myString)
If Mid$(entireFile, i, 1) = "?" Then linePunctuation += 1
Next
Dim delimiters(3) As Char
delimiters(0) = "."
delimiters(1) = "!"
delimiters(2) = "?"
currentLineSplit = myString.Split(delimiters)
Dim sentenceArray(linePunctuation) As String
Dim count As Integer = 0
While linePunctuation > 0
sentenceArray(count) = currentLineSplit(count)'Here I want to add what ever delimiter was used to make the split back onto the string before it is stored in the array.'
count += 1
linePunctuation -= 1
End While
If you add a capturing group to your regex like this:
SplitArray = Regex.Split(myString, "([.?!])")
Then the returned array contains both the text between the punctuation, and separate elements for each punctuation character. The Split() function in .NET includes text matched by capturing groups in the returned array. If your regex has several capturing groups, all their matches are included in the array.
This splits your sample into:
some text
.
with punctuation
!
in it
?
You can then iterate over the array to get your "sentences" and your punctuation.
.Split() does not provide this information.
You will need to use a regular expression to accomplish what you are after, which I infer as the desire to split an English-ish paragraph into sentences by splitting on punctuation.
The simplest implementation would look like this.
var input = "some text. with punctuation! in it?";
string[] sentences = Regex.Split(input, #"\b(?<sentence>.*?[\.!?](?:\s|$))");
foreach (string sentence in sentences)
{
Console.WriteLine(sentence);
}
Results
some text.
with punctuation!
in it?
But you are going to find very quickly that language, as spoken/written by humans, does not follow simple rules most of the time.
Here it is in VB.NET for you:
Dim sentences As String() = Regex.Split(line, "\b(?<sentence>.*?[\.!?](?:\s|$))")
Once you've called Split with all 3 characters, you've tossed that information away. You could do what you're trying to do by splitting yourself or by splitting on one punctuation mark at a time.