How to replace {“bcz”,u,thr} with {“because”,you,there} in whole text? (Text-ming) - text-mining

Can someone please explain with example R/Python code to replace {“bcz”,u,thr} with {“because”,you,there} in whole text? (Text-ming)

The remove_words function gets a string and returns another string with making your required changes. Make sure that regex package is installed properly to be able to run this code and get the desired output. Through regex you can define the compilers for extracting each pattern of strings you want to change:
import regex as re
def remove_words(my_line):
new_line =''
compiler_thr = re.compile(r"thr")
compiler_u = re.compile(r"u")
compiler_bcz = re.compile(r"bcz")
for i in my_line.split():
if i in compiler_thr.findall(my_line):
new_line = new_line + ' ' + 'there'
elif i in compiler_u.findall(my_line):
new_line = new_line + ' ' + 'you'
elif i in compiler_bcz.findall(my_line):
new_line = new_line + ' ' + 'because'
else:
new_line = new_line + ' ' + i
return new_line

Related

numpy/pandas - why the selected the element from list are the same by random.choice

there is a list which contains integer values.
list=[1,2,3,.....]
then I use np.random.choice function to select a random element and add it to the a existing dataframe column, please refer to below code
df.message = df.message.astype(str) + "rowNumber=" + '"' + str(np.random.choice(list)) + '"'
But the element selected by np.random.choice and appended to the message column are always the same for all message row.
What is issue here?
Expected result is that the selected element from the list is not the same.
Pass to np.random.choice with parameter size and convert values to strings:
df = pd.DataFrame(
{'message' : ['aa','bb','cc']})
L = [1,2,3,4,5]
df.message = (df.message.astype(str) + "rowNumber=" + '"' +
np.random.choice(L, size=len(df)).astype(str) + '"')
print (df)
message
0 aarowNumber="4"
1 bbrowNumber="2"
2 ccrowNumber="5"

Adding prime symbol (') to ggplot2 axis label using expression()

I have the following code snippet that should (once complete) plot u_prime and v_prime on the log2(x + 1) scale using expression:
p <- ggplot(df_uv_prime, aes(x = u_prime, y = v_prime)) + geom_point(colour = "blue") +
labs(x = expression(log[2](u^{...} + 1)), y = expression(log[2](v^{...} + 1))) +
xlim(0, 1) +
ylim(0, 1)
However adding ' in place of ... doesn't work, as R expects a closing '.
Is there a base R solution without having to resort to any packages?
Based on your description of the desired outcome, one option is to enclose the single quote or backtick (not sure which one you're using) in double quotes and add 'connectors' (*) or 'spaces' (~) to either side, e.g. backtick with space on x axis, single quote with connector on y axis:
library(ggplot2)
ggplot(mtcars, aes(x = hp, y = disp)) +
geom_point(colour = "blue") +
labs(x = expression(log[2](u*"`"~ + 1)),
y = expression(log[2](v*"'"* + 1)))
Created on 2022-11-28 with reprex v2.0.2

token expression in flat file connection

I have loaded the variables in the format for example:
where produkt = 'Muj zivot2 R' and uraz = 'Uraz'
and I need the output in the file name to be:
Muj zivot2 R_Uraz
token worked for me, but it doesn't work in this case
" + TOKEN(" #[User::where] ","''",2) + "_" + TOKEN(" #[User::where] ","''",4) + "
You can use the following expression:
TOKEN(#[User::where],"'",2) + "_" + TOKEN(#[User::where],"'",4)
Output
Muj zivot2 R_Uraz

Nested Cell to string

I have the following problem:
Objective (high-level):
I would like to convert ESRI Shapefiles into SQL spatial data. For that purpose, I need to adapt the synthax.
Current status / problem:
I constructed a the following cell array:
'MULTIPOLYGON(' {1x2332 cell} ',' {1x916 cell} ',' {1x391 cell} ',' {1x265 cell} ')'
with in total 9 fields. This cell array contains the following 'nested' cell arrays: {1x2332 cell}, {1x916 cell}, {1x391 cell}, {1x265 cell}. As an example, 'nested' cell {1x2332 cell} has the following form:
'((' [12.714606000000000] [42.155628000000000] ',' [12.702529999999999] [42.152873999999997] ',' ... ',' [12.714606000000000] [42.155628000000000] '))'
However, I would like to have the entire cell array (including all 'nested cells') as one string without any spaces (except the space between the numbers (coordinates)). Would you have an idea how I could get to a solution?
Thank you in advance.
You probably need loops for this.
Consider a smaller example:
innerCell1 = {'((' [12.714606000000000] [42.155628000000000] ',' [12.702529999999999] [42.152873999999997] ',' [12.714606000000000] [42.155628000000000] '))'};
outerCell = {'MULTIPOLYGON(' innerCell1 ',' innerCell1 ')'};
You can go along these lines:
outer = outerCell; %// will be overwritten
ind_outer = find(cellfun(#iscell, outer)); %// positions of inner cell arrays in `outer`
for m = ind_outer
inner = outer{m};
ind_inner = cellfun(#isnumeric, inner); %// positions of numbers in `inner`
ind_inner_space = find(ind_inner(1:end-1) & ind_inner(2:end)); %// need space
ind_inner_nospace = setdiff(find(ind_inner), ind_inner_space); %// don't need
for n = ind_inner_space
inner{n} = [num2str(inner{n}) ' ']; %// convert to string with space
end
for n = ind_inner_nospace
inner{n} = num2str(inner{n}); %// convert to string, without space
end
outer{m} = [inner{:}]; %// concatenate all parts of `inner`
end
str = [outer{:}]; %// concatenate all parts of `outer`
This results in the string
str =
MULTIPOLYGON(((12.7146 42.1556,12.7025 42.1529,12.7146 42.1556)),((12.7146 42.1556,12.7025 42.1529,12.7146 42.1556)))

Regex: match SQL PRINT blocks with quoted text in it

I have the following text I am trying match using regular expressions:
PRINT CONVERT(NVARCHAR,
CURRENT_TIMESTAMP, 111) + ' ' +
CONVERT(NVARCHAR, CURRENT_TIMESTAMP,
108)
+ ' -Test Mode : ' + (CASE WHEN #turbo_mode_ind = 1 THEN
'some text ''test'' some more text.'
ELSE 'and even more text ''temp'' when
will it stop ?' END)
PRINT 'text don''t text'
PRINT 'text ''test2'' text'
What I want to match is:
PRINT CONVERT(NVARCHAR,
CURRENT_TIMESTAMP, 111) + ' ' +
CONVERT(NVARCHAR, CURRENT_TIMESTAMP,
108)
+ ' -Test Mode : ' + (CASE WHEN #turbo_mode_ind = 1 THEN
'some text ''test''
PRINT 'text ''test2''
So basically I want to match:
starting at PRINT
each char that comes after PRINT (.*)
inclusive line-breaks (don't stop at
line-breaks)
with \'{2}\w+\'{2} at the end of the
match
non-greedy (.*?)
AND no empty line(s) between PRINT
and \'{2}\w+\'{2}
I have already compsed this, but it still matches empty line(s):
PRINT.*?\'{2}\w+\'{2}(?!\n\s*\n)
Edit after comment:
Looking at the requirements again I could not come up with a single regex solution quickly. In your comments you mention that you are using C#.
A possible solution would therfore be to first split the string at blank lines and then extracting the text.
Something like this:
string pattern = #"^$";
foreach (string result in Regex.Split(input, pattern, RegexOptions.Multiline)
{
Regex rxFindSql = Regex(#"PRINT.*?\'{2}\w+?\'{2}", RegexOptions.SingleLine)
MatchCollection matches = rxFindSql.Matches(result);
}
This should do the trick but I did not test the code.
I hope this helps.