Remove the last punctuation in list of numbers in Python - pandas

I have variable of numbers and letters and want a code to remove the apostrophe between each number/letter and only keeping the first and last apostrophe for the variable. Desired output is shown below
numbers = 'V7780T103', '494368103', '003654100', '26210C104'
output should be
numbers = 'V7780T103, 494368103, 003654100, 26210C104'

Related

How to add a character to the last third place of a string?

I have a column with numbers with various lengths such as 50055, 1055,155 etc. How can I add a decimal before the last 2nd place of each so that it would be 500.55, 10.55, and 1.55?
I tried using replace by finding the last 2 numbers and replace it with .||last 2 number. That doesn't always work because of a possibility of multiple repetition of the same sequence in the same string.
replace(round(v_num/2),substr(round(v_num/2),-2),'.'||substr(round(v_num/2),-2))
You would divide by 100:
select v_num / 100
You can convert this into a string, if you want.

Select rows which contain numeric substrings in Pandas

I need to delete rows from a dataframe in which a particular column contains string which contains numeric substrings. See the shaded column of my dataframe.
rows with values like 0E as prefix or 21 (any two digit number) as suffix or 24A (any two digit number with a letter) as suffix should be deleted.
Any suggestions?
Thanks in advance.
You can use boolean indexing with a str.contains() regex:
^0E - starts with 0E
\d{2}$ - ends with 2 digits
\d{2}[A-Z]$ - ends with 2 digits and 1 capital letter
col = ... # target column
mask = df[col].str.contains(r'^0E|\d{2}$|\d{2}[A-Z]$')
df = df.loc[~mask]
#tdy gave a good answer, but only one place need to be modified if I understand it correctly.
For value ends with two digits or two digits and a capital character, the regex should be:
.*\d{2}[A-Z]?$

Perl6 split function adding extra elements to array

my #r = split("", "hi");
say #r.elems;
--> output: 4
split is adding two extra elements to the array, one at the beginning and another at the end.
I have to do shift and pop after every split to correct for this.
Is there a better way to split a string?
If you're splitting on the empty string, you will get an empty element at the start and the end of the returned list as there is also an empty string before and after the string.
What you want is .comb without parameters, written out completely functionally:
"hi".comb.elems.say; # 2
See https://docs.raku.org/routine/comb#(Str)_routine_comb for more info.
The reason for this is when you use an empty Str “” for the delimiter it is the same as if you had used the regex /<|wb>/ which matches next to characters. So it also matches before the first character, and after the last character. Perl 5 removes these “extra” strings for you in this case (and in this case only), which is likely where the confusion lays.
What Perl 6 does instead is allow you to explicitly :skip-empty values
'hi'.split('') :skip-empty
'hi'.split('', :skip-empty)
split("", "hi") :skip-empty
split("", "hi", :skip-empty)
Or to specify what you actually want
'hi'.comb( /./ )
'hi'.comb( 1 )
'hi'.comb
comb( /./, 'hi' )
comb( 1, 'hi' )

Extra blank space between words

Please help me with 2 questions on how to do the GREL expression for:
If there are double spaces between 2 words in a column, how can I eliminate 1 space Example: Robert--Smith to Robert-Smith The minus character equals a blank for illustration
How can I look for an exact word in a text filter.
Thanks!
1°) try transform---> value.replace(" "," ")
Or, simply common transforms ----> collapse consecutive white spaces
2°) Column ---> text filters and enter you word
Or, do column---> Facet---> Customs facet and type : value.contains(" you_word ")
or value.contains(/(yourexactword)/)
This will return a True or False facet
H.
#hpiedcoq is the right answer if you need to have them in GREL. if not you can just use the point and click interface:
for the first question: Select your column and select Edit cells > Common transforms > Collapse consecutive white space
for the second question: select your column > text filter > enter the work you are looking for. You can select case sensitive if you want to take into account upper and lower case in your search.
1.1 transform -- > value.replace(" "," ")
Deletes all double whitespace.
1.2 transform -- > value.trim()
Deletes all double whitespace and deletes whitespaces before and after the string.
1.3 transform -- > value.replace(/\b \b/," ")
Replace with regular expression, deletes only double whitespace between two words.
Text filter > turn on regular expression and use \b.
Text filter with regular expression: \bWord\b = exact word, before and after the word may or may not be a only whitespace.

AWK: Ignore lines grouped by an unique value conditioned on occurrences of a specific field value

Please help revise the title and the post if needed, thanks.
In short, I would like to firstly group lines with a unique value in the first field and accumulate the occurrences of a specific value in the other field in the underlying group of lines. If the sum of occurrences doesn't meet the self-defined threshold, the lines in the group should be ignored.
Specifically, with input
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
222,1,M,1
222,1,M,0
333,1,P,0
333,1,P,1
444,1,M,1
444,1,M,1
444,0,M,0
555,1,P,1
666,1,P,0
the desired output should be
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
333,1,P,0
333,1,P,1
555,1,P,1
666,1,P,0
meaning that "because the unique values in the first field 222 and 444 don't have at least one (which can be any desired threshold) P in the third field, lines corresponding to 222 and 444 are ignored."
Furthermore, this should be done without editing the original file and have to be combined with the solved issue Split CSV to Multiple Files Containing a Set Number of Unique Field Values. By doing this, a few lines will not be involved in the resulted split files.
I believe this one-liner does what you want:
$ awk -F, '{a[$1,++c[$1]]=$0}$3=="P"{p[$1]}END{for(i in c)if(i in p)for(j=1;j<=c[i];++j)print a[i,j]}' file
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
333,1,P,0
333,1,P,1
555,1,P,1
666,1,P,0
Array a, keeps track of all the lines in the file, grouping them by the first field and a count c which we use later. If the third field contains a P, set a key in the p array.
After processing the entire file, loop through all the values of the first field. If a key has been set in p for the value, then print the lines from a.
You mention a threshold number of entries in your question. If by that, you mean that there must be N occurrences of "P" in order for the lines to be printed, you could change {p[$1]} to {++p[$1]}, then change if(i in p) to if(p[i]>=N) in the END block.