make new column based on presence of a word in another - pandas

I have
pd.DataFrame({'text':['fewfwePDFerglergl','htrZIPg','gemlHTML']})
text
0 wePDFerglergl
1 htrZIPg
2 gemlHTML
a column 10k rows long. Each column contains one of ['PDF','ZIP','HTML']. The length of each entry in text is 14char max.
how do I get:
pd.DataFrame({'text':['wePDFerglergl','htrZIPg','gemlHTML'],'file_type':['pdf','zip','html']})
text file_type
0 wePDFerglergl pdf
1 htrZIPg zip
2 gemlHTML html
I tried df.text[0].find('ZIP') for a single entry, but do not know how to stitch it all together to test and return the correct value for each row in the column
Any suggestions?

We can use str.extract here with the regex flag for in-case sensitive (?i)
words = ['pdf','zip','html']
df['file_type'] = df['text'].str.extract(f'(?i)({"|".join(words)})')
Or we use the flags=re.IGNORECASE argument:
import re
df['file_type'] = df['text'].str.extract(f'({"|".join(words)})', flags=re.IGNORECASE)
Output
text file_type
0 fewfwePDFerglergl PDF
1 htrZIPg ZIP
2 gemlHTML HTML
If you want file_type as lower case, chain str.lower():
df['file_type'] = df['text'].str.extract(f'(?i)({"|".join(words)})')[0].str.lower()
text file_type
0 fewfwePDFerglergl pdf
1 htrZIPg zip
2 gemlHTML html
Details:
The pipe (|) is the or operator in regular expressions. So with:
"|".join(words)
'pdf|zip|html'
We get the following in pseudocode:
extract "pdf" or "zip" or "html" from our string

You could use regex for this:
import re
regex = re.compile(r'(PDF|ZIP|HTML)')
This matches any of the desired substrings. To extract these matches in order in proper case, here's a one-liner:
file_type = [re.search(regex, x).group().lower() for x in df['text']]
This returns the following list:
['pdf', 'zip', 'html']
Then to add the column:
df['file_type'] = file_type

Related

Converting zip+4 to zip python

I am looking to convert zip+4 codes into zip codes in a pandas dataframe. I want it to identify that a zip 4 code exists and keep just the first 5 digits. I effectively want to do the below code (although this doesn't work in this format):
df.replace('^(\d{5}-?\d{4})', group(1), regex=True)
The following code does the same procedure for a list, I'm looking to do the same thing in the dataframe.
my_input = ['01234-5678', '012345678', '01234', 'A1A 1A1', 'A1A1A1']
expression = re.compile(r'^(\d{5})-?(\d{4})?$')
my_output = []
for string in my_input:
if m := re.match(expression, string):
my_output.append(re.match(expression, string).group(1))
else:
my_output.append(string)
You can use
df = df.replace(r'^(\d{5})-?\d{4}$', r'\1', regex=True)
See the regex demo.
Details:
^ - start of string
(\d{5}) - Group 1 (\1): five digits
-? - an optional -
\d{4} - any four digits
$ - end of string.

Data cleaning: regex to replace numbers

I have this dataframe:
p=pd.DataFrame({'text':[2,'string']})
and trying to replace digit 2 by an 'a' using this code:
p['text']=p['text'].str.replace('\d+', 'a')
But instead of letter a and get NaN?
What am I doing wrong here?
In your dataframe, the first value of the text column is actually a number, not a string, thus the NaN error when you try to call .str. Just convert it to a string first:
p['text'] = p['text'].astype(str).str.replace('\d+', 'a')
Output:
>>> p
text
0 a
1 string
(Note that .str.replace is soon going to change the default value of regex from True to False, so you won't be able to use regular expressions without passing regex=True, e.g. .str.replace('\d+', 'a', regex=True))

Find Each Occurrence of X and Insert a Carriage Return

A colleague has some data he is putting into a flat file (.txt) and needs to insert a carriage return before EACH occurrence of 'POL01', 'SUB01','VEH01','MCO01'.
I did use:
For Each line1 As String In System.IO.File.ReadAllLines(BodyFileLoc)
If line1.Contains("POL01") Or line1.Contains("SUB01") Or line1.Contains("VEH01") Or line1.Contains("MCO01") Then
Writer.WriteLine(Environment.NewLine & line1)
Else
Writer.WriteLine(line1)
End If
Next
But unfortunately it turns out that the file is not formatted in 'lines' by SSIS but as one whole string.
How can I insert a carriage return before every occurrence of the above?
Test Text
POL01CALT302276F 332 NBPM 00101 20151113201511130001201611132359 2015111300010020151113000100SUB01CALT302276F 332 NBPMP01 Akl Abi-Khalil 19670131 M U33 Stoford Close SW19 6TJ 2015111300010020151113000100VEH01CALT302276F 332 NBPM001LV56 LEJ N 2006VAUXHALL CA 2015111300010020151113000100MCO01CALT302276F 332 NBPM0101 0 2015111300010020151113000100POL01CALT742569N
You can use regular expressions for this, specifically by using Regex.Replace to find and replace each occurrence of the strings you're looking for with a newline followed by the matching text:
Dim str as String = "xxxPOL01xxxSUB01xxxVEH01xxxMCO01xxx"
Dim output as String = Regex.Replace(str, "((?:POL|SUB|VEH|MCO)01)", Environment.NewLine + "$1")
'output contains:
'xxx
'POL01xxx
'SUB01xxx
'VEH01xxx
'MCO01xxx
There may be a better way to construct this regular expression, but this is a simple alternation on the different letters, followed by 01. This matched text is represented by the $1 in the replacement string.
If you're new to regular expressions, there are a number of tools that help you understand them - for example, regex101.com will show you an explanation of the one I have used here:

Substring function does not work in Vb?

I am trying to mask SSn and want show it on label caption.
lblSPTINTo.Caption = rsMM("SPTIN")
lblCPTINTo.Caption = rsMM("CPTIN")
i am trying to use substring function to get last 4 characters but i not am to able to use it as it throws compile error .
lblSPTINTo.Caption = rsMM("SPTIN").sutbstring(4,4)
Replace sutbstring with Substring.
But it won't work that way because the first parameter is the index and the second parameter in Substring is the length, if you want the last 4 characters:
Dim last4 As String = rsMM("SPTIN")
If last4.Length > 4 Then last4 = last4.Substring(last4.Length - 4)

gnuplot store one number from data file into variable

OSX v10.6.8 and Gnuplot v4.4
I have a data file with 8 columns. I would like to take the first value from the 6th column and make it the title. Here's what I have so far:
#m1 m2 q taua taue K avgPeriodRatio time
#1 2 3 4 5 6 7 8
K = #read in data here
graph(n) = sprintf("K=%.2e",n)
set term aqua enhanced font "Times-Roman,18"
plot file using 1:3 title graph(K)
And here is what the first few rows of my data file looks like:
1.00e-07 1.00e-07 1.00e+00 1.00e+05 1.00e+04 1.00e+01 1.310 12070.00
1.11e-06 1.00e-07 9.02e-02 1.00e+05 1.00e+04 1.00e+01 1.310 12070.00
2.12e-06 1.00e-07 4.72e-02 1.00e+05 1.00e+04 1.00e+01 1.310 12070.00
3.13e-06 1.00e-07 3.20e-02 1.00e+05 1.00e+04 1.00e+01 1.310 12090.00
I don't know how to correctly read in the data or if this is even the right way to go about this.
EDIT #1
Ok, thanks to mgilson I now have
#m1 m2 q taua taue K avgPeriodRatio time
#1 2 3 4 5 6 7 8
set term aqua enhanced font "Times-Roman,18"
K = "`head -1 datafile | awk '{print $6}'`"
print K+0
graph(n) = sprintf("K=%.2e",n)
plot file using 1:3 title graph(K)
but I get the error: Non-numeric string found where a numeric expression was expected
EDIT #2
file = "testPlot.txt"
K = "`head -1 file | awk '{print $6}'`"
K=K+0 #Cast K to a floating point number #this is line 9
graph(n) = sprintf("K=%.2e",n)
plot file using 1:3 title graph(K)
This gives the error--> head: file: No such file or directory
"testPlot.gnu", line 9: Non-numeric string found where a numeric expression was expected
You have a few options...
FIRST OPTION:
use columnheader
plot file using 1:3 title columnheader(6)
I haven't tested it, but this may prevent the first row from actually being plotted.
SECOND OPTION:
use an external utility to get the title:
TITLE="`head -1 datafile | awk '{print $6}'`"
plot 'datafile' using 1:3 title TITLE
If the variable is numeric, and you want to reformat it, in gnuplot, you can cast strings to a numeric type (integer/float) by adding 0 to them (e.g).
print "36.5"+0
Then you can format it with sprintf or gprintf as you're already doing.
It's weird that there is no float function. (int will work if you want to cast to an integer).
EDIT
The script below worked for me (when I pasted your example data into a file called "datafile"):
K = "`head -1 datafile | awk '{print $6}'`"
K=K+0 #Cast K to a floating point number
graph(n) = sprintf("K=%.2e",n)
plot "datafile" using 1:3 title graph(K)
EDIT 2 (addresses comments below)
To expand a variable in backtics, you'll need macros:
set macro
file="mydatafile.txt"
#THE ORDER OF QUOTES (' and ") IS CRUCIAL HERE.
cmd='"`head -1 ' . file . ' | awk ''{print $6}''`"'
# . is string concatenation. (this string has 3 pieces)
# to get a single quote inside a single quoted string
# you need to double. e.g. 'a''b' yields the string a'b
data=#cmd
To address your question 2, it is a good idea to familiarize yourself with shell utilities -- sed and awk can both do it. I'll show a combination of head/tail:
cmd='"`head -2 ' . file . ' | tail -1 | awk ''{print $6}''`"'
should work.
EDIT 3
I recently learned that in gnuplot, system is a function as well as a command. To do the above without all the backtic gymnastics,
data=system("head -1 " . file . " | awk '{print $6}'")
Wow, much better.
This is a very old question, but here's a nice way to get access to a single value anywhere in your data file and save it as a gnuplot-accessible variable:
set term unknown #This terminal will not attempt to plot anything
plot 'myfile.dat' index 0 every 1:1:0:0:0:0 u (var=$1):1
The index number allows you to address a particular dataset (separated by two carriage returns), while every allows you to specify a particular line.
The colon-separated numbers after every should be of the form 1:1:<line_number>:<block_number>:<line_number>:<block_number>, where the line number is the line with the the block (starting from 0), and the block number is the number of the block (separated by a single carriage return, again starting from 0). The first and second numbers say plot every 1 lines and every one data block, and the third and fourth say start from line <line_number> and block <block_number>. The fifth and sixth say where to stop. This allows you to select a single line anywhere in your data file.
The last part of the plot command assigns the value in a particular column (in this case, column 1) to your variable (var). There needs to be two values to a plot command, so I chose column 1 to plot against my variable assignment statement.
Here is a less 'awk'-ward solution which assigns the value from the first row and 6th column of the file 'Data.txt' to the variable x16.
set table
# Syntax: u 0:($0==RowIndex?(VariableName=$ColumnIndex):$ColumnIndex)
# RowIndex starts with 0, ColumnIndex starts with 1
# 'u' is an abbreviation for the 'using' modifier
plot 'Data.txt' u 0:($0==0?(x16=$6):$6)
unset table
A more general example for storing several values is given below:
# Load data from file to variable
# Gnuplot can only access the data via the "plot" command
set table
# Syntax: u 0:($0==RowIndex?(VariableName=$ColumnIndex):$ColumnIndex)
# RowIndex starts with 0, ColumnIndex starts with 1
# 'u' is an abbreviation for the 'using' modifier
# Example: Assign all values according to: xij = Data33[i,j]; i,j = 1,2,3
plot 'Data33.txt' u 0:($0==0?(x11=$1):$1),\
'' u 0:($0==0?(x12=$2):$2),\
'' u 0:($0==0?(x13=$3):$3),\
'' u 0:($0==1?(x21=$1):$1),\
'' u 0:($0==1?(x22=$2):$2),\
'' u 0:($0==1?(x23=$3):$3),\
'' u 0:($0==2?(x31=$1):$1),\
'' u 0:($0==2?(x32=$2):$2),\
'' u 0:($0==2?(x33=$3):$3)
unset table
print x11, x12, x13 # Data from first row
print x21, x22, x23 # Data from second row
print x31, x32, x33 # Data from third row