Converting zip+4 to zip python - pandas

I am looking to convert zip+4 codes into zip codes in a pandas dataframe. I want it to identify that a zip 4 code exists and keep just the first 5 digits. I effectively want to do the below code (although this doesn't work in this format):
df.replace('^(\d{5}-?\d{4})', group(1), regex=True)
The following code does the same procedure for a list, I'm looking to do the same thing in the dataframe.
my_input = ['01234-5678', '012345678', '01234', 'A1A 1A1', 'A1A1A1']
expression = re.compile(r'^(\d{5})-?(\d{4})?$')
my_output = []
for string in my_input:
if m := re.match(expression, string):
my_output.append(re.match(expression, string).group(1))
else:
my_output.append(string)

You can use
df = df.replace(r'^(\d{5})-?\d{4}$', r'\1', regex=True)
See the regex demo.
Details:
^ - start of string
(\d{5}) - Group 1 (\1): five digits
-? - an optional -
\d{4} - any four digits
$ - end of string.

Related

numpy/pandas - find a substring by regex and replace it by selecting a random value from a list

there is a list which is like below.
list=[1,2,3,4,5.....]
Then there's a df like below.
message
"2022-12-18 23:56:32,939 vlp=type rev=2 td=robert CIP=x.x.x.x motherBoard=A motherName=""A"" ns=nsA. npd=npd1 messageID=sfsdfdsfsdsa nu=nuA diui=8"
...
...
I use below code to find the messageID value first and then replace by selecting a random value from list. but it doesn't work
messageID = list(map(str, messageID))
df.messageID = df.messageID.str.replace(r'\s+messageID=(.*?)\s+', np.random.choice(messageID, size=len(df)) , regex=True)
can any expert please help take a look?
Thanks.
Use lookbehind with re.sub for replace in list comprehension:
import re
zipped = zip(df.messageID, np.random.choice(messageID, size=len(df)))
df['messageID'] = [re.sub(r'(?<=messageID=)\w+', s, r) for r, s in zipped]

Data cleaning: regex to replace numbers

I have this dataframe:
p=pd.DataFrame({'text':[2,'string']})
and trying to replace digit 2 by an 'a' using this code:
p['text']=p['text'].str.replace('\d+', 'a')
But instead of letter a and get NaN?
What am I doing wrong here?
In your dataframe, the first value of the text column is actually a number, not a string, thus the NaN error when you try to call .str. Just convert it to a string first:
p['text'] = p['text'].astype(str).str.replace('\d+', 'a')
Output:
>>> p
text
0 a
1 string
(Note that .str.replace is soon going to change the default value of regex from True to False, so you won't be able to use regular expressions without passing regex=True, e.g. .str.replace('\d+', 'a', regex=True))

Pandas str split. Can I skip line which gives troubles?

I have a dataframe (all5) including one column with dates('CREATIE_DATUM'). Sometimes the notation is 01/JAN/2015 sometimes it's written as 01-JAN-15.
I only need the year, so I wrote the following code line:
all5[['Day','Month','Year']]=all5['CREATIE_DATUM'].str.split('-/',expand=True)
but I get the following error:
columns must be same length as key
so I assume somewhere in my dataframe (>100.000 lines) a value has more than two '/' signs.
How can I make my code skip this line?
You can try to use pd.to_datetime and then use .dt property to access day, month and year:
x = pd.to_datetime(all5["CREATIE_DATUM"])
all5["Day"] = x.dt.day
all5["Month"] = x.dt.month
all5["Year"] = x.dt.year

make new column based on presence of a word in another

I have
pd.DataFrame({'text':['fewfwePDFerglergl','htrZIPg','gemlHTML']})
text
0 wePDFerglergl
1 htrZIPg
2 gemlHTML
a column 10k rows long. Each column contains one of ['PDF','ZIP','HTML']. The length of each entry in text is 14char max.
how do I get:
pd.DataFrame({'text':['wePDFerglergl','htrZIPg','gemlHTML'],'file_type':['pdf','zip','html']})
text file_type
0 wePDFerglergl pdf
1 htrZIPg zip
2 gemlHTML html
I tried df.text[0].find('ZIP') for a single entry, but do not know how to stitch it all together to test and return the correct value for each row in the column
Any suggestions?
We can use str.extract here with the regex flag for in-case sensitive (?i)
words = ['pdf','zip','html']
df['file_type'] = df['text'].str.extract(f'(?i)({"|".join(words)})')
Or we use the flags=re.IGNORECASE argument:
import re
df['file_type'] = df['text'].str.extract(f'({"|".join(words)})', flags=re.IGNORECASE)
Output
text file_type
0 fewfwePDFerglergl PDF
1 htrZIPg ZIP
2 gemlHTML HTML
If you want file_type as lower case, chain str.lower():
df['file_type'] = df['text'].str.extract(f'(?i)({"|".join(words)})')[0].str.lower()
text file_type
0 fewfwePDFerglergl pdf
1 htrZIPg zip
2 gemlHTML html
Details:
The pipe (|) is the or operator in regular expressions. So with:
"|".join(words)
'pdf|zip|html'
We get the following in pseudocode:
extract "pdf" or "zip" or "html" from our string
You could use regex for this:
import re
regex = re.compile(r'(PDF|ZIP|HTML)')
This matches any of the desired substrings. To extract these matches in order in proper case, here's a one-liner:
file_type = [re.search(regex, x).group().lower() for x in df['text']]
This returns the following list:
['pdf', 'zip', 'html']
Then to add the column:
df['file_type'] = file_type

TypeError: 'DataFrame' object is not callable in concatenating different dataframes of certain types

I keep getting the following error.
I read a file that contains time series data of 3 columns: [meter ID] [daycode(explain later)] [meter reading in kWh]
consum = pd.read_csv("data/File1.txt", delim_whitespace=True, encoding = "utf-8", names =['meter', 'daycode', 'val'], engine='python')
consum.set_index('meter', inplace=True)
test = consum.loc[[1048]]
I will observe meter readings for all the length of data that I have in this file, but first filter by meter ID.
test['day'] = test['daycode'].astype(str).str[:3]
test['hm'] = test['daycode'].astype(str).str[-2:]
For readability, I convert daycode based on its rule. First 3 digits are in range of 1 to 365 x2 = 730, last 2 digits in range of 1 to 48. These are 30-min interval reading of 2-year length. (but not all have in full)
So I create files that contain dates in one, and times in another separately. I will use index to convert the digits of daycode into the corresponding date & time that these file contain.
#dcodebook index starts from 0. So minus 1 from the daycode before match
dcodebook = pd.read_csv("data/dcode.txt", encoding = "utf-8", sep = '\r', names =['match'])
#hcodebook starts from 1
hcodebook = pd.read_csv("data/hcode.txt", encoding = "utf-8", sep ='\t', lineterminator='\r', names =['code', 'print'])
hcodebook = hcodebook.drop(['code'], axis= 1)
For some weird reason, dcodebook was indexed using .iloc function as I understood, but hcodebook needed .loc.
#iloc: by int-position
#loc: by label value
#ix: by both
day_df = dcodebook.iloc[test['day'].astype(int) - 1].reset_index(drop=True)
#to avoid duplicate index Valueerror, create separate dataframes..
hm_df = hcodebook.loc[test['hm'].astype(int) - 1]
#.to_frame error / do I need .reset_index(drop=True)?
The following line is where the code crashes.
datcode_df = day_df(['match']) + ' ' + hm_df(['print'])
print datcode_df
print test
What I don't understand:
I tested earlier that columns of different dataframes can be merged using the simple addition as seen
I initially assigned this to the existing column ['daycode'] in test dataframe, so that previous values will be replaced. And the same error msg was returned.
Please advise.
You need same size of both DataFrames, so is necessary day and hm are unique.
Then reset_index with drop=True for same indices and last remove () in join:
day_df = dcodebook.iloc[test['day'].astype(int) - 1].reset_index(drop=True)
hm_df = hcodebook.loc[test['hm'].astype(int) - 1].reset_index(drop=True)
datcode_df = day_df['match'] + ' ' + hm_df['print']