I am trying to see how can we extract all characters in a column after the 4th character.
col_a
XYZ123
ABCD001
Expecting the below
col_a, new_col
XYZ123, 23
ABCD001, D001
Try with string slicing:
df['new_col']=df['col_a'].str[4:]
OR
Via re module:
import re
df['new_col']=df['col_a'].apply(lambda x:re.findall('[0-9]+', x)[0])
With your shown samples, could you please try following. Using str.extract function of Pandas. Simple explanation would be, using regex ^.{4}(.*)$ by which getting everything apart from 1st 4 characters into capturing group and saving it to new column.
df['new_col'] = df['col_a'].str.extract(r'^.{4}(.*)$',expand=False)
Output of df will be as follows:
col_a new_col
0 XYZ123 23
1 ABCD001 001
Another way;
Extract alphanumerics left of the first 3 alphanumerics
df['new_col']= df.col_a.str.extract('((?<=^\w{3})\w+)')
Related
Im reading a file which has a column with a ' in the column. Something like
df:
colA col'21 colC
abc 2001 Ab1
now I can't seem to read the column like:
df['col\'21']
It gives the KeyError.
Your character is not a quote but the Right Single Quotation Mark
Replace this character by the standard quote:
df.columns = df.columns.str.replace('\u2019', "'")
print(df["col'21"])
To find the unicode character, use:
>>> hex(ord("’"))
'0x2019'
You need to use the ’ instead:
instead of:
df[YTD'21"]
try:
df["YTD’21"]
I don't get any error:
df = pd.DataFrame({"col'21":[1,2,3]})
or
df = pd.DataFrame({"""col'21""":[1,2,3]})
Output:
col'21
0 1
1 2
2 3
How do I drop pandas dataframe columns that contains special characters such as # / ] [ } { - _ etc.?
For example I have the following dataframe (called df):
I need to drop the columns Name and Matchkey becasue they contain some special characters.
Also, how can I specify a list of special characters based on which the columns will be dropped?
For example: I'd like to drop the columns that contain (in any record, in any cell) any of the following special characters:
listOfSpecialCharacters: ¬,`,!,",£,$,£,#,/,\
One option is to use a regex with str.contains and apply, then use boolean indexing to drop the columns:
import re
chars = '¬`!"£$£#/\\'
regex = f'[{"".join(map(re.escape, chars))}]'
# '[¬`!"£\\$£\\#/\\\\]'
df2 = df.loc[:, ~df.apply(lambda c: c.str.contains(regex).any())]
example:
# input
A B C
0 123 12! 123
1 abc abc a¬b
# output
A
0 123
1 abc
I have a text file that looks like this:
Version:23
Developer: Ali
NAME AGE IN
- Carol 22 no
- Kyle 31 yes
...
I ma reading it using Dask dataframe (which should be similar to Pandas). The result table should be dataframe look like this:
NAME AGE IN
Carol 22 no
Kyle 31 yes
I am having trouble to get rid of the dash in each row ('-') below the column name '-'. I tried
dd.read_csv(filepath, header = 3, sep="\s+")
which results in a dataframe with has different row size and brings more problems,
and I also tried using multiple delimiters but still giving errors.
dd.read_csv(filepath, header = 3, sep="\s-\s+")
dask.dataframe assumes your data is already in a tabular format. If you insist on using dask, then you will get further with dask.bag, which will load the file line by line. You can then filter out the lines that do not start with a dash, and process the ones that do, encoding them as a json object/dict, which you later convert to dataframe with .to_dataframe().
I have a column of data in pandas dataframe in Bxxxx-xx-xx-xx.y format. Only the first part (Bxxxx) is all I require. How do I split the data? In addition, I also have data in BSxxxx-xx-xx-xx format in the same column which I would like to remove using regex='^BS' command (For some reason, it's not working). Any help in this regard will be appreciated.BTW, I am using df.filter command.
This should work.
df[df.col1.apply(lambda x: x.split("-")[0][0:2]!="BS")].col1.apply(lambda x: x.split("-")[0])
Consider below example:
df = pd.DataFrame({
'col':['B123-34-gd-op','BS01010-9090-00s00','B000003-3frdef4-gdi-ortp','B1263423-304-gdcd-op','Bfoo3-poo-plld-opo', 'BSfewf-sfdsd-cvc']
})
print(df)
Output:
col
0 B123-34-gd-op
1 BS01010-9090-00s00
2 B000003-3frdef4-gdi-ortp
3 B1263423-304-gdcd-op
4 Bfoo3-poo-plld-opo
5 BSfewf-sfdsd-cvc
Now Let's do two tasks:
Extract Bxxxx part from Bxxx-xx-xx-xxx .
Remove BSxxx formated strings.
Consider below code which uses startswith():
df[~df.col.str.startswith('BS')].col.str.split('-').str[0]
Output:
0 B123
2 B000003
3 B1263423
4 Bfoo3
Name: col, dtype: object
Breakdown:
df[~df.col.str.startswith('BS')] gives us all the string which do not start with BS. Next, We are spliting those string with - and taking the first part with .col.str.split('-').str[0] .
You can define a function where in you treat Bxxxx-xx-xx-xx.y as a string and just extract the first 5 indexes.
>>> def edit_entry(x):
... return (str(x)[:5])
>>> df['Column_name'].apply(edit_entry)
A one-liner solution would be:
df["column_name"] = df["column_name"].apply(lambda x: x[:5])
I have my pandas dataframe contain data in the following format:
SAC1001.K
KAM10120.B01.W001
CLT004.09C
ASMA104
AJAY101.A.KAS.101
I wish to modify the column using string manipulation so, that the result is
SAC1001.K
KAM10120.B01
CLT004.09C
ASMA104
AJAY101.A
How this can be done? Regex looks to be one way but, not sure of it. Any other elegant way to do it? Please guide
In [109]: df
Out[109]:
col
0 SAC1001.K
1 KAM10120.B01.W001
2 CLT004.09C
3 ASMA104
4 AJAY101.A.KAS.101
In [110]: df['col'] = df['col'].str.replace(r'(\..*?)\..*', r'\1')
In [111]: df
Out[111]:
col
0 SAC1001.K
1 KAM10120.B01
2 CLT004.09C
3 ASMA104
4 AJAY101.A
Here is another way without regex but maybe with too many str
df['col'].str.split('.').str[:2].str.join('.')