I am reading a CSV file to Pandas DataFrame but need to be cleaned up before can be used. I need to do two things:
use regex to filter values
apply string functions such as trim, left, right, ...
For instance, DataFrame may looks like:
0 city_some_string_45
1 city_Other_string_56
2 city_another_string_77
so I need to filter (using regex) for all rows that its value start with "city" and get last two character.
the end result should looks like:
0 45
1 56
2 77
In another word, logic I want to apply is: read value of cell and if starts with city (filtering with regex ie: ^city) and replace the value of cell with its two last character of the cell (eg using right string function)
For a dataframe like this:
No city
0 0 city_some_string_45
1 1 city_Other_string_56
2 2 city_another_string_77
Filter the dataframe to keep the rows with city column starting with city
df = df[df.city.str.startswith('city')]
You can use str.extract to extract only the number
df['city'] = df.city.str.extract('(\d+)').astype(int)
The resulting df
No city
0 0 45
1 1 56
2 2 77
Related
I want to filter data rows based on string from data['Age'] and count atleast two occurence of that pattern "o;", "i;", "twenty;", "a;" in data['Name'].
data = pd.DataFrame({'Name':['To;om', 'ni;cki;', 'krish', 'jack'],'Age':['o', 'i', 'twenty', 'a']})
data
Name Age
0 To;om o
1 ni;cki; i
2 krish twenty
3 jack a
Output should look like this:
Name Age count
0 ni;cki; i 2
Use df.apply:
In [427]: data[data.apply(lambda x: x['Name'].count(f"{x['Age']};") >= 2, 1)]
Out[427]:
Name Age
1 ni;cki; i
List comprehension to find the counts and then filtering with it:
df["count"] = [name.count(age) for name, age in zip(df.Name, df.Age)]
out = df[df["count"] >= 2].copy()
to get
>>> out
Name Age count
1 ni;cki; i 2
The copy at the end is to avoid the SettingWithCopyWarning in possible future manipulations.
Having a dataset like this:
word
0 TBH46T
1 BBBB
2 5AAH
3 CAAH
4 AAB1
5 5556
Which would be the most efficient way to select the rows where column word is a combination of numbers and letters?
The output would be like this:
word
0 TBH46T
2 5AAH
4 AAB1
A possible solution would be to create a new column using apply and regex in which store if column word has the desired structure. But I'm curious about if this could be achieved in a more straightforward way.
Use Series.str.contains for chain mask for match numeric and for match non numeric with & for bitwise AND:
df = df[df['word'].str.contains('\d') & df['word'].str.contains('\D')]
print (df)
word
0 TBH46T
2 5AAH
4 AAB1
I have a pandas data frame,
Currently the list column is a string, I want to delimit this by spaces and replicate rows for each primary key would be associated with each item in the list. Can you please advise me on how I can achieve this?
Edit:
I need to copy down the value column after splitting and stacking the list column
If your data frame is df you can do:
df.List.str.split(' ').apply(pd.Series).stack()
and you will get
Primary Key
0 0 a
1 b
2 c
1 0 d
1 e
2 f
dtype: object
You are splitting the variable List on spaces, turning the resulting list into a series, and then stacking it to turn it into long format, indexed on the primary key, along with a sequence for each item obtained from the split.
My version:
df['List'].str.split().explode()
produces
0 a
0 b
0 c
1 d
1 e
1 f
With regards to the Edit of the question, the following tweak will give you want you need I think:
df['List'] = df['List'].str.split()
df.explode('List')
Here is a solution.
df = df.assign(**{'list':df['list'].str.split()}).explode('list')
df['cc'] = df.groupby(level=0)['list'].cumcount()
df.set_index(['cc'],append=True)
hi i have string in one column :
s='123. 125. 200.'
i want to split it to 3 columns(or as many numbers i have ends with .)
To separate columns and that it will be number not string !, in every column .
From what I understand, you can use:
s='123. 125. 200.'
pd.Series(s).str.rstrip('.').str.split('.',expand=True).apply(pd.to_numeric,errors='coerce')
0 1 2
0 123 125 200
I have the following:
C1 C2 C3
0 0 0 1
1 0 0 1
2 0 0 1
And i would like to get the corresponding column index value that has 1's, so the result
should be "C3".
I know how to do this by transposing the dataframe and then getting the index values, but this is not ideal for data in the dataframes i have, and i wonder there might be a more efficient solution?
I will save the result in a list because otherwise there could be more than one column with values equal to 1. You can use DataFrame.loc
if all column values must be 1 then you can use:
df.loc[:,df.eq(1).all()].columns.tolist()
Output:
['C3']
if this isn't necessary then use:
df.loc[:,df.eq(1).any()].columns.tolist()
or as suggested #piRSquared, you can select directly from df.columns:
[*df.columns[df.eq(1).all()]]