Get characters before the underscore

Get characters before the underscore - pandas

I have a pandas dataframe in the below format
name
BC_new-0
BC_new-1
BC_new-2
Would like to extract whatever is below the "_" and append it to a new column
df['value'] = str(df['name']).split("_")[0]
But I get the below results
value
0 BC
0 BC
0 BC
Any suggestions on how this "0" could not be present in the output. Any leads would be appreciated.

I might use str.extract here:
df['value'] = df['name'].str.extract(r'^([^_]+)')
As the comment above suggests, if you want to use string splitting, then use str.split:
df['value'] = df['name'].str.split("_").str[0]

Related

pandas constains with regular expression- special character and full word

I am trying to remove rows that contains any of this characters(##%+*=) and also a full word
col
ahoi*
word
be
df = df[~df[col].str.contains(r'[##%+*=](word))', regex=True)]
I achieved to remove special characters only with .str.contains(r'[##%+*=])', however I cannot remove the row with the full word.
What am I missing?
This is the expected result.
col
be

IIUC, you need to use the or operator (|) instead of parenthesis :
df = df[~df["col"].str.contains(r'[##%+*=]|word', regex=True)]

Output :
print(df)
col
2 be

You can try:
>>> df[~df['col'].str.contains(r'(?:[##%+*=]|word)', regex=True)]
col
2 be

pandas: column name with ' separator

Im reading a file which has a column with a ' in the column. Something like
df:
colA col'21 colC
abc 2001 Ab1
now I can't seem to read the column like:
df['col\'21']
It gives the KeyError.

Your character is not a quote but the Right Single Quotation Mark
Replace this character by the standard quote:
df.columns = df.columns.str.replace('\u2019', "'")
print(df["col'21"])
To find the unicode character, use:
>>> hex(ord("’"))
'0x2019'

You need to use the ’ instead:
instead of:
df[YTD'21"]
try:
df["YTD’21"]

I don't get any error:
df = pd.DataFrame({"col'21":[1,2,3]})
or
df = pd.DataFrame({"""col'21""":[1,2,3]})
Output:
col'21
0 1
1 2
2 3

How to find and replace second comma (,) with - in a dataframe using Pandas?

I have dataframe like :
Name
Address
Anuj
Anuj,Sinha,BB
Sinha
Sinha,Anuj BB
In column Adrress, I want to replace all comma (,) except the fist comma in all row with -.
Can anyone please suggest me the possible solution?
provided:
df.dtypes
Customer ID Int64
First_name-Last_name string
Address string
Phone string
Secondary_station string
Customer_disconnected string

If there is a maximum of 2 commas, you can use this simple regex:
df['Address'] = df['Address'].str.replace('(,.*),', r'\1-')
output:
Name Address
0 Anuj Anuj,Sinha-BB
1 Sinha Sinha,Anuj BB
If there are possibly more than 2 commas, you can do:
df['Address'] = df['Address'].str.split(',').apply(lambda x: x[0]+','+'-'.join(x[1:]))
or, more efficient:
splits = df['Address'].str.split(',', 1)
df['Address'] = splits.str[0]+','+splits.str[1].str.replace(',', '-')

You can use the replace function this way :
txt = "I like bananas"
x = txt.replace("bananas", "apples")
print(x)
It will display :
I like apples
For your dataframe, you just need to iterate thought your values this way:
import pandas as pd
df = pd.DataFrame(
{
'name': ['Anuj', 'Sinha'],
'adresse': ['Anuj,Sinha,BB', 'Sinha,Anuj BB']
}
)
for colunm in df.columns:
for index, value in enumerate(df[colunm]):
df[colunm][index] = value.replace(',', '-')
print(df)
It will display :
name adresse
0 Anuj Anuj-Sinha-BB
1 Sinha Sinha-Anuj BB

Pandas exporting to_csv() with quotation marks around column names

For some reason I need to output to a csv in this format with quotations around each columns names, my desired output looks like:
"date" "ret"
2018-09-24 0.00013123989025119056
I am trying with
import csv
import pandas as pd
Y_pred.index.name = "\"date\""
Y_pred.name = "\'ret\'"
Y_pred = Y_pred.to_frame()
path = "prediction/Q1/"
try:
os.makedirs(path)
except:
pass
Y_pred.to_csv(path+instrument_tmp+"_ret.txt",sep=' ')
and got outputs like:
"""date""" 'ret'
2018-09-24 0.00013123989025119056
I can't seem to find a way to use quotation to wrap at the columns. Does anyone know how to? Thanks.
My solution:
using quoting=csv.QUOTE_NONE together with Y_pred.index.name = "\"date\"", Y_pred.name = "\"ret\""
Y_pred.index.name = "\"date\""
Y_pred.name = "\"ret\""
Y_pred = Y_pred.to_frame()
path = "prediction/Q1/"
try:
os.makedirs(path)
except:
pass
Y_pred.to_csv(path+instrument_tmp+"_ret.txt",sep=' ',quoting=csv.QUOTE_NONE)
and then I get
"date" "ret"
2018-09-24 0.00013123989025119056

This is called quoted output.
Instead of manually hacking in quotes into your column names (which will mess with other dataframe functionality), use the quoting option:
df = pd.DataFrame({"date": ["2018-09-24"], "ret": [0.00013123989025119056]})
df.to_csv("out_q_esc.txt", sep=' ', escapechar='\\', quoting=csv.QUOTE_ALL, index=None)
"date" "ret"
"2018-09-24" "0.00013123989025119056"
The 'correct' way is to use quoting=csv.QUOTE_ALL (and optionally escapechar='\\'), but note however that QUOTE_ALL will force all columns to be quoted, even obviously numeric ones like the index; if we hadn't specified index=None, we would get:
"" "date" "ret"
"0" "2018-09-24" "0.00013123989025119056"
csv.QUOTE_MINIMAL refuses to quote these fields because they don't strictly need quotes (they're neither multiline nor do they contain internal quote or separator chars)

IIUC, you can use the quoting argument with csv.QUOTE_NONE
import csv
df.to_csv('test.csv',sep=' ',quoting=csv.QUOTE_NONE)
And your resulting csv will look like:
"date" "ret"
0 2018-09-24 0.00013123989025119056
Side Note: To facilitate the adding of quotations to your columns, you can use add_prefix and add_suffix. If your starting dataframe looks like:
>>> df
date ret
0 2018-09-24 0.000131
Then do:
df = df.add_suffix('"').add_prefix('"')
df.to_csv('test.csv',sep=' ',quoting=csv.QUOTE_NONE)

Select data from pandas multiindex pivottable

I have a multiindex dataframe (pivottable) with 1703 rows that looks like this:
Local code Ex Code ... Value
159605 FR1xx ... 30
159973 FR1xx ... 50
...
ZZC923HDV906 XYxx ... 20
There are either numerical local codes (e.g. 159973) or local codes consisting of both characters and strings (e.g. ZZC923HDV906)
I'd like to select the data by the first index column (Local code). This works well for the string characters with the following code
pv_comb[(pv_comb.index.get_level_values("Local code") == "ZZC923HDV906")]
However I don't manage to select the numerical values:
pv_comb[(pv_comb.index.get_level_values("Local code") == 159973)]
This returns an empty dataframe.
Is it possible to convert the values in the first column of the multiindex into string characters and then select the data?

IIUC you need '', because your numeric values are strings - so 159973 change to '159973':
pv_comb[(pv_comb.index.get_level_values("Local code") == '159973')]
If need convert some level of MultiIndex to string need create new index and then assign:
#if python 3 add list
new_index = list(zip(df.index.get_level_values('Local code').astype(str),
df.index.get_level_values('Ex Code')))
df.index = pd.MultiIndex.from_tuples(new_index, names = df.index.names)
Also is possible there are some whitespaces, you can remove them by strip:
#change multiindex
new_index = zip(df.index.get_level_values('Local code').astype(str).str.strip(),
df.index.get_level_values('Ex Code')
df.index = pd.MultiIndex.from_tuples(new_index, names = df.index.names)
Solution if many levels is first reset_problematic level, do operations and set index back. Then is possible sortlevel is necessary:
df = pd.DataFrame({'Local code':[159605,159973,'ZZC923HDV906'],
'Ex Code':['FR1xx','FR1xx','XYxx'],
'Value':[30,50,20]})
pv_comb = df.set_index(['Local code','Ex Code'])
print (pv_comb)
Value
Local code Ex Code
159605 FR1xx 30
159973 FR1xx 50
ZZC923HDV906 XYxx 20
pv_comb = pv_comb.reset_index('Local code')
pv_comb['Local code'] = pv_comb['Local code'].astype(str)
pv_comb = pv_comb.set_index('Local code', append=True).swaplevel(0,1)
print (pv_comb)
Value
Local code Ex Code
159605 FR1xx 30
159973 FR1xx 50
ZZC923HDV906 XYxx 20

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Get characters before the underscore - pandas

I might use str.extract here: df['value'] = df['name'].str.extract(r'^([^_]+)') As the comment above suggests, if you want to use string splitting, then use str.split: df['value'] = df['name'].str.split("_").str[0]

Related

pandas constains with regular expression- special character and full word

pandas: column name with ' separator

How to find and replace second comma (,) with - in a dataframe using Pandas?

Pandas exporting to_csv() with quotation marks around column names

Select data from pandas multiindex pivottable

Categories

Resources