Replacing whole string with part of it using Regex in Python Pandas - pandas

I have a table in pdf found on this link: https://taxation-customs.ec.europa.eu/system/files/2022-11/tobacco_products_releases-consumption.pdf
I am trying to clean the data before doing analysis but I noticed that between 2014-2017 the cigarette data was merged due to error. Instead of two cells per year in a column for Sweden and UK I got one merged which looks something like this: 5393688\r28587000
I would like to update data only for Sweden and get the first value before \r.
So far my code was as follows:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
cig= pd.DataFrame(tabula.read_pdf(r"https://taxation-customs.ec.europa.eu/system/files/2022-11/tobacco_products_releases-consumption.pdf", pages ='all')[0])
cig.replace(to_replace='N/A', value=0, inplace=True, regex=True)
cig= cig.replace(',','', regex=True)
After this I tried
df.iloc[26,:].str.replace("('\r').*","")
cig.iloc[26,:] = cig.iloc[26,:].replace("('\r').*","", regex=True)
and
cig.iloc[26,:].replace(to_replace='(?:[0-9]+)([^0-9]{2})([0-9]+)', value='', regex=True)
But none of the above seem to produce desired result and I still have values with similar format i.e. 5393688\r28587000

Set regex=True and assign the changed subset back to the dataframe:
df.iloc[26,:] = df.iloc[26,:].replace("('\r').*","", regex=True)

Related

Convert type object column to float

I have a table with a column named "price". This column is of type object. So, it contains numbers as strings and also NaN or ? characters. I want to find the mean of this column but first I have to remove the NaN and ? values and also convert it to float
I am using the following code:
import pandas as pd
import numpy as np
df = pd.read_csv('Automobile_data.csv', sep = ',')
df = df.dropna('price', inplace=True)
df['price'] = df['price'].astype('int')
df['price'].mean()
But, this doesn't work. The error says:
ValueError: No axis named price for object type DataFrame
How can I solve this problem?
edit: in pandas version 1.3 and less, you need subset=[col] wrapped in a list/array. In verison 1.4 and greater you can pass a single column as a string.
You've got a few problems:
df.dropna() arguments require the axis and then the subset. The axis is rows/columns, and then subset is which of those to look at. So you want this to be (I think) df.dropna(axis='rows',subset='price')
Using inplace=True makes the whole thing return None, and so you have set df = None. You don't want to do that. If you are using inplace=True, then you don't assign something to that, the whole line would just be df.dropna(...,inplace=True).
Don't use inplace=True, just do the assignment. That is, you should use df=df.dropna(axis='rows',subset='price')

Reading BED files into pandas dataframe (windows)

For a bioinformatics project, I would like to read a .BED file into a pandas dataframe and have no clue how I can do it and what tools/programs are required. Nothing I found on the internet was really applicable to me, as I am working on windows10 with Python 3.7 (Anaconda distribution).
Any help would be appreciated.
According to https://software.broadinstitute.org/software/igv/BED:
A BED file (.bed) is a tab-delimited text file that defines a feature
track.
According to http://genome.ucsc.edu/FAQ/FAQformat#format1 is contains up to 12 fields (columns) and possible comment lines starting with the word 'track'. The following is a minimal program to read such a bed file into a pandas dataframe.
import pandas as pd
df = pd.read_csv('so58178958.bed', sep='\t', comment='t', header=None)
header = ['chrom', 'chromStart', 'chromEnd', 'name', 'score', 'strand', 'thickStart', 'thickEnd', 'itemRgb', 'blockCount', 'blockSizes', 'blockStarts']
df.columns = header[:len(df.columns)]
This is just a very simple code snippet treating all lines starting with a 't' as comments. This should work as all 'chrom' field entries should start with either a 'c', an 's' or a digit.
If you use pyranges, the df will be given names and the columns appropriate data types.
import pyranges as pr
df = pr.read_bed("your.bed", as_df=True)
It also has readers for untidy bioinformatics formats such as gtfs and gff3s.

Holoviz panel will not print pandas dataframe row in Jupyter notebook

I'm trying to recreate the first panel.interact example in the Holoviz tutorial using a Pandas dataframe instead of a Dask dataframe. I get the slider, but the pandas dataframe row does not show.
See the original example at: http://holoviz.org/tutorial/Building_Panels.html
I've tried using Dask as in the Holoviz example. Dask rows print out just fine, but it demonstrates that panel seem to treat Dask dataframe rows differently for printing than Pandas dataframe rows. Here's my minimal code:
import pandas as pd
import panel
l1 = ['a','b','c','d','a','b']
l2 = [1,2,3,4,5,6]
df = pd.DataFrame({'cat':l1,'val':l2})
def select_row(rowno=0):
row = df.loc[rowno]
return row
panel.extension()
panel.extension('katex')
panel.interact(select_row, rowno=(0, 5))
I've included a line with the katex extension, because without it, I get a warning that it is needed. Without it, I don't even get the slider.
I can call the select_row(rowno=0) function separately in a Jupyter cell and get a nice printout of the row, so it appears the function is working as it should.
Any help in getting this to work would be most appreciated. Thanks.
Got a solution. With Pandas, loc[rowno:rowno] returns a pandas.core.frame.DataFrame object of length 1 which works fine with panel while loc[rowno] returns a pandas.core.series.Series object which does not work so well. Thus modifying the select_row() function like this makes it all work:
def select_row(rowno=0):
row = df.loc[rowno:rowno]
return row
Still not sure, however, why panel will print out the Dataframe object and not the Series object.
Note: if you use iloc, then you use add +1, i.e., df.iloc[rowno:rowno+1].

Pandas - format of values

Here is my file, which I read in to pandas dataframe using read_csv.
import pandas as pd
df_from =pd.read_csv(r'some path')
if you look the file column ('NetworthIndicator_Rollup') values look good, but when I display dataframe in Jupyter or plot heatmap, some of the index values are not correctly formatted. Here is preview of incorrect formatting:
In particular 'NetworthIndicator_Rollup' is missing spaces. However when I execute: df_from['NetworthIndicator_Rollup'], everything looks good:
What is wrong?

How to Render Math Table Properly in IPython Notebook

The math problem that I'm solving gives different analytical solutions in different scenarios, and I would like to summarize the result in a nice table. IPython Notebook renders the list nicely:
for example:
import sympy
from pandas import DataFrame
from sympy import *
init_printing()
a, b, c, d = symbols('a b c d')
t = [[a/b, b/a], [c/d, d/c]]
t
However, when I summarize the answers into a table using DataFrame, the math cannot be rendered any more:
df = DataFrame(t, index=['Situation 1', 'Situation 2'], columns=['Answer1','Answer2'])
df
"print df.to_latex()" also gives the same result. I also tried "print(latex(t))" but it gives this after compiling in LaTex, which is alright, but I still need to manually convert it to a table:
How should I use DataFrame properly in order to render the math properly? Or is there any other way to export the math result into a table in Latex? Thanks!
Update: 01/25/14
Thanks again to #Jakob for solving the problem. It works perfectly for simple matrices, though there are still some minor problems for more complicated math expressions. But I guess like #asmeurer said, perfection requires an update in IPython and Pandas.
Update: 01/26/14
If I render the result directly, i.e. just print the list, it works fine:
MathJax is currently not able to render tables, hence the most obvious approach (pure latex) does not work.
However, following the advise of #asmeurer you should use an html table and render the cell content as latex. In your case this could be easily achieved by the following intermediate step:
from sympy import latex
tl = map(lambda tc: '$'+latex(tc)+'$',t)
df = DataFrame(tl, index=['Situation 1', 'Situation 2'], columns=['Answer'])
df
which gives:
Update:
In case of two dimensional data, the simple map function will not work directly. To cope with this situation the numpy shape, reshape and ravel functions could be used like:
import numpy as np
t = [[a/b, b/a],[a*a,b*b]]
tl=np.reshape(map(lambda tc: '$'+latex(tc)+'$',np.ravel(t)),np.shape(t))
df = DataFrame(tl, index=['Situation 1', 'Situation 2'], columns=['Answer 1','Answer 2'])
df
This gives:
Update 2:
Pandas crops cell content if the string length exceeds a certain number. E.g a more complicated expression like
t1 = [a/2+b/2+c/2+d/2]
tl=np.reshape(map(lambda tc: '$'+latex(tc)+'$',np.ravel(t1)),np.shape(t1))
df = DataFrame(tl, index=['Situation 1'], columns=['Answer 1'])
df
gives:
To cope with this issue a pandas package option has to be altered, for details see here. For the present case the max_colwidth has to be changed. The default value is 50, hence let's change it to 100:
import pandas as pd
pd.options.display.max_colwidth=100
df
gives: