Pandas - format of values - pandas

Here is my file, which I read in to pandas dataframe using read_csv.
import pandas as pd
df_from =pd.read_csv(r'some path')
if you look the file column ('NetworthIndicator_Rollup') values look good, but when I display dataframe in Jupyter or plot heatmap, some of the index values are not correctly formatted. Here is preview of incorrect formatting:
In particular 'NetworthIndicator_Rollup' is missing spaces. However when I execute: df_from['NetworthIndicator_Rollup'], everything looks good:
What is wrong?

Related

Replacing whole string with part of it using Regex in Python Pandas

I have a table in pdf found on this link: https://taxation-customs.ec.europa.eu/system/files/2022-11/tobacco_products_releases-consumption.pdf
I am trying to clean the data before doing analysis but I noticed that between 2014-2017 the cigarette data was merged due to error. Instead of two cells per year in a column for Sweden and UK I got one merged which looks something like this: 5393688\r28587000
I would like to update data only for Sweden and get the first value before \r.
So far my code was as follows:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
cig= pd.DataFrame(tabula.read_pdf(r"https://taxation-customs.ec.europa.eu/system/files/2022-11/tobacco_products_releases-consumption.pdf", pages ='all')[0])
cig.replace(to_replace='N/A', value=0, inplace=True, regex=True)
cig= cig.replace(',','', regex=True)
After this I tried
df.iloc[26,:].str.replace("('\r').*","")
cig.iloc[26,:] = cig.iloc[26,:].replace("('\r').*","", regex=True)
and
cig.iloc[26,:].replace(to_replace='(?:[0-9]+)([^0-9]{2})([0-9]+)', value='', regex=True)
But none of the above seem to produce desired result and I still have values with similar format i.e. 5393688\r28587000
Set regex=True and assign the changed subset back to the dataframe:
df.iloc[26,:] = df.iloc[26,:].replace("('\r').*","", regex=True)

display output pandas dataframe in rstudio

One question please.
I like to use rstudio to code in python and R, but when I print a pandas dataframe I get output that doesn't use all the space. It is not very friendly and it is worse if I have more variables.
As shown in the attached image.
Is there a way to display the columns to the right like we do with tibble in r?
Thanks!
I have tried using these options but it doesn't work.
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

Reading BED files into pandas dataframe (windows)

For a bioinformatics project, I would like to read a .BED file into a pandas dataframe and have no clue how I can do it and what tools/programs are required. Nothing I found on the internet was really applicable to me, as I am working on windows10 with Python 3.7 (Anaconda distribution).
Any help would be appreciated.
According to https://software.broadinstitute.org/software/igv/BED:
A BED file (.bed) is a tab-delimited text file that defines a feature
track.
According to http://genome.ucsc.edu/FAQ/FAQformat#format1 is contains up to 12 fields (columns) and possible comment lines starting with the word 'track'. The following is a minimal program to read such a bed file into a pandas dataframe.
import pandas as pd
df = pd.read_csv('so58178958.bed', sep='\t', comment='t', header=None)
header = ['chrom', 'chromStart', 'chromEnd', 'name', 'score', 'strand', 'thickStart', 'thickEnd', 'itemRgb', 'blockCount', 'blockSizes', 'blockStarts']
df.columns = header[:len(df.columns)]
This is just a very simple code snippet treating all lines starting with a 't' as comments. This should work as all 'chrom' field entries should start with either a 'c', an 's' or a digit.
If you use pyranges, the df will be given names and the columns appropriate data types.
import pyranges as pr
df = pr.read_bed("your.bed", as_df=True)
It also has readers for untidy bioinformatics formats such as gtfs and gff3s.

Holoviz panel will not print pandas dataframe row in Jupyter notebook

I'm trying to recreate the first panel.interact example in the Holoviz tutorial using a Pandas dataframe instead of a Dask dataframe. I get the slider, but the pandas dataframe row does not show.
See the original example at: http://holoviz.org/tutorial/Building_Panels.html
I've tried using Dask as in the Holoviz example. Dask rows print out just fine, but it demonstrates that panel seem to treat Dask dataframe rows differently for printing than Pandas dataframe rows. Here's my minimal code:
import pandas as pd
import panel
l1 = ['a','b','c','d','a','b']
l2 = [1,2,3,4,5,6]
df = pd.DataFrame({'cat':l1,'val':l2})
def select_row(rowno=0):
row = df.loc[rowno]
return row
panel.extension()
panel.extension('katex')
panel.interact(select_row, rowno=(0, 5))
I've included a line with the katex extension, because without it, I get a warning that it is needed. Without it, I don't even get the slider.
I can call the select_row(rowno=0) function separately in a Jupyter cell and get a nice printout of the row, so it appears the function is working as it should.
Any help in getting this to work would be most appreciated. Thanks.
Got a solution. With Pandas, loc[rowno:rowno] returns a pandas.core.frame.DataFrame object of length 1 which works fine with panel while loc[rowno] returns a pandas.core.series.Series object which does not work so well. Thus modifying the select_row() function like this makes it all work:
def select_row(rowno=0):
row = df.loc[rowno:rowno]
return row
Still not sure, however, why panel will print out the Dataframe object and not the Series object.
Note: if you use iloc, then you use add +1, i.e., df.iloc[rowno:rowno+1].

Pandas- Groupby Plot is not working for object

I am new to Pandas and doing some analysis csv file. I have successfully read csv and shown all details. I have got two column as an object type which I need to plot. I have done groupy for those two columns and getting first and all data, However I am not sure, how to do plotting for these object types in Pandas. Below is my sample of Groupby and smaple for event_type and event_description for which I need to do plotting. If I can plot for Application and Network for event_type that will be great help
import pandas as pd
data = pd.read_csv('/Users/temp/Downloads/sample.csv’)
data.head()
grouped_df = data.groupby([ "event_type", "event_description"])
grouped_df.first()
As commented - need more info, but IIUC, try:
df['event_type'].value_counts(sort=True).plot(kind='barh')