Parse list and create DataFrame - pandas

I have been given a list called data which has the following content
data=[b'Name,Age,Occupation,Salary\r\nRam,37,Plumber,1769\r\nMohan,49,Elecrician,3974\r\nRahim,39,Teacher,4559\r\n']
I wanted to have a pandas dataframe which looks like the link
Expected Dataframe
How can I achieve this.

You can try this:
data=[b'Name,Age,Occupation,Salary\r\nRam,37,Plumber,1769\r\nMohan,49,Elecrician,3974\r\nRahim,39,Teacher,4559\r\n']
processed_data = [x.split(',') for x in data[0].decode().replace('\r', '').strip().split('\n')]
df = pd.DataFrame(columns=processed_data[0], data=processed_data[1:])
Hope it helps.

I would recommend you to convert this list to string as there is only one index in this list
str1 = ''.join(data)
Then use solution provided here
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
TESTDATA = StringIO(str1)
df = pd.read_csv(TESTDATA, sep=",")

Related

Display dataframe index name with Streamlit

The following code does not display the name of the index:
import pandas as pd
import streamlit as st
df = pd.DataFrame(['row1', 'row2'], index=pd.Index([1, 2], name='my_index'))
st.write(df)
Is there a way to have my_index displayed like you would do in a jupyter notebook?
According to the streamlit doc it will write dataframe as a table. So the index name is not shown.
To show the my_index name, reset the index to default and as a result the my_index will become a normal column. Add the following before st.write().
df.reset_index(inplace=True)
Output
I found a solution using pandas dataframe to_html() method:
import pandas as pd
import streamlit as st
df = pd.DataFrame(['row1', 'row2'], index=pd.Index([1, 2], name='my_index'))
st.write(df.to_html(), unsafe_allow_html=True)
This results with the following output:
If you want the index and columns names to be in the same header row you can use the following code:
import pandas as pd
import streamlit as st
df = pd.DataFrame(['row1', 'row2'], index=pd.Index([1, 2], name='my_index'))
df.columns.name = df.index.name
df.index.name = None
st.write(df.to_html(), unsafe_allow_html=True)
This results with the following output:
Note - if you have a large dataset and want to limit the number of rows use df.to_html(max_rows=N) instead where N is the number of rows you want to dispplay.

convert csv file to nc(netcdf) file using xarray

I want to convert the CSV file storing date, temperature values, latitude, and longitude information to NetCDF file format having three-dimension.
My dataframe is like:
When I use this script below it only contains 1 dimension.
import pandas as pd
import xarray as xr
df = pd.read_csv(csv_file)
xr = df.to_xarray()
nc=xr.to_netcdf('my_netcdf.nc')
Can you help me with that?
Thank you.
You will need to set time etc. as indices in pandas first. So modify your code to something like this:
import pandas as pd
import xarray as xr
df = pd.read_csv(csv_file)
df = df.set_index(["time", "lon", "lat"]
xr = df.to_xarray()
nc=xr.to_netcdf('my_netcdf.nc')

Lambdas function on multiple columns

I am trying to extract only number from multiple columns in my pandas data.frame.
I am able to do so one-by-one columns however I would like to perform this operation simultaneously to multiple columns
My reproduced example:
import pandas as pd
import re
import numpy as np
import seaborn as sns
df = sns.load_dataset('diamonds')
# Create columns one again
df['clarity2'] = df['clarity']
df.head()
df[['clarity', 'clarity2']].apply(lambda x: x.str.extract(r'(\d+)'))
If you want a tuple
cols = ['clarity', 'clarity2']
tuple(df[col].str.extract(r'(\d+)') for col in cols)
If you want a list
cols = ['clarity', 'clarity2']
[df[col].str.extract(r'(\d+)') for col in cols]
adding them to the original data
df['digit1'], df['digit2'] = [df[col].str.extract(r'(\d+)') for col in cols]

Transfer a df to a new one and change the context of a column

I have one dataframe df_test and I want to parse all the columns into a new df.
Also I want with if else statement to modify one column's context.
Tried this:
import pyspark
import pandas as pd
from pyspark.sql import SparkSession
df_cast= df_test.withColumn('account_id', when(col("account_id") == 8, "teo").when(col("account_id") == 9, "liza").otherwise(' '))
But it gives me this error:
NameError: name 'when' is not defined
Thanks in advance
At the start of your code, you should import the pyspark sql functions. The following, for example, would work:
import pyspark.sql.functions as F
import pyspark
import pandas as pd
from pyspark.sql import SparkSession
df_cast= df_test.withColumn('account_id', F.when(col("account_id") == 8, "teo").F.when(col("account_id") == 9, "liza").otherwise(' '))

How can I get an interpolated value from a Pandas data frame?

I have a simple Pandas data frame with two columns, 'Angle' and 'rff'. I want to get an interpolated 'rff' value based on entering an Angle that falls between two Angle values (i.e. between two index values) in the data frame. For example, I'd like to enter 3.4 for the Angle and then get an interpolated 'rff'. What would be the best way to accomplish that?
import pandas as pd
data = [[1.0,45.0], [2,56], [3,58], [4,62],[5,70]] #Sample data
s= pd.DataFrame(data, columns = ['Angle', 'rff'])
print(s)
s = s.set_index('Angle') #Set 'Angle' as index
print(s)
result = s.at[3.0, "rff"]
print(result)
You may use numpy:
import numpy as np
np.interp(3.4, s.index, s.rff)
#59.6
You could use numpy for this:
import numpy as np
import pandas as pd
data = [[1.0,45.0], [2,56], [3,58], [4,62],[5,70]] #Sample data
s= pd.DataFrame(data, columns = ['Angle', 'rff'])
print(s)
print(np.interp(3.4, s.Angle, s.rff))
>>> 59.6