Dask .loc only the first result (iloc[0]) - pandas

Sample dask dataframe:
import pandas as pd
import dask
import dask.dataframe as dd
df = pd.DataFrame({'col_1': [1,2,3,4,5,6,7], 'col_2': list('abcdefg')},
index=pd.Index([0,0,1,2,3,4,5]))
df = dd.from_pandas(df, npartitions=2)
Now I would like to only get first (based on the index) result back - like this in pandas:
df.loc[df.col_1 >3].iloc[0]
col_1 col_2
2 4 d
I know there is no positional row indexing in dask using iloc, but I wonder if it would be possible to limit the query to 1 result like in SQL?

Got it - But not sure about the efficiency here:
tmp = df.loc[df.col_1 >3]
tmp.loc[tmp.index == tmp.index.min().compute()].compute()

Related

Display dataframe index name with Streamlit

The following code does not display the name of the index:
import pandas as pd
import streamlit as st
df = pd.DataFrame(['row1', 'row2'], index=pd.Index([1, 2], name='my_index'))
st.write(df)
Is there a way to have my_index displayed like you would do in a jupyter notebook?
According to the streamlit doc it will write dataframe as a table. So the index name is not shown.
To show the my_index name, reset the index to default and as a result the my_index will become a normal column. Add the following before st.write().
df.reset_index(inplace=True)
Output
I found a solution using pandas dataframe to_html() method:
import pandas as pd
import streamlit as st
df = pd.DataFrame(['row1', 'row2'], index=pd.Index([1, 2], name='my_index'))
st.write(df.to_html(), unsafe_allow_html=True)
This results with the following output:
If you want the index and columns names to be in the same header row you can use the following code:
import pandas as pd
import streamlit as st
df = pd.DataFrame(['row1', 'row2'], index=pd.Index([1, 2], name='my_index'))
df.columns.name = df.index.name
df.index.name = None
st.write(df.to_html(), unsafe_allow_html=True)
This results with the following output:
Note - if you have a large dataset and want to limit the number of rows use df.to_html(max_rows=N) instead where N is the number of rows you want to dispplay.

How to replicate the between_time function of Pandas in PySpark

I want to replicate the between_time function of Pandas in PySpark.
Is it possible since in Spark the dataframe is distributed and there is no indexing based on datetime?
i = pd.date_range('2018-04-09', periods=4, freq='1D20min')
ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i)
ts.between_time('0:45', '0:15')
Is something similar possible in PySpark?
pandas.between_time - API
If you have a timestamp column, say ts, in a Spark dataframe, then for your case above, you can just use
import pyspark.sql.functions as F
df2 = df.filter(F.hour(F.col('ts')).between(0,0) & F.minute(F.col('ts')).between(15,45))

How can I get an interpolated value from a Pandas data frame?

I have a simple Pandas data frame with two columns, 'Angle' and 'rff'. I want to get an interpolated 'rff' value based on entering an Angle that falls between two Angle values (i.e. between two index values) in the data frame. For example, I'd like to enter 3.4 for the Angle and then get an interpolated 'rff'. What would be the best way to accomplish that?
import pandas as pd
data = [[1.0,45.0], [2,56], [3,58], [4,62],[5,70]] #Sample data
s= pd.DataFrame(data, columns = ['Angle', 'rff'])
print(s)
s = s.set_index('Angle') #Set 'Angle' as index
print(s)
result = s.at[3.0, "rff"]
print(result)
You may use numpy:
import numpy as np
np.interp(3.4, s.index, s.rff)
#59.6
You could use numpy for this:
import numpy as np
import pandas as pd
data = [[1.0,45.0], [2,56], [3,58], [4,62],[5,70]] #Sample data
s= pd.DataFrame(data, columns = ['Angle', 'rff'])
print(s)
print(np.interp(3.4, s.Angle, s.rff))
>>> 59.6

How to access dask dataframe index value in map_paritions?

I am trying to use dask dataframe map_partition to apply a function which access the value in the dataframe index, rowise and create a new column.
Below is the code I tried.
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame(index = ["row0" , "row1","row2","row3","row4"])
df
ddf = dd.from_pandas(df, npartitions=2)
res = ddf.map_partitions(lambda df: df.assign(index_copy= str(df.index)),meta={'index_copy': 'U' })
res.compute()
I am expecting df.index to be the value in the row index, not the entire partition index which it seems to refer to. From the doc here, this work well for columns but not the index.
what you want to do is this
df.index = ['row'+str(x) for x in df.index]
and for that first create your pandas dataframe and then run this code after you will have your expected result.
let me know if this works for you.

how to extract the unique values and its count of a column and store in data frame with index key

I am new to pandas.I have a simple question:
how to extract the unique values and its count of a column and store in data frame with index key
I have tried to:
df = df1['Genre'].value_counts()
and I am getting a series but I don't know how to convert it to data frame object.
Pandas series has a .to_frame() function. Try it:
df = df1['Genre'].value_counts().to_frame()
And if you wanna "switch" the rows to columns:
df = df1['Genre'].value_counts().to_frame().T
Update: Full example if you want them as columns:
import pandas as pd
import numpy as np
np.random.seed(400) # To reproduce random variables
df1 = pd.DataFrame({
'Genre': np.random.choice(['Comedy','Drama','Thriller'], size=10)
})
df = df1['Genre'].value_counts().to_frame().T
print(df)
Returns:
Thriller Comedy Drama
Genre 5 3 2
try
df = pd.DataFrame(df1['Genre'].value_counts())