How to show truncated form of large pandas dataframe after style.apply? - pandas

Normally, a relatively long dataframe like
df = pd.DataFrame(np.random.randint(0,10,(100,2)))
df
will display a truncated form in jupyter notebook like
With head, tail, ellipsis in between and row column count in the end.
However, after style.apply
def highlight_max(x):
return ['background-color: yellow' if v == x.max() else '' for v in x]
df.style.apply(highlight_max)
we got all rows displayed
Is it possible to still display the truncated form of dataframe after style.apply?

Something simple like this?
def display_df(dataframe, function):
display(dataframe.head().style.apply(function))
display(dataframe.tail().style.apply(function))
print(f'{dataframe.shape[0]} rows x {dataframe.shape[1]} columns')
display_df(df, highlight_max)
Output:
**** EDIT ****
def display_df(dataframe, function):
display(pd.concat([dataframe.iloc[:5,:],
pd.DataFrame(index=['...'], columns=dataframe.columns),
dataframe.iloc[-5:,:]]).style.apply(function))
print(f'{dataframe.shape[0]} rows x {dataframe.shape[1]} columns')
display_df(df, highlight_max)
Output:
The jupyter preview is basically something like this:
def display_df(dataframe):
display(pd.concat([dataframe.iloc[:5,:],
pd.DataFrame(index=['...'], columns=dataframe.columns, data={0: '...', 1: '...'}),
dataframe.iloc[-5:,:]]))
but if you try to apply style you are getting an error (TypeError: '>=' not supported between instances of 'int' and 'str') because it's trying to compare and highlight the string values '...'

You can capture the output in a variable and then use head or tail on it. This gives you more control on what you display every time.
output = df.style.apply(highlight_max)
output.head(10) # 10 -> number of rows to display
If you want to see more variate data you can also use sample, which will get random rows:
output.sample(10)

Related

create n dataframes in a for loop with an extra column with a specific number in it

Hi all I have a dataframe like that shown in the picture:
I am trying to create 2 different dataframes with the same "hour", "minute", "value" (and value.1 respectively) columns, by adding also column with number 0 and 1 respectively). I would like to do it in a for loop as I want to create n dataframe (not just 2 shown here).
I tried something like this but it's not working (error: KeyError: "['value.i'] not in index"):
for i in range(1):
series[i] = df_new[['hour', 'minute', 'value.i']]
series[i].insert(0, 'number', 'i')
can you help me ?
thannks
from what I have understood you want to make value.i to show value.1 or value.2
for i in range(1):
# f is for the format so can interpret i as variable only
series[i] = df_new[['hour','minute',f'value.{i}']]

Get count vectorizer vocabulary in new dataframe column by applying vectorizer on existing dataframe column using pandas

I have dataframe column 'review' with content like 'Food was Awesome' and I want a new column which counts the number of repetition of each word.
name The First Years Massaging Action Teether
review A favorite in our house!
rating 5
Name: 269, dtype: object
Expecting output like ['Food':1,'was':1,'Awesome':1]
I tried with for loop but its taking too long to execute
for row in range(products.shape[0]):
try:
count_vect.fit_transform([products['review_without_punctuation'][row]])
products['word_count'][row]=count_vect.vocabulary_
except:
print(row)
I would like to do it without for loop.
I found a solution for this.
I have defined a function like this-
def Vectorize(text):
try:
count_vect.fit_transform([text])
return count_vect.vocabulary_
except:
return-1
and applied above function-
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
products['word_count'] = products['review_without_punctuation'].apply(Vectorize)
This solution worked and I got vocabulary in new column.
You can get the count vector for all docs like this:
cv = CountVectorizer()
count_vectors = cv.fit_transform(products['review_without_punctuation'])
To get the count vector in array format for a particular document by index, say, the 1st doc,
count_vectors[0].toarray()
The vocabulary is in
cv.vocabulary_
To get the words that make up a count vector, say, for the 1st doc, use
cv.inverse_transform(count_vectors[0])

delete rows from pandas data frame that contains one of its columns as list , when one of its values match value in another compared list

delete rows from pandas data frame that contains one of its columns as list , when one of its values match value in another compared list column in another data frame.
here is the first data frame column: enter image description here
and the other data frame column is here: enter image description here
I have tried a lot of codes
Revdf=Revdf.drop(lambda x: [i for i in Revdf.AffiliationHistory if i in Authdf.Affiliations.values], axis=1)
or
Revdf=Revdf[~(Revdf.AffiliationHistory.isin(Authdf.Affiliations.values))]
but these can't help
There has to be an easier way, but i wrote a function for it and it works:
def remove_row(df1,x1,y1,df2,x2,y2):
assert type(df1.loc[x1,y1])==list,"type have to be list"
assert type(df2.loc[x2,y2])==list,"type have to be list"
flag =False
l1=df1.loc[x1,y1]
print(l1)
l2=df2.loc[x2,y2]
print(l2)
for i in l1:
if i in l2:
flag=True
break
if flag==True:
return df1.drop(x1)
else:
return df1
x is the row index, y is the column name, tried it on synthetic data and it works:
df1=pd.DataFrame({'col1':[0,0,0,0,1],
'col2':[[1,2,3,4],0,0,0,0]})
df2=pd.DataFrame({'col1':[0,0,0,0],
'col2':[[0,0,0,4],0,0,0]})
remove_row(df1,0,'col2',df2,0,'col2')
Also, i think a mistake you're making is this:
[1,2,3,4] in [0,1,2,3,4]
will return false, because you're asking if the second list contains the first.

Python3.4 Pandas DataFrame from function

I wrote a function that outputs selected data from a parsing function. I am trying to put this information into a DataFrame using pandas.DataFrame but I am having trouble.
The headers are listed below as well as the function.head() data output
QUESTION
How will I be able to place the function output within the pandas DataFrame so the headers are linked to the output
HEADERS
--TICK---------NI----------CAPEXP----------GW---------------OE---------------RE-------
OUTPUT
['MMM', ['4,956,000'], ['(1,493,000)'], ['7,050,000'], ['13,109,000'], ['34,317,000']]
['ABT', ['2,284,000'], ['(1,077,000)'], ['10,067,000'], ['21,526,000'], ['22,874,000']]
['ABBV', ['1,774,000'], ['(612,000)'], ['5,862,000'], ['1,742,000'], ['535,000']]
-Loop through each item (I'm assuming data is a list with each element being one of the lists shown above)
-Take the first element as the ticker and convert the rest into numbers using translate to undo the string formatting
-Make a DataFrame per row and then concat all at the end, then transpose
-Set the columns by parsing the header string (I've called it headers)
dflist = list()
for x in data:
h = x[0]
rest = [float(z[0].translate(str.maketrans('(','-','),'))) for z in x[1:]]
dflist.append(pd.DataFrame([h]+rest))
df = pd.concat(dflist, 1).T
df.columns = [x for x in headers.split('-') if len(x) > 0]
But this might be a bit slow - would be easier if you could get your input into a more consistent format.

Apply to each element in a Pandas dataframe

Since each series in the data frame is of tuple, I need to convert them into one number. Basically I have something like this:
price_table['Col1'].apply(lambda x: x[0])
But I actually need to do this for each column. x itself is a tuple but it has only 1 number inside, so I need to return x[0] to get its "value" which is of format float instead of tuple.
In R, I will put axis = c(1,2) but here seems that putting 2 numbers in axis doesnt work:
price_table.apply(lambda x: x[0],axis = 1)
TypeError: <lambda>() got an unexpected keyword argument 'axis'
Is there anyway to apply this simple function to each element in the data frame?
Thanks in advance.
For me the following works well:
"price_table['Col1'].apply(lambda x: x[0],1)"
I do not use the axis. But, I do not know the reason.