This question already has answers here:
How do I count the values from a pandas column which is a list of strings?
(5 answers)
Closed 11 months ago.
I am trying to count the number of characters in an uneven 2-D pandas series.
df = pd.DataFrame({ 'A' : [['a','b'],['a','c','f'],['a'], ['b','f']]}
I want to count the number of times each character is repeated.
any ideas?
You can use explode() and value_counts().
import pandas as pd
df = pd.DataFrame({ 'A' : [['a','b'],['a','c','f'],['a'], ['b','f']]})
df = df.explode("A")
print(df.value_counts())
Expected output:
A
a 3
b 2
f 2
c 1
Related
This question already has answers here:
Take the positive value for a primary key in case of duplicates
(1 answer)
Pandas filter maximum groupby
(2 answers)
Closed 7 months ago.
I would like to keep the latest entry per group in a dataframe:
from datetime import date
import pandas as pd
data = [
['A', date(2018,2,1), "I want this"],
['A', date(2018,1,1), "Don't want"],
['B', date(2019,4,1), "Don't want"],
['B', date(2019,5,1), "I want this"]]
df = pd.DataFrame(data, columns=['name', 'date', 'result'])
The following does what I want (found and credits here):
df.sort_values('date').groupby('name').tail(1)
name date result
0 A 2018-02-01 I want this
3 B 2019-05-01 I want this
But how do I know the order is always preserved when you do a groupby on a sorted data frame like df? Is it somewhere documented?
No it won't. Try to replace A with Z to see it.
Use sort=False:
df.sort_values('date').groupby('name', sort=False).tail(1)
This question already has answers here:
finding values in pandas series - Python3
(2 answers)
Closed 1 year ago.
I'd like to check if a pandas.DataFrame column contains a specific value. For instance, this toy Dataframe has a "h" in column "two":
import pandas as pd
df = pd.DataFrame(
np.array(list("abcdefghi")).reshape((3, 3)),
columns=["one", "two", "three"]
)
df
one two three
0 a b c
1 d e f
2 g h i
But surprisingly,
"h" in df["two"]
evaluates to False.
My question is: What's the clearest way to find out if a DataFrame column (or pandas.Series in general) contains a specific value?
df["two"] is a pandas.Series which looks like this:
0 b
1 e
2 h
It turns out, the in operator checks the index, not the values. I.e.
2 in df["two"]
evaluates to True
So one has to explicitly check for the values like this:
"h" in df["two"].values
This evaluates to True.
This question already has answers here:
Convert list of dictionaries to a pandas DataFrame
(7 answers)
Closed 1 year ago.
Is there a way to deal with missing values when converting a list of dictionaries into a pandas dataframe? Sometimes the dictionary entries in different orders so I have to deal with each column separately.
Here is an example:
p = [
{'c': 53.13,'n': 1,'t': 1575050400000},
{'t': 1575048600000,'c': 53.11}
]
And here is what I have been trying:
import pandas as pd
df = pd.DataFrame([{
"c": t["c"],
"n": t["n"],
"t": t['t']}
for t in p])
I get a KeyError: 'n' because the entry for 'n' is missing in the second dictionary. Is there a way of handling that to just put a NaN when the entry is missing?
You can instantiate a DataFrame from p as an argument:
df = pd.DataFrame(p)
df
Output:
c n t
0 53.13 1.0 1575050400000
1 53.11 NaN 1575048600000
This question already has answers here:
Appending to an empty DataFrame in Pandas?
(5 answers)
Creating an empty Pandas DataFrame, and then filling it
(8 answers)
Closed 3 years ago.
I am trying to append a new row to an empty dataset and i found the below code fine:
import panda as pd
df = pd.DataFrame(columns=['A'])
for i in range(5):
df = df.append({'A': i}, ignore_index=True)
So, it gives me:
A
0 0
1 1
2 2
3 3
4 4
But, when i try the below code, my dataset is still empty:
df = pd.DataFrame(columns=['A'])
df.append({'A': 2}, ignore_index=True)
df
Can someone explain me the solution to add only 1 row?
This question already has answers here:
Python pandas slice dataframe by multiple index ranges
(3 answers)
Slice multiple column ranges with Pandas
(1 answer)
Closed 5 years ago.
Given an example of pandas dataframe with index from 0 to 30. I would like to select the rows within several ranges of index, [0:5], [10:15] and [20:25].
How to do that?
Say you have a random pandas DataFrame with 30 rows and 4 columns as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,30,size=(30, 4)), columns=list('ABCD'))
You can then use np.r_ to index into ranges of rows [0:5], [10:15] and [20:25] as follows:
df.loc[np.r_[0:5, 10:15, 20:25], :]