I have a pandas dataframe which need to group by a text column to obtain sum of duplicated values along that column. But when I run the groupby method it drop many columns mysteriously. Can anyone help me on this?
Try to check your column dtypes , sum will only for numeric value.
For example you have df as below :
df=pd.DataFrame({'V1':[1,2,3],'V2':['A','B','C'],'KEY':[1,2,2]})
df.dtypes
Out[159]:
KEY int64
V1 int64
V2 object
dtype: object
Then you groupby key and do sum for whole dataframe it will only return the result of numeric columns
df.groupby('KEY').sum()
Out[160]:
V1
KEY
1 1
2 5
If you need string type to join together you can
df.groupby('KEY',as_index=False).apply(lambda x : x.sum())
Out[164]:
KEY V1 V2
0 1 1 A
1 4 5 BC
Related
I have an input dataframe.
I have also a list, with the same len as the number of rows in the dataframe.
Every element of the list is a dictionary: the key is the name of the new column, and the value is the value to be inserted in the cell.
I have to insert the columns from that list in the dataframe.
What is the best way to do so?
So far, given the input dataframe indf and the list l, I came up with something on the line of:
from copy import deepcopy
outdf = deepcopy(indf)
for index, row in indf.iterrows():
e = l[index]
for key, value in e:
outdf.loc[index, key] = value
But it doesn't seem pythonic and pandasnic and I get performance warnings like:
<ipython-input-5-9dde586a9c14>:8: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
If the sorting of the list and the data frame is the same, you can convert your list of dictionaries to a data frame:
mylist = [
{'a':1,'b':2,'c':3},
{'e':11,'f':22,'c':33},
{'a':111,'b':222,'c':333}
]
mylist_df = pd.DataFrame(mylist)
a
b
c
e
f
0
1
2
3
nan
nan
1
nan
nan
33
11
22
2
111
222
333
nan
nan
Then you can use pd.concat to merge the list to your input data frame:
result = pd.concat([input_df, mylist_df], axis=1)
In this way, there is always a column created for all unique keys in your dictionary, regardless of they exist in one dictionary and not the other.
I have a pandas dataframe where some values are integers and other values are an array. I simply want to drop all of the rows that contain the array (object datatype I believe) in my "ORIGIN_AIRPORT_ID" column, but I have not been able to figure out how to do so after trying many methods.
Here is what the first 20 rows of my dataframe looks like. The values that show up like a list are the ones I want to remove. The dataset is a couple million rows, so I just need to write code that removes all of the array-like values in that specific dataframe column if that makes sense.
df = df[df.origin_airport_ID.str.contains(',') == False]
You should consider next time giving us a data sample in text, instead of a figure. It's easier for us to test your example.
Original data:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
3 20194149 [10397, 10398, 10399, 10400]
4 20194150 10397
In your case, you can use the .to_numeric pandas function:
df['ORIGIN_AIRPORT_ID'] = pd.to_numeric(df['ORIGIN_AIRPORT_ID'], errors='coerce')
It replaces every cell that cannot be converted into a number to a NaN ( Not a Number ), so we get:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397.0
1 20194147 10397.0
2 20194148 10397.0
3 20194149 NaN
4 20194150 10397.0
To remove these rows now just use .dropna
df = df.dropna().astype('int')
Which results in your desired DataFrame
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
4 20194150 10397
I have a Pandas DataFrame of key-value pairs for a collection of IDs. The columns in the DataFrame are (ID, Key, Value).
data = {
"ID":{0:1,1:1,2:1,3:2,4:2,5:2,6:3,7:3,8:3,9:4,10:4,11:4},
"Key":{0:"A",1:"B",2:"B",3:"A",4:"B",5:"B",6:"A",7:"B",8:"B",9:"A",10:"B",11:"C"},
"Value":{0:28,1:94,2:107,3:67,4:70,5:70,6:24,7:77,8:87,9:24,10:83,11:83}
}
data = pd.DataFrame(data)
I am trying to create a new table where the columns are the unique Keys, and their associated value is the maximum value for each ID:
So far I am able to create a DataFrame that contains the desired maximum values:
max_data = data.loc[ data.groupby(["ID", "Key"])["Value"].idxmax() ]
However, I am not sure the best way to get a DataFrame where the columns are the unique Keys with their associated values. This is what I have so far, but I am trying to avoid a loop:
result = pd.DataFrame(max_data["ID"].unique(), columns=["ID"])
for key in max_data["Key"].unique():
result = result.merge(
max_data.loc[max_data["Key"] == key][["ID", "Value"]],
how="left",
on="ID"
)
Something like pivot_table
data.pivot_table(index='ID',columns='Key',values='Value',aggfunc='max')
Out[22]:
Key A B C
ID
1 28.0 107.0 NaN
2 67.0 70.0 NaN
3 24.0 87.0 NaN
4 24.0 83.0 83.0
An elegant function like
df[~pandas.isnull(df.loc[:,0])]
can check a pandas DataFrame column and return the entire DataFrame but with all NaN value rows from the selected column removed.
I am wondering if there is a similar function which can check and return a df column conditional on its dtype without using any loops.
I've looked at
.select_dtypes(include=[np.float])
but this only returns columns that have entirely float64 values, not every row in a column that is a float.
First lets set up a DataFrame with two columns. Only column b has a float. We'll try and find this row:
df = pandas.DataFrame({
'a': ['qw', 'er'],
'b' : ['ty', 1.98]
})
When printed this looks like:
a b
0 qw ty
1 er 1.98
Then create a map to select the rows using apply()
def check_if_float(row):
return isinstance(row['b'], float)
map = df.apply(check_if_float, axis=1)
This will give a boolean map of all the rows that have a float in column b:
0 False
1 True
You can then use this map to select the rows you want
filtered_rows = df[map]
Which leaves you only the rows that contain a float in column b:
a b
1 er 1.98
I want to run frequency table on each of my variable in my df.
def frequency_table(x):
return pd.crosstab(index=x, columns="count")
for column in df:
return frequency_table(column)
I got an error of 'ValueError: If using all scalar values, you must pass an index'
How can i fix this?
Thank you!
You aren't passing any data. You are just passing a column name.
for column in df:
print(column) # will print column names as strings
try
ctabs = {}
for column in df:
ctabs[column]=frequency_table(df[column])
then you can look at each crosstab by using the column name as keys in the ctabs dictionary
for column in df:
print(data[column].value_counts())
For example:
import pandas as pd
my_series = pd.DataFrame(pd.Series([1,2,2,3,3,3, "fred", 1.8, 1.8]))
my_series[0].value_counts()
will generate output like in below:
3 3
1.8 2
2 2
fred 1
1 1
Name: 0, dtype: int64