Say I have a dataframe where there are different values in a column, e.g.,
raw_data = {'first_name': ['Jason', 'Molly', np.nan, np.nan, np.nan],
'nationality': ['USA', 'USA', 'France', 'UK', 'UK'],
'age': [42, 52, 36, 24, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'nationality', 'age'])
df
How do I create a new dataframe(s), where each dataframe contains only the values for USA, only the values for UK, and only the values for France? But here is the thing, say `I don't what to specify a condition like
Don't want this
# Create variable with TRUE if nationality is USA
american = df['nationality'] == "USA"
I want all the data aggregated for each nationality whatever the nationality is, without having to specify the nationality condition. I just want all the same nationalities together in their own dataframe. Also, I want all the columns that pertain to that row.
So for example, the function
SplitDFIntoSeveralDFWhereColumnValueAllTheSame(column):
code
Will return an array of dataframes with all the values of a column in each dataframe are equal.
So if I had more data and more nationalities, the aggregation into new dataframes will work without changing the code.
This will give you a dictionary of dataframes where the keys are the unique values of the 'nationality' column and the values are the dataframes you are looking for.
{name: group for name, group in df.groupby('nationality')}
demo
dodf = {name: group for name, group in df.groupby('nationality')}
for k in dodf:
print(k, '\n'*2, dodf[k], '\n'*2)
France
first_name nationality age
2 NaN France 36
USA
first_name nationality age
0 Jason USA 42
1 Molly USA 52
UK
first_name nationality age
3 NaN UK 24
4 NaN UK 70
Related
I am trying to create a new column of students' grade levels based on their DOB. The cut off dates for 1st grade would be 2014/09/02 - 2015/09/01. Is there a simple solution for this besides making a long if/elif statement. Thanks.
Name
DOB
Sally
2011/06/20
Mike
2009/02/19
Kevin
2012/12/22
You can use pd.cut(), which also supports custom bins.
from datetime import date
import pandas as pd
dob = {
'Sally': '2011/06/20',
'Mike': '2009/02/19',
'Kevin': '2012/12/22',
'Ron': '2009/09/01',
}
dob = pd.Series(dob).astype('datetime64').rename("DOB").to_frame()
grades = [
'2008/9/1',
'2009/9/1',
'2010/9/1',
'2011/9/1',
'2012/9/1',
'2013/9/1',
]
grades = pd.Series(grades).astype('datetime64')
dob['grade'] = pd.cut(dob['DOB'], grades, labels = [5, 4, 3, 2, 1])
print(dob.sort_values('DOB'))
DOB grade
Mike 2009-02-19 5
Ron 2009-09-01 5
Sally 2011-06-20 3
Kevin 2012-12-22 1
I sorted the data frame by date of birth, to show that oldest students are in the highest grades.
Hi so I have a dataframe df with a numeric index, a datetime column, and ozone concentrations, among several other columns. But here's a list of the important columns regarding my question.
index, date, ozone
0, 4-29-2018, 55.4375
1, 4-29-2018, 52.6375
2, 5-2-2018, 50.4128
3, 5-2-2018, 50.3
4, 5-3-2018, 50.3
5, 5-4-2018, 51.8845
I need to call the index value of a row based on the column value. However, multiple rows have a column value of 50.3. First, how do I find the index value based on a specific column value? I've tried:
np.isclose(df['ozone'], 50.3).argmax() from Getting the index of a float in a column using pandas
but this only gives me the first index value that the number appears. Is there a way to call the index based on two parameters (like ask what the index value for when datetime = 5-2-2018 and ozone = 50.3)?
I've also tried df.loc but it doesn't work for floating points.
here's some sample code:
df = pd.read_csv('blah.csv')
df.set_index('date', inplace = True)
df.index = pd.to_datetime(df.index)
date = pd.to_datetime(df.index)
dv = df.groupby([date.month,date.day]).mean()
dv.drop(columns=['dir_5408'], inplace=True)
df['ozone'] = df.oz_3186.rolling('8H', min_periods=2).mean().shift(-4)
ozone = df.groupby([date.month,date.day])['ozone'].max()
df['T_mda8_3186'] = df.Temp_C_3186.rolling('8H', min_periods=2).mean().shift(-4)
T_mda8_3186 = df.groupby([date.month,date.day])['T_mda8_3186'].max()
df['T_mda8_5408'] = df.Temp_C_5408.rolling('8H', min_periods=2).mean().shift(-4)
T_mda8_5408 = df.groupby([date.month,date.day])['T_mda8_5408'].max()
df['ws_mda8_5408'] = df.ws_5408.rolling('8H', min_periods=2).mean().shift(-4)
ws_mda8_5408 = df.groupby([date.month,date.day])['ws_mda8_5408'].max()
dv_MDA8 = df.drop(columns=['Temp_C_3186', 'Temp_C_5408','dir_5408','ws_5408','u_5408','v_5408','rain(mm)_5724',
'rain(mm)_5408','rh_3186','rh_5408','pm10_5408','pm10_3186','pm25_5408','oz_3186'])
dv_MDA8.reset_index(inplace=True)
I need the date as a datetime index for the beginning of my code.
Thanks in advance for your help.
This is what you might be looking for,
import pandas as pd
import datetime
data = pd.DataFrame({
'index':[0,1,2,3,4,5],
'date':['4-29-2018','4-29-2018','5-2-2018','5-2-2018','5-3-2018','5-4-2018'],
'ozone':[55.4375,52.6375,50.4128,50.3,50.3,51.8845]
}
)
data.set_index(['index'],inplace=True)
data['date'] = data['date'].apply(lambda x: datetime.datetime.strptime(x,'%m-
%d-%Y'))
data['ozone'] = data['ozone'].astype('float')
data.loc[(data['date'] == datetime.datetime.strptime('5-3-2018','%m-%d-%Y'))
& (data['ozone'] == 50.3)]
Index represents each row and you can find out indexes and then store/use it
later until, ofcourse, the index of df has not changed
Code:
import pandas as pd
import numpy as np
students = [('jack', 34, 'Sydeny', 'Engineering'),
('Sachin', 30, 'Delhi', 'Medical'),
('Aadi', 16, 'New York', 'Computer Science'),
('Riti', 30, 'Delhi', 'Data Science'),
('Riti', 30, 'Delhi', 'Data Science'),
('Riti', 30, 'Mumbai', 'Information Security'),
('Aadi', 40, 'London', 'Arts'),
('Sachin', 30, 'Delhi', 'Medical')
]
df = pd.DataFrame(students, columns=['Name', 'Age', 'City', 'Subject'])
print(df)
ind_name_sub = df.where((df.Name == 'Riti') & (df.Subject == 'Data Science')).dropna().index
# similarly you can have your ind_date_ozone may have one or more values
print(ind_name_sub)
print(df.loc[ind_name_sub])
Output:
Name Age City Subject
0 jack 34 Sydeny Engineering
1 Sachin 30 Delhi Medical
2 Aadi 16 New York Computer Science
3 Riti 30 Delhi Data Science
4 Riti 30 Delhi Data Science
5 Riti 30 Mumbai Information Security
6 Aadi 40 London Arts
7 Sachin 30 Delhi Medical
Int64Index([3, 4], dtype='int64')
Name Age City Subject
3 Riti 30 Delhi Data Science
4 Riti 30 Delhi Data Science
This question is related to many previously asked about adding columns to a dataframe, but I could not find one that addresses my problem.
I have 2 lists and I want to create a dataframe for them where each list is a column, and the index is taken from a previous dataframe.
When I try:
STNAME = Filter3['STNAME'].tolist() #first list to be converted to column
CTYNAME = Filter3['CTYNAME'].tolist() #second list to be converted to column
ORIG_INDEX = Filter3.index #index pulled from previous dataframe
FINAL = pd.Series(STNAME, CTYNAME, index=ORIG_INDEX)
return FINAL
I get an error that an index already exists:
TypeError: init() got multiple values for argument 'index'
So I tried it with just two columns, and no declaration of index, and it turns out that
FINAL = pd.Series(STNAME, CTYNAME) makes the CTYNAME into the index:
STNAME = Filter3['STNAME'].tolist()
CTYNAME = Filter3['CTYNAME'].tolist()
ORIG_INDEX = Filter3.index
FINAL = pd.Series(STNAME, CTYNAME)
return FINAL
Washington County Iowa
Washington County Minnesota
Washington County Pennsylvania
Washington County Rhode Island
Washington County Wisconsin
dtype: object
How would I create a dataframe that accepts 2 lists as columns and a third index (with matching length) as the index?
Thank you very much
I believe need DataFrame, not Series if want working with lists:
FINAL = pd.DataFrame({'STNAME':STNAME, 'CTYNAME': CTYNAME},
index=ORIG_INDEX,
columns = ['STNAME', 'CTYNAME'])
Or better is create only subset by list of columns and for avoid possible SettingWithCopyWarning add DataFrame.copy:
FINAL = Filter3[['STNAME', 'CTYNAME']].copy()
Sample:
d = {'COL': ['a', 'b', 's', 'b', 'b'],
'STNAME': ['Iowa', 'Minnesota', 'Pennsylvania', 'Rhode Island', 'Wisconsin'],
'CTYNAME': ['Washington County', 'Washington County', 'Washington County',
'Washington County', 'Washington County'],}
Filter3 = pd.DataFrame(d,index=[10,20,3,50,40])
print (Filter3)
COL CTYNAME STNAME
10 a Washington County Iowa
20 b Washington County Minnesota
3 s Washington County Pennsylvania
50 b Washington County Rhode Island
40 b Washington County Wisconsin
FINAL = Filter3[['STNAME', 'CTYNAME']].copy()
print (FINAL)
STNAME CTYNAME
10 Iowa Washington County
20 Minnesota Washington County
3 Pennsylvania Washington County
50 Rhode Island Washington County
40 Wisconsin Washington County
I want to replicate what where clause does in SQL, using Python. Many times conditions in where clause can be complex and have multiple conditions. I am able to do it in the following way. But I think there should be a smarter way to achieve this. I have following data and code.
My requirement is: I want to select all columns only when first letter in the address is 'N'. This is the initial data frame.
d = {'name': ['john', 'tom', 'bob', 'rock', 'dick'], 'Age': [23, 32, 45, 42, 28], 'YrsOfEducation': [10, 15, 8, 12, 10], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
import pandas as pd
df = pd.DataFrame(data = d)
df['col1'] = df['Address'].str[0:1] #creating a new column which will have only the first letter from address column
n = df['col1'] == 'N' #creating a filtering criteria where the letter will be equal to N
newdata = df[n] # filtering the dataframe
newdata1 = newdata.drop('col1', axis = 1) # finally dropping the extra column 'col1'
So after 7 lines of code I am getting this output:
My question is how can I do it more efficiently or is there any smarter way to do that ?
A new column is not necessary:
newdata = df[df['Address'].str[0] == 'N'] # filtering the dataframe
print (newdata)
Address Age YrsOfEducation name
0 NY 23 10 john
1 NJ 32 15 tom
3 NY 42 12 rock
This question already has answers here:
Convert list of dictionaries to a pandas DataFrame
(7 answers)
Closed 4 years ago.
I'm new to pandas and that's my first question on stackoverflow, I'm trying to do some analytics with pandas.
I have some text files with data records that I want to process. Each line of the file match to a record which fields are in a fixed place and have a length of a fixed number of characters. There are different kinds of records on the same file, all records share the first field that are two characters depending of the type of record. As an example:
Some file:
01Jhon Smith 555-1234
03Cow Bos primigenius taurus 00401
01Jannette Jhonson 00100000000
...
field start length
type 1 2 *common to all records, example: 01 = person, 03 = animal
name 3 10
surname 13 10
phone 23 8
credit 31 11
fill of spaces
I'm writing some code to convert one record to a dictionary:
person1 = {'type': 01, 'name': = 'Jhon', 'surname': = 'Smith', 'phone': '555-1234'}
person2 = {'type': 01, 'name': 'Jannette', 'surname': 'Jhonson', 'credit': 1000000.00}
animal1 = {'type': 03, 'cname': 'cow', 'sciname': 'Bos....', 'legs': 4, 'tails': 1 }
If a field is empty (filled with spaces) there will not be in the dictionary).
With all records of one kind I want to create a pandas DataFrame with the dicts keys as columns names, I've try with pandas.DataFrame.from_dict() without success.
And here comes my question: Is any way to do this with pandas so dict keys become column names? Are any other standard method to deal with this kind of files?
To make a DataFrame from a dictionary, you can pass a list of dictionaries:
>>> person1 = {'type': 01, 'name': 'Jhon', 'surname': 'Smith', 'phone': '555-1234'}
>>> person2 = {'type': 01, 'name': 'Jannette', 'surname': 'Jhonson', 'credit': 1000000.00}
>>> animal1 = {'type': 03, 'cname': 'cow', 'sciname': 'Bos....', 'legs': 4, 'tails': 1 }
>>> pd.DataFrame([person1])
name phone surname type
0 Jhon 555-1234 Smith 1
>>> pd.DataFrame([person1, person2])
credit name phone surname type
0 NaN Jhon 555-1234 Smith 1
1 1000000 Jannette NaN Jhonson 1
>>> pd.DataFrame.from_dict([person1, person2])
credit name phone surname type
0 NaN Jhon 555-1234 Smith 1
1 1000000 Jannette NaN Jhonson 1
For the more fundamental issue of two differently-formatted files intermixed, and assuming the files aren't so big that we can't read them and store them in memory, I'd use StringIO to make an object which is sort of like a file but which only has the lines we want, and then use read_fwf (fixed-width-file). For example:
from StringIO import StringIO
def get_filelike_object(filename, line_prefix):
s = StringIO()
with open(filename, "r") as fp:
for line in fp:
if line.startswith(line_prefix):
s.write(line)
s.seek(0)
return s
and then
>>> type01 = get_filelike_object("animal.dat", "01")
>>> df = pd.read_fwf(type01, names="type name surname phone credit".split(),
widths=[2, 10, 10, 8, 11], header=None)
>>> df
type name surname phone credit
0 1 Jhon Smith 555-1234 NaN
1 1 Jannette Jhonson NaN 100000000
should work. Of course you could also separate the files into different types before pandas ever sees them, which might be easiest of all.