I want to replicate what where clause does in SQL, using Python. Many times conditions in where clause can be complex and have multiple conditions. I am able to do it in the following way. But I think there should be a smarter way to achieve this. I have following data and code.
My requirement is: I want to select all columns only when first letter in the address is 'N'. This is the initial data frame.
d = {'name': ['john', 'tom', 'bob', 'rock', 'dick'], 'Age': [23, 32, 45, 42, 28], 'YrsOfEducation': [10, 15, 8, 12, 10], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
import pandas as pd
df = pd.DataFrame(data = d)
df['col1'] = df['Address'].str[0:1] #creating a new column which will have only the first letter from address column
n = df['col1'] == 'N' #creating a filtering criteria where the letter will be equal to N
newdata = df[n] # filtering the dataframe
newdata1 = newdata.drop('col1', axis = 1) # finally dropping the extra column 'col1'
So after 7 lines of code I am getting this output:
My question is how can I do it more efficiently or is there any smarter way to do that ?
A new column is not necessary:
newdata = df[df['Address'].str[0] == 'N'] # filtering the dataframe
print (newdata)
Address Age YrsOfEducation name
0 NY 23 10 john
1 NJ 32 15 tom
3 NY 42 12 rock
Related
I would like to merge n data frames based on certain variables (external to the data frame).
Let me clarify the problem referring to an example.
We have two dataframes detailing the height and age of certain members of a population.
On top, we are given one array per data frame, containing one value per property (so array length = number of columns with numerical value in the data frame).
Consider the following two data frames
df1 = pd.DataFrame({'Name': ['A', 'B', 'C', 'D', 'E'],
'Age': [3, 8, 4, 2, 5], 'Height': [7, 2, 1, 4, 9]})
df2 = pd.DataFrame({'Name': ['A', 'B', 'D'],
'Age': [4, 6, 4], 'Height': [3,9, 2]})
looking as
( Name Age Height
0 A 3 7
1 B 8 2
2 C 4 1
3 D 2 4
4 E 5 9,
Name Age Height
0 A 4 3
1 B 6 9
2 D 4 2)
As mentioned, we also have two arrays, say
array1 = np.array([ 1, 5])
array2 = np.array([2, 3])
To make the example concrete, let us say each array contains the year in which the property was measured.
The output should be constructed as follows:
if an individual appears only in one dataframe, its properties are taken from said dataframe
if an individual appears in more than one data frame, for each property take the values from the data frame whose associated array has the corresponding higher value. So, for property i, compare array1[[i]] and array2[[i]], and take property values from dataframe df1 if array1[[i]] > array2[[i]], and viceversa.
In the context of the example, the rules are translated as, take the property which has been measured more recently, if more are available
The output given the example data frames should look like
Name Age Height
0 A 4 7
1 B 6 2
2 C 4 1
3 D 4 4
4 E 5 9
Indeed, for the first property "Age", as array1[[0]] < array2[[0]], values are taken from the second dataframe, for the available individuals (A, B, D). Remaining values come from the first dataframe.
For the second property "Height", as as array1[[1]] > array2[[1]], values come from the first dataframe, which already describes all the individuals.
At the moment I have some sort of solution based on looping over properties, but it is silly convoluted, I am wondering if any Pandas expert out there could help me towards an elegant solution.
Thanks for your support.
Your question is a bit confusing: array indexes start from 0 so I think in your example it should be [[0]] and [[1]] instead of [[1]] and [[2]].
You can first concatenate your dataframes to have all names listed, then loop over your columns and update the values where the corresponding array is greater (I added a Z row to df2 to show new rows are being added):
df1 = pd.DataFrame({'Name': ['A', 'B', 'C', 'D', 'E'],
'Age': [3, 8, 4, 2, 5], 'Height': [7, 2, 1, 4, 9]})
df2 = pd.DataFrame({'Name': ['A', 'B', 'D', 'Z'],
'Age': [4, 6, 4, 8], 'Height': [3,9, 2, 7]})
array1 = np.array([ 1, 5])
array2 = np.array([2, 3])
df1.set_index('Name', inplace=True)
df2.set_index('Name', inplace=True)
df3 = pd.concat([df1, df2[~df2.index.isin(df1.index)]])
for i, col in enumerate(df1.columns):
if array2[[i]] > array1[[i]]:
df3[col].update(df2[col])
print(df3)
Note: You have to set Name as index in order to update the right rows
Output:
Age Height
Name
A 4 7
B 6 2
C 4 1
D 4 4
E 5 9
Z 8 7
I you have more than two dataframes in a list, you'll have to store your arrays in a list as well and iterate over the dataframe list while keeping track of the highest array values in a new array.
Thank you, I only 3 weeks into learning Pandas, and I am getting unexpected results, any guidance would be appreciated.
I would like to merge two DataFrames together and retain my set_index.
I have a simple DataFrame
import pandas as pd
data = {
'part_number': [123,123,123],
'part_name': ['some name in 11', 'some name in 12', 'some name in 13'],
'part_size': [11,12,13]
}
df = pd.DataFrame(data=data)
df.set_index('part_name', inplace=True)
I groupby the part_sizes, and merge.
This is where my knowledge breaks down, I lose my index which is the part_name.
I see there are joins and concats, am I using the wrong syntax?
part_size_merge = df.groupby(['part_number'], dropna=False)['part_size'].agg(tuple).to_frame()
merged = df.merge(part_size_merge, on=['part_number'])
display(merged.head())
I tried concat, however, it looks like it stacks the two df's together, which isn't how I'd like it.
x = pd.concat([df, part_size_merge], axis=0, join='inner')
x.head()
Yes that is normal merge
out = df.reset_index().merge(part_size_merge, on=['part_number']).set_index('part_name')
Out[334]:
part_number part_size_x part_size_y
part_name
some name in 11 123 11 (11, 12, 13)
some name in 12 123 12 (11, 12, 13)
some name in 13 123 13 (11, 12, 13)
Hi so I have a dataframe df with a numeric index, a datetime column, and ozone concentrations, among several other columns. But here's a list of the important columns regarding my question.
index, date, ozone
0, 4-29-2018, 55.4375
1, 4-29-2018, 52.6375
2, 5-2-2018, 50.4128
3, 5-2-2018, 50.3
4, 5-3-2018, 50.3
5, 5-4-2018, 51.8845
I need to call the index value of a row based on the column value. However, multiple rows have a column value of 50.3. First, how do I find the index value based on a specific column value? I've tried:
np.isclose(df['ozone'], 50.3).argmax() from Getting the index of a float in a column using pandas
but this only gives me the first index value that the number appears. Is there a way to call the index based on two parameters (like ask what the index value for when datetime = 5-2-2018 and ozone = 50.3)?
I've also tried df.loc but it doesn't work for floating points.
here's some sample code:
df = pd.read_csv('blah.csv')
df.set_index('date', inplace = True)
df.index = pd.to_datetime(df.index)
date = pd.to_datetime(df.index)
dv = df.groupby([date.month,date.day]).mean()
dv.drop(columns=['dir_5408'], inplace=True)
df['ozone'] = df.oz_3186.rolling('8H', min_periods=2).mean().shift(-4)
ozone = df.groupby([date.month,date.day])['ozone'].max()
df['T_mda8_3186'] = df.Temp_C_3186.rolling('8H', min_periods=2).mean().shift(-4)
T_mda8_3186 = df.groupby([date.month,date.day])['T_mda8_3186'].max()
df['T_mda8_5408'] = df.Temp_C_5408.rolling('8H', min_periods=2).mean().shift(-4)
T_mda8_5408 = df.groupby([date.month,date.day])['T_mda8_5408'].max()
df['ws_mda8_5408'] = df.ws_5408.rolling('8H', min_periods=2).mean().shift(-4)
ws_mda8_5408 = df.groupby([date.month,date.day])['ws_mda8_5408'].max()
dv_MDA8 = df.drop(columns=['Temp_C_3186', 'Temp_C_5408','dir_5408','ws_5408','u_5408','v_5408','rain(mm)_5724',
'rain(mm)_5408','rh_3186','rh_5408','pm10_5408','pm10_3186','pm25_5408','oz_3186'])
dv_MDA8.reset_index(inplace=True)
I need the date as a datetime index for the beginning of my code.
Thanks in advance for your help.
This is what you might be looking for,
import pandas as pd
import datetime
data = pd.DataFrame({
'index':[0,1,2,3,4,5],
'date':['4-29-2018','4-29-2018','5-2-2018','5-2-2018','5-3-2018','5-4-2018'],
'ozone':[55.4375,52.6375,50.4128,50.3,50.3,51.8845]
}
)
data.set_index(['index'],inplace=True)
data['date'] = data['date'].apply(lambda x: datetime.datetime.strptime(x,'%m-
%d-%Y'))
data['ozone'] = data['ozone'].astype('float')
data.loc[(data['date'] == datetime.datetime.strptime('5-3-2018','%m-%d-%Y'))
& (data['ozone'] == 50.3)]
Index represents each row and you can find out indexes and then store/use it
later until, ofcourse, the index of df has not changed
Code:
import pandas as pd
import numpy as np
students = [('jack', 34, 'Sydeny', 'Engineering'),
('Sachin', 30, 'Delhi', 'Medical'),
('Aadi', 16, 'New York', 'Computer Science'),
('Riti', 30, 'Delhi', 'Data Science'),
('Riti', 30, 'Delhi', 'Data Science'),
('Riti', 30, 'Mumbai', 'Information Security'),
('Aadi', 40, 'London', 'Arts'),
('Sachin', 30, 'Delhi', 'Medical')
]
df = pd.DataFrame(students, columns=['Name', 'Age', 'City', 'Subject'])
print(df)
ind_name_sub = df.where((df.Name == 'Riti') & (df.Subject == 'Data Science')).dropna().index
# similarly you can have your ind_date_ozone may have one or more values
print(ind_name_sub)
print(df.loc[ind_name_sub])
Output:
Name Age City Subject
0 jack 34 Sydeny Engineering
1 Sachin 30 Delhi Medical
2 Aadi 16 New York Computer Science
3 Riti 30 Delhi Data Science
4 Riti 30 Delhi Data Science
5 Riti 30 Mumbai Information Security
6 Aadi 40 London Arts
7 Sachin 30 Delhi Medical
Int64Index([3, 4], dtype='int64')
Name Age City Subject
3 Riti 30 Delhi Data Science
4 Riti 30 Delhi Data Science
I have a dataframe with multiple columns and rows. One column, say 'name' has several rows with names, the same name used multiple times. Other rows, say, 'x', 'y', 'z', 'zz' have values. I want to group by name and get the mean of each column (x,y,z,zz)for each name, then plot on a bar chart.
Using the pandas.DataFrame.groupby is an important data-wrangling stuff. Let's first make a dummy Pandas data frame.
df = pd.DataFrame({"name": ["John", "Sansa", "Bran", "John", "Sansa", "Bran"],
"x": [2, 3, 4, 5, 6, 7],
"y": [5, -3, 10, 34, 1, 54],
"z": [10.6, 99.9, 546.23, 34.12, 65.04, -74.29]})
>>>
name x y z
0 John 2 5 10.60
1 Sansa 3 -3 99.90
2 Bran 4 10 546.23
3 John 5 34 34.12
4 Sansa 6 1 65.04
5 Bran 7 54 -74.29
We can use the label of the column to group the data (here the label is "name"). Explicitly defining the by parameter can be omitted (c.f., df.groupby("name")).
df.groupby(by = "name").mean().plot(kind = "bar")
which gives us a nice bar graph.
Transposing the group by results using T (as also suggested by anky) yields a different visualization. We can also pass a dictionary as the by parameter to determine the groups. The by parameter can also be a function, Pandas series, or ndarray.
df.groupby(by = {1: "Sansa", 2: "Bran"}).mean().T.plot(kind = "bar")
Say I have a dataframe where there are different values in a column, e.g.,
raw_data = {'first_name': ['Jason', 'Molly', np.nan, np.nan, np.nan],
'nationality': ['USA', 'USA', 'France', 'UK', 'UK'],
'age': [42, 52, 36, 24, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'nationality', 'age'])
df
How do I create a new dataframe(s), where each dataframe contains only the values for USA, only the values for UK, and only the values for France? But here is the thing, say `I don't what to specify a condition like
Don't want this
# Create variable with TRUE if nationality is USA
american = df['nationality'] == "USA"
I want all the data aggregated for each nationality whatever the nationality is, without having to specify the nationality condition. I just want all the same nationalities together in their own dataframe. Also, I want all the columns that pertain to that row.
So for example, the function
SplitDFIntoSeveralDFWhereColumnValueAllTheSame(column):
code
Will return an array of dataframes with all the values of a column in each dataframe are equal.
So if I had more data and more nationalities, the aggregation into new dataframes will work without changing the code.
This will give you a dictionary of dataframes where the keys are the unique values of the 'nationality' column and the values are the dataframes you are looking for.
{name: group for name, group in df.groupby('nationality')}
demo
dodf = {name: group for name, group in df.groupby('nationality')}
for k in dodf:
print(k, '\n'*2, dodf[k], '\n'*2)
France
first_name nationality age
2 NaN France 36
USA
first_name nationality age
0 Jason USA 42
1 Molly USA 52
UK
first_name nationality age
3 NaN UK 24
4 NaN UK 70