I am trying to create a new column of students' grade levels based on their DOB. The cut off dates for 1st grade would be 2014/09/02 - 2015/09/01. Is there a simple solution for this besides making a long if/elif statement. Thanks.
Name
DOB
Sally
2011/06/20
Mike
2009/02/19
Kevin
2012/12/22
You can use pd.cut(), which also supports custom bins.
from datetime import date
import pandas as pd
dob = {
'Sally': '2011/06/20',
'Mike': '2009/02/19',
'Kevin': '2012/12/22',
'Ron': '2009/09/01',
}
dob = pd.Series(dob).astype('datetime64').rename("DOB").to_frame()
grades = [
'2008/9/1',
'2009/9/1',
'2010/9/1',
'2011/9/1',
'2012/9/1',
'2013/9/1',
]
grades = pd.Series(grades).astype('datetime64')
dob['grade'] = pd.cut(dob['DOB'], grades, labels = [5, 4, 3, 2, 1])
print(dob.sort_values('DOB'))
DOB grade
Mike 2009-02-19 5
Ron 2009-09-01 5
Sally 2011-06-20 3
Kevin 2012-12-22 1
I sorted the data frame by date of birth, to show that oldest students are in the highest grades.
Related
I am writing the function in Scala to fetch data for training the ML model.
I have a dataframe DF1 which have a one column consisting of names.
Another dataframe DF2 which consists of columns [description, released, ... few more]
I want to create dataframe DF3 which is join of DF1 and DF2 on condition that is names of DF1 should is in description of DF2.
Example:
DF1
name
0 John
1 Mike
2 Kara
DF2
released total description
0 2006 5 This involved John and Kara who played role of......
1 2010 120 It is the latest release of Mike and Kara after almost decade...
DF3 [Expected output DF]
name released total description
0 John 2006 5 This involved John and Kara who played role of......
1 Kara 2006 5 This involved John and Kara who played role of......
2 Kara 2010 120 It is the latest release of Mike and Kara after almost decade...
3 Mike 2010 120 It is the latest release of Mike and Kara after almost decade...
I am trying to do cross join so make all combinations, and then filtering out the based on conditions on column name and description.
val DF3 = DF1.crossjoin(DF2).filter(col("name") in col("description"))
Seems, there is no contains method in Snowpark available to do this.
Anyone has idea on how to do it?
There are at least 2 solutions, but you should ask yourself some questions:
Do you wanna find substring, or you want find a word? E.g., do you wanna find Karan for name Kara?
Do you wanna store or use result dataframe in such state? Maybe you wanna store/use it in more optimal way, e.g., for each name store indexes/positions of rows of DF2.
You can test (on big and real dataset) which one is faster and more suitable for you.
1st solution (via DataFrame)
import org.apache.spark.SparkContext
import org.apache.spark.sql.catalyst.dsl.expressions._
import org.apache.spark.sql.{Column, DataFrame, SparkSession}
object Main extends App {
case class Name(name: String)
case class TextInfo(year: Int, moth: Int, text: String)
val spark: SparkSession = SparkSession.builder.config("spark.master", "local").getOrCreate()
val sc: SparkContext = spark.sparkContext
val namesDf: DataFrame = spark.createDataFrame(sc.parallelize(Seq("John", "Mike", "Kara").map(Name)))
val textToSearchDf: DataFrame = spark.createDataFrame(sc.parallelize(Seq(
TextInfo(2006, 5, "This involved John and Kara who played role of"),
TextInfo(2010, 120, "It is the latest release of Mike and Kara after almost decade")
)))
val resultDf: DataFrame = textToSearchDf.crossJoin(namesDf)
.where(new Column($"text" contains $"name"))
resultDf.foreach(println(_))
}
2nd solution, via RDD:
val spark: SparkSession = SparkSession.builder.config("spark.master", "local").getOrCreate()
val sc: SparkContext = spark.sparkContext
val namesAsRdd: RDD[String] = sc.parallelize(Seq("John", "Mike", "Kara"))
val rddWithTextToSearch: RDD[(Int, Int, String)] = sc.parallelize(Seq(
(2006, 5, "This involved John and Kara who played role of"),
(2010, 120, "It is the latest release of Mike and Kara after almost decade")
))
val names: Set[String] = namesAsRdd.collect().toSet
val resultRdd: RDD[(String, Int, Int, String)] = rddWithTextToSearch.flatMap {
case (year, month, str) => names.filter(name => str.contains(name)).map(name => (name, year, month, str))
}
resultRdd.foreach(println(_))
Hi so I have a dataframe df with a numeric index, a datetime column, and ozone concentrations, among several other columns. But here's a list of the important columns regarding my question.
index, date, ozone
0, 4-29-2018, 55.4375
1, 4-29-2018, 52.6375
2, 5-2-2018, 50.4128
3, 5-2-2018, 50.3
4, 5-3-2018, 50.3
5, 5-4-2018, 51.8845
I need to call the index value of a row based on the column value. However, multiple rows have a column value of 50.3. First, how do I find the index value based on a specific column value? I've tried:
np.isclose(df['ozone'], 50.3).argmax() from Getting the index of a float in a column using pandas
but this only gives me the first index value that the number appears. Is there a way to call the index based on two parameters (like ask what the index value for when datetime = 5-2-2018 and ozone = 50.3)?
I've also tried df.loc but it doesn't work for floating points.
here's some sample code:
df = pd.read_csv('blah.csv')
df.set_index('date', inplace = True)
df.index = pd.to_datetime(df.index)
date = pd.to_datetime(df.index)
dv = df.groupby([date.month,date.day]).mean()
dv.drop(columns=['dir_5408'], inplace=True)
df['ozone'] = df.oz_3186.rolling('8H', min_periods=2).mean().shift(-4)
ozone = df.groupby([date.month,date.day])['ozone'].max()
df['T_mda8_3186'] = df.Temp_C_3186.rolling('8H', min_periods=2).mean().shift(-4)
T_mda8_3186 = df.groupby([date.month,date.day])['T_mda8_3186'].max()
df['T_mda8_5408'] = df.Temp_C_5408.rolling('8H', min_periods=2).mean().shift(-4)
T_mda8_5408 = df.groupby([date.month,date.day])['T_mda8_5408'].max()
df['ws_mda8_5408'] = df.ws_5408.rolling('8H', min_periods=2).mean().shift(-4)
ws_mda8_5408 = df.groupby([date.month,date.day])['ws_mda8_5408'].max()
dv_MDA8 = df.drop(columns=['Temp_C_3186', 'Temp_C_5408','dir_5408','ws_5408','u_5408','v_5408','rain(mm)_5724',
'rain(mm)_5408','rh_3186','rh_5408','pm10_5408','pm10_3186','pm25_5408','oz_3186'])
dv_MDA8.reset_index(inplace=True)
I need the date as a datetime index for the beginning of my code.
Thanks in advance for your help.
This is what you might be looking for,
import pandas as pd
import datetime
data = pd.DataFrame({
'index':[0,1,2,3,4,5],
'date':['4-29-2018','4-29-2018','5-2-2018','5-2-2018','5-3-2018','5-4-2018'],
'ozone':[55.4375,52.6375,50.4128,50.3,50.3,51.8845]
}
)
data.set_index(['index'],inplace=True)
data['date'] = data['date'].apply(lambda x: datetime.datetime.strptime(x,'%m-
%d-%Y'))
data['ozone'] = data['ozone'].astype('float')
data.loc[(data['date'] == datetime.datetime.strptime('5-3-2018','%m-%d-%Y'))
& (data['ozone'] == 50.3)]
Index represents each row and you can find out indexes and then store/use it
later until, ofcourse, the index of df has not changed
Code:
import pandas as pd
import numpy as np
students = [('jack', 34, 'Sydeny', 'Engineering'),
('Sachin', 30, 'Delhi', 'Medical'),
('Aadi', 16, 'New York', 'Computer Science'),
('Riti', 30, 'Delhi', 'Data Science'),
('Riti', 30, 'Delhi', 'Data Science'),
('Riti', 30, 'Mumbai', 'Information Security'),
('Aadi', 40, 'London', 'Arts'),
('Sachin', 30, 'Delhi', 'Medical')
]
df = pd.DataFrame(students, columns=['Name', 'Age', 'City', 'Subject'])
print(df)
ind_name_sub = df.where((df.Name == 'Riti') & (df.Subject == 'Data Science')).dropna().index
# similarly you can have your ind_date_ozone may have one or more values
print(ind_name_sub)
print(df.loc[ind_name_sub])
Output:
Name Age City Subject
0 jack 34 Sydeny Engineering
1 Sachin 30 Delhi Medical
2 Aadi 16 New York Computer Science
3 Riti 30 Delhi Data Science
4 Riti 30 Delhi Data Science
5 Riti 30 Mumbai Information Security
6 Aadi 40 London Arts
7 Sachin 30 Delhi Medical
Int64Index([3, 4], dtype='int64')
Name Age City Subject
3 Riti 30 Delhi Data Science
4 Riti 30 Delhi Data Science
I have a Pandas dataframe. I want to add a column to the dataframe, where the value in the new column is dependent on other values in the row.
What is an efficient way to go about this?
Example
Begin
Start with this dataframe (let's call it df), and a dictionary of people's roles.
first last
--------------------------
0 Jon McSmith
1 Jennifer Foobar
2 Dan Raizman
3 Alden Lowe
role_dict = {
"Raizman": "sales",
"McSmith": "analyst",
"Foobar": "analyst",
"Lowe": "designer"
}
End
We have a dataframe where we've 'iterated' over each row, and used the last_name to lookup values from our role_dict and add that value to each row as role.
first last role
--------------------------------------
0 Jon McSmith analyst
1 Jennifer Foobar analyst
2 Dan Raizman sales
3 Alden Lowe designer
One solution is using series map function since the role is a dictionary
df['role'] = df.loc[:, 'last'].map(role_dict)
try this using merge
import pandas as pd
df = pd.DataFrame([["Jon","McSmith"],
["Jennifer","Foobar"],
["Dan","Raizman"],
["Alden","Lowe"]],columns=["first","last"])
role_dict = {
"Raizman": "sales",
"McSmith": "analyst",
"Foobar": "analyst",
"Lowe": "designer"
}
df_2 = pd.DataFrame(role_dict.items(),columns=["last","role"])
result = pd.merge(df,df_2,on=["last"],how="left")
output
first last role
0 Jon McSmith analyst
1 Jennifer Foobar analyst
2 Dan Raizman sales
3 Alden Lowe designer
Given a dataframe (df) with the following columns:
id,
created_date,
name
I need to ensure that all rows with the same name have the same id. I can create a mapping from old id to new id (selected at 'random' using max).
df.groupBy('name')\
.agg(
func.max('id').alias('new_id'),
func.collect_set(id).alias('grouped_ids'))\
.filter(func.size('grouped_ids') > 1)\
.select(func.explode("grouped_ids").alias('old_id'), "new_id")\
.filter("new_id != old_id")
I can the leftouter join this to the original df (on id = old_id) and swap the ids if there is a new_id available.
However, I need to ensure that the new_id selected is the one with the oldest created_date in the dataframe (rather than just selecting the max).
How best to go about this?
e.g. Given the data
id, created_date, name
---
17a, 2019-01-05, Jeff
17a, 2019-01-03, Jeremy
d21, 2019-01-04, Jeremy
u45, 2019-01-04, Jeremy
d21, 2019-01-02, Scott
x22, 2019-01-01, Julian
Rows 2, 3 and 4 group on Jeremy so should have the same id. The oldest id in the dataframe for the grouped ids is d21 as on row 5 the created_date is 2019-01-02, so that should be selected and applied to all rows in the dataframe with the other grouped ids, and we end up with:
id, created_date, name
---
d21, 2019-01-05, Jeff
d21, 2019-01-03, Jeremy
d21, 2019-01-04, Jeremy
d21, 2019-01-04, Jeremy
d21, 2019-01-02, Scott
x22, 2019-01-01, Julian
UPDATE:
#Charles Du - Cheers, I tried your code but it didn't work out, the oldest id was selected from the grouped names, not the df as a whole and the new_id was not applied throughout the df.
Result:
0 = {Row} Row(name='Scott', created_date='2019-01-02', new_ID='d21', id='d21', created_date='2019-01-02')
1 = {Row} Row(name='Julian', created_date='2019-01-01', new_ID='x22', id='x22', created_date='2019-01-01')
2 = {Row} Row(name='Jeremy', created_date='2019-01-03', new_ID='17a', id='17a', created_date='2019-01-03')
3 = {Row} Row(name='Jeremy', created_date='2019-01-03', new_ID='17a', id='d21', created_date='2019-01-04')
4 = {Row} Row(name='Jeremy', created_date='2019-01-03', new_ID='17a', id='u45', created_date='2019-01-04')
5 = {Row} Row(name='Jeff', created_date='2019-01-05', new_ID='17a', id='17a', created_date='2019-01-05')
My spitball here
from pyspark.sql import functions as F
new_df = df.groupBy('name').agg(F.min('date'))
new_df = new_df.join(df, on=['name', 'date'], how='inner')
# This should give you a df with a single record for each name with the oldest ID.
new_df = new_df.withColumnRenamed('id', 'new_ID')
#you'll need to decide on a naming convention for your date column since you'll have two if you don't rename
res = new_df.join(df, on='name', how='inner)
that should match up your id with the oldest date.
Say I have a dataframe where there are different values in a column, e.g.,
raw_data = {'first_name': ['Jason', 'Molly', np.nan, np.nan, np.nan],
'nationality': ['USA', 'USA', 'France', 'UK', 'UK'],
'age': [42, 52, 36, 24, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'nationality', 'age'])
df
How do I create a new dataframe(s), where each dataframe contains only the values for USA, only the values for UK, and only the values for France? But here is the thing, say `I don't what to specify a condition like
Don't want this
# Create variable with TRUE if nationality is USA
american = df['nationality'] == "USA"
I want all the data aggregated for each nationality whatever the nationality is, without having to specify the nationality condition. I just want all the same nationalities together in their own dataframe. Also, I want all the columns that pertain to that row.
So for example, the function
SplitDFIntoSeveralDFWhereColumnValueAllTheSame(column):
code
Will return an array of dataframes with all the values of a column in each dataframe are equal.
So if I had more data and more nationalities, the aggregation into new dataframes will work without changing the code.
This will give you a dictionary of dataframes where the keys are the unique values of the 'nationality' column and the values are the dataframes you are looking for.
{name: group for name, group in df.groupby('nationality')}
demo
dodf = {name: group for name, group in df.groupby('nationality')}
for k in dodf:
print(k, '\n'*2, dodf[k], '\n'*2)
France
first_name nationality age
2 NaN France 36
USA
first_name nationality age
0 Jason USA 42
1 Molly USA 52
UK
first_name nationality age
3 NaN UK 24
4 NaN UK 70