I have df1
index state fin_salary new_title
5 CA 1 Data Scientist
8 CA 1 Data Scientist
35 CA 150000 Deep Learning Engineer
36 CA 1 Data Analyst
39 CA 1 Data Engineer
43 CA 1 Data Scientist
56 CA 1 Data Scientist
And another datfarame df2
state new_title fin_salary
CA Artificial Intelligence Engineer 207500.0
CA Data Analyst 64729.0
CA Data Engineer 146000.0
CA Data Scientist 129092.75
CA Deep Learning Engineer 162500.0
CA Machine Learning Engineer 133120.0
CA Python Developer 96797.0
So I want to update df1 with fin_salary from df2 based on condition state and new_title and where fin_salary = 1. So my desired output should be
index state fin_salary new_title
5 CA 129092.75 Data Scientist
8 CA 129092.75 Data Scientist
35 CA 150000 Deep Learning Engineer
36 CA 64729.0 Data Analyst
39 CA 146000.0 Data Engineer
43 CA 129092.75 Data Scientist
56 CA 129092.75 Data Scientist
You can do this:
DF = df1.merge(df2, on=['state','new_title'], how = 'inner')
df_final = DF[DF['fin_salary']==1]
I am trying to slice item from CSV here is example [enter image description here][1]
df1 = pandas.read_csv("supermarkets.csv")
df1
ID Address City State Country Name Employees
0 1 3666 21st St San Francisco CA 94114 USA Madeira 8
1 2 735 Dolores St San Francisco CA 94119 USA Bready Shop 15
2 3 332 Hill St San Francisco California 94114 USA Super River 25
3 4 3995 23rd St San Francisco CA 94114 USA Ben's Shop 10
4 5 1056 Sanchez St San Francisco California USA Sanchez 12
5 6 551 Alvarado St San Francisco CA 94114 USA Richvalley 20
df2 = df1.loc["735 Dolores St":"332 Hill St","City":"Country"]
df2
In output I am only getting this output
City State Country
How do I correct?
As you can read in pandas documentation .loc[] can access a group of rows and columns by label(s) or a boolean array.
You cannot directly select using the values in the Series.
In your example df1.loc["735 Dolores St":"332 Hill St","City":"Country"] you are getting an empty selection because only "City":"Country" is a valid accessor.
"735 Dolores St":"332 Hill St" will return an empty row selection as they are not labels on the index.
You can try selecting by index with .iloc[[1,2], "City":"Country"] if you want specific rows.
df.loc is primarily label based and commonly slices the rows using an index. In this case, you can use the numeric index or set address as index
print(df)
ID Address City State Country Name Employees
0 1 3666 21st San Francisco CA 94114 USA Madeira 8
1 2 735 Dolores San Francisco CA 94114 USA Bready Shop 15
2 3 332 Hill San Francisco CA 94114 USA Super River 25
df2=df.loc[1:2,'City':'Country']
print(df2)
City State Country
1 San Francisco CA 94114 USA
2 San Francisco CA 94114 USA
Or
df2=df.set_index('Address').loc['735 Dolores':'332 Hill','City':'Country']
print(df2)
City State Country
Address
735 Dolores San Francisco CA 94114 USA
332 Hill San Francisco CA 94114 USA
I have two dataframes as follows:
df1 (reference data)
Tempe, AZ, USA
San Jose, CA, USA
Mountain View, CA, USA
New York, NY, USA
df2 (User entered data)
Tempe, AZ
Tempe, Arizona
San Jose, USA
San Jose, CA
Mountain View, CA
I would like to get a dataframe (df3) as following:
-------------------------------------------
|Tempe, AZ, USA | Tempe, Arizona |
|Tempe, AZ, USA | Tempe, AZ |
|San Jose, CA, USA | San Jose, CA |
|San Jose, CA, USA | San Jose, USA |
|Mountain View, CA, USA| Mountain View, CA|
-------------------------------------------
I already a User Defined Function :
isSameAs(str1: String, str2:String): Boolean{
......
}
that take two strings (user entered data and reference data) and tells me if they are a match or not.
I just need to find out the right way to implement map in Scala Spark SQL so that I get the the dataframe like df3.
Option 1: You can use an UDF as join expression:
import org.apache.spark.sql.functions._
val isSameAsUdf = udf(isSameAs(_,_))
val result = df1.join(df2, isSameAsUdf(df1.col("address"), df2.col("address")))
The downside of this approach is that Spark performs a cartesian product over both dataframes df1 and df2 and then filters the columns that do not match the join condition afterwards (more details here). Running result.explain prints
== Physical Plan ==
CartesianProduct UDF(address#4, address#10)
:- LocalTableScan [address#4]
+- LocalTableScan [address#10]
Option 2: to avoid the cartesian product, it might be faster to broadcast the reference data as a standard Scala sequence and then do the mapping of the addresses in another UDF:
val normalizedAddress: Seq[String] = //content of df2 as scala sequence
val broadcastSeq = spark.sparkContext.broadcast(normalizedAddress)
def toNormalizedAddress(str: String ): String =
broadcastSeq.value.find(isSameAs(_, str)).getOrElse("")
val toNormalizedAddressUdf = udf(toNormalizedAddress(_))
val result2 = df2.withColumn("NormalizedAddress", toNormalizedAddressUdf('address))
The result is the same as for option 1, but result2.explain prints
== Physical Plan ==
LocalTableScan [address#10, NormalizedAddress#40]
This second option works, if the amount of reference data is small enough to be broadcasted. Depending on the cluster's hardware, some 10.000s lines of reference data would still considered to be small.
Assuming the below schema (address:string), try this-
Load the data
val data1 =
"""Tempe, AZ, USA
|San Jose, CA, USA
|Mountain View, CA, USA""".stripMargin
val df1 = data1.split(System.lineSeparator()).toSeq.toDF("address")
df1.show(false)
/**
* +----------------------+
* |address |
* +----------------------+
* |Tempe, AZ, USA |
* |San Jose, CA, USA |
* |Mountain View, CA, USA|
* +----------------------+
*/
val data2 =
"""Tempe, AZ
|Tempe, Arizona
|San Jose, USA
|San Jose, CA
|Mountain View, CA""".stripMargin
val df2 = data2.split(System.lineSeparator()).toSeq.toDF("address")
df2.show(false)
/**
* +-----------------+
* |address |
* +-----------------+
* |Tempe, AZ |
* |Tempe, Arizona |
* |San Jose, USA |
* |San Jose, CA |
* |Mountain View, CA|
* +-----------------+
*/
Extract the joining key and join based on that
df1.withColumn("joiningKey", substring_index($"address", ",", 1))
.join(
df2.withColumn("joiningKey", substring_index($"address", ",", 1)),
"joiningKey"
)
.select(df1("address"), df2("address"))
.show(false)
/**
* +----------------------+-----------------+
* |address |address |
* +----------------------+-----------------+
* |Tempe, AZ, USA |Tempe, AZ |
* |Tempe, AZ, USA |Tempe, Arizona |
* |San Jose, CA, USA |San Jose, USA |
* |San Jose, CA, USA |San Jose, CA |
* |Mountain View, CA, USA|Mountain View, CA|
* +----------------------+-----------------+
*/
Let's say I have a table of customers and each customer has an address. My task is to design an object model that allows to group the customers by similar address. Example:
John 123 Main St, #A; Los Angeles, CA 90032
Jane 92 N. Portland Ave, #1; Pasadena, CA 91107
Peter 92 N. Portland Avenue, #2; Pasadena, CA 91107
Lester 92 N Portland Av #4; Pasadena, CA 91107
Mark 123 Main Street, #C; Los Angeles, CA 90032
The query should somehow return:
1 Similar_Address_Key1
5 Similar_Address_Key1
2 Similar_Address_key2
3 Similar_Address_key2
4 Similar_Address_key2
What is the best way to accomplish this? Notice the addresses are NOT consistent (some address have "Avenue" others have "Av" and the apartment numbers are different). The existing data of names/address cannot be corrected so doing a GROUP BY (Address) on the table itself is out of the question.
I was thinking to add a SIMILAR_ADDRESSES table that takes an address, evaluates it and gives it a key, so something like:
cust_key address similar_addr_key
1 123 Main St, #A; Los Angeles, CA 90032 1
2 92 N. Portland Ave, #1; Pasadena, CA 91107 2
3 92 N. Portland Avenue, #2; Pasadena, CA 91107 2
4 92 N. Portland Av #4; Pasadena, CA 91107 2
5 123 Main Street, #C; Los Angeles, CA 90032 1
Then group by the similar address key. But the question is how to best accomplish the "evaluation" part. One way would be to modify the address in the SIMILAR_ADDRESSES table so that they are consistent and ignoring things like apt, #, or suite and assign a "key" to each exact match. Another different approach I thought about was to feed the address to a Geolocator service and save the latitude/longitude values to a table and use these values to generate a similar address key.
Any ideas?
I have a PDB file, in short it look a bit like this
ATOM 1189 CA ILE A 172 4.067 0.764 -48.818 1.00 19.53 C
ATOM 1197 CA ATHR A 173 7.121 3.051 -48.711 0.50 17.77 C
ATOM 1198 CA BTHR A 173 7.198 2.978 -48.704 0.50 16.94 C
ATOM 1208 CA ALA A 174 7.797 2.124 -52.350 1.00 16.85 C
ATOM 1213 CA LEU A 175 4.431 3.707 -53.288 1.00 16.47 C
ATOM 1221 CA VAL A 176 4.498 6.885 -51.185 1.00 13.92 C
ATOM 1228 CA ARG A 177 6.418 10.059 -51.947 1.00 20.28 C
ATOM 1241 CA GLN B 23 -15.516 -2.515 13.305 1.00 32.36 C
ATOM 1250 CA ASP B 24 -12.740 -2.653 10.715 1.00 22.25 C
ATOM 1258 CA PHE B 25 -12.476 -2.459 6.886 1.00 19.17 C
ATOM 1269 CA TYR B 26 -12.886 -6.243 6.470 1.00 14.87 C
ATOM 1281 CA ASP B 27 -16.276 -6.196 8.222 1.00 18.01 C
ATOM 1289 CA PHE B 28 -17.998 -4.432 5.309 1.00 15.39 C
ATOM 1300 CA LYS B 29 -19.636 -5.878 2.191 1.00 14.46 C
ATOM 1309 CA ALA B 30 -19.587 -4.640 -1.378 1.00 15.26 C
ATOM 1314 CA VAL B 31 -21.000 -5.566 -4.753 1.00 16.26 C
what I want to go is to get rid of the B's and keep the A's, and then get rid of everything but the 6th row
grep ^ATOM 2p31protein.pdb | grep ' CA ' | grep ' A ' | cut -c23-27
this is what i have tried, get everything with ATOM and CA which i what i want and get the row that i want but it does not get rid of the B's
This is more suited to awk:
$ awk '$1=="ATOM"&&$3=="CA"&&$5=="A"{print $6}' file
172
173
173
174
175
176
177
with awk you may do it easier:
awk '$1=="ATOM" && $3=="CA" && $5=="A"{print $6}' your.pdb