Join data frames with substring conditions on columns - dataframe

I am writing the function in Scala to fetch data for training the ML model.
I have a dataframe DF1 which have a one column consisting of names.
Another dataframe DF2 which consists of columns [description, released, ... few more]
I want to create dataframe DF3 which is join of DF1 and DF2 on condition that is names of DF1 should is in description of DF2.
Example:
DF1
name
0 John
1 Mike
2 Kara
DF2
released total description
0 2006 5 This involved John and Kara who played role of......
1 2010 120 It is the latest release of Mike and Kara after almost decade...
DF3 [Expected output DF]
name released total description
0 John 2006 5 This involved John and Kara who played role of......
1 Kara 2006 5 This involved John and Kara who played role of......
2 Kara 2010 120 It is the latest release of Mike and Kara after almost decade...
3 Mike 2010 120 It is the latest release of Mike and Kara after almost decade...
I am trying to do cross join so make all combinations, and then filtering out the based on conditions on column name and description.
val DF3 = DF1.crossjoin(DF2).filter(col("name") in col("description"))
Seems, there is no contains method in Snowpark available to do this.
Anyone has idea on how to do it?

There are at least 2 solutions, but you should ask yourself some questions:
Do you wanna find substring, or you want find a word? E.g., do you wanna find Karan for name Kara?
Do you wanna store or use result dataframe in such state? Maybe you wanna store/use it in more optimal way, e.g., for each name store indexes/positions of rows of DF2.
You can test (on big and real dataset) which one is faster and more suitable for you.
1st solution (via DataFrame)
import org.apache.spark.SparkContext
import org.apache.spark.sql.catalyst.dsl.expressions._
import org.apache.spark.sql.{Column, DataFrame, SparkSession}
object Main extends App {
case class Name(name: String)
case class TextInfo(year: Int, moth: Int, text: String)
val spark: SparkSession = SparkSession.builder.config("spark.master", "local").getOrCreate()
val sc: SparkContext = spark.sparkContext
val namesDf: DataFrame = spark.createDataFrame(sc.parallelize(Seq("John", "Mike", "Kara").map(Name)))
val textToSearchDf: DataFrame = spark.createDataFrame(sc.parallelize(Seq(
TextInfo(2006, 5, "This involved John and Kara who played role of"),
TextInfo(2010, 120, "It is the latest release of Mike and Kara after almost decade")
)))
val resultDf: DataFrame = textToSearchDf.crossJoin(namesDf)
.where(new Column($"text" contains $"name"))
resultDf.foreach(println(_))
}
2nd solution, via RDD:
val spark: SparkSession = SparkSession.builder.config("spark.master", "local").getOrCreate()
val sc: SparkContext = spark.sparkContext
val namesAsRdd: RDD[String] = sc.parallelize(Seq("John", "Mike", "Kara"))
val rddWithTextToSearch: RDD[(Int, Int, String)] = sc.parallelize(Seq(
(2006, 5, "This involved John and Kara who played role of"),
(2010, 120, "It is the latest release of Mike and Kara after almost decade")
))
val names: Set[String] = namesAsRdd.collect().toSet
val resultRdd: RDD[(String, Int, Int, String)] = rddWithTextToSearch.flatMap {
case (year, month, str) => names.filter(name => str.contains(name)).map(name => (name, year, month, str))
}
resultRdd.foreach(println(_))

Related

plotly - remove or ignore "Non-leaves rows" for sunburst diagram

I have a DataFrame with some "Non-leaves rows" in it. Is there any way to get plotly to ignore them, or a way to automatically remove them?
Here's a sample DataFrame:
0
1
2
3
0
Alice
Bob
1
Alice
Bob
Carol
David
2
Alice
Bob
Chuck
Delia
3
Alice
Bob
Chuck
Ella
4
Alice
Bob
Frank
In this case, I get the error Non-leaves rows are not permitted in the dataframe because the 0th row is not a distinct leaf.
I've tried using df = df.replace(np.NaN, pd.NA).where(df.notnull(), None) to add the None to the empty values, but the error persists.
Is there any way to have the non-leaves ignored? If not, is there a simple way to prune them? My real dataset is several thousand rows.
One way is to reshape your dataframe with unique relations parent-child. Here is one way:
import plotly.express as px
cols = df.columns
data = (
pd.concat(
[df[[i,j]].rename(columns={i:'parents',j:'childs'})
for i,j in zip(cols[:-1], cols[1:])])
.drop_duplicates()
)
fig = px.sunburst(data, names='childs', parents='parents')
fig.show()

Create a new column in Pandas dataframe by arbitrary function over rows

I have a Pandas dataframe. I want to add a column to the dataframe, where the value in the new column is dependent on other values in the row.
What is an efficient way to go about this?
Example
Begin
Start with this dataframe (let's call it df), and a dictionary of people's roles.
first last
--------------------------
0 Jon McSmith
1 Jennifer Foobar
2 Dan Raizman
3 Alden Lowe
role_dict = {
"Raizman": "sales",
"McSmith": "analyst",
"Foobar": "analyst",
"Lowe": "designer"
}
End
We have a dataframe where we've 'iterated' over each row, and used the last_name to lookup values from our role_dict and add that value to each row as role.
first last role
--------------------------------------
0 Jon McSmith analyst
1 Jennifer Foobar analyst
2 Dan Raizman sales
3 Alden Lowe designer
One solution is using series map function since the role is a dictionary
df['role'] = df.loc[:, 'last'].map(role_dict)
try this using merge
import pandas as pd
df = pd.DataFrame([["Jon","McSmith"],
["Jennifer","Foobar"],
["Dan","Raizman"],
["Alden","Lowe"]],columns=["first","last"])
role_dict = {
"Raizman": "sales",
"McSmith": "analyst",
"Foobar": "analyst",
"Lowe": "designer"
}
df_2 = pd.DataFrame(role_dict.items(),columns=["last","role"])
result = pd.merge(df,df_2,on=["last"],how="left")
output
first last role
0 Jon McSmith analyst
1 Jennifer Foobar analyst
2 Dan Raizman sales
3 Alden Lowe designer

Pandas split ages by group

I'm quite new with pandas and need a bit help. I have a column with ages and need to make groups of these:
Young people: age≤30
Middle-aged people: 30<age≤60
Old people:60<age
Here is the code, but it gives me an error:
def get_num_people_by_age_category(dataframe):
young, middle_aged, old = (0, 0, 0)
dataframe["age"] = pd.cut(x=dataframe['age'], bins=[30,31,60,61], labels=["young","middle_aged","old"])
return young, middle_aged, old
ages = get_num_people_by_age_category(dataframe)
print(dataframe)
Code below gets the age groups using pd.cut().
# Import libraries
import pandas as pd
# Create DataFrame
df = pd.DataFrame({
'age': [1,20,30,31,50,60,61,80,90] #np.random.randint(1,100,50)
})
# Function: Copy-pasted from question and modified
def get_num_people_by_age_category(df):
df["age_group"] = pd.cut(x=df['age'], bins=[0,30,60,100], labels=["young","middle_aged","old"])
return df
# Call function
df = get_num_people_by_age_category(df)
Output
print(df)
age age_group
0 1 young
1 20 young
2 30 young
3 31 middle_aged
4 50 middle_aged
5 60 middle_aged
6 61 old
7 80 old
8 90 old

Pandas Merge function only giving column headers - Update

What I want to achieve.
I have two data frames. DF1 and DF2. Both are being read from different excel file.
DF1 has 9 columns and 3000 rows, of which one of the column name is "Code Group".
DF2 has 2 columns and 20 rows, of which one of the column name is also "Code Group". In this same dataframe another column "Code Management Method" gives the explanation of code group. For eg. H001 is Treated at recyclable, H002 is Treated as landfill.
What happens
When I use the command data = pd.merge(DF1,DF2, on='Code Group') I only get 10 column names but no rows underneath.
What I expect
I would want DF1 and DF2 to be merged and wherever Code Group number is same Code Management Method to be pasted for explanation.
Additional information
Following are datatype for DF1
Entity object
Address object
State object
Site object
Disposal Facility object
Pounds float64
Waste Description object
Shipment Date datetime64[ns]
Code Group object
FollOwing are datatype for DF2
Code Management Method object
Code Group object
What I tried
I tried to follow the suggestions from similar post on SO that the datatypes on both sides should be same and Code Group here both are objects so don't know what's the issue. I also tried Concat function.
Code
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
CH = "C:\Python\Waste\Shipment.xls"
Code = "C:\Python\Waste\Code.xlsx"
Data = pd.read_excel(Code)
data1 = pd.read_excel(CH)
data1.rename(columns={'generator_name':'Entity','generator_address':'Address', 'Site_City':'Site','final_disposal_facility_name':'Disposal Facility', 'wst_dscrpn':'Waste Description', 'drum_wgt':'Pounds', 'wst_dscrpn' : 'Waste Description', 'genrtr_sgntr_dt':'Shipment Date','generator_state': 'State','expected_disposal_management_methodcode':'Code Group'},
inplace=True)
data2 = data1[['Entity','Address','State','Site','Disposal Facility','Pounds','Waste Description','Shipment Date','Code Group']]
data2
merged = data2.merge(Data, on='Code Group')
Getting a Warning
C:\Anaconda\lib\site-packages\pandas\core\generic.py:5890: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._update_inplace(new_data)
import pandas as pd
df1 = pd.DataFrame({'Region': [1,2,3],
'zipcode':[12345,23456,34567]})
df2 = pd.DataFrame({'ZipCodeLowerBound': [10000,20000,30000],
'ZipCodeUpperBound': [19999,29999,39999],
'Region': [1,2,3]})
df1.merge(df2, on='Region')
this is how the example is given, and the result for this is:
Region zipcode
0 1 12345
1 2 23456
2 3 34567
Region ZipCodeLowerBound ZipCodeUpperBound
0 1 10000 19999
1 2 20000 29999
2 3 30000 39999
and that thing will result in
Region zipcode ZipCodeLowerBound ZipCodeUpperBound
0 1 12345 10000 19999
1 2 23456 20000 29999
2 3 34567 30000 39999
I hope this is what you want to do
After multiple tries I found that the column had some garbage so used the code below and it worked perfectly. Funny thing is I never encountered the problem on two other data-sets that I imported from excel file.
data2['Code'] = data2['Code'].str.strip()

Create a pandas DataFrame from multiple dicts [duplicate]

This question already has answers here:
Convert list of dictionaries to a pandas DataFrame
(7 answers)
Closed 4 years ago.
I'm new to pandas and that's my first question on stackoverflow, I'm trying to do some analytics with pandas.
I have some text files with data records that I want to process. Each line of the file match to a record which fields are in a fixed place and have a length of a fixed number of characters. There are different kinds of records on the same file, all records share the first field that are two characters depending of the type of record. As an example:
Some file:
01Jhon Smith 555-1234
03Cow Bos primigenius taurus 00401
01Jannette Jhonson 00100000000
...
field start length
type 1 2 *common to all records, example: 01 = person, 03 = animal
name 3 10
surname 13 10
phone 23 8
credit 31 11
fill of spaces
I'm writing some code to convert one record to a dictionary:
person1 = {'type': 01, 'name': = 'Jhon', 'surname': = 'Smith', 'phone': '555-1234'}
person2 = {'type': 01, 'name': 'Jannette', 'surname': 'Jhonson', 'credit': 1000000.00}
animal1 = {'type': 03, 'cname': 'cow', 'sciname': 'Bos....', 'legs': 4, 'tails': 1 }
If a field is empty (filled with spaces) there will not be in the dictionary).
With all records of one kind I want to create a pandas DataFrame with the dicts keys as columns names, I've try with pandas.DataFrame.from_dict() without success.
And here comes my question: Is any way to do this with pandas so dict keys become column names? Are any other standard method to deal with this kind of files?
To make a DataFrame from a dictionary, you can pass a list of dictionaries:
>>> person1 = {'type': 01, 'name': 'Jhon', 'surname': 'Smith', 'phone': '555-1234'}
>>> person2 = {'type': 01, 'name': 'Jannette', 'surname': 'Jhonson', 'credit': 1000000.00}
>>> animal1 = {'type': 03, 'cname': 'cow', 'sciname': 'Bos....', 'legs': 4, 'tails': 1 }
>>> pd.DataFrame([person1])
name phone surname type
0 Jhon 555-1234 Smith 1
>>> pd.DataFrame([person1, person2])
credit name phone surname type
0 NaN Jhon 555-1234 Smith 1
1 1000000 Jannette NaN Jhonson 1
>>> pd.DataFrame.from_dict([person1, person2])
credit name phone surname type
0 NaN Jhon 555-1234 Smith 1
1 1000000 Jannette NaN Jhonson 1
For the more fundamental issue of two differently-formatted files intermixed, and assuming the files aren't so big that we can't read them and store them in memory, I'd use StringIO to make an object which is sort of like a file but which only has the lines we want, and then use read_fwf (fixed-width-file). For example:
from StringIO import StringIO
def get_filelike_object(filename, line_prefix):
s = StringIO()
with open(filename, "r") as fp:
for line in fp:
if line.startswith(line_prefix):
s.write(line)
s.seek(0)
return s
and then
>>> type01 = get_filelike_object("animal.dat", "01")
>>> df = pd.read_fwf(type01, names="type name surname phone credit".split(),
widths=[2, 10, 10, 8, 11], header=None)
>>> df
type name surname phone credit
0 1 Jhon Smith 555-1234 NaN
1 1 Jannette Jhonson NaN 100000000
should work. Of course you could also separate the files into different types before pandas ever sees them, which might be easiest of all.