difflib.get_close_matches is not giving any output when I compared 2 columns in pandas dataframe - pandas

I have 2 pandas dataframes, one with clean city names(df2) and another with unclean city names(df1).
sample values:
df2.city_name: bangalore
df1.city_name: bongolor
I tried to use the below code to get the close match for the city name
city_names=df2.city_name.to_list()
for i in df1.city_name:
difflib.get_close_matches(i,city_names)
This is running for a very long time (more than an hour, so I stopped it).
I had tried fuzzywuzzy process as well. PFB:
list1=df1.city_name.to_list()
list2=df2.city_name.to_list()
mat1=[]
for i in list1:
mat1.append(process.extract(i, list2, limit=10))
df1['match']=mat1
This was also taking a very long time so I killed it.
Is there an optimized way to compare the column values and get the closer state name?
Note: my df1.city_names are 3.3M and df2.city_names are 2.7k

Related

Query returns value that don't exist in PySpark Dataframe

Is there a way to create a subset dataframe from a dataframe and be sure that its values will be used afterward?
I have a huge PySpark Dataframe like this (simplified example):
id
timestamp
value
1
1658919602
5
1
1658919604
9
2
1658919632
2
Now I want to take a sample from it to test something, before running on the entire Dataframe. I get a sample by:
# Big dataframe
df = ...
# Create sample
df_sample = df.limit(10)
df_sample.show() shows some values.
Then I run this command, and sometimes it returns values that are present in df_sample and sometimes it returns values that are not present in df_sample but in df.
df_temp = df_sample.sort(F.desc('timestamp')).groupBy('id').agg(F.collect_list('value').alias('newcol'))
As if it's not using df_sample but picking in a non deterministic way 10 rows from df.
Interestingly, if I run df_sample.show() afterwards, it shows the same values as when it was first called.
Why is this happening?
Here's full code:
# Big dataframe
df = ...
# Create sample
df_sample = df.limit(10)
# shows some values
df_sample.show()
# run query
df_temp = df_sample.sort(F.desc('timestamp')).groupBy('id').agg(F.collect_list('value').alias('newcol')
# df_temp sometimes shows values that are present in df_sample, but sometimes shows values that aren't present in df_sample but in df
df_temp.show()
# Shows the exact same values as when it was first called
df_sample.show()
Edit1: I understand that Spark is lazy, but is there any way to force it to not be lazy in this scenario?
We can use sample function provided by spark to achieve this.Every time you run a sample() function it returns a different set of sampling records, To regenerate the same sample every time as you need to compare the results from your previous run. To get consistent same random sampling uses the same slice value for every run.
df=spark.range(100)
# Execute first time
print(df.sample(0.1,123).collect())
# Execute Second time with same seed-123
print(df.sample(0.1,123).collect())
# Execute with different seed-456
print(df.sample(0.1,456).collect())
Refer spark docs
Stratum sampling in spark
What worked was using df_sample = df.limit(10).cache() or df_sample = df.limit(10).persist(). Samkart's comment pointed me in this direction.

Matching an element in a column, to others in the same column

I have columns taken from excel as a dataframe, the columns are as follows:
HolidayTourProvider|Packages|Meals|Accommodation|LocalTravelVehicle|Cancellationfee
Holiday Tour Provider has a couple of company names
Packages, the features provided in each package are mostly the same like
Meals,Accommodation etc... even though one company may call it "Saver", others may call it "Budget". (each of column mostly follow Yes/No, except Local travel vehicle are again car names like Ford Taurus,jeep cherokee etc..
Cancellation amount is integers)
I need to write a function like
match(HolidayTP,Package)
where the user can give input like
match(AdventureLife, Luxury)
then I need to return all the packages that have similar features with Luxury by other Holiday Tour Providers, no matter what name they give the package like 'Semi Lux', 'Comfort' etc...
I want to give a counter for every match and display all the packages that exceed the counter by 3 or 4.
This is my first python code. I am stuck here.
fb is the total df I exported to
def mapHol(HTP, PACKAGE):
mfb = (fb['HTP']== HTP)&(fb['package']== package)
B = fb[mfb]
for i in fb[i]:
for j in B[j]:
if fb[i]==B[j]:
count+=1
I dont know how to proceed, please help me this is my first major project, I started on my own.

How to stop Jupyter outputting truncated results when using pd.Series.value_counts()?

I have a DataFrame and I want to display the frequencies for certain values in a certain Series using pd.Series.value_counts().
The problem is that I only see truncated results in the output. I'm coding in Jupyter Notebook.
I have tried unsuccessfully a couple of methods:
df = pd.DataFrame(...) # assume df is a DataFrame with many columns and rows
# 1st method
df.col1.value_counts()
# 2nd method
print(df.col1.value_counts())
# 3rd method
vals = df.col1.value_counts()
vals # neither print(vals) doesn't work
# All output something like this
value1 100000
value2 10000
...
value1000 1
Currently this is what I'm using, but it's quite cumbersome:
print(df.col1.value_counts()[:50])
print(df.col1.value_counts()[50:100])
print(df.col1.value_counts()[100:150])
# etc.
Also, I have read this related Stack Overflow question, but haven't found it helpful.
So how to stop outputting truncated results?
If you want to print all rows:
pd.options.display.max_rows = 1000
print(vals)
If you want to print all rows only once:
with pd.option_context("display.max_rows", 1000):
print(vals)
Relevant documentation here.
I think you need option_context and set to some large number, e.g. 999. Advatage of solution is:
option_context context manager has been exposed through the top-level API, allowing you to execute code with given option values. Option values are restored automatically when you exit the with block.
#temporaly display 999 rows
with pd.option_context('display.max_rows', 999):
print (df.col1.value_counts())

Is there a way to speed up this webscraping iteration? Pandas

So I'm collecting data on a list of stocks and putting all that info into a dataframe. The list has about 700 stocks.
import pandas as pd
stock =['adma','aapl','fb'] # list has about 700 stocks which I extracted from a pickled dataframe that was storing the info.
#The site I'm visiting is below with the name of the stock added to the end of the end of the link
##http://finviz.com/quote.ashx?t=adma
##http://finviz.com/quote.ashx?t=aapl
I'm just extracting one portion of that site, evident by [-2] in the code below
df2 = pd.DataFrame()
for i in stock:
df = pd.read_html('http://finviz.com/quote.ashx?t={}'.format(i), header =0)[-2].set_index('SEC Form 4')
df['Stock'] = i.upper() # creating a column which has the name of the stock, so I can differentiate between stocks
df2 = df2.append(df)
It feels like I'm doing a few seconds per iteration and I have around 700 to go through at the moment. It's not terribly slow, but I was just curious if there is a more efficient method. Thanks.
Your current code is blocking, you don't proceed with retrieving the information from the next url until you are done with the current. Instead, you can switch to, for example, Scrapy which is based on twisted and working asynchronously processing multiple pages at the same time.

run time error for stock maximize ... hackerrank

I am getting a correct answer on my compiler, but I am getting a run time error on hacker rank. Solution for stock maximize problem. I am new at python, therefore I am having difficulty in removing error.Inputs are of this form
1 //no of test cases
3 //no of stocks
5 2 3 //cost of stocks
I think error is in taking input as 5 3 2 (continuous). If I am taking as
5
3
2
(one on each line)then it is working fine. How can I fix this problem?
t=int(input())
list=[]
while t>0:
n=int(input())
list.clear()
for i in range(0,n):
list.append(int(input()))
sum=0
print('hello')
max=list[n-1]
for i in range(n-2,-1,-1):
if(list[i]<max):
sum=sum+(max-list[i])
else:
max=list[i]
print(sum)
t=t-1
You can read the //cost of stocks like so in Python3
stockcost=[int(k) for k in input().split()]
This will create a list of stock prices
Instead of trying to read in three lines of stock costs when there is actually only one line of three space-separated costs, you need to read in that one line and split it into a list of integers, for example like this (since it looks like you're using Python 3):
stocks = list(map(int, input().split(" ")))
With your example, this would give you the list [5, 3, 2].