How to extract link's href from a search result? - beautifulsoup

I tried the quote below, but unable to get the results links...
import requests
from requests import get
from bs4 import BeautifulSoup
url = "https://www.immoweb.be/en/search/house-and-apartment/for-sale?countries=BE&page=1&orderBy=relevance"
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")
links =[]
for el in soup.find_all('li', {'class': 'card__title-link'}):
links.append(el.find('a').get('href'))
links
[]

The links are constructed dynamically from the Json data within the page. To print them, you can do for example:
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.immoweb.be/en/search/house-and-apartment/for-sale?countries=BE&page=1&orderBy=relevance'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data = json.loads( soup.find('iw-search')[':results'] )
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for d in data:
print(d['property']['title'])
print('https://www.immoweb.be/en/classified/{}'.format(d['id']))
print('-' * 80)
Prints:
EXCEPTIONAL APARTMENT IN PRIVATE DOMAIN
https://www.immoweb.be/en/classified/8917294
--------------------------------------------------------------------------------
Proche du parvis Saint-Pierre
https://www.immoweb.be/en/classified/8892312
--------------------------------------------------------------------------------
niché au coeur d'un parc privatif proche du parvis Saint-Pie
https://www.immoweb.be/en/classified/8892319
--------------------------------------------------------------------------------
niché au coeur d'un parc privatif proche du parvis Saint-Pie
https://www.immoweb.be/en/classified/8892317
--------------------------------------------------------------------------------
Entre la Maison Communale et "La Rasante" - 2 Logements
https://www.immoweb.be/en/classified/8856902
--------------------------------------------------------------------------------
Very nice apartment of 87 m² (terrace 13 m²)
https://www.immoweb.be/en/classified/8899851
--------------------------------------------------------------------------------
au coeur d'un parc privatif proche du parvis Saint-Pierre, b
https://www.immoweb.be/en/classified/8892306
--------------------------------------------------------------------------------
New Loft/Apartment - one bedroom + terrace
https://www.immoweb.be/en/classified/8904631
--------------------------------------------------------------------------------
Uccle | Studio, apartments 1-2-3 rooms. & villas
https://www.immoweb.be/en/classified/8917208
--------------------------------------------------------------------------------
New 3 bedrooms apartment + parking
https://www.immoweb.be/en/classified/8914953
--------------------------------------------------------------------------------
LAST OPPORTUNITY and new conditions ! In this penthouse you
https://www.immoweb.be/en/classified/8909856
--------------------------------------------------------------------------------
PENTHOUSE 150m² 3 bedrooms + office room with large TERRACE
https://www.immoweb.be/en/classified/8897103
--------------------------------------------------------------------------------
Wonen in Sint-Niklaas, leven in de stad en tevens genieten v
https://www.immoweb.be/en/classified/8904372
--------------------------------------------------------------------------------
WONEN AAN DE VREDESBRUG
https://www.immoweb.be/en/classified/8910704
--------------------------------------------------------------------------------
Nouvelle résidence centre-ville de Huy
https://www.immoweb.be/en/classified/8876897
--------------------------------------------------------------------------------
Magnifique immeuble neuf de 17 entités !
https://www.immoweb.be/en/classified/6909169
--------------------------------------------------------------------------------
PROJET IMMOBILIER DE STANDING - INTRA MUROS - PRIX DE LANCEM
https://www.immoweb.be/en/classified/7149936
--------------------------------------------------------------------------------
4 nouvelles constructions à découvrir proche du centre-ville
https://www.immoweb.be/en/classified/7171388
--------------------------------------------------------------------------------
Apartments For Sale
https://www.immoweb.be/en/classified/8792346
--------------------------------------------------------------------------------
24 assistentiewoningen en 2 commerciële ruimten
https://www.immoweb.be/en/classified/7089514
--------------------------------------------------------------------------------
Gerenoveerd 3 slaapkamer appartement
https://www.immoweb.be/en/classified/8912931
--------------------------------------------------------------------------------
Nieuwbouwproject Dockside Gardens - Gent
https://www.immoweb.be/en/classified/8903968
--------------------------------------------------------------------------------
Projet neuf de 12 appartements de standing
https://www.immoweb.be/en/classified/8717659
--------------------------------------------------------------------------------
Architecturaal hoogstaand nieuwbouwproject
https://www.immoweb.be/en/classified/7126544
--------------------------------------------------------------------------------
2nd Phase of the magnificent project at the best value for m
https://www.immoweb.be/en/classified/8098994
--------------------------------------------------------------------------------
Modern en rustig wonen in Tildonk
https://www.immoweb.be/en/classified/7128367
--------------------------------------------------------------------------------
Next to Woluwé's shopping, beautiful project offering views
https://www.immoweb.be/en/classified/8577833
--------------------------------------------------------------------------------
Modern apartment, design furnished and completely equiped
https://www.immoweb.be/en/classified/8866349
--------------------------------------------------------------------------------
WOLUWE - APPARTMENT 3 BEDROOMS + PARKING POSSIBLE
https://www.immoweb.be/en/classified/8871919
--------------------------------------------------------------------------------
Nieuwbouwproject Dunant Gardens - Gent
https://www.immoweb.be/en/classified/8837221
--------------------------------------------------------------------------------

Related

extracting year from string using regexp_extract pyspark

This is the portion of my result :
Grumpier Old Men (1995)
Death Note: Desu nôto (2006–2007)
Irwin & Fran 2013
9500 Liberty (2009)
Captive Women (1000 Years from Now) (3000 A.D.) (1952)
The Garden of Afflictions 2017
The Naked Truth (1957) (Your Past Is Showing)
Conquest 1453 (Fetih 1453) (2012)
Commune, La (Paris, 1871) (2000)
1013 Briar Lane
return:
1995
2006
2013
2009
1952
2017
1957
1453<--
1871<--
<--this part for last title is empty and supposed to be empty too
As you can see from the above,last 2 title is given wrong result.
This is my code:
import pyspark.sql.functions as F
from pyspark.sql.functions import regexp_extract,col
bracket_regexp = "((?<=\()\d{4}(?=[^\(]*$))"
movies_DF=movies_DF.withColumn('yearOfRelease', regexp_extract("title", bracket_regexp + "|(\d{4}$)", 0))
movies_DF.display(10000)
I am trying to get the year portion of the title string.
You can try using the following regex: r'(?<=\()(\d+)(?=\))', which is inspired by this excellent answer.
For example:
movies_DF = movies_DF.withColumn('uu', regexp_extract(col("title"), r'(?<=\()(\d+)(?=\))',1))
+------------------------------------------------------------+----+
|title |uu |
+------------------------------------------------------------+----+
|Grumpier Old Men (1995) |1995|
|Happy Anniversary (1959) |1959|
|Paths (2017) |2017|
|The Three Amigos - Outrageous! (2003) |2003|
|L'obsession de l'or (1906) |1906|
|Babe Ruth Story, The (1948) |1948|
|11'0901 - September 11 (2002) |2002|
|Blood Trails (2006) |2006|
|Return to the 36th Chamber (Shao Lin da peng da shi) (1980) |1980|
|Off and Running (2009) |2009|
+------------------------------------------------------------+----+
Empirically, the following regex pattern seems to be working:
(?<=[( ])\d{4}(?=\S*\)|$)
Here is a working regex demo.
Updated PySpark code:
bracket_regexp = "((?<=[( ])\d{4}(?=\S*\)|$))"
movies_DF = movies_DF.withColumn('yearOfRelease', regexp_extract("title", bracket_regexp + "|(\d{4}$)", 0))
movies_DF.display(10000)
The regex pattern works by matching:
(?<=[( ]) assert that what precedes is ( or a space
\d{4} match a 4 digit year
(?=\S*\)|$) assert that ), possibly prefaced by non whitespace, follows
OR the end of the string follows
Your regex can only work for the first line. \(\d{4}\) tries to match a (, 4 digits and a ). For the first line you have (1995) which is alright. The other lines do not contain that pattern.
In your situation, we can use lookbehind and lookahead patterns to detect dates within brackets. (?<=\() means an open bracket before. (?=–|(–)|\)) means a closing bracket after, or – or – which is the original character that was misencoded. Once you have covered the date in between brackets, you can cover dates that are at the end of the string without brackets: \d{4}$.
import pyspark.sql.functions as F
bracket_regexp = "((?<=\()\d{4}(?=–|(–)|\)))"
movies_DF\
.withColumn('uu', regexp_extract("title", bracket_regex + "|(\d{4}$)", 0))\
.show(truncate=False)
+------------------------------------------------------+-------------+
|title |yearOfRelease|
+------------------------------------------------------+-------------+
|Grumpier Old Men (1995) |1995 |
|Death Note: Desu nôto (2006–2007) |2006 |
|Irwin & Fran 2013 |2013 |
|9500 Liberty (2009) |2009 |
|test 1234 test 4567 |4567 |
|Captive Women (1000 Years from Now) (3000 A.D.) (1952)|1952 |
|The Garden of Afflictions 2017 |2017 |
|The Naked Truth (1957) (Your Past Is Showing) |1957 |
|Conquest 1453 (Fetih 1453) (2012) |2012 |
|Commune, La (Paris, 1871) (2000) |2000 |
|1013 Briar Lane | |
+------------------------------------------------------+-------------+
Also you do not need to prefix the string with r when you pass a regex to a spark function.
Here is a regexp that would work:
df = df.withColumn("year", F.regexp_extract("title", "(?:[\s\(])(\d{4})(?:[–\)])?", 1))
Definitely overkill for the examples you provide, but I want to avoid capturing e.g. other numbers in the titles. Also, your regexp does not work because not all years are surrounding by brackets in your examples and sometimes you have non-numeric characters inside the brackets,.

Counting Distinct words AND average time in Pandas

I'm working on analysing some text from a Twitter API using pandas. This will eventually be visualized.
For reference
df.head() of my dataset
is:
Count User Time Tweet
0 0 x 2022 ✔️Nécessité de maintien d’une filière 🇪🇺 dynam...
1 1 x 2022 Échanges approfondis à #Dakar avec le Premier ...
2 2 x 2022 ✔️Approvisionnement en #céréales & #engrai...
3 3 x 2022 Aujourd’hui à Tambacounda, à l’Est du Sénégal,...
4 4 x 2022 Working hard since 2019 to reinforce EU #auton...
I'm looking to return the distinct word count with the average time of the tweet where the word was used in.
Right now, I've been getting the distinct word count of my dataset using df.Tweet.str.split(expand=True).stack().value_counts().
This is useful, returning:
the 1505
de 1500
to 1168
RT 931
of 906
...
africain, 1
langue 1
Félicitations! 1
Length: 18071, dtype: int64
However, I want to also analyse text usage over time.
I'm not super experienced so I'm wondering if there is a way to use a function such as df.groupby() to sort this result by time? Or, is there a way to modify my original function to add a column to my results that includes average time?
I would use str.extractall to get the words, join the Time, then perform a groupby.value_counts to get the count per Year:
out = (df['Tweet']
.str.extractall('(\S+)')
.droplevel('match')
.join(df['Time'])
.groupby('Time')[0].value_counts()
)
NB. if you want to exclude non-letters/digits from the words, use (\w+) in place of (\S+).
Output:
Time 0
2022 à 3
#Dakar 1
#auton... 1
#céréales 1
#engrai... 1
& 1
... 1
...

I want to fetch all the details of wrestlers from the tables

I have a link its this- www.cagematch.net/?id=8&nr=1&page=15
In this link you will able to see a table with wrestlers. But If you click on the the name of a wrestler you will be able to see details of a wrestler. So, I want to fetch all the wrestlers with details in an easy & shortcut way. In my mind, I am thinking like this :
urls = [
link1, link2, link3, link4
]
for u in urls:
..... do the scrap
But there are 275 wrestlers I don't want to enter all the links like this. Is there any easy way to do it?
To get all links into a list and then info about each wrestler you can use this example:
import requests
from bs4 import BeautifulSoup
url = "http://www.cagematch.net/?id=8&nr=1&page=15"
headers = {"Accept-Encoding": "deflate"}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
links = [
"https://www.cagematch.net/" + a["href"] for a in soup.select(".TCol a")
]
for u in links:
soup = BeautifulSoup(
requests.get(u, headers=headers).content, "html.parser"
)
print(soup.h1.text)
for info in soup.select(".InformationBoxRow"):
print(
info.select_one(".InformationBoxTitle").text.strip(),
info.select_one(".InformationBoxContents").text.strip(),
)
# get other info here
# ...
print("-" * 80)
Prints:
Adam Pearce
Current gimmick: Adam Pearce
Age: 44 years
Promotion: World Wrestling Entertainment
Active Roles: Road Agent, Trainer, On-Air Official, Backstage Helper
Birthplace: Lake Forest, Illinois, USA
Gender: male
Height: 6' 2" (188 cm)
Weight: 238 lbs (108 kg)
WWW: http://twitter.com/ScrapDaddyAP https://www.facebook.com/OfficialAdamPearce https://www.youtube.com/watch?v=us91bK1ScL4
Alter egos: Adam O'BrienAdam Pearce    a.k.a.  US Marshall Adam J. PearceMasked Spymaster #2Tommy Lee Ridgeway
Roles: Singles Wrestler (1996 - 2014)Road Agent (2015 - today)Booker (2008 - 2010)Trainer (2013 - today)On-Air Official (2020 - today)Backstage Helper (2015 - today)
Beginning of in-ring career: 16.05.1996
End of in-ring career: 21.12.2014
In-ring experience: 18 years
Wrestling style: Allrounder
Trainer: Randy Ricci & Sonny Rogers
Nicknames: "Scrap Iron"
Signature moves: PiledriverFlying Body SplashRackbomb II
--------------------------------------------------------------------------------
AJ Styles
Current gimmick: AJ Styles
Age: 45 years
Promotion: World Wrestling Entertainment
Brand: RAW
Active Roles: Singles Wrestler
Birthplace: Jacksonville, North Carolina, USA
Gender: male
Height: 5' 11" (180 cm)
Weight: 218 lbs (99 kg)
Background in sports: Ringen, Football, Basketball, Baseball
WWW: http://AJStyles.org https://www.facebook.com/AJStylesOrg-110336188978264/ https://twitter.com/AJStylesOrg https://www.instagram.com/ajstylesp1/ https://www.twitch.tv/Stylesclash
Alter egos: AJ Styles    a.k.a.  Air StylesMr. Olympia
Roles: Singles Wrestler (1999 - today)Tag Team Wrestler (2001 - 2021)
Beginning of in-ring career: 15.02.1999
In-ring experience: 23 years
Wrestling style: Techniker, High Flyer
Trainer: Rick Michaels
Nicknames: "The Phenomenal""The Prince Of Phenomenal"
Signature moves: Styles ClashPelé KickCalf Killer/Calf CrusherStylin' DDTCliffhangerSpiral TapPhenomenal Forearm450 Splash
--------------------------------------------------------------------------------
...and so on.

Modsecurity finds no geo data for IP

I want to block every country except mine, so I downloaded the GeoLite2 database and added it in the crs-setup.conf file. Under -=[ Block Countries ]=- I also added every country code for testing.
This did not work and after trying multiple alternative "country blocking" rules I looked into the debug log and saw that the rule itself was working, but it wasn't finding any geo data for the IP:
Recipe: Invoking rule 72bef6b0; [file "/etc/modsecurity/rules/REQUEST-910-IP-REPUTATION.conf"] [line "75"] [id "910100"].
Rule 72bef6b0: SecRule "TX:HIGH_RISK_COUNTRY_CODES" "!#rx ^$" "phase:2,log,auditlog,id:910100,drop,t:none,msg:'Client IP is from a HIGH Risk Country Location',logdata:%{MATCHED_VAR},tag:application-multi,tag:language-multi,tag:platform-multi,tag:attack-reputation-ip,tag:paranoia-level/1,tag:OWASP_CRS,ver:OWASP_CRS/3.3.2,severity:CRITICAL,chain"
Transformation completed in 8 usec.
Executing operator "!rx" with param "^$" against TX:high_risk_country_codes.
Target value: "AD AE AF AG AI AL AM AO AQ AR AS AT AU AW AX AZ BA BB BD BE BF BG BH BI BJ BL BM BN BO BQ BR BS BT BV BW BY BZ CA CC CD CF CG CH CI CK CL CM CN CO CR CU CV CW CX CY CZ DE DJ DK DM DO DZ EC EE EG EH ER ES ET FI FJ FK FM FO FR GA GB GD GE GF GG GH GI GL GM GN GP GQ GR GS GT GU GW GY HK HM HN HR HT HU ID IE IL IM IN IO IQ IR IS IT JE JM JO JP KE KG KH KI KM KN KP KR KW KY KZ LA LB LC LI LK LR LS LT LU LV LY MA MC MD ME MF MG MH MK ML MM MN MO MP MQ MR MS MT MU MV MW MX MY MZ NA NC NE NF NG NI NL NO NP NR NU NZ OM PA PE PF PG PH PK PL PM PN PR PS PT PW PY QA RE RO RS RU RW SA SB SC SD SE SG SH SI SJ SK SL SM SN SO SR SS ST SV SX SY SZ TC TD TF TG TH TJ TK TL TM TN TO TR TT TV TW TZ UA UG UM US UY UZ VA VC VE VG VI VN VU WF WS YE YT ZA ZM ZW"
Operator completed in 20 usec.
Rule returned 1.
Match -> mode NEXT_RULE.
Recipe: Invoking rule 72eb4298; [file "/etc/modsecurity/rules/REQUEST-910-IP-REPUTATION.conf"] [line "77"].
Rule 72eb4298: SecRule "TX:REAL_IP" "#geoLookup " "chain"
Transformation completed in 2 usec.
Executing operator "geoLookup" with param "" against TX:real_ip.
Target value: "###.##.#.###"
GEO: Looking up "###.##.#.###".
GEO: Using address "###.##.#.###" (0x########). ##########
No geo data for "###.##.#.###" (country -4431872).
Operator completed in 10205 usec.
Rule returned 0.
However the IP is in the database as I checked it with geoip2 in Python and it returned the correct country for said IP.
Is there anything obvious I missed?
ModSecurity does NOT support new GeoIP2 format of GeoIP database so old, legacy, format need to be used.

how to change month name to a different language in pyspark - dataframe

I am trying to create a table for "Date" on Databricks using the below configurations:
# Get date range
dateFrom = dbutils.widgets.get("date_from")
dateTo = dbutils.widgets.get("date_to")
dateDF_TESTE = spark.sql("SELECT sequence(to_date('{0}'), to_date('{1}'), interval 1 day) AS date".format(dateFrom, dateTo))\
.select(F.explode("date").alias('DSC_DATE'))'''
But when I add columns with those data, I am only getting the information, for example month name or days of the week, in english.
I intend to change this information to another language (portuguese), but without any success. I´ve tried to use locale but it is not working.
import locale
# use user's default settings
locale.setlocale(locale.LC_ALL, 'pt_PT.utf8')
Since Spark 3.0 it is possible to use to_csv() on a single column. to_csv accepts the same parameters like the standard csv writer, so it is possible to set the locale here:
from pyspark.sql import functions as F
dateDF_TESTE.withColumn("formatted_date",
F.to_csv(F.struct(F.col("DSC_DATE")),
{"dateFormat": "EEEE, d 'de' MMMM 'de' yyyy", "locale": "pt", "quote":""}))\
.show(truncate=False, n=5)
prints
+----------+------------------------------------+
|DSC_DATE |formatted_date |
+----------+------------------------------------+
|2020-01-01|Quarta-feira, 1 de Janeiro de 2020|
|2020-01-02|Quinta-feira, 2 de Janeiro de 2020|
|2020-01-03|Sexta-feira, 3 de Janeiro de 2020 |
|2020-01-04|Sábado, 4 de Janeiro de 2020 |
|2020-01-05|Domingo, 5 de Janeiro de 2020 |
+----------+------------------------------------+
only showing top 5 rows