how to change month name to a different language in pyspark - dataframe - dataframe

I am trying to create a table for "Date" on Databricks using the below configurations:
# Get date range
dateFrom = dbutils.widgets.get("date_from")
dateTo = dbutils.widgets.get("date_to")
dateDF_TESTE = spark.sql("SELECT sequence(to_date('{0}'), to_date('{1}'), interval 1 day) AS date".format(dateFrom, dateTo))\
.select(F.explode("date").alias('DSC_DATE'))'''
But when I add columns with those data, I am only getting the information, for example month name or days of the week, in english.
I intend to change this information to another language (portuguese), but without any success. I´ve tried to use locale but it is not working.
import locale
# use user's default settings
locale.setlocale(locale.LC_ALL, 'pt_PT.utf8')

Since Spark 3.0 it is possible to use to_csv() on a single column. to_csv accepts the same parameters like the standard csv writer, so it is possible to set the locale here:
from pyspark.sql import functions as F
dateDF_TESTE.withColumn("formatted_date",
F.to_csv(F.struct(F.col("DSC_DATE")),
{"dateFormat": "EEEE, d 'de' MMMM 'de' yyyy", "locale": "pt", "quote":""}))\
.show(truncate=False, n=5)
prints
+----------+------------------------------------+
|DSC_DATE |formatted_date |
+----------+------------------------------------+
|2020-01-01|Quarta-feira, 1 de Janeiro de 2020|
|2020-01-02|Quinta-feira, 2 de Janeiro de 2020|
|2020-01-03|Sexta-feira, 3 de Janeiro de 2020 |
|2020-01-04|Sábado, 4 de Janeiro de 2020 |
|2020-01-05|Domingo, 5 de Janeiro de 2020 |
+----------+------------------------------------+
only showing top 5 rows

Related

Pandas bringing in data from another dataframe

I am trying to bring data from a dataframe which is mapping table into another dataframe using the following, however I get an error 'x' is not defined, what am I doing wrong pls?
Note for values not in the mapping table (China/CN) I would just like the value to be blank or nan. If there are values in the mapping table that are not in my data - I don't want to include them.
import pandas as pd
languages = {'Language': ["English", "German", "French", "Spanish"],
'countryCode': ["EN", "DE", "FR", "ES"]
}
countries = {'Country': ["Australia", "Argentina", "Mexico", "Algeria", "China"],
'countryCode': ["EN", "ES", "ES", "FR", "CN"]
}
language_map = pd.DataFrame(languages)
data = pd.DataFrame(countries)
def language_converter(x):
return language_map.query(f"countryCode=='{x}'")['Language'].values[0]
data['Language'] = data['countryCode'].apply(language_converter(x))
Use pandas.DataFrame.merge:
data.merge(language_map, how='left')
Output:
Country countryCode Language
0 Australia EN English
1 Argentina ES Spanish
2 Mexico ES Spanish
3 Algeria FR French
4 China CN NaN
.apply accepts a callable object, but you've passed language_converter(x) which is already a function call with undefined x variable as apply is not applied yet.
A valid usage is: .apply(language_converter).
But next, you'll have another error IndexError: index 0 is out of bounds for axis 0 with size 0 as some country codes may not be found (which breaks the indexing .values[0]).
If proceeding with your starting approach a valid version would look as below:
def language_converter(x):
lang = language_map[language_map["countryCode"] == x]['Language'].values
return lang[0] if lang.size > 0 else np.nan
data['Language'] = data['countryCode'].apply(language_converter)
print(data)
Country countryCode Language
0 Australia EN English
1 Argentina ES Spanish
2 Mexico ES Spanish
3 Algeria FR French
4 China CN NaN
But, instead of defining and applying language_converter it's much simpler and straightforward to map country codes explicitly with just:
data['Language'] = data['countryCode'].map(language_map.set_index("countryCode")['Language'])

extracting year from string using regexp_extract pyspark

This is the portion of my result :
Grumpier Old Men (1995)
Death Note: Desu nôto (2006–2007)
Irwin & Fran 2013
9500 Liberty (2009)
Captive Women (1000 Years from Now) (3000 A.D.) (1952)
The Garden of Afflictions 2017
The Naked Truth (1957) (Your Past Is Showing)
Conquest 1453 (Fetih 1453) (2012)
Commune, La (Paris, 1871) (2000)
1013 Briar Lane
return:
1995
2006
2013
2009
1952
2017
1957
1453<--
1871<--
<--this part for last title is empty and supposed to be empty too
As you can see from the above,last 2 title is given wrong result.
This is my code:
import pyspark.sql.functions as F
from pyspark.sql.functions import regexp_extract,col
bracket_regexp = "((?<=\()\d{4}(?=[^\(]*$))"
movies_DF=movies_DF.withColumn('yearOfRelease', regexp_extract("title", bracket_regexp + "|(\d{4}$)", 0))
movies_DF.display(10000)
I am trying to get the year portion of the title string.
You can try using the following regex: r'(?<=\()(\d+)(?=\))', which is inspired by this excellent answer.
For example:
movies_DF = movies_DF.withColumn('uu', regexp_extract(col("title"), r'(?<=\()(\d+)(?=\))',1))
+------------------------------------------------------------+----+
|title |uu |
+------------------------------------------------------------+----+
|Grumpier Old Men (1995) |1995|
|Happy Anniversary (1959) |1959|
|Paths (2017) |2017|
|The Three Amigos - Outrageous! (2003) |2003|
|L'obsession de l'or (1906) |1906|
|Babe Ruth Story, The (1948) |1948|
|11'0901 - September 11 (2002) |2002|
|Blood Trails (2006) |2006|
|Return to the 36th Chamber (Shao Lin da peng da shi) (1980) |1980|
|Off and Running (2009) |2009|
+------------------------------------------------------------+----+
Empirically, the following regex pattern seems to be working:
(?<=[( ])\d{4}(?=\S*\)|$)
Here is a working regex demo.
Updated PySpark code:
bracket_regexp = "((?<=[( ])\d{4}(?=\S*\)|$))"
movies_DF = movies_DF.withColumn('yearOfRelease', regexp_extract("title", bracket_regexp + "|(\d{4}$)", 0))
movies_DF.display(10000)
The regex pattern works by matching:
(?<=[( ]) assert that what precedes is ( or a space
\d{4} match a 4 digit year
(?=\S*\)|$) assert that ), possibly prefaced by non whitespace, follows
OR the end of the string follows
Your regex can only work for the first line. \(\d{4}\) tries to match a (, 4 digits and a ). For the first line you have (1995) which is alright. The other lines do not contain that pattern.
In your situation, we can use lookbehind and lookahead patterns to detect dates within brackets. (?<=\() means an open bracket before. (?=–|(–)|\)) means a closing bracket after, or – or – which is the original character that was misencoded. Once you have covered the date in between brackets, you can cover dates that are at the end of the string without brackets: \d{4}$.
import pyspark.sql.functions as F
bracket_regexp = "((?<=\()\d{4}(?=–|(–)|\)))"
movies_DF\
.withColumn('uu', regexp_extract("title", bracket_regex + "|(\d{4}$)", 0))\
.show(truncate=False)
+------------------------------------------------------+-------------+
|title |yearOfRelease|
+------------------------------------------------------+-------------+
|Grumpier Old Men (1995) |1995 |
|Death Note: Desu nôto (2006–2007) |2006 |
|Irwin & Fran 2013 |2013 |
|9500 Liberty (2009) |2009 |
|test 1234 test 4567 |4567 |
|Captive Women (1000 Years from Now) (3000 A.D.) (1952)|1952 |
|The Garden of Afflictions 2017 |2017 |
|The Naked Truth (1957) (Your Past Is Showing) |1957 |
|Conquest 1453 (Fetih 1453) (2012) |2012 |
|Commune, La (Paris, 1871) (2000) |2000 |
|1013 Briar Lane | |
+------------------------------------------------------+-------------+
Also you do not need to prefix the string with r when you pass a regex to a spark function.
Here is a regexp that would work:
df = df.withColumn("year", F.regexp_extract("title", "(?:[\s\(])(\d{4})(?:[–\)])?", 1))
Definitely overkill for the examples you provide, but I want to avoid capturing e.g. other numbers in the titles. Also, your regexp does not work because not all years are surrounding by brackets in your examples and sometimes you have non-numeric characters inside the brackets,.

Counting Distinct words AND average time in Pandas

I'm working on analysing some text from a Twitter API using pandas. This will eventually be visualized.
For reference
df.head() of my dataset
is:
Count User Time Tweet
0 0 x 2022 ✔️Nécessité de maintien d’une filière 🇪🇺 dynam...
1 1 x 2022 Échanges approfondis à #Dakar avec le Premier ...
2 2 x 2022 ✔️Approvisionnement en #céréales & #engrai...
3 3 x 2022 Aujourd’hui à Tambacounda, à l’Est du Sénégal,...
4 4 x 2022 Working hard since 2019 to reinforce EU #auton...
I'm looking to return the distinct word count with the average time of the tweet where the word was used in.
Right now, I've been getting the distinct word count of my dataset using df.Tweet.str.split(expand=True).stack().value_counts().
This is useful, returning:
the 1505
de 1500
to 1168
RT 931
of 906
...
africain, 1
langue 1
Félicitations! 1
Length: 18071, dtype: int64
However, I want to also analyse text usage over time.
I'm not super experienced so I'm wondering if there is a way to use a function such as df.groupby() to sort this result by time? Or, is there a way to modify my original function to add a column to my results that includes average time?
I would use str.extractall to get the words, join the Time, then perform a groupby.value_counts to get the count per Year:
out = (df['Tweet']
.str.extractall('(\S+)')
.droplevel('match')
.join(df['Time'])
.groupby('Time')[0].value_counts()
)
NB. if you want to exclude non-letters/digits from the words, use (\w+) in place of (\S+).
Output:
Time 0
2022 à 3
#Dakar 1
#auton... 1
#céréales 1
#engrai... 1
& 1
... 1
...

End user column sort - PowerBi

Wanted to know if there is a way where the end user can custom sort a column
For example we can add a conditional column sort like below and can use 'sort by column' option to custom sort a column. Is there a way where we can add a parameter in M and end user can switch between different sorting for the column? For example if he
Zone |Sort
North | 1
South | 2
Central | 3
East | 4
West | 5
There is a new Dynamic M query parameter in preview in power bi march update but couldn't make it to work
If it cannot be achieved via parameters then what would be the best approach?

Loop over pandas dataframe to create multiple networks

I have data of countries trade with one another. I have split the main file according to months and got 12 csv files for the year 2019. A sample of the data of January csv is provided below:
reporter partner year month trade
0 Albania Argentina 2019 01 515256
1 Albania Australia 2019 01 398336
2 Albania Austria 2019 01 7664503
3 Albania Bahrain 2019 01 400
4 Albania Bangladesh 2019 01 653907
5 Zimbabwe Zambia 2019 01 79569855
I want to make complex network for every month and print the number of nodes of every network. Now I can do it the hard (stupid) way like so.
df01 = pd.read_csv('012019.csv')
df02 = pd.read_csv('022019.csv')
df03 = pd.read_csv('032019.csv')
df1= df01[['reporter','partner', 'trade']]
df2= df02[['reporter','partner', 'trade']]
df3= df03[['reporter','partner', 'trade']]
G1 = nx.Graph()
G1 = nx.from_pandas_edgelist(df1, 'reporter', 'partner', edge_attr='trade')
G1.number_of_nodes()
and so on for the next networks.
My question is how can I use a "for loop" to read the files, convert them to networks from dataframe and report the number of nodes of each node.
I tried this but nothing is reported.
for f in glob.glob('.csv'):
df = pd.read_csv(f)
df1 = df[['reporter','partner', 'trade']]
G = nx.from_pandas_edgelist(df1, 'reporter', 'partner', edge_attr='trade')
G.number_of_nodes()
Thanks.
Edit:
Ok. So I managed to do the above using similar codes like below:
for files in glob.glob('/home/user/VMShared/network/2nd/*.csv'):
df = pd.read_csv(files)
df1=df[['reporter','partner', 'import']]
G = nx.Graph()
G = nx.from_pandas_edgelist(df1, 'reporter', 'partner', edge_attr='import')
nx.write_graphml_lxml(G, "/home/user/VMShared/network/2nd/*.graphml")
The problem that I now face is how to write separate files. All I get from this is one file titled *.graphml. How can I get graphml files for every input file? Also if I can get the same graphml output name as the input file would be a plus.