Extract words from the text in Pyspark Dataframe

Extract words from the text in Pyspark Dataframe - dataframe

I have dataframe:
d = [{'text': 'They say that all cats land on their feet, but this does not apply to my cat. He not only often falls, but also jumps badly.', 'begin_end': [128, 139]},
{'text': 'Mom called dad, and when he came home, he took moms car and drove to the store', 'begin_end': [20,31]}]
s = spark.createDataFrame(d)
----------+----------------------------------------------------------------------------------------------------------------------------+
|begin_end |text |
+----------+----------------------------------------------------------------------------------------------------------------------------+
|[111, 120]|They say that all cats land on their feet, but this does not apply to my cat. He not only often falls, but also jumps badly.|
|[20, 31] |Mom called dad, and when he came home, he took moms car and drove to the store |
+----------+----------------------------------------------------------------------------------------------------------------------------+
I needed to extract the words from the text column using the begin_end column array, like text[111:120+1]. In pandas, this could be done via zip:
df['new_col'] = [s[a:b+1] for s, (a,b) in zip(df['text'], df['begin_end'])]
result:
begin_end new_col
0 [111, 120] jumps bad
1 [20, 31] when he came
How can I rewrite zip function to pyspark and get new_col? Do I need to write a udf function for this?

You can do so by using substring in an expression. It expects the string you want to substring, a starting position and the length of the substring. An expression is needed as the substring function from pyspark.sql.functions doesn't take a column as starting position or length.
s.withColumn('new_col', F.expr("substr(text, begin_end[0] + 1, begin_end[1] - begin_end[0] + 1)")).show()
+----------+--------------------+------------+
| begin_end| text| new_col|
+----------+--------------------+------------+
|[111, 120]|They say that all...| jumps bad|
| [20, 31]|Mom called dad, a...|when he came|
+----------+--------------------+------------+

Related

Use dictionary as part of replace_regexp in Pyspark

I am trying to use a dictionary like this:
mydictionary = {'AL':'Alabama', '(AL)': 'Alabama', 'WI':'Wisconsin','GA': 'Georgia','(GA)': 'Georgia'}
To go through a spark dataframe:
data = [{"ID": 1, "TheString": "On WI ! On WI !"},
{"ID": 2, "TheString": "The state of AL is next to GA"},
{"ID": 3, "TheString": "The state of (AL) is also next to (GA)"},
{"ID": 4, "TheString": "Alabama is in the South"},
{"ID": 5, "TheString": 'Wisconsin is up north way'}
]
sdf = spark.createDataFrame(data)
display(sdf)
And replace the substring found in the value part of the dictionary with matching substrings to the key.
So, something like this:
for k, v in mydictionary.items():
replacement_expr = regexp_replace(col("TheString"), '(\s+)'+k, v)
print(replacement_expr)
sdf.withColumn("TheString_New", replacement_expr).show(truncate=False)
(this of course does not work; the regular expression being compiled is wrong)
A few things to note:
The abbreviation has either a space before and after, or left and right parentheses.
I think the big problem here is that I can't get the re to "compile" correctly across the dictionary elements. (And then also throw in the "space or parentheses" restriction noted.)
I realize I could get rid of the (GA) with parentheses keys (and just use GA with spaces or parentheses as boundaries), but it seemed simpler to have those cases in the dictionary.
Expected result:
On Wisconsin ! On Wisconsin !
The state of Alabama is next to Georgia
The state of (Alabama) is next to (Georgia)
Alabama is in the South
Wisconsin is way up north
Your help is much appreciated.
Some close solutions I've looked at:
Replace string based on dictionary pyspark

Use \b in regex to specify word boundary. Also, you can use functools.reduce to generate the replace expression from the dict itemps like this:
from functools import reduce
from pyspark.sql import functions as F,
replace_expr = reduce(
lambda a, b: F.regexp_replace(a, rf"\b{b[0]}\b", b[1]),
mydictionary.items(),
F.col("TheString")
)
sdf.withColumn("TheString", replace_expr).show(truncate=False)
# +---+------------------------------------------------+
# |ID |TheString |
# +---+------------------------------------------------+
# |1 |On Wisconsin ! On Wisconsin ! |
# |2 |The state of Alabama is next to Georgia |
# |3 |The state of (Alabama) is also next to (Georgia)|
# |4 |Alabama is in the South |
# |5 |Wisconsin is up north way |
# +---+------------------------------------------------+

Using Google big query sql split the string in a column to multiple columns without breaking words

Is there any solution in bigquery to break a column of string length 1500 characters should be split into 264 characters in each columns without breaking/splitting the words

Regular expression are a good way to accomplish this task. However, BigQuery is still quite limited in the usasge of regular expression. Therefore, I would suggest to solve this with a UDF and JavaScript. A solution for JavaScript can be found here:
https://www.tutorialspoint.com/how-to-split-sentence-into-blocks-of-fixed-length-without-breaking-words-in-javascript
Adaption this solution to BigQuery
The function string_split expects the character counts to be splitted and the text to be splitted. It returns an array with the chunks. The chunks can be two characters longer than the given size value due to the spaces.
CREATE TEMP FUNCTION string_split(size int64,str string)
RETURNS ARRAY<STRING>
LANGUAGE js AS r"""
const regraw='\\S.{3,' + size + '}\\S(?= |$)';
const regex = new RegExp(new RegExp(regraw, 'g'), 'g');
return str.match(regex);
""";
SELECT text, split_text,
#length(split_text)
FROM
(
SELECT
text,string_split(20,text) as split_text
FROM (
SELECT "Is there any solution in bigquery to break a column of string length 1500 characters should be split into 264 characters in each columns without breaking/splitting the words" AS text
UNION ALL SELECT "This is a short text. And can be splitted as well."
)
)
#, unnest(split_text) as split_text #
Please uncomment the two lines to split the text from the array into single rows.
For larger datasets it also works and took less than two minutes:
CREATE TEMP FUNCTION string_split(size int64,str string)
RETURNS ARRAY<STRING>
LANGUAGE js AS r"""
const regraw='\\S.{3,' + size + '}\\S(?= |$)';
const regex = new RegExp(new RegExp(regraw, 'g'), 'g');
return str.match(regex);
""";
SELECT text, split_text,
length(split_text)
FROM
(
SELECT
text,string_split(40,text) as split_text
FROM (
SELECT abstract as text from `bigquery-public-data.breathe.jama`
)
)
, unnest(split_text) as split_text #
order by 3 desc

Consider below approach
create temp function split_parts(parts array<string>, max_len int64) returns array<string>
language js as """
var arr = [];
var part = '';
for (i = 0; i < parts.length; i++) {
if (part.length + parts[i].length < max_len){part += parts[i]}
else {arr.push(part); part = parts[i];}
}
arr.push(part);
return arr;
""";
select * from (
select id, offset, part
from your_table, unnest(split_parts(regexp_extract_all(col, r'[^ ]+ ?'), 50)) part with offset
)
pivot (any_value(trim(part)) as part for offset in (0, 1, 2, 3))
if applied to dummy data as below with split size = 50
output is

Non-regexp Approach
DECLARE LONG_SENTENCE DEFAULT "It was my fourth day walking the Island Walk, a new 700km route that circles Canada's smallest province. Starting on PEI's rural west end, I had walked past vinyl-clad farmhouses with ocean vistas, along a boardwalk beneath whirling wind turbines, and above red clay cliffs that plunged sharply into the sea. I had stopped for a midday country music hour at the Stompin' Tom Centre, honouring Canadian singer-songwriter Tom Connors. I'd tromped through the rain along a secluded, wooded trail where swarms of canny mosquitos tried to shelter under my umbrella. And after learning about PEI's major crop at the Canadian Potato Museum, I had fuelled my day's walk with an extra-large cheese-topped baked potato served with freshly made potato chips. You know that a place is serious about its spuds when your potato comes with a side of potatoes.";
CREATE TEMP FUNCTION cumsumbin(a ARRAY<INT64>) RETURNS INT64
LANGUAGE js AS """
bin = 0;
a.reduce((c, v) => {
if (c + Number(v) > 264) { bin += 1; return Number(v); }
else return c += Number(v);
}, 0);
return bin;
""";
WITH splits AS (
SELECT w, cumsumbin(ARRAY_AGG(LENGTH(w) + 1) OVER (ORDER BY o)) AS bin
FROM UNNEST(SPLIT(LONG_SENTENCE, ' ')) w WITH OFFSET o
)
SELECT * FROM (
SELECT bin, STRING_AGG(w, ' ') AS segment
FROM splits
GROUP BY 1
) PIVOT (ANY_VALUE(segment) AS segment FOR bin IN (0, 1, 2, 3))
;
Query results:
segment_0
segment_1
segment_2
segment_3
It was my fourth day walking the Island Walk, a new 700km route that circles Canada's smallest province. Starting on PEI's rural west end, I had walked past vinyl-clad farmhouses with ocean vistas, along a boardwalk beneath whirling wind turbines, and above red
clay cliffs that plunged sharply into the sea. I had stopped for a midday country music hour at the Stompin' Tom Centre, honouring Canadian singer-songwriter Tom Connors. I'd tromped through the rain along a secluded, wooded trail where swarms of canny mosquitos
tried to shelter under my umbrella. And after learning about PEI's major crop at the Canadian Potato Museum, I had fuelled my day's walk with an extra-large cheese-topped baked potato served with freshly made potato chips. You know that a place is serious about
its spuds when your potato comes with a side of potatoes.
Length of each segment
segment_0
segment_1
segment_2
segment_3
261
262
261
57
Regexp Approach
[note] below expression (.{1,264}\b) is simple but word boundary doesn't include a period(.), thus result can have some error. You can see last period(.) in segment_3 is missing. But under centain circumtances this might be useful, I think.
SELECT * FROM (
SELECT *
FROM UNNEST(REGEXP_EXTRACT_ALL(LONG_SENTENCE, r'(.{1,264}\b)')) segment WITH OFFSET o
) PIVOT (ANY_VALUE(segment) segment FOR o IN (0, 1, 2, 3));
Query rseults:
segment_0
segment_1
segment_2
segment_3
It was my fourth day walking the Island Walk, a new 700km route that circles Canada's smallest province. Starting on PEI's rural west end, I had walked past vinyl-clad farmhouses with ocean vistas, along a boardwalk beneath whirling wind turbines, and above red
clay cliffs that plunged sharply into the sea. I had stopped for a midday country music hour at the Stompin' Tom Centre, honouring Canadian singer-songwriter Tom Connors. I'd tromped through the rain along a secluded, wooded trail where swarms of canny mosquitos
tried to shelter under my umbrella. And after learning about PEI's major crop at the Canadian Potato Museum, I had fuelled my day's walk with an extra-large cheese-topped baked potato served with freshly made potato chips. You know that a place is serious about
its spuds when your potato comes with a side of potatoes
Length of each segment
segment_0
segment_1
segment_2
segment_3
261
262
261
56

How to count Total Price in dataframe

I have retail data from which I created retail dataframe
spark.sparkContext.addFile('https://raw.githubusercontent.com/databricks/Spark-The-Definitive-Guide/master/data/retail-data/all/online-retail-dataset.csv')
retail_df = spark.read.csv(SparkFiles.get('online-retail-dataset.csv'), header=True, inferSchema=True)\
.withColumn('OverallItems', struct('StockCode', 'Description', 'UnitPrice', 'Quantity', 'InvoiceDate','CustomerID', 'Country'))
then I created retail_array that has two columns InvoiceNo and Items
retail_array = retail_df.groupBy('InvoiceNo')\
.agg(collect_list(col('OverallItems')).alias('Items'))
I want to count total price of invoice items and add to into items column in retail_array.
So far I have written this code:
transformer = lambda x: struct(x['UnitPrice'], x['Quantity'], x['UnitPrice'] * x['Quantity']).cast("struct<UnitPrice:double,Quantity:double,TotalPrice:double>")
TotalPrice_df = retail_array\
.withColumn('TotalPrice', transform("items", transformer))
TotalPrice_df.show(truncate=False)
But with this code Im adding to retail_arraynew column, but I want this new column to be part of items column inretail_array`.
for one invoice item output is like:
--+
|InvoiceNo|Items|TotalPrice |
+---------+---------------------------------------------------------------------------------------
|536366 |[{22633, HAND WARMER UNION JACK, 1.85, 6, 12/1/2010 8:28, 17850, United Kingdom}, {22632, HAND WARMER RED POLKA DOT, 1.85, 6, 12/1/2010 8:28, 17850, United Kingdom}] |[{1.85, 6.0, 11.100000000000001}, {1.85, 6.0, 11.100000000000001}]
I want it count 11.100000000000001 + 11.100000000000001 and add it into items column with no extra column. Also for other invoice items there are sometimes more than two total price I want to add to each other.

Use aggregate instead of transform function to calculate the total price like this:
from pyspark.sql import functions as F
retail_array = retail_df.groupBy("InvoiceNo").agg(
F.collect_list(F.col("OverallItems")).alias("Items")
).withColumn(
"TotalPrice",
F.aggregate("items", F.lit(.0), lambda acc, x: acc + (x["Quantity"] * x["UnitPrice"]))
)
Note however that you can actually calculate this TotalPrice in the same aggregation when you collect the list of structs and thus avoid, additional calculations by iterating on array elements:
retail_array = retail_df.groupBy("InvoiceNo").agg(
F.collect_list(F.col("OverallItems")).alias("Items"),
F.sum(F.col("Quantity") * F.col("UnitPrice")).alias("TotalPrice")
)
retail_array.show(1)
#+---------+--------------------+------------------+
#|InvoiceNo| Items| TotalPrice|
#+---------+--------------------+------------------+
#| 536366|[{22633, HAND WAR...|22.200000000000003|
#+---------+--------------------+------------------+
But with this code I'm adding to retail_array new column, but I want this new column to be part of items column in retail_array
Note sure I correctly understood this part. Items column is an array of structs, that does not make much sens to replicate the total price of an InvoiceNo in each
of its items.
That said, if you really want to do this, you can use transform after calculating the total price (step above):
result = retail_array.withColumn(
"Items",
F.transform("Items", lambda x: x.withField("TotalPrice", F.col("TotalPrice")))
).drop("TotalPrice")
result.show(1, False)
#+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|InvoiceNo|Items |
#+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|536366 |[{22633, HAND WARMER UNION JACK, 1.85, 6, 12/1/2010 8:28, 17850, United Kingdom, 22.200000000000003}, {22632, HAND WARMER RED POLKA DOT, 1.85, 6, 12/1/2010 8:28, 17850, United Kingdom, 22.200000000000003}]|
#+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Unable to slice year from date column using negative indexing with pandas

I have a simple data set, where we have a Dates column from which I want to extract the year.
I am using the negative index to get the year
d0['Year'] = d0['Dates'].apply(lambda x: x[-1:-5])
This normally works, however, not on this. A blank column is created.
I sampled the column for some of the data and saw no odd characters present.
I have tried the following variations
d0['Year'] = d0['Dates'].apply(lambda x: str(x)[-1:-5]) # column is created and it is blank.
d0['Year'] = d0.Dates.str.extract('\d{4}') # gives an error "ValueError: pattern contains no capture groups"
d0['Year'] = d0['Dates'].apply(lambda x: str(x).replace('[^a-zA-Z0-9_-]','a')[-1:-5]) # same - gives a blank column
Really not sure what other options I have and where is the issue.
What possibly can be the issue?
Below is a sample dump of the data I have
Outbreak,Dates,Region,Tornadoes,Fatalities,Notes
2000 Southwest Georgia tornado outbreak,"February 13–14, 2000",Georgia,17,18,"Produced a series of strong and deadly tornadoes that struck areas in and around Camilla, Meigs, and Omega, Georgia. Weaker tornadoes impacted other states."
2000 Fort Worth tornado,"March 28, 2000",U.S. South,10,2,"Small outbreak produced an F3 that hit downtown Fort Worth, Texas, severely damaging skyscrapers and killing two. Another F3 caused major damage in Arlington and Grand Prairie."
2000 Easter Sunday tornado outbreak,"April 23, 2000","Oklahoma, Texas, Louisiana, Arkansas",33,0,
"2000 Brady, Nebraska tornado","May 17, 2000",Nebraska,1,0,"Highly photographed F3 passed near Brady, Nebraska."
2000 Granite Falls tornado,"July 25, 2000","Granite Falls, Minnesota",1,1,"F4 struck Granite Falls, causing major damage and killing one person."

To extract year from "Dates" column , as object type use
da['Year'] = da['Dates'].apply(lambda x: x[-4:])
If you want to use it as int then , you could do following operations after doing the step above
da['Year']=pd.to_numeric(da['Year'])

Mapping column values to a combination of another csv file's information

I have a dataset that indicates date & time in 5-digit format: ddd + hm
ddd part starts from 2009 Jan 1. Since the data was collected only from then to 2-years period, its [min, max] would be [1, 365 x 2 = 730].
Data is observed in 30-min interval, making 24 hrs per day period to lengthen to 48 at max. So [min, max] for hm at [1, 48].
Following is the excerpt of daycode.csv file that contains ddd part of the daycode, matching date & hm part of the daycode, matching time.
And I think I agreed to not showing the dataset which is from ISSDA. So..I will just describe that the daycode in the File1.txt file reads like '63317'.
This link gave me a glimpse of how to approach this problem, and I was in the middle of putting up this code together..which of course won't work at this point.
consume = pd.read_csv("data/File1.txt", sep= ' ', encoding = "utf-8", names =['meter', 'daycode', 'val'])
df1= pd.read_csv("data/daycode.csv", encoding = "cp1252", names =['code', 'print'])
test = consume[consume['meter']==1048]
test['daycode'] = test['daycode'].map(df1.set_index('code')['print'])
plt.plot(test['daycode'], test['val'], '.')
plt.title('test of meter 1048')
plt.xlabel('daycode')
plt.ylabel('energy consumption [kWh]')
plt.show()
Not all units(thousands) have been observed at full length but 730 x 48 is a large combination to lay out on excel by hand. Tbh, not an elegant solution but I tried by dragging - it doesn't quite get it.
If I could read the first 3 digits of the column values and match with another file's column, 2 last digits with another column, then combine.. is there a way?

For the last 2 lines you can just do something like this
df['first_3_digits'] = df['col1'].map(lambda x: str(x)[:3])
df['last_2_digits'] = df['col1'].map(lambda x: str(x)[-2:])
for joining 2 dataframes
df3 = df.merge(df2,left_on=['first_3_digits','last_2_digits'],right_on=['col1_df2','col2_df2'],how='left')

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Extract words from the text in Pyspark Dataframe - dataframe

Related

Use dictionary as part of replace_regexp in Pyspark

Using Google big query sql split the string in a column to multiple columns without breaking words

How to count Total Price in dataframe

Unable to slice year from date column using negative indexing with pandas

Mapping column values to a combination of another csv file's information

Categories

Resources