find frequency (times of occurrence) of a (list of) substring (which are elements in list of dictionaries) in another string - sql

I would like to find the frequency(number) of occurrence of a substring (which are elements in list of dictionaries to determine their categories) in another string.
See the sample input and output below.
Find number of repetition of element of st in string namedstgs
The code:
def freqcounter(st, stgs):
"""
st: A mapping of st name to st keyword list.
st: dict of str -> list
stgs: A list of stgs (type:str)
return: A mapping of st name to st occurance in list of stgs.
:rtype: dict of str -> int
"""
stgs = str(stgs).split(" ") or str(stgs).split(' ')
dic = {}
count = 0
for k,v in st.items():
for i in range(len(v)):
for j in range(len(stgs)):
if v[i] == stgs[j]:
count+=1
dic[k]=count
return dic
if __name__ == '__main__':
stgs=['John Smith sells trees, he said the height of his tree is high. I expected more trees with lower price, but it is higher than my expectation.', 'I like my new tree, John!', "100 dollars per each tree is very high. Tree is source of oxygen. Next time I do my shopping from a Cheap Trees shoppers."]
st = {'Height': ['low', 'high', 'height'], 'Topic Work': ['tree', 'trees'], 'John Smith': ['John Smith']}
outtts = freqcounter(st,stgs)
outtts = sorted(list(outtts.items()))
for outtt in outtts:
print(outtt[0])
print(outtt[1])
sample input:
#For instance, if `stgs` is the input, and I would like to count frequency of each `st` in this text.
stgs=['John Smith sells trees, he said the height of his tree is high. I expected more trees with lower price, but it is higher than my expectation.', 'I like my new tree, John!', "100 dollars per each tree is very high. Tree is source of oxygen. Next time I do my shopping from a Cheap Trees shoppers."]
st = {'Height': ['low', 'high', 'height'], 'Topic Work': ['tree', 'trees'], 'John Smith': ['John Smith']}
I would like to calculate on 2 cases:
do not consider word+appendix the same as a word. For instance: do not count lower, higher as low, high
sample output:
'Height': 3 , 'Topic Work': 7, 'John Smith': 1
because 3 times found 'Height','high','low' which are the element of 'Height', 7 times found 'tree', 'trees' which are elements of 'Topic Work' and 1 time found 'John Smith' which are elements of 'John Smith'
2.consider word+appendix the same as a word. For instance: count lower, higher as low, high
sample output:
'Height': 5 , 'Topic Work': 7, 'John Smith': 1
What is my expectation is showing how many of each of them are found.

Related

Using Google big query sql split the string in a column to multiple columns without breaking words

Is there any solution in bigquery to break a column of string length 1500 characters should be split into 264 characters in each columns without breaking/splitting the words
Regular expression are a good way to accomplish this task. However, BigQuery is still quite limited in the usasge of regular expression. Therefore, I would suggest to solve this with a UDF and JavaScript. A solution for JavaScript can be found here:
https://www.tutorialspoint.com/how-to-split-sentence-into-blocks-of-fixed-length-without-breaking-words-in-javascript
Adaption this solution to BigQuery
The function string_split expects the character counts to be splitted and the text to be splitted. It returns an array with the chunks. The chunks can be two characters longer than the given size value due to the spaces.
CREATE TEMP FUNCTION string_split(size int64,str string)
RETURNS ARRAY<STRING>
LANGUAGE js AS r"""
const regraw='\\S.{3,' + size + '}\\S(?= |$)';
const regex = new RegExp(new RegExp(regraw, 'g'), 'g');
return str.match(regex);
""";
SELECT text, split_text,
#length(split_text)
FROM
(
SELECT
text,string_split(20,text) as split_text
FROM (
SELECT "Is there any solution in bigquery to break a column of string length 1500 characters should be split into 264 characters in each columns without breaking/splitting the words" AS text
UNION ALL SELECT "This is a short text. And can be splitted as well."
)
)
#, unnest(split_text) as split_text #
Please uncomment the two lines to split the text from the array into single rows.
For larger datasets it also works and took less than two minutes:
CREATE TEMP FUNCTION string_split(size int64,str string)
RETURNS ARRAY<STRING>
LANGUAGE js AS r"""
const regraw='\\S.{3,' + size + '}\\S(?= |$)';
const regex = new RegExp(new RegExp(regraw, 'g'), 'g');
return str.match(regex);
""";
SELECT text, split_text,
length(split_text)
FROM
(
SELECT
text,string_split(40,text) as split_text
FROM (
SELECT abstract as text from `bigquery-public-data.breathe.jama`
)
)
, unnest(split_text) as split_text #
order by 3 desc
Consider below approach
create temp function split_parts(parts array<string>, max_len int64) returns array<string>
language js as """
var arr = [];
var part = '';
for (i = 0; i < parts.length; i++) {
if (part.length + parts[i].length < max_len){part += parts[i]}
else {arr.push(part); part = parts[i];}
}
arr.push(part);
return arr;
""";
select * from (
select id, offset, part
from your_table, unnest(split_parts(regexp_extract_all(col, r'[^ ]+ ?'), 50)) part with offset
)
pivot (any_value(trim(part)) as part for offset in (0, 1, 2, 3))
if applied to dummy data as below with split size = 50
output is
Non-regexp Approach
DECLARE LONG_SENTENCE DEFAULT "It was my fourth day walking the Island Walk, a new 700km route that circles Canada's smallest province. Starting on PEI's rural west end, I had walked past vinyl-clad farmhouses with ocean vistas, along a boardwalk beneath whirling wind turbines, and above red clay cliffs that plunged sharply into the sea. I had stopped for a midday country music hour at the Stompin' Tom Centre, honouring Canadian singer-songwriter Tom Connors. I'd tromped through the rain along a secluded, wooded trail where swarms of canny mosquitos tried to shelter under my umbrella. And after learning about PEI's major crop at the Canadian Potato Museum, I had fuelled my day's walk with an extra-large cheese-topped baked potato served with freshly made potato chips. You know that a place is serious about its spuds when your potato comes with a side of potatoes.";
CREATE TEMP FUNCTION cumsumbin(a ARRAY<INT64>) RETURNS INT64
LANGUAGE js AS """
bin = 0;
a.reduce((c, v) => {
if (c + Number(v) > 264) { bin += 1; return Number(v); }
else return c += Number(v);
}, 0);
return bin;
""";
WITH splits AS (
SELECT w, cumsumbin(ARRAY_AGG(LENGTH(w) + 1) OVER (ORDER BY o)) AS bin
FROM UNNEST(SPLIT(LONG_SENTENCE, ' ')) w WITH OFFSET o
)
SELECT * FROM (
SELECT bin, STRING_AGG(w, ' ') AS segment
FROM splits
GROUP BY 1
) PIVOT (ANY_VALUE(segment) AS segment FOR bin IN (0, 1, 2, 3))
;
Query results:
segment_0
segment_1
segment_2
segment_3
It was my fourth day walking the Island Walk, a new 700km route that circles Canada's smallest province. Starting on PEI's rural west end, I had walked past vinyl-clad farmhouses with ocean vistas, along a boardwalk beneath whirling wind turbines, and above red
clay cliffs that plunged sharply into the sea. I had stopped for a midday country music hour at the Stompin' Tom Centre, honouring Canadian singer-songwriter Tom Connors. I'd tromped through the rain along a secluded, wooded trail where swarms of canny mosquitos
tried to shelter under my umbrella. And after learning about PEI's major crop at the Canadian Potato Museum, I had fuelled my day's walk with an extra-large cheese-topped baked potato served with freshly made potato chips. You know that a place is serious about
its spuds when your potato comes with a side of potatoes.
Length of each segment
segment_0
segment_1
segment_2
segment_3
261
262
261
57
Regexp Approach
[note] below expression (.{1,264}\b) is simple but word boundary doesn't include a period(.), thus result can have some error. You can see last period(.) in segment_3 is missing. But under centain circumtances this might be useful, I think.
SELECT * FROM (
SELECT *
FROM UNNEST(REGEXP_EXTRACT_ALL(LONG_SENTENCE, r'(.{1,264}\b)')) segment WITH OFFSET o
) PIVOT (ANY_VALUE(segment) segment FOR o IN (0, 1, 2, 3));
Query rseults:
segment_0
segment_1
segment_2
segment_3
It was my fourth day walking the Island Walk, a new 700km route that circles Canada's smallest province. Starting on PEI's rural west end, I had walked past vinyl-clad farmhouses with ocean vistas, along a boardwalk beneath whirling wind turbines, and above red
clay cliffs that plunged sharply into the sea. I had stopped for a midday country music hour at the Stompin' Tom Centre, honouring Canadian singer-songwriter Tom Connors. I'd tromped through the rain along a secluded, wooded trail where swarms of canny mosquitos
tried to shelter under my umbrella. And after learning about PEI's major crop at the Canadian Potato Museum, I had fuelled my day's walk with an extra-large cheese-topped baked potato served with freshly made potato chips. You know that a place is serious about
its spuds when your potato comes with a side of potatoes
Length of each segment
segment_0
segment_1
segment_2
segment_3
261
262
261
56

Extract words from the text in Pyspark Dataframe

I have dataframe:
d = [{'text': 'They say that all cats land on their feet, but this does not apply to my cat. He not only often falls, but also jumps badly.', 'begin_end': [128, 139]},
{'text': 'Mom called dad, and when he came home, he took moms car and drove to the store', 'begin_end': [20,31]}]
s = spark.createDataFrame(d)
----------+----------------------------------------------------------------------------------------------------------------------------+
|begin_end |text |
+----------+----------------------------------------------------------------------------------------------------------------------------+
|[111, 120]|They say that all cats land on their feet, but this does not apply to my cat. He not only often falls, but also jumps badly.|
|[20, 31] |Mom called dad, and when he came home, he took moms car and drove to the store |
+----------+----------------------------------------------------------------------------------------------------------------------------+
I needed to extract the words from the text column using the begin_end column array, like text[111:120+1]. In pandas, this could be done via zip:
df['new_col'] = [s[a:b+1] for s, (a,b) in zip(df['text'], df['begin_end'])]
result:
begin_end new_col
0 [111, 120] jumps bad
1 [20, 31] when he came
How can I rewrite zip function to pyspark and get new_col? Do I need to write a udf function for this?
You can do so by using substring in an expression. It expects the string you want to substring, a starting position and the length of the substring. An expression is needed as the substring function from pyspark.sql.functions doesn't take a column as starting position or length.
s.withColumn('new_col', F.expr("substr(text, begin_end[0] + 1, begin_end[1] - begin_end[0] + 1)")).show()
+----------+--------------------+------------+
| begin_end| text| new_col|
+----------+--------------------+------------+
|[111, 120]|They say that all...| jumps bad|
| [20, 31]|Mom called dad, a...|when he came|
+----------+--------------------+------------+

How to count Total Price in dataframe

I have retail data from which I created retail dataframe
spark.sparkContext.addFile('https://raw.githubusercontent.com/databricks/Spark-The-Definitive-Guide/master/data/retail-data/all/online-retail-dataset.csv')
retail_df = spark.read.csv(SparkFiles.get('online-retail-dataset.csv'), header=True, inferSchema=True)\
.withColumn('OverallItems', struct('StockCode', 'Description', 'UnitPrice', 'Quantity', 'InvoiceDate','CustomerID', 'Country'))
then I created retail_array that has two columns InvoiceNo and Items
retail_array = retail_df.groupBy('InvoiceNo')\
.agg(collect_list(col('OverallItems')).alias('Items'))
I want to count total price of invoice items and add to into items column in retail_array.
So far I have written this code:
transformer = lambda x: struct(x['UnitPrice'], x['Quantity'], x['UnitPrice'] * x['Quantity']).cast("struct<UnitPrice:double,Quantity:double,TotalPrice:double>")
TotalPrice_df = retail_array\
.withColumn('TotalPrice', transform("items", transformer))
TotalPrice_df.show(truncate=False)
But with this code Im adding to retail_arraynew column, but I want this new column to be part of items column inretail_array`.
for one invoice item output is like:
--+
|InvoiceNo|Items|TotalPrice |
+---------+---------------------------------------------------------------------------------------
|536366 |[{22633, HAND WARMER UNION JACK, 1.85, 6, 12/1/2010 8:28, 17850, United Kingdom}, {22632, HAND WARMER RED POLKA DOT, 1.85, 6, 12/1/2010 8:28, 17850, United Kingdom}] |[{1.85, 6.0, 11.100000000000001}, {1.85, 6.0, 11.100000000000001}]
I want it count 11.100000000000001 + 11.100000000000001 and add it into items column with no extra column. Also for other invoice items there are sometimes more than two total price I want to add to each other.
Use aggregate instead of transform function to calculate the total price like this:
from pyspark.sql import functions as F
retail_array = retail_df.groupBy("InvoiceNo").agg(
F.collect_list(F.col("OverallItems")).alias("Items")
).withColumn(
"TotalPrice",
F.aggregate("items", F.lit(.0), lambda acc, x: acc + (x["Quantity"] * x["UnitPrice"]))
)
Note however that you can actually calculate this TotalPrice in the same aggregation when you collect the list of structs and thus avoid, additional calculations by iterating on array elements:
retail_array = retail_df.groupBy("InvoiceNo").agg(
F.collect_list(F.col("OverallItems")).alias("Items"),
F.sum(F.col("Quantity") * F.col("UnitPrice")).alias("TotalPrice")
)
retail_array.show(1)
#+---------+--------------------+------------------+
#|InvoiceNo| Items| TotalPrice|
#+---------+--------------------+------------------+
#| 536366|[{22633, HAND WAR...|22.200000000000003|
#+---------+--------------------+------------------+
But with this code I'm adding to retail_array new column, but I want this new column to be part of items column in retail_array
Note sure I correctly understood this part. Items column is an array of structs, that does not make much sens to replicate the total price of an InvoiceNo in each
of its items.
That said, if you really want to do this, you can use transform after calculating the total price (step above):
result = retail_array.withColumn(
"Items",
F.transform("Items", lambda x: x.withField("TotalPrice", F.col("TotalPrice")))
).drop("TotalPrice")
result.show(1, False)
#+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|InvoiceNo|Items |
#+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|536366 |[{22633, HAND WARMER UNION JACK, 1.85, 6, 12/1/2010 8:28, 17850, United Kingdom, 22.200000000000003}, {22632, HAND WARMER RED POLKA DOT, 1.85, 6, 12/1/2010 8:28, 17850, United Kingdom, 22.200000000000003}]|
#+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Taking mean of N largest values of group by absolute value

I have some DataFrame:
d = {'fruit': ['apple', 'pear', 'peach'] * 6, 'values': np.random.uniform(-5,5,18), 'values2': np.random.uniform(-5,5,18)}
df = pd.DataFrame(data=d)
I can take the mean of each fruit group as such:
df.groupby('fruit').mean()
However, for each group of fruit, I'd like to take the mean of the N number of largest values as
ranked by absolute value.
So for example, if my values were as follows and N=3:
[ 0.7578507 , 3.81178045, -4.04810913, 3.08887538, 2.87999752, 4.65670954]
The desired outcome would be (4.65670954 + -4.04810913 + 3.81178045) / 3 = ~1.47
Edit - to clarify that sign is preserved in outcome:
(4.65670954 + -20.04810913 + 3.81178045) / 3 = -3.859
Updating with a new approach that I think is simpler. I was avoiding apply like the plague but maybe this is one of the more acceptable uses. Plus it fixes the fact that you want to mean the original values as ranked by their absolute values:
def foo(d):
return d[d.abs().nlargest(3).index].mean()
out = df.groupby('fruit')['values'].apply(foo)
So you index each group by the 3 largest absolute values, then mean.
And for the record my original, incorrect, and slower code was:
df['values'].abs().groupby(df['fruit']).nlargest(3).groupby("fruit").mean()

how to do this operation in pandas

I have a data frame that contains country column. Unfortunately the country names characters are all capitalized and I need them to be ISO3166_1_Alpha_3
as an example United States of America is going to be U.S.A
United Kingdom is going to be U.K and so on.
Fortunately I found this data frame on the internet that contains 2 important columns the first is the country name and the second is the ISO3166_1_Alpha_3
you can find the data frame on this website
https://datahub.io/JohnSnowLabs/iso-3166-country-codes-itu-dialing-codes-iso-4217-currency-codes
So i wrote this code
data_geo = pd.read_excel("tab0.xlsx")#this is the data frame that contains all the capitalized country name
country_iso = pd.read_csv(r"https://datahub.io/JohnSnowLabs/iso-3166-country-codes-itu-dialing-codes-iso-4217-currency-codes/r/iso-3166-country-codes-itu-dialing-codes-iso-4217-currency-codes-csv.csv",
usecols=['Official_Name_English', 'ISO3166_1_Alpha_3'])
s = pd.Series(data_geo.countery_name_e).str.lower().str.title()#this line make all the names characters small except the first character
y = pd.Series([])
Now i want to make a loop when a
s = Official_Name_English
I want to append
country_iso[Official_Name_English].ISO3166_1_Alpha_3
to the Y series. If country name isn't in this list append NaN
this is 20 rows in s
['Diffrent Countries', 'Germany', 'Other Countries', 'Syria',
'Jordan', 'Yemen', 'Sudan', 'Somalia', 'Australia',
'Other Countries', 'Syria', 'Lebanon', 'Jordan', 'Yemen', 'Qatar',
'Sudan', 'Ethiopia', 'Djibouti', 'Somalia', 'Botswana Land']
Do you know how can i make this?
You could try map:
data_geo = pd.read_excel("tab0.xlsx")
country_iso = pd.read_csv(r"https://datahub.io/JohnSnowLabs/iso-3166-country-codes-itu-dialing-codes-iso-4217-currency-codes/r/iso-3166-country-codes-itu-dialing-codes-iso-4217-currency-codes-csv.csv",
usecols=['Official_Name_English', 'ISO3166_1_Alpha_3'])
s = pd.Series(data_geo.countery_name_e).str.lower().str.title()
mapper = (country_iso.drop_duplicates('Official_Name_English')
.dropna(subset=['Official_Name_English'])
.set_index('Official_Name_English')['ISO3166_1_Alpha_3'])
y = data_geo['countery_name_e'].map(mapper)