I have these sample values
prm_2020 P02 United Kingdom London 2 for 2
prm_2020 P2 United Kingdom London 2 for 2
prm_2020 P10 United Kingdom London 2 for 2
prm_2020 P11 United Kingdom London 2 for 2
Need to find P2, P02, P11,p06,p05 like this, trying to use Regexp_extract function in databricks. struggling to find the correct expression. Once i find P10, p6 from string i need to put numbers in new column called ID
select distinct
promo_name
,regexp_extract(promo_name, '(?<=p\d+\s+)P\d+') as regexp_id
from stock
where promo_name is not null
select distinct
promo_name
,regexp_extract(promo_name, 'P[0-9]+') as regexp_id
from stock
where promo_name is not null
both generating errors
The expression would be:
select regexp_extract(col, 'P[0-9]+')
One regex could be (?<=prm_\d+\s+)P\d+
Besides searching for strings in the form of P* where * is a digit, it also checks that such strings are preceded by strings in the form prm_* where * is a digit.
Keep in mind case sensitivity. The solution above IS case sensitive (if your input comes as PRM, then your match will be discarded.) I am not familiar with apache-spark but I assume it supports parameters such as /i as other platforms to indicate the regex should be case insensitive.
Regexr.com demo
Just select the group 0
regexp_extract(promo_name, 'P[0-9]+',0)
function regexp_extract will take 3 parameters.
Column value
Regex Pattern
Group Index
def regexp_extract(e: org.apache.spark.sql.Column,exp: String,groupIdx: Int): org.apache.spark.sql.Column
You are missing last parameter in regexp_extract function.
Check below code.
scala> df.show(truncate=False)
+------------------------------------------+
|data |
+------------------------------------------+
|prm_2020 P02 United Kingdom London 2 for 2|
|prm_2020 P2 United Kingdom London 2 for 2 |
|prm_2020 P10 United Kingdom London 2 for 2|
|prm_2020 P11 United Kingdom London 2 for 2|
+------------------------------------------+
df
.withColumn("parsed_data",regexp_extract(col("data"),"(P[0-9]*)",0))
.show(truncate=False)
+------------------------------------------+-----------+
|data |parsed_data|
+------------------------------------------+-----------+
|prm_2020 P02 United Kingdom London 2 for 2|P02 |
|prm_2020 P2 United Kingdom London 2 for 2 |P2 |
|prm_2020 P10 United Kingdom London 2 for 2|P10 |
|prm_2020 P11 United Kingdom London 2 for 2|P11 |
+------------------------------------------+-----------+
df.createTempView("tbl")
spark
.sql("select data,regexp_extract(data,'(P[0-9]*)',0) as parsed_data from tbl")
.show(truncate=False)
+------------------------------------------+-----------+
|data |parsed_data|
+------------------------------------------+-----------+
|prm_2020 P02 United Kingdom London 2 for 2|P02 |
|prm_2020 P2 United Kingdom London 2 for 2 |P2 |
|prm_2020 P10 United Kingdom London 2 for 2|P10 |
|prm_2020 P11 United Kingdom London 2 for 2|P11 |
+------------------------------------------+-----------+
Related
I would like to get a cartesian product of several tables in SQL (which are actually only one column, so no common key). For example:
TABLE A
Robert
Pierre
Samuel
TABLE B
Montreal
Chicago
TABLE C
KLM
AIR FRANCE
FINAL TABLE (CROSS PRODUCT)
Robert | Montreal | KLM
Pierre | Montreal | KLM
Samuel | Montreal | KLM
Robert | Chicago | KLM
Pierre | Chicago | KLM
Samuel | Chicago | KLM
Robert | Montreal | AIR FRANCE
Pierre | Montreal | AIR FRANCE
Samuel | Montreal | AIR FRANCE
Robert | Chicago | AIR FRANCE
Pierre | Chicago | AIR FRANCE
Samuel | Chicago | AIR FRANCE
I tried CROSS JOIN, but I couldn't find an example with multiple tables. Is the only way to do it is nesting? What if we have 15 tables to join that way... it creates a very long code.
Thank you!
You would simply use:
select *
from a cross join b cross join c;
Do note that if any of the tables are empty (i.e. no rows), you will get no results.
I'm learning nested select and I've encountered a problem with AS operator within the second (i.e. nested select).
Please have a look at the following table (truncated):
+-------------+-----------+---------+------------+--------------+
| name | continent | area | population | gdp |
+-------------+-----------+---------+------------+--------------+
| Afghanistan | Asia | 652230 | 25500100 | 20343000000 |
| Albania | Europe | 28748 | 2831741 | 12960000000 |
| Algeria | Africa | 2381741 | 37100000 | 188681000000 |
| Andorra | Europe | 468 | 78115 | 3712000000 |
| Angola | Africa | 1246700 | 20609294 | 100990000000 |
+-------------+-----------+---------+------------+--------------+
The aim is to show the countries in Europe with a per capita GDP greater than that of United Kingdom's. (Per capita GDP is the gdp/population).
The following query is correct in terms of syntax but it will not give the correct result as it selects gdp instead of gdp/population:
SELECT name
FROM world
WHERE gdp/population >
(SELECT gdp
FROM world
WHERE name = 'United Kingdom')
AND continent = 'Europe';
One solution to correct this would be using gdp/population instead of gdp in nested select but the resulting query would be incorrect in terms of syntax. Why? I use MariaDB but I'd like the query to be not dependent on DBMS provider.
SELECT name
FROM world
WHERE gdp/population >
(SELECT gdp AS gdp/population
FROM world
WHERE name = 'United Kingdom')
AND continent = 'Europe';
AS syntax is
SELECT expression AS ALIAS
So you got it the wrong way round, and the alias you are defining contains an illegal character (/). An alias is not required in this case, so you can simply do:
SELECT name
FROM world
WHERE gdp/population >
(SELECT gdp/population
FROM world
WHERE name = 'United Kingdom')
AND continent = 'Europe';
How can I get all the countries from the DB, from this table:
city | country | info
Jerusalem | Israel | Capital
Tel Aviv | Israel |
New York | USA | Biggest
Washington DC | USA | Capital
Berlin | Germany | Capital
How can I get, using SQL, the countries only: Israel, USA, Germany?
Which database server are you using?
Assuming that the top row is the column name and you are using MySQL then you should be able to just do
"SELECT distinct(country) FROM <table-name>;"
This is probably in the documentation for the database software that you are using.
I've got the following problem to solve with an excel table:
Column A contains country names and column B, values, as the example below shows.
There are some names in column A that came with aditional words added to it, so i need them to be treated like an unique country other columns.
Follows an example:
A | B
------------------------------------
Country Name | Number
------------------------------------
ITALY (MOBILE) | 100
PORTUGAL (MOBILE) | 180
UNITED KINGDOM (MOBILE) | 160
ARGENTINA BUA | 120
FRANCE MOBILE ORANGE | 100
CHINA (MOBILE) | 100
ITALY | 93
SPAIN (MOBILE) | 90
PORTUGAL | 85
GERMANY (MOBILE) | 75
UNITED KINGDOM | 10
GERMANY | 70
ECUADOR (MOBILE) | 55
The exit could be a table like the following, in the same worksheet, to columns D and E, for example.
It would sum countries values and show them as the right unique country name (The right name would be the first one that shows up, before the "(", and without the values between "()" ).
A | B
------------------------------------
ITALY | 193
PORTUGAL | 265
UNITED KINGDOM | 170
GERMANY | 145
ARGENTINA | 120
FRANCE MOBILE ORANGE | 100
CHINA | 100
SPAIN | 90
ECUADOR | 55
Is it easier using VBA?
Thanks, guys!
Here's what I would do:
Label column C Country (or whatever you need) and write this formula in column C:
=SUBSTITUTE(A2," (MOBILE)","")
Create a pivot table on columns A through C, with column C as your row labels and the sum of column B as your values.
#pnuts makes a good point below. I'd use this formula instead in part 1:
=IFERROR(TRIM(REPLACE(A2,FIND("(",A2),FIND(")",A2)-FIND("(",A2)+1,"")),A2)
I'm using the MySQL WORLD database.
For each Continent, I want to return the Name of the country with the largest population.
I was able to come up with a query that works. Trying to find another query that uses join only and avoid the subquery.
Is there a way to write this query using JOIN?
SELECT Continent, Name
FROM Country c1
WHERE Population >= ALL (SELECT Population FROM Country c2 WHERE c1.continent = c2.continent);
+---------------+----------------------------------------------+
| Continent | Nanme |
+---------------+----------------------------------------------+
| Oceania | Australia |
| South America | Brazil |
| Asia | China |
| Africa | Nigeria |
| Europe | Russian Federation |
| North America | United States |
| Antarctica | Antarctica |
| Antarctica | Bouvet Island |
| Antarctica | South Georgia and the South Sandwich Islands |
| Antarctica | Heard Island and McDonald Islands |
| Antarctica | French Southern territories |
+---------------+----------------------------------------------+
11 rows in set (0.14 sec)
This is the "greatest-n-per-group" problem that comes up frequently on StackOverflow.
SELECT c1.Continent, c1.Name
FROM Country c1
LEFT OUTER JOIN Country c2
ON (c1.continent = c2.continent AND c1.Population < c2.Population)
WHERE c2.continent IS NULL;
Explanation: do a join looking for a country c2 that has the same continent and a greater population. If you can't find one (which is indicated by the outer join returning NULL for all columns of c2) then c1 must be the country with the highest population on that continent.
Note that this can find more than one country per continent, if there's a tie for the #1 position. In other words, there could be two countries for which no third country exists with a greater population.