PostgreSQL - Get all subgroups from text array - sql
I have a table with Job Titles in it
e.g.
Need a Barista on the Weekend
Need a Barista, 24$ an hour
Needed on the weekend, baby sitter, 24$ an hour
I am trying to get a count of unique phrases
e.g.
- 2 x Need a
- 2 x Need a Barista
- 2 x on the Weekend
- 2 x on the
- 2 x 24$ an hour
I have run created a table to turn my text into an array of words
CREATE TABLE IF NOT EXISTS job_words (
source VARCHAR,
title VARCHAR,
words VARCHAR[]
)
I have split my titles and inserted as words into this table
insert into job_words
select 'job-title', raw_title, string_to_array(raw_title, ' ') from jobs
The longest sentences have 49 words in them
I would like to find any phrase that is between 2 and 10 words long
Happy to use another table to write into or just a direct query if it is possible
Sample Query to Get some Sample Data
select cardinality(words) no_of_words, words, title from job_words
where cardinality(words) > 4 and cardinality(words) < 10 and title ilike 'need a%'
order by title limit 100
Sample Data
8;"{Need,a,baby,sitter,for,an,amazing,girl}"
7;"{Need,a,baby,sitter,for,casual,sitting}"
8;"{Need,a,babysitter,for,our,19,months,old}"
9;"{Need,a,babysitter,for,our,4,year,old,son}"
9;"{Need,a,babysitter,for,our,little,reyon,19,months}"
7;"{Need,a,babysitter,-,look,no,further}"
5;"{Need,a,babysitter,or,tutor?}"
9;"{Need,a,baby,sitter,tonight,kids,are,already,sleeping}"
6;"{Need,a,Baker,now?,I'm,available!}"
6;"{Need,a,barista,all,rounder,ASAP}"
8;"{NEED,A,BARISTA???,full,time,or,part,time.}"
5;"{Need,a,brick,labourer,urgently}"
9;"{Need,a,care,giver,for,a,Month,old,baby}"
7;"{Need,a,Carer,-,After,School,hours}"
7;"{Need,a,Carpenter,-,build,a,cubby}"
5;"{Need,a,Carwash,staff,asap}"
5;"{Need,a,catering,assistant,job}"
9;"{Need,a,change,from,customer,service?,Look,no,further!}"
5;"{Need,a,change,of,scenery?}"
6;"{NEED,A,CLEANER,-,asap,start}"
6;"{Need,a,cleaner,for,daily,work}"
6;"{Need,a,cleaner,for,daily,work}"
9;"{Need,a,Cleaner,for,hotel,in,Belmont,near,Geelong}"
9;"{Need,a,Cleaner,for,hotel,in,Fyansford,near,Geelong}"
9;"{Need,a,Cleaner,for,hotel,in,Queenscliff,near,Geelong}"
5;"{Need,a,cleaner,for,mcdownal}"
7;"{Need,a,cleaner,for,tomorrow,pay,cash}"
7;"{Need,a,cleaner,for,tomorrow,pay,cash}"
5;"{Need,a,cleaner,in,Brisbane}"
6;"{Need,a,cleaner,in,Roxburghpark,Area}"
7;"{Need,a,cleaner,on,a,weekly,basis}"
7;"{Need,a,cleaner,on,Sunday,18th,June}"
9;"{Need,a,cleaning,team,for,your,building,or,office?}"
8;"{Need,a,concreter,to,start,full,time,/paving}"
6;"{Need,a,contract,climber,on,Tuesday}"
7;"{Need,a,cook,for,Road,Trip,Film}"
6;"{Need,a,delivery,driver,in,kew}"
8;"{Need,a,dishwasher,-,Wetherill,Park,6,days}"
7;"{Need,admin,done,for,hair,salon,asap}"
7;"{Need,admin,done,for,hair,salon,asap}"
7;"{Need,admin,done,for,hair,salon,asap}"
6;"{Need,a,driver,at,8:00,tonight}"
8;"{Need,a,driver,for,my,4.5,tomne,truck}"
7;"{Need,a,driver,in,a,Korean,restaurant}"
6;"{NEED,A,EXPERIENCED,CAR,WASH,STUFF}"
8;"{Need,a,flexible,babysitter,to,suit,shift,work}"
7;"{Need,a,fridge,picked,up,tommorow,Saturday}"
7;"{Need,after,school,care,with,pick,up}"
5;"{need,a,fulltime,female,cleaner}"
5;"{Need,a,full,time,job}"
8;"{Need,a,full,time,nanny,at,Baulkham,Hills}"
6;"{Need,a,"fun,","reliable,",interactive,babysitter}"
9;"{Need,a,gardener,/,labourer,tomorrow,for,5,hours}"
6;"{Need,a,gardener,or,labourer,tomorrow}"
6;"{Need,a,girl,for,sharing,room}"
8;"{Need,a,girl,or,boy,for,cleaning,job}"
6;"{Need,a,good,Barista,in,putney}"
8;"{Need,a,good,painter,for,the,next,month}"
7;"{Need,a,gyprock,setter,for,monday,23/01/17}"
9;"{Need,a,handy,person,in,our,new,work,shop}"
8;"{Need,a,helper,for,a,house,removals,truck}"
5;"{Need,a,house,Cleaner,?}"
6;"{Need,a,house,cleaner?,Call,now}"
6;"{Need,a,house,cleaner?,CALL,NOW!}"
8;"{Need,a,house,cleaner,for,this,afternoon,$30p/h}"
5;"{Need,a,housekeeper,tomorrow,morning}"
9;"{Need,a,HR,driver,for,one,day,a,week}"
8;"{need,a,invester,for,the,new,restaurant,Urgently}"
7;"{Need,a,job,asap,will,start,tomorrow}"
9;"{Need,a,job,?,Backpackers,wanted,+,FREE,ACCOMODATION}"
5;"{Need,a,job,for,weekend.}"
8;"{need,a,job,in,day,time,and,weekends}"
7;"{Need,a,job,of,cleaning,or,handkitchen}"
5;"{NEED,A,JOB?!,Start,immediately!!}"
5;"{need,a,job,(student,here)}"
6;"{Need,a,job,to,start,asap}"
9;"{Need,a,kitchen,hand,for,Indian,take,away,shop}"
8;"{Need,a,labourer.,Easy,work.,Monday,or,Tues}"
6;"{need,a,labourer,for,7,weeks}"
5;"{Need,a,labourer,for,today}"
5;"{Need,a,labourer,next,week}"
9;"{Need,a,last,minute,barista,or,chef?!,Staff,cancelled?!}"
9;"{Need,a,live,in,nanny,for,our,2,sons}"
8;"{Need,a,local,Electrician?,Look,no,further,:)}"
8;"{Need,a,male,cleaner,for,a,busy,restaurant}"
6;"{Need,a,man,and,a,ute!!!}"
9;"{NEED,A,MAN,AND,UTE,MONDAY,26th,AFTER,5PM}"
9;"{NEED,A,MANSPOWER,TO,HELP,US,IN,OUR,MOVING}"
8;"{Need,an,after-school,nanny,for,month,of,October}"
5;"{Need,a,Nanny,/,Babysitter}"
6;"{Need,a,nanny,for,2,kids}"
9;"{Need,a,nanny,for,3,days,after,school,care}"
7;"{Need,a,Nanny,for,7,year,old}"
9;"{Need,a,nanny,for,a,few,days,a,week}"
9;"{Need,a,nanny,for,after,school,pick,and,care}"
8;"{Need,a,nanny,for,immediate,start,on,Thursdays}"
9;"{Need,a,nanny,for,my,3,year,old,daughter}"
9;"{Need,a,nanny,for,my,3,year,old,daughter}"
6;"{Need,a,nanny,for,one,day.}"
7;"{Need,a,nanny,for,"Rydges,",Campbelltown,5.45-10.30pm}"
I got this. Is this the result you expected?
phrase_part count
------------------------------------------------------------------------- -----
{"Need","a"} 32
{"Need","a","cleaner"} 9
{"a","cleaner"} 9
{"cleaner","for"} 5
{"Need","a","babysitter"} 5
{"a","babysitter"} 5
{"a","cleaner","for"} 5
{"Need","a","cleaner","for"} 5
{"for","hotel","in"} 3
{"near","Geelong"} 3
{"hotel","in"} 3
{"Cleaner","for","hotel"} 3
{"for","hotel"} 3
{"babysitter","for","our"} 3
{"for","our"} 3
{"babysitter","for"} 3
{"a","Cleaner","for"} 3
{"Need","a","babysitter","for"} 3
...
{"NEED","A"} 2
{"a","cleaner","for","daily"} 2
{"Need","a","change"} 2
{"daily","work"} 2
{"pay","cash"} 2
{"a","cleaner","in"} 2
{"Need","a","cleaner","for","tomorrow","pay","cash"} 2
{"for","casual","sitting"} 1
{"concreter","to","start","full"} 1
{"a","change","from","customer","service?","Look"} 1
{"a","contract","climber"} 1
{"Roxburghpark","Area"} 1
{"Need","a","Carwash"} 1
...
If this is your expected result here's the query for it. But I am not sure if you should do this with a huge data set!
I began with the plain phrases instead of your sample data with the arrays. Additionally I added an id column for each phrase:
WITH phrases as (
SELECT
*,
row_number() over (partition by id) nth_word -- B
FROM (
SELECT
id,
unnest(string_to_array(phrase, ' ')) as word -- A
FROM testdata.phrases
)s
)
SELECT
phrase_part,
count(phrase_part) FILTER (WHERE cardinality(phrase_part) >= 2) -- E
FROM (
SELECT
*,
array_agg(b.word) over (partition by a.id, a.nth_word order by a.id, a.nth_word, b.nth_word) --D
as phrase_part
FROM phrases a
JOIN phrases b -- C
ON (a.id = b.id AND a.nth_word <= b.nth_word)
) s
GROUP BY phrase_part
ORDER BY COUNT DESC
A: formatting the plain phrases into single word arrays and expand the table to one word per line
B: adding a word counter to identify the nth word of the phrase with a window function
C: cross join the phrases with themself; better speaking: join on word with each following of the same phrase
D: this window function aggregates the phrase words. It creates a result like
id word nth_word id word nth_word phrase_part
-- ------------ -------- -- ------------ -------- -------------------------------------------------------------------------
1 Need 1 1 Need 1 {"Need"}
1 Need 1 1 a 2 {"Need","a"}
1 Need 1 1 baby 3 {"Need","a","baby"}
1 Need 1 1 sitter 4 {"Need","a","baby","sitter"}
1 Need 1 1 for 5 {"Need","a","baby","sitter","for"}
1 Need 1 1 casual 6 {"Need","a","baby","sitter","for","casual"}
1 Need 1 1 sitting 7 {"Need","a","baby","sitter","for","casual","sitting"}
1 a 2 1 a 2 {"a"}
1 a 2 1 baby 3 {"a","baby"}
1 a 2 1 sitter 4 {"a","baby","sitter"}
E: grouping by phrases and counting the elements. The filter clause allows you to count different cardinalities.
Related
JOIN on aggregate function
I have a table showing production steps (PosID) for a production order (OrderID) and which machine (MachID) they will be run on; I’m trying to reduce the table to show one record for each order – the lowest position (field “PosID”) that is still open (field “Open” = Y); i.e. the next production step for the order. Example data I have: OrderID PosID MachID Open 1 1 A N 1 2 B Y 1 3 C Y 2 4 C Y 2 5 D Y 2 6 E Y Example result I want: OrderID PosID MachID 1 2 B 2 4 C I’ve tried two approaches, but I can’t seem to get either to work: I don’t want to put “MachID” in the GROUP BY because that gives me all the records that are open, but I also don’t think there is an appropriate aggregate function for the “MachID” field to make this work. SELECT “OrderID”, MIN(“PosID”), “MachID” FROM Table T0 WHERE “Open” = ‘Y’ GROUP BY “OrderID” With this approach, I keep getting error messages that T1.”PosID” (in the JOIN clause) is an invalid column. I’ve also tried T1.MIN(“PosID”) and MIN(T1.”PosID”). SELECT T0.“OrderID”, T0.“PosID”, T0.“MachID” FROM Table T0 JOIN (SELECT “OrderID”, MIN(“PosID”) FROM Table WHERE “Open” = ‘Y’ GROUP BY “OrderID”) T1 ON T0.”OrderID” = T1.”OrderID” AND T0.”PosID” = T1.”PosID”
Try this: SELECT “OrderID”,“PosID”,“MachID” FROM ( SELECT T0.“OrderID”, T0.“PosID”, T0.“MachID”, ROW_NUMBER() OVER (PARTITION BY “OrderID” ORDER BY “PosID”) RNK FROM Table T0 WHERE “Open” = ‘Y’ ) AS A WHERE RNK = 1 I've included the brackets when selecting columns as you've written it in the question above but in general it's not needed. What it does is it first filters open OrderIDs and then numbers the OrderIDs from 1 to X which are ordered by PosID OrderID PosID MachID Open RNK 1 2 B Y 1 1 3 C Y 2 2 4 C Y 1 2 5 D Y 2 2 6 E Y 3 After it filters on the "rnk" column indicating the lowest PosID per OrderID. ROW_NUMBER() in the select clause is called a window function and there are many more which are quite useful. P.S. Above solution should work for MSSQL
Keyset pagination with composite key
I am using oracle 12c database and I have a table with the following structure: Id NUMBER SeqNo NUMBER Val NUMBER Valid VARCHAR2 A composite primary key is created with the field Id and SeqNo. I would like to fetch the data with Valid = 'Y' and apply ketset pagination with a page size of 3. Assume I have the following data: Id SeqNo Val Valid 1 1 10 Y 1 2 20 N 1 3 30 Y 1 4 40 Y 1 5 50 Y 2 1 100 Y 2 2 200 Y Expected result: ---------------------------- Page 1 ---------------------------- Id SeqNo Val Valid 1 1 10 Y 1 3 30 Y 1 4 40 Y ---------------------------- Page 2 ---------------------------- Id SeqNo Val Valid 1 5 50 Y 2 1 100 Y 2 2 200 Y Offset pagination can be done like this: SELECT * FROM table ORDER BY Id, SeqNo OFFSET 3 ROWS FETCH NEXT 3 ROWS ONLY; However, in the actual db it has more than 5 millions of records and using OFFSET is going to slow down the query a lot. Therefore, I am looking for a ketset pagination approach (skip records using some unique fields instead of OFFSET) Since a composite primary key is used, I need to offset the page with information from more than 1 field. This is a sample SQL that should work in PostgreSQL (fetch 2nd page): SELECT * FROM table WHERE (Id, SeqNo) > (1, 4) AND Valid = 'Y' ORDER BY Id, SeqNo LIMIT 3; How do I achieve the same in oracle?
Use row_number() analytic function with ceil arithmetic fuction. Arithmetic functions don't have a negative impact on performance, and row_number() over (order by ...) expression automatically orders the data without considering the insertion order, and without adding an extra order by clause for the main query. So, consider : select Id,SeqNo, ceil(row_number() over (order by Id,SeqNo)/3) as page from tab where Valid = 'Y'; P.S. It also works for Oracle 11g, while OFFSET 3 ROWS FETCH NEXT 3 ROWS ONLY works only for Oracle 12c. Demo
You can use order by and then fetch rows using fetch and offset like following: Select ID, SEQ, VAL, VALID FROM TABLE WHERE VALID = 'Y' ORDER BY ID, SEQ --FETCH FIRST 3 ROWS ONLY -- first page --OFFSET 3 ROWS FETCH NEXT 3 ROWS ONLY -- second pages --OFFSET 6 ROWS FETCH NEXT 3 ROWS ONLY -- third page --Update-- You can use row_number analytical function as following. Select id, seqNo, Val, valid from (Select t.*, Row_number(order by id, seq) as rn from table t Where valid = 'Y') Where ceil(rn/3) = 2 -- for page no. 2 Cheers!!
SQL random number that doesn't repeat within a group
Suppose I have a table: HH SLOT RN -------------- 1 1 null 1 2 null 1 3 null -------------- 2 1 null 2 2 null 2 3 null I want to set RN to be a random number between 1 and 10. It's ok for the number to repeat across the entire table, but it's bad to repeat the number within any given HH. E.g.,: HH SLOT RN_GOOD RN_BAD -------------------------- 1 1 9 3 1 2 4 8 1 3 7 3 <--!!! -------------------------- 2 1 2 1 2 2 4 6 2 3 9 4 This is on Netezza if it makes any difference. This one's being a real headscratcher for me. Thanks in advance!
To get a random number between 1 and the number of rows in the hh, you can use: select hh, slot, row_number() over (partition by hh order by random()) as rn from t; The larger range of values is a bit more challenging. The following calculates a table (called randoms) with numbers and a random position in the same range. It then uses slot to index into the position and pull the random number from the randoms table: with nums as ( select 1 as n union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9 ), randoms as ( select n, row_number() over (order by random()) as pos from nums ) select t.hh, t.slot, hnum.n from (select hh, randoms.n, randoms.pos from (select distinct hh from t ) t cross join randoms ) hnum join t on t.hh = hnum.hh and t.slot = hnum.pos; Here is a SQLFiddle that demonstrates this in Postgres, which I assume is close enough to Netezza to have matching syntax.
I am not an expert on SQL, but probably do something like this: Initialize a counter CNT=1 Create a table such that you sample 1 row randomly from each group and a count of null RN, say C_NULL_RN. With probability C_NULL_RN/(10-CNT+1) for each row, assign CNT as RN Increment CNT and go to step 2
Well, I couldn't get a slick solution, so I did a hack: Created a new integer field called rand_inst. Assign a random number to each empty slot. Update rand_inst to be the instance number of that random number within this household. E.g., if I get two 3's, then the second 3 will have rand_inst set to 2. Update the table to assign a different random number anywhere that rand_inst>1. Repeat assignment and update until we converge on a solution. Here's what it looks like. Too lazy to anonymise it, so the names are a little different from my original post: /* Iterative hack to fill 6 slots with a random number between 1 and 13. A random number *must not* repeat within a household_id. */ update c3_lalfinal a set a.rand_inst = b.rnum from ( select household_id ,slot_nbr ,row_number() over (partition by household_id,rnd order by null) as rnum from c3_lalfinal ) b where a.household_id = b.household_id and a.slot_nbr = b.slot_nbr ; update c3_lalfinal set rnd = CAST(0.5 + random() * (13-1+1) as INT) where rand_inst>1 ; /* Repeat until this query returns 0: */ select count(*) from ( select household_id from c3_lalfinal group by 1 having count(distinct(rnd)) <> 6 ) x ;
Access SQL query to mailmerge
How can I transform this table from this id name 1 sam 2 nick 3 ali 4 farah 5 josef 6 fadi to id1 name1 id2 name2 id3 name3 id4 name4 1 sam 2 nick 3 ali 4 farah 5 josef 6 fadi the reason i need this is i have a database and i need to do a mail merge using word and I want to print every 4 rows on one page, MS word can only print one row per page, so using an SQL query I want one row to represent 4 rows thanks in advance Ali
You don't need to create a query for this in Access. Word has a merge field called <<Next Record>> which forces moving to the next record. If you look at how label documents are created using the Mail Merge Wizard, you'll see that's how it's done. Updated - Doing this in SQL The columns in simple SELECT statements are derived from the columns from the underlying table/query (or from expressions). If you want to define columns based on the data, you need to use a crosstab query. First create a query with a running count for each person (say your table is called People), and calculate the row and column position from the running count: SELECT People.id, Count(*)-1 AS RunningCount, int(RunningCount/4) AS RowNumber, RunningCount Mod 4 AS ColumnNumber FROM People LEFT JOIN People AS People_1 ON People.id >= People_1.id GROUP BY People.id; (You won't be able to view this in the Query Designer, because the JOIN isn't comparing with = but with >=.) This query returns the following results: id Rank RowNumber ColumnNumber 1 0 0 0 2 1 0 1 3 2 0 2 4 3 0 3 5 4 1 0 6 5 1 1 Assuming this query is saved as Positions, the following query will return the results: TRANSFORM First(Item) AS FirstOfItem SELECT RowNumber FROM ( SELECT ID AS Item, RowNumber, "id" &( ColumnNumber + 1) AS ColumnHeading FROM Positions UNION ALL SELECT Name, RowNumber, "name" & (ColumnNumber +1) FROM Positions INNER JOIN People ON Positions.id = People.id ) AS AllValues GROUP BY AllValues.RowNumber PIVOT AllValues.ColumnHeading In ("id1","name1","id2","name2","id3","name3","id4","name4"); The UNION is there so each record in the People table will have two columns - one with the id, and one with the name. The PIVOT clause forces the columns to the specified order, and not in alphabetical order (e.g. id1, id2 ... name1, name2...)
Returning several rows from a single query, based on a value of a column
Let's say I have this table: |Fld | Number| 1 5 2 2 And I want to make a select that retrieves as many Fld as the Number field has: |Fld | 1 1 1 1 1 2 2 How can I achieve this? I was thinking about making a temporary table and instert data based on the Number, but I was wondering if this could be done with a single Select statement. PS: I'm new to SQL
You can join with a numbers table: SELECT Fld FROM yourtable JOIN Numbers ON yourtable.Number <= Numbers.Number A numbers table is just a table with a list of numbers: Number 1 2 3 etc...
Not an great solution (since you still query your table twice, but maybe you can work from it) SELECT t1.fld, t1.number FROM table t1, ( SELECT ROWNUM number FROM dual CONNECT BY LEVEL <= (SELECT MAX(number) FROM t1)) t2 WHERE t2.number<=t1.number It generates maximum amount of rows needed and then filters it by each row.
I don't know if your RDBMS version supports it (although I rather suspect it does), but here is a recursive version: WITH remaining (fld, times) as (SELECT fld, 1 FROM <table> UNION ALL SELECT a.fld, a.times + 1 FROM remaining as a JOIN <table> as b ON b.fld = a.fld AND b.number > a.times) SELECT fld FROM remaining ORDER BY fld Given your source data table, it outputs this (count included for verification): fld times ============= 1 1 1 2 1 3 1 4 1 5 2 1 2 2