Postgres 13 regexp_match() returns null values - sql

I'm reading a book about Postgres and there is an exercise with regex. So, I created table and loaded csv into it.
CREATE TABLE crime_reports (
crime_id bigserial PRIMARY KEY,
date_1 timestamp with time zone,
date_2 timestamp with time zone,
street varchar(250),
city varchar(100),
crime_type varchar(100),
description text,
case_number varchar(50),
original_text text NOT NULL
);
COPY crime_reports (original_text)
FROM 'C:\YourDirectory\crime_reports.csv'
WITH (FORMAT CSV, HEADER OFF, QUOTE '"');
Here is CSV:
"4/16/17-4/17/17
2100-0900 hrs.
46000 Block Ashmere Sq.
Sterling
Larceny: The victim reported that a
bicycle was stolen from their opened
garage door during the overnight hours.
C0170006614"
"4/8/17
1600 hrs.
46000 Block Potomac Run Plz.
Sterling
Destruction of Property: The victim
reported that their vehicle was spray
painted and the trim was ripped off while
it was parked at this location.
C0170006162"
"4/4/17
1400-1500 hrs.
24000 Block Hawthorn Thicket Ter.
Sterling
Larceny: The complainant reported that
multiple windows were stolen from this home
under construction. C0170006079"
"04/10/17
1605 hrs.
21800 block Newlin Mill Rd.
Middleburg
Larceny: A license plate was reported
stolen from a vehicle.
SO170006250"
"04/09/17
1200 hrs.
470000 block Fairway Dr.
Sterling
Destruction of Property: Unknown
subject(s) wrote graffiti on a sign in the
area.
SO170006211"
And when I'm trying to execute this query I have a lot of NULL values returned.
SELECT
regexp_match(original_text, '(?:C0|SO)[0-9]+') AS case_number,
regexp_match(original_text, '\d{1,2}\/\d{1,2}\/\d{2}') AS date_1,
regexp_match(original_text, '\n(?:\w+ \w+|\w+)\n(.*):') AS crime_type,
regexp_match(original_text, '(?:Sq.|Plz.|Dr.|Ter.|Rd.)\n(\w+ \w+|\w+)\n')
AS city
FROM crime_reports;
If I use explicit string the query works well:
SELECT regexp_match(
'4/16/17-4/17/17
2100-0900 hrs.
46000 Block Ashmere Sq.
Sterling
Larceny: The victim reported that a
bicycle was stolen from their opened
garage door during the overnight hours.
C0170006614', '\n(?:\w+ \w+|\w+)\n(.*):');
But in my case there are NULL values:
SELECT crime_id,
regexp_match(original_text, '\d{1,2}\/\d{1,2}\/\d{2}') AS date_1,
CASE WHEN EXISTS (SELECT regexp_matches(original_text, '-(\d{1,2}\/\d{1,2}\/\d{1,2})'))
THEN regexp_match(original_text, '-(\d{1,2}\/\d{1,2}\/\d{1,2})')
ELSE NULL
END AS date_2,
regexp_match(original_text, '\/\d{2}\n(\d{4})') AS hour_1,
CASE WHEN EXISTS (SELECT regexp_matches(original_text, '\/\d{2}\n\d{4}-(\d{4})'))
THEN regexp_match(original_text, '\/\d{2}\n\d{4}-(\d{4})')
ELSE NULL
END AS hour_2,
regexp_match(original_text, 'hrs.\n(\d+ .+(?:Sq.|Plz.|Dr.|Ter.|Rd.))') AS street,
regexp_match(original_text, '(?:Sq.|Plz.|Dr.|Ter.|Rd.)\n(\w+ \w+|\w+)\n') AS city,
regexp_match(original_text, '\n(?:\w+ \w+|\w+)\n(.*):') AS crime_type,
regexp_match(original_text, ':\s(.+)(?:C0|SO)') AS description,
regexp_match(original_text, '(?:C0|SO)[0-9]+') AS case_number
FROM crime_reports;
This is what I have:
What should be:
So what am I doing wrong?
I dont' know what to do
By the way I remembered that when I execute:
SELECT original_text FROM crime_reports;
I receive this:
Instead of this like in a book:
And it doesn't display popup msg with a full text.
Does that matter?

Related

How to get the differences between two rows **and** the name of the field where the difference is, in BigQuery?

I have a table in BigQuery like this:
Name
Phone Number
Address
John
123456778564
1 Penny Lane
John
873452987424
1 Penny Lane
Mary
845704562848
87 5th Avenue
Mary
845704562848
54 Lincoln Rd.
Amy
342847327234
4 Ocean Drive Avenue
Amy
347907387469
98 Truman Rd.
I want to get a table with the differences between two consecutive rows and the name of the field where occurs the difference:
I mean this:
Name
Field
Before
After
John
Phone Number
123456778564
873452987424
Mary
Address
87 5th Avenue
54 Lincoln Rd.
Amy
Phone Number
342847327234
347907387469
Amy
Address
4 Ocean Drive Avenue
98 Truman Rd.
How can I do this ? I've looked on other posts but couldn't find something that corresponds to my need.
Thank you
Consider below BigQuery'ish solution
select Name, ['Phone Number', 'Address'][offset(offset)] Field,
prev_field as Before, field as After
from (
select timestamp, Name, offset, field,
lag(field) over (partition by Name, offset order by timestamp) as prev_field
from yourtable,
unnest([`Phone Number`, Address]) field with offset
)
where prev_field != field
if applied to sample data in your question - output is
As you can see here - no matter how many columns in your table that you need to compare - it is still just one query - no unions and such.
You just need to enumerate your columns in two places
['Phone Number', 'Address'][offset(offset)] Field
and
unnest([`Phone Number`, Address]) field with offset
Note: you can further refactor above using scripting's execute immediate to compose such lists within the query on the fly (check my other answers - I frequently use such technique in them)
One method is just use to use lag() and union all
select name, 'phone', prev_phone as before, phone as after
from (select name, phone,
lag(phone) over (partition by name order by timestamp) as prev_phone
from t
) t
where prev_phone <> phone
union all
select name, 'address', prev_address as before, address as afte4r
from (select name, address,
lag(address) over (partition by name order by timestamp) as prev_address
from t
) t
where prev_address <> address

SQL - count function not working correctly

I'm trying to count the blood type for each blood bank I'm using oracle DB
the blood bank table is created like this
CREATE TABLE BloodBank (
BB_ID number(15),
BB_name varchar2(255) not NULL,
B_type varchar2(255),CONSTRAINT
blood_ty_pk FOREIGN KEY
(B_type) references BloodType(B_type),
salary number(15) not Null,
PRIMARY KEY (BB_ID)
);
INSERT INTO BloodBank (BB_ID,BB_name,b_type, salary)
VALUES (370,'new york Blood Bank','A+,A-,B+',12000);
INSERT INTO BloodBank (BB_ID,BB_name,b_type, salary)
VALUES (791,'chicago Blood Bank','B+,AB-,O-',90000);
INSERT INTO BloodBank (BB_ID,BB_name,b_type, salary)
VALUES (246,'los angeles Blood Bank','O+,A-,AB+',4500);
INSERT INTO BloodBank (BB_ID,BB_name,b_type, salary)
VALUES (360,'boston Blood Bank','A+,AB+',13000);
INSERT INTO BloodBank (BB_ID,BB_name,b_type, salary)
VALUES (510,'seattle Blood Bank','AB+,AB-,B+',2300);
select * from BloodBank;
when I use the count function
select count(B_type)
from bloodbank
group by BB_ID;
the result would be like this
so why the count function is not working correctly?
I'm trying to display each blood bank blood type count which is not only one in this case
I hope I don't get downvoted for solving the specific problem you're asking about, but this query would work:
select bb_id,
bb_name,
REGEXP_COUNT(b_type, ',')+1
from bloodbank;
However, this solution ignores a MAJOR issue with your data, which is that you do not normalize it as #Tim Biegeleisen correctly instructs you to do. The solution I've provided is EXTREMELY hacky in that it counts the commas in your string to determine the number of blood types. This is not at all reliable, and you should 100% do what Tim B recommends. But for the circumstances you find yourself in, this will tell you how many different blood types are kept at a specific blood bank.
http://sqlfiddle.com/#!4/8ed1c2/2
You should normalize your data and get each blood type value onto a separate record. That is, your starting data should look like this:
BB_ID | BB_name | b_type | salary
370 | new york Blood Bank | A+ | 12000
370 | new york Blood Bank | A- | 12000
370 | new york Blood Bank | A+ | 12000
... and so on
With this data model, the query you want is something along these lines:
SELECT BB_ID, BB_name, b_type, COUNT(*) AS cnt
FROM bloodbank
GROUP BY BB_ID, BB_name, b_type;
Or, if you want just counts of types across all bloodbanks, then use:
SELECT b_type, COUNT(*) AS cnt
FROM bloodbank
GROUP BY b_type;

how to find char in column value

I have two tables
table with all country codes like KZ,US,RU
table tranzactions with terminal location like
(Starbucks 1500 Broadway *Near Times Square US)
(CoffeBoom KZ Mendekulova district *Near Dostyk plaza)
and I want select
country code number , code str , location terminal name
like
398 | KZ | CoffeBoom KZ Mendekulova district *Near Dostyk plaza
840 | US | tarbucks 1500 Broadway *Near Times Square US
and without case when in terminal location name has code char in string like 'Gucci Moscow Redkzsuzin district RU' where char 'KZ','UZ' country code I want to select only 'RU'.
You can try building a regular expression incorporating column code_str within itself. The following attempts such. It builds an expression looking for the beginning of the string or a space followed the country code followed by a space or end-of-string and extracts rows matching. However, both false positives and false negatives as your searching free form text. Any occurrence matching that pattern will be returned even if NOT actually the a valid code and can miss valid ones as well. For example it will not find the row:
982,'US', 'Starbucks 618 Miracle Mile, Chicago, IL, USA'
You may need to workout a better definition of what you are searching for.
with tranzactions (country_code_number , code_str , location_terminal_name) as
(select 398,'KZ', 'CoffeBoom KZ Mendekulova district *Near Dostyk plaza' from dual union all
select 840,'US', 'Starbucks 1500 Broadway *Near Times Square US' from dual union all
select 982,'US', 'Starbucks 618 Miracle Mile, Chicago, IL, USA' from dual
)
select * from tranzactions
where regexp_like(location_terminal_name, '(^| )' || code_str || '( |$)' );

Summerize Time in area

i just have huge data of technitions that can be OnSite or OnTheWay ,
i want to summerize in witch site they been and for how long.
Example:
id UpdateTime UserName SiteID
488565 2019-02-18 19:07:24.000 stephen null
488388 2019-02-18 17:34:52.000 stephen 297
488558 2019-02-18 18:06:48.000 stephen 297
488565 2019-02-18 18:07:24.000 stephen 297
488565 2019-02-18 14:07:24.000 stephen null
483170 2019-02-18 13:53:14.000 stephen 299
488565 2019-02-18 11:07:24.000 stephen null
483170 2019-02-18 10:53:14.000 stephen 297
the technition was in 297 twice this day , i want to get this result per tech (End Time is the when i got Null or Diffrent SiteID):
UserName InComeTime TimeInSite(min) SiteID
stephen 2019-02-18 10:53:14.000 14 297
stephen 2019-02-18 13:53:14.000 14 299
stephen 2019-02-18 17:34:52.000 153 297
thanks,
eyal
Can't comment because no reputation :( ?!? so I'll post as answer although some questions remain. In principle you can work along the lines of joining null-value site records onto not-null site records. If you can't warrant that null-value siteIds mean 'exit' and not-null siteIds mean entry then there is no 'starting point' and you'd need to do a table scan. If you can warrant it (or deal with exceptions separately) then the query could take on the following form:
select t1.UserName,
t1.UpdateTime as EntryTime,
t2.UpdateTime as ExitTime,
datediff(MI, t1.UpdateTime, t2.UpdateTime) as TimeInSite,
t1.SiteId
from TimeTable t1
join TimeTable t2 on t2.id in
(select id from TimeTable
where
-- want the same user
UserName = t1.UserName
-- site id null/different means 'exited site'
and (siteId is null)
-- now get the entry with the minium update time that is greater than the entry time
and UpdateTime = (select min(UpdateTime) from TimeTable where UpdateTime > t1.UpdateTime
)
)
where t1.SiteId is not null
order by EntryTime
This does not take into account that you can have multiple 'not-null' siteIds for the same visit (i.e. the three 297s). Ideally this should be avoided. If you can't then you could first collate those entries into a temp table to only pick the the first entry time.
The above query outputs the following (SQL server, note that I have added entry and exit time for clarity). It is not 100% what you wanted because of the multiple 297s, but maybe it gets you started. Out of time now, maybe someone else can provide a 100% solution. Good luck!
UserName EntryTime ExitTime TimeInSite SiteId
------------ ----------------------- ----------------------- ----------- -----------
stephen 2019-02-18 10:53:14.000 2019-02-18 11:07:24.000 14 297
stephen 2019-02-18 13:53:14.000 2019-02-18 14:07:24.000 14 299
stephen 2019-02-18 18:07:24.000 2019-02-18 19:07:24.000 60 297
You can do this with window functions. You want to assign groups to the rows and then aggregate. How is the grouping defined?
In this case, you want to include the next NULL value in the group. So, a definition that works for you is the number of NULL values accumulated in reverse order. That is:
select t.*,
sum(case when siteId is null then 1 else 0 end) over (partition by userName order by updatetime desc) as grp
from t;
Then you can aggregate to get what you want:
select username, min(siteid) as siteid,
min(updatetime) as incometime,
datediff(minute, min(updatetime), max(updatetime)) as minutes
from (select t.*,
sum(case when siteId is null then 1 else 0 end) over (partition by userName order by updatetime desc) as grp
from t
) t;

How to load grouped data with SSIS

I have a tricky flat file data source. The data is grouped, like this:
Country City
U.S. New York
Washington
Baltimore
Canada Toronto
Vancouver
But I want it to be this format when it's loaded in to the database:
Country City
U.S. New York
U.S. Washington
U.S. Baltimore
Canada Toronto
Canada Vancouver
Anyone has met such a problem before? Got a idea to deal with it?
The only idea I got now is to use the cursor, but the it is just too slow.
Thank you!
The answer by cha will work, but here is another in case you need to do it in SSIS without temporary/staging tables:
You can run your dataflow through a Script Transformation that uses a DataFlow-level variable. As each row comes in the script checks the value of the Country column.
If it has a non-blank value, then populate the variable with that value, and pass it along in the dataflow.
If Country has a blank value, then overwrite it with the value of the variable, which will be last non-blank Country value you got.
EDIT: I looked up your error message and learned something new about Script Components (the Data Flow tool, as opposed to Script Tasks, the Control Flow tool):
The collection of ReadWriteVariables is only available in the
PostExecute method to maximize performance and minimize the risk of
locking conflicts. Therefore you cannot directly increment the value
of a package variable as you process each row of data. Increment the
value of a local variable instead, and set the value of the package
variable to the value of the local variable in the PostExecute method
after all data has been processed. You can also use the
VariableDispenser property to work around this limitation, as
described later in this topic. However, writing directly to a package
variable as each row is processed will negatively impact performance
and increase the risk of locking conflicts.
That comes from this MSDN article, which also has more information about the Variable Dispenser work-around, if you want to go that route, but apparently I mislead you above when I said you can set the value of the package variable in the script. You have to use a variable that is local to the script, and then change it in the Post-Execute event handler. I can't tell from the article whether that means that you will not be able to read the variable in the script, and if that's the case, then the Variable Dispenser would be the only option. Or I suppose you could create another variable that the script will have read-only access to, and set its value to an expression so that it always has the value of the read-write variable. That might work.
Yes, it is possible. First you need to load the data to a table with an IDENTITY column:
-- drop table #t
CREATE TABLE #t (id INTEGER IDENTITY PRIMARY KEY,
Country VARCHAR(20),
City VARCHAR(20))
INSERT INTO #t(Country, City)
SELECT a.Country, a.City
FROM OPENROWSET( BULK 'c:\import.txt',
FORMATFILE = 'c:\format.fmt',
FIRSTROW = 2) AS a;
select * from #t
The result will be:
id Country City
----------- -------------------- --------------------
1 U.S. New York
2 Washington
3 Baltimore
4 Canada Toronto
5 Vancouver
And now with a bit of recursive CTE magic you can populate the missing details:
;WITH a as(
SELECT Country
,City
,ID
FROM #t WHERE ID = 1
UNION ALL
SELECT COALESCE(NULLIF(LTrim(#t.Country), ''),a.Country)
,#t.City
,#t.ID
FROM a INNER JOIN #t ON a.ID+1 = #t.ID
)
SELECT * FROM a
OPTION (MAXRECURSION 0)
Result:
Country City ID
-------------------- -------------------- -----------
U.S. New York 1
U.S. Washington 2
U.S. Baltimore 3
Canada Toronto 4
Canada Vancouver 5
Update:
As Tab Alleman suggested below the same result can be achieved without the recursive query:
SELECT ID
, COALESCE(NULLIF(LTrim(a.Country), ''), (SELECT TOP 1 Country FROM #t t WHERE t.ID < a.ID AND LTrim(t.Country) <> '' ORDER BY t.ID DESC))
, City
FROM #t a
BTW, the format file for your input data is this (if you want to try the scripts save the input data as c:\import.txt and the format file below as c:\format.fmt):
9.0
2
1 SQLCHAR 0 11 "" 1 Country SQL_Latin1_General_CP1_CI_AS
2 SQLCHAR 0 100 "\r\n" 2 City SQL_Latin1_General_CP1_CI_AS