Extracting a word from string from n rows and append that word as a new col in SQL Server - sql

I have got a data set that contains 3 columns and has 15565 observations. one of the columns has got several words in the same row.
What I am looking to do is to extract a particular word from each row and append it to a new column (i will have 4 cols in total)
The problem is that the word that i am looking for are not the same and they are not always on the same position.
Here is an extract of my DS:
x y z
-----------------------------------------------------------------------
1 T 3C00652722 (T558799A)
2 T NA >> MSP: T0578836A & 3C03024632
3 T T0579010A, 3C03051500, EAET03051496
4 U T0023231A > MSP: T0577506A & 3C02808556
8 U (T561041A C72/59460)>POPMigr.T576447A,C72/221816*3C00721502
I am looking to extract all the words that start with 3Cand are 10 characters long and then append the to a new col so it looks like this:
x y z Ref
----------------------------------------------------------------
1 T 3C00652722 (T558799A) 3C00652722
2 T NA >> MSP: T0578836A & 3C03024632 3C03024632
3 T T0579010A, 3C03051500, EAET03051496 3C03051500
4 U T0023231A > MSP: T0577506A & 3C02808556 3C02808556
8 U >POPMigr.T576447A,C72/221816*3C00721502 3C00721502
I have tried using the Contains, Like and substring methods but it does not give me the results i am looking for as it basically finds the rows that have the 3C number but does not extract it, it just copies the whole cell and pastes is on the Ref column.

SQL Server doesn't have good string functions, but this should suffice if you only want to extract one value per row:
select t.*,
left(stuff(col,
1,
patindex('%3C[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', col),
''
), 10)
from t ;

Related

Select rows where column value is a combination of numbers and letters

Having a dataset like this:
word
0 TBH46T
1 BBBB
2 5AAH
3 CAAH
4 AAB1
5 5556
Which would be the most efficient way to select the rows where column word is a combination of numbers and letters?
The output would be like this:
word
0 TBH46T
2 5AAH
4 AAB1
A possible solution would be to create a new column using apply and regex in which store if column word has the desired structure. But I'm curious about if this could be achieved in a more straightforward way.
Use Series.str.contains for chain mask for match numeric and for match non numeric with & for bitwise AND:
df = df[df['word'].str.contains('\d') & df['word'].str.contains('\D')]
print (df)
word
0 TBH46T
2 5AAH
4 AAB1

Merge certain rows in a DataFrame based on startswith

I have a DataFrame, in which I want to merge certain rows to a single one. It has the following structure (values repeat)
Index Value
1 date:xxxx
2 user:xxxx
3 time:xxxx
4 description:xxx1
5 xxx2
6 xxx3
7 billed:xxxx
...
Now the problem is, that the columns 5 & 6 still belong to the description and were separated just wrong (whole string separated by ","). I want to merge the "description" row (4) with the values afterwards (5,6). In my DF, there can be 1-5 additional entries which have to be merged with the description row, but the structure allows me to work with startswith, because no matter how many rows have to be merged, the end point is always the row which starts with "billed". Due to me being very new to python, I haven´t got any code written for this problem yet.
My thought is the following (if it is even possible):
Look for a row which starts with "description" → Merge all the rows afterwards till reaching the row which starts with "billed", then stop (obviosly we keep the "billed" row) → Do the same to each row starting with "description"
New DF should look like:
Index Value
1 date:xxxx
2 user:xxxx
3 time:xxxx
4 description:xxx1, xxx2, xxx3
5 billed:xxxx
...
df = pd.DataFrame.from_dict({'Value': ('date:xxxx', 'user:xxxx', 'time:xxxx', 'description:xxx', 'xxx2', 'xxx3', 'billed:xxxx')})
records = []
description = description_val = None
for rec in df.to_dict('records'): # type: dict
# if previous description and record startswith previous description value
if description and rec['Value'].startswith(description_val):
description['Value'] += ', ' + rec['Value'] # add record Value into previous description
continue
# record with new description...
if rec['Value'].startswith('description:'):
description = rec
_, description_val = rec['Value'].split(':')
elif rec['Value'].startswith('billed:'):
# billed record - remove description value
description = description_val = None
records.append(rec)
print(pd.DataFrame(records))
# Value
# 0 date:xxxx
# 1 user:xxxx
# 2 time:xxxx
# 3 description:xxx, xxx2, xxx3
# 4 billed:xxxx

Pulling previous cell value using conditional lag function

I am trying to condense down a data table which has separate rows for a particular ID: one row has an intent string and the following rows have one or more log strings. There can be more than one set of intents/logs for each ID. I want to pull down the intent string cells in a separate column so they are listed on the same row/s as the associated log strings.
I've "tried" LAG(tobi_intent, 1,0) OVER (ORDER BY datevalue) as AssociatedIntent
but firstly, this isn't valid code, and secondly, wouldn't ensure that the associated intent and logs are for the same ID.
Can anyone advise on the correct sql code to get the output below?
expected table output:
ID log intent associated_intent
1 x
1 b x
1 a x
1 u
1 f u
2 x
2 f x
5 e
5 a e
5 s e

Split a string to its subwords

Every letter has a value
a b c d e f g h i j k l m n o p q r s t u v w x y z
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
TableA
String Length Value Subwords
exampledomain 13 132 #example-domain#example-do-main#
creditcard 10 85 #credit-card#credit-car-d#
TableB
Words Length Value
example 7 76
do 2 19
main 4 37
domain 6 56
credit 6 59
card 4 26
car 3 22
d 1 4
Explanation
TableA has string based over milion rows, and it will be new added 100k rows/daily to tableA.
And also "string" column has no whitespaces
TableB has words based over milion rows,there is every letter and words in 1-2 languages
What i want to do
i want to split strings in TableA to its subwords, as you see in example; "creditcard" i search in TableB all words and try to find which words when comes together matches the string
What i did,and couldnt solve my question
i took the string and JOIN the TableB with INNER JOINS i made 2-3 times INNER JOINS because there can be 3word 4word strings too, and that WORKED!! but it takes too much time even doing it for 100-200 strings. Guess i want to do it for 100k/everyday???
Now what i try to do
i gave values to everyletter as you see above,
Took the strings one by one and from their including letters i count the value of strings..
And the same for the words too in TableB..
Now i have every string in TableA and everyword in TableB with their VALUES..
_
1- i will take the string,length and value of it (Exmple; creditcard - 10 - 85)
2- and make a search in TableB to find the possible words when they come together, with their SUM(length), and SUM(value) matches the strings length and value, and write theese possibilities to a new column.
At last even their sum of length and sum of values matches each other there can be some posibilities that doesnt match the whole string i will elliminate theese ones (Example; "doma-in" can be "moda-in" too and their lengths and values are same but not same words)
I dont know but,i guess with that value method i can solve the time proplem??? , or if there is another ways to do that, i will be gratefull taking your advices.
Thanks
You could try to find the solutions recursively by looking always at the next letter. For example for the word DOMAIN
D - no
DO - is a word!
M - no
MA - no
MAI - no
MAIN - is a word!
No more letters --> DO + MAIN
DOM - is a word!
A - no
AI - no
AIN - no
Finished without result
DOMA - no
DOMAI - no
DOMAIN - is a word!
No more letters --> DOMAIN

Read only n-th column of a text file which has no header with R and sqldf

I have a similiar problem like this question:
selecting every Nth column in using SQLDF or read.csv.sql
I want to read some columns of large files (table of 150rows, >500,000 columns, space separated, filled with numeric data and only a 32 bit system available). This file has no header, therefore the code in the thread above didn't work and I decided to write a new post.
Do you have an idea to solve this problem?
I thought about something like that, but any results with fread or read.table are also ok:
MyConnection <- file("path/file.txt")
df<-sqldf("select column 1 100 1000 235612 from MyConnection",file.format = list(header=F,sep=" "))
You can use substr to specify the start and end position of the columns you want to read in if they are fixed width:
x <- tempfile()
cat("12345", "67890", "09876", "54321", sep = "\n", file = x)
myfile <- file(x)
sqldf("select substr(V1, 1, 1) var1, substr(V1, 3, 5) var2 from myfile")
# var1 var2
# 1 1 345
# 2 6 890
# 3 9 76
# 4 5 321
See this blog post for some more examples. The "select" statement can easily be constructed with paste if you know the details about the column starting positions and widths.