T-SQL on XML (using XQuery) - sql

I have the below XML
<myroot>
<scene>
<sceneId>983247</sceneId>
<item>
<coordinates>
<coordinate>0</coordinate>
<coordinate>1</coordinate>
<coordinate>2</coordinate>
<coordinate>3</coordinate>
</coordinates>
<Values>
<Value>34</Value>
<Value>541</Value>
<Value>255</Value>
<Value>332</Value>
</Values>
</item>
</scene>
</myroot>
How can I get using TSQL the following result:
Col1 Col2
0 34
1 541
2 255
3 332
Thanks,
M

This XPath 2.0 expression:
/myroot/scene/item/
string-join(for $pos in (0 to max(*/count(*)))
return string-join(for $col in (1 to max(count(*)))
return if ($pos=0)
then concat('Col',$col)
else *[$col]/*[$pos],
' '),
'
')
Output:
Col1 Col2
0 34
1 541
2 255
3 332

Here's my XML noob approach.
If you only trust the element sequencing, and not the coordinate values themselves being a sequence:
select
coordinate = max(case when element = 'coordinate' then elemval end)
, value = max(case when element = 'Value' then elemval end)
from (
select
element = row.value('local-name(.)','varchar(32)')
, elemval = row.value('.','int')
, position = row.value('for $s in . return count(../*[. << $s]) + 1', 'int')
from #xml.nodes('/myroot/scene/item/*/*') a (row)
) a
group by position
Alternatively written as two .nodes() and a JOIN (you get the idea).
If do you trust the coordinate numbering to be a sequence starting at zero:
select
coordinate = row.value('for $s in . return count(../*[. << $s]) + 1', 'int')
- 1
, value = row.value('.','int')
from #xml.nodes('/myroot/scene/item/Values/*') a (row)
If you only trust the coordinate numbering to be a sequence, but from an arbitrary seed:
select
coordinate = row.value('for $s in . return count(../*[. << $s]) + 1', 'int')
+ row.value('(/myroot/scene/item/coordinates/coordinate)[1]','int')
- 1
, value = row.value('.','int')
from #xml.nodes('/myroot/scene/item/Values/*') a (row)
Paths can be abbreviated:
/myroot/scene/item/*/* -> //item/*/*
/myroot/scene/item/Values/* -> //Values/*
/myroot/scene/item/coordinates/coordinate -> //coordinate
But I don't know the wisdom of this either way.
//item/*/* can probably be made more specific, so that it only includes coordinate and Value edge nodes, but I don't know the syntax.

Related

How do I measure the length of the lists per userId using pandas?

I am trying to measure the length of the list under Original Query and subsequently find the mean and std dev but I cannot seem to measure the length. How do I do it?
This is what I tried:
filepath = "yandex_users_paired_queries.csv" #path to the csv with the query datasetqueries = pd.read_csv(filepath)
totalNum = queries.groupby('Original Query').size().reset_index(name='counts')
sessions = queries.groupby(['UserID','Original Query'])
print(sessions.size())
print("----------------------------------------------------------------")
print("~~~Mean & Average~~~")
sessionsDF = sessions.size().to_frame('counts')
sessionsDFbyBool = sessionsDF.groupby(['Original Query'])
print(sessionsDFbyBool["counts"].agg([np.mean,np.std]))
And this is my output:
UserID Original Query
154 [1228124, 388107, 1244921, 3507784] 1
[1237207, 1974238, 1493311, 1222688, 733390, 868851, 428547, 110871, 868851, 235307] 1
[1237207, 1974238, 1493311, 1222688, 733390, 868851, 428547] 1
[1237207, 1974238, 1493311, 1222688, 733390] 1
[1237207] 1
..
343 [919873, 551537, 1841361, 1377305, 610887, 1196372, 3724298] 1
[919873, 551537, 1841361, 1377305, 610887, 1196372] 1
345 [3078369, 3613096, 4249887, 2383044, 2366003, 4043437] 1
[3531370, 3078369, 284354, 4300636] 1
347 [1617419] 1
Length: 612, dtype: int64
You want to apply the len function on the 'Original Query' column.
queries['oq_len'] = queries['Original Query'].apply(len)
sessionsDF = queries.groupby('UserID').oq_len.agg([np.mean,np.std])

Perl: Combine duplicated keys in Hash of Array

I having issues with this and wondering if someone could provide some help. I'm parsing a .txt file and want to combine duplicated keys and it's values. Essentially, for each identifier I want to store it's height value. Each "sample" has 2 entries (A & B). I have the file stored like this:
while(...){
#data= split ("\t", $line);
$curr_identifier= $data[0];
$markername= $data[1];
$position1= $data[2];
$height= $data[4];
if ($line >0){
$result[0] = $markername;
$result[1] = $position1;
$result[2] = $height;
$result[3] = $curr_identifier;
$data{$curr_identifier}= [#result];
}
}
This seems to work fine, but my issue is that when I send this data to below function. It prints the $curr_identifier twice. I only want to populate unique identifiers and check for the presence of it's $height variable.
if (!defined $data{$curr_identifier}[2]){
$output1= "no height for both markers- failed";
} else {
if ($data{$curr_identifier}[2] eq " ") {
$output1 = $markername;
}
}
print $curr_identifier, $output1 . "\t" . $output1 . "\n";
Basically, if sample height is present for both markers (A&B), then output is both markers.
'1', 'A', 'B'
If height is not present, then output is empty for reported marker.
'2', 'A', ' '
'3', ' ', 'B'
My current output is printing out like this:
1, A
1, B
2, A
2, ' '
3, ' '
3, B'
_DATA_
Name Marker Position1 Height Time
1 A A 6246 0.9706
1 B B 3237 0.9706
2 A 0
2 B B 5495 0.9775
3 A A 11254 0.9694
3 B 0
Your desired output can essentially be boiled down to these few lines of perl code:
while (<DATA>) {
($name,$mark,$pos,$heig,$time) = split /\t/;
print "'$name','$mark','$pos'\n";
}
__DATA__
... your tab-separated data here ...

How to find word frequency per country list in pandas?

Let's say I have a .CSV which has three columns: tidytext, location, vader_senti
I was already able to get the amount of *positive, neutral and negative text instead of word* pero country using the following code:
data_vis = pd.read_csv(r"csviamcrpreprocessed.csv", usecols=fields)
def print_sentiment_scores(text):
vadersenti = analyser.polarity_scores(str(text))
return pd.Series([vadersenti['pos'], vadersenti['neg'], vadersenti['neu'], vadersenti['compound']])
data_vis[['vadersenti_pos', 'vadersenti_neg', 'vadersenti_neu', 'vadersenti_compound']] = data_vis['tidytext'].apply(print_sentiment_scores)
data_vis['vader_senti'] = 'neutral'
data_vis.loc[data_vis['vadersenti_compound'] > 0.3 , 'vader_senti'] = 'positive'
data_vis.loc[data_vis['vadersenti_compound'] < 0.23 , 'vader_senti'] = 'negative'
data_vis['vader_possentiment'] = 0
data_vis.loc[data_vis['vadersenti_compound'] > 0.3 , 'vader_possentiment'] = 1
data_vis['vader_negsentiment'] = 0
data_vis.loc[data_vis['vadersenti_compound'] <0.23 , 'vader_negsentiment'] = 1
data_vis['vader_neusentiment'] = 0
data_vis.loc[(data_vis['vadersenti_compound'] <=0.3) & (data_vis['vadersenti_compound'] >=0.23) , 'vader_neusentiment'] = 1
sentimentbylocation = data_vis.groupby(["Location"])['vader_senti'].value_counts()
sentimentbylocation
sentimentbylocation gives me the following results:
Location vader_senti
Afghanistan negative 151
positive 25
neutral 2
Albania negative 6
positive 1
Algeria negative 116
positive 13
neutral 4
TO GET THE MOST COMMON POSITIVE WORDS, I USED THIS CODE:
def process_text(text):
tokens = []
for line in text:
toks = tokenizer.tokenize(line)
toks = [t.lower() for t in toks if t.lower() not in stopwords_list]
tokens.extend(toks)
return tokens
tokenizer=TweetTokenizer()
punct = list(string.punctuation)
stopwords_list = stopwords.words('english') + punct + ['rt','via','...','…','’','—','—:',"‚","â"]
pos_lines = list(data_vis[data_vis.vader_senti == 'positive'].tidytext)
pos_tokens = process_text(pos_lines)
pos_freq = nltk.FreqDist(pos_tokens)
pos_freq.most_common()
Running this will give me the most common words and the number of times they appeared, such as
[(good, 1212),
(amazing, 123)
However, what I want to see is how many of these positive words appeared in a country.
For example:
I have a sample CSV here: https://drive.google.com/file/d/112k-6VLB3UyljFFUbeo7KhulcrMedR-l/view?usp=sharing
Create a column for each most_common word, then do a groupby location and use agg to apply a sum for each count:
words = [i[0] for i in pos_freq.most_common()]
# lowering all cases in tidytext
data_vis.tidytext = data_vis.tidytext.str.lower()
for i in words:
data_vis[i] = data_vis.tidytext.str.count(i)
funs = {i: 'sum' for i in words}
grouped = data_vis.groupby('Location').agg(funs)
Based on the example from the CSV and using most_common as ['good', 'amazing'] the result would be:
grouped
# good amazing
# Location
# Australia 0 1
# Belgium 6 4
# Japan 2 1
# Thailand 2 0
# United States 1 0

Performance improvement for linq query with distinct

Considering the sample table
Col 1, Col2, Col3
1 , x , G
1 , y , H
2 , z , J
2 , a , K
2 , a , K
3 , b , E
I want below result, i.e distinct rows
1 , x , G
1 , y , H
2 , z , J
2 , a , K
3 , b , E
I tried
var Result = Context.Table.Select(C =>
new {
Col1 = C.Col1,
Col2 = C.Col2,
Col3 = C.Col3
}).Distinct();
and
Context.Table.GroupBy(x=>new {x.Col1,x.Col2,x.Col3}).Select(x=>x.First()).ToList();
The results are as expected, however my table has 35 columns and 1 million records and its size will keep on growing, the current time for the query is 22-30 secs, so how to improve the performance and get it down to 2-3 secs?
Using distinct is the way to go... I'd say that the first approach you tried is the correct one - but do you really need all 1 million rows? See what where conditions you can add or maybe take just the first x records?
var Result = Context.Table.Select(c => new
{
Col1 = c.Col1,
Col2 = c.Col2,
Col3 = c.Col3
})
.Where(c => /*some condition to narrow results*/)
.Take(1000) //some number of the wanted amount of records
.Distinct();
What you might be able to do, is to use the rownum to select in bulks. Something like:
public <return type> RetrieveBulk(int fromRow, int toRow)
{
return Context.Table.Where(record => record.Rownum >= fromRow && record.Rownum < toRow)
.Select(c => new
{
Col1 = c.Col1,
Col2 = c.Col2,
Col3 = c.Col3
}).Distinct();
}
This code you can then do something like:
List<Task<return type>> selectTasks = new List<Task<return type>>();
for(int i = 0; i < 1000000; i+=1000)
{
selectTasks.Add(Task.Run(() => RetrieveBulk(i, i + 1000)));
}
Task.WaitAll(selectTasks);
//And then intercet data using some efficient structure as a HashSet so when you intersect it wont be o(n)2 but o(n)

Confused about behavior of setResultsName in Pyparsing

I am trying to parse a few SQL statements. Here is a sample:
select
ms.member_sk a,
dd.date_sk b,
st.subscription_type,
(SELECT foo FROM zoo) e
from dim_member_subscription_all p,
dim_subs_type
where a in (select moo from t10)
I am interested in getting tables only at this time. So I would like to see
[zoo, dim_member_subscription_all, dim_subs_type] & [t10]
I have put together a small script looking at Paul McGuire's example
#!/usr/bin/env python
import sys
import pprint
from pyparsing import *
pp = pprint.PrettyPrinter(indent=4)
semicolon = Combine(Literal(';') + lineEnd)
comma = Literal(',')
lparen = Literal('(')
rparen = Literal(')')
update_kw, volatile_kw, create_kw, table_kw, as_kw, from_kw, \
where_kw, join_kw, left_kw, right_kw, cross_kw, outer_kw, \
on_kw , insert_kw , into_kw= \
map(lambda x: Keyword(x, caseless=True), \
['UPDATE', 'VOLATILE', 'CREATE', 'TABLE', 'AS', 'FROM',
'WHERE', 'JOIN' , 'LEFT', 'RIGHT' , \
'CROSS', 'OUTER', 'ON', 'INSERT', 'INTO'])
select_kw = Keyword('SELECT', caseless=True) | Keyword('SEL' , caseless=True)
reserved_words = (update_kw | volatile_kw | create_kw | table_kw | as_kw |
select_kw | from_kw | where_kw | join_kw |
left_kw | right_kw | cross_kw | on_kw | insert_kw |
into_kw)
ident = ~reserved_words + Word(alphas, alphanums + '_')
table = Combine(Optional(ident + Literal('.')) + ident)
column = Combine(Optional(ident + Literal('.')) + (ident | Literal('*')))
column_alias = Optional(Optional(as_kw).suppress() + ident)
table_alias = Optional(Optional(as_kw).suppress() + ident).suppress()
select_stmt = Forward()
nested_table = lparen.suppress() + select_stmt + rparen.suppress() + table_alias
table_list = delimitedList((nested_table | table) + table_alias)
column_list = delimitedList((nested_table | column) + column_alias)
txt = """
select
ms.member_sk a,
dd.date_sk b,
st.subscription_type,
(SELECT foo FROM zoo) e
from dim_member_subscription_all p,
dim_subs_type
where a in (select moo from t10)
"""
select_stmt << select_kw.suppress() + column_list + from_kw.suppress() + \
table_list.setResultsName('tables', listAllMatches=True)
print txt
for token in select_stmt.searchString(txt):
pp.pprint(token.asDict())
I am getting the following nested output. Can anybody please help me understand what I am doing wrong?
{ 'tables': ([(['zoo'], {}), (['dim_member_subscription_all', 'dim_subs_type'], {})], {})}
{ 'tables': ([(['t10'], {})], {})}
searchString will return a list of all matching ParseResults - you can see the tables value of each using:
for token in select_stmt.searchString(txt):
print token.tables
Giving:
[['zoo'], ['dim_member_subscription_all', 'dim_subs_type']]
[['t10']]
So searchString found two SELECT statements.
Recent versions of pyparsing support summing this list into a single consolidated using Python builtin sum. Accessing the tables value of this consolidated result looks like this:
print sum(select_stmt.searchString(txt)).tables
[['zoo'], ['dim_member_subscription_all', 'dim_subs_type'], ['t10']]
I think the parser is doing all you want, you just need to figure out how to process the returned results.
For further debugging, you should start using the dump method on ParseResults to see what you are getting, which will print the nested list of returned tokens, and then a hierarchical tree of all named results. For your example:
for token in select_stmt.searchString(txt):
print token.dump()
print
prints:
['ms.member_sk', 'a', 'dd.date_sk', 'b', 'st.subscription_type', 'foo', 'zoo', 'dim_member_subscription_all', 'dim_subs_type']
- tables: [['zoo'], ['dim_member_subscription_all', 'dim_subs_type']]
['moo', 't10']
- tables: [['t10']]