How to explode quantiles in hive

How to explode quantiles in hive - sql

I am trying to get quantiles of a field and I want to explode them so that each value is a separate row rather than all of them forming a single array. First, I calculate 20 quantiles as below:
select percentile_approx(probability,
array(0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35,
0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75,
0.8, 0.85, 0.9, 0.95, 1.0)) as quantiles
from my_table
The above code gives the array below:
[0.17808226409449213, 0.18250386256254247, 0.18525207046272224, 0.18800918537059694, 0.1907743631982954, 0.200154288105411, 0.30419108474685375, 0.3299437131426226, 0.352433633041806, 0.3589875791100745, 0.37581775428218006, 0.3825168120904496, 0.3966342376441502, 0.4173753044627164, 0.43268994899295316, 0.44015098935735575, 0.461413042176578, 0.4720422104416653, 0.487852850513824, 0.5050010622123932]
But since I wanted to explode it, I tried using lateral view posexplode like below (actually, I passed the output from the above code):
select i, x
lateral view posexplode([0.17808226409449213, 0.18250386256254247,0.18525207046272224, 0.18800918537059694,
0.1907743631982954, 0.200154288105411, 0.30419108474685375, 0.3299437131426226,
0.352433633041806, 0.3589875791100745, 0.37581775428218006, 0.3825168120904496,
0.3966342376441502, 0.4173753044627164, 0.43268994899295316, 0.44015098935735575,
0.461413042176578, 0.4720422104416653, 0.487852850513824, 0.5050010622123932]) q as i, x
but it gives the error message below:
ParseException: "\nextraneous input '[' expecting {'(', ')', 'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', 'POSITION', 'EXTRACT', '+', '-', '*', 'DIV', '~', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', STRING, BIGINT_LITERAL, SMALLINT_LITERAL, TINYINT_LITERAL, INTEGER_VALUE, DECIMAL_VALUE, DOUBLE_LITERAL, BIGDECIMAL_LITERAL, IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 3, pos 24)\n\n== SQL ==\n\nselect i, x\nlateral view posexplode([0.17808226409449213, 0.18250386256254247,0.18525207046272224, 0.18800918537059694, \n------------------------^^^\n 0.1907743631982954, 0.200154288105411, 0.30419108474685375, 0.3299437131426226, \n 0.352433633041806, 0.3589875791100745, 0.37581775428218006, 0.3825168120904496, \n 0.3966342376441502, 0.4173753044627164, 0.43268994899295316, 0.44015098935735575, \n 0.461413042176578, 0.4720422104416653, 0.487852850513824, 0.5050010622123932]) q as i, x\n"
On the other hand, if I create the array inside posexplode like below, it works fine:
select i, x
lateral view posexplode(array('a', 'b', 'c')) q as i, x
| i| x|
+---+---+
| 0| a|
| 1| b|
| 2| c|
+---+---+

select s.*, e.i, e.x
from
(
select percentile_approx(probability,
array(0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35,
0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75,
0.8, 0.85, 0.9, 0.95, 0.99)) as quantiles
from my_table
) s lateral view outer posexplode (s.quantiles) e as i, x
;

Related

Extract words from a column and count frequency

Does anyone know if there's an efficient way to extract all the words from a single column and count the frequency of each word in SQL Server? I only have read-only access to my database so I can't create a self-defined function to do this.
Here's a reproducible example:
CREATE TABLE words
(
id INT PRIMARY KEY,
text_column VARCHAR(1000)
);
INSERT INTO words (id, text_column)
VALUES
(1, 'SQL Server is a popular database management system'),
(2, 'It is widely used for data storage and retrieval'),
(3, 'SQL Server is a powerful tool for data analysis');
I have found this code but it's not working correctly, and I think it's too complicated to understand:
WITH E1(N) AS
(
SELECT 1
FROM (VALUES
(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
) t(N)
),
E2(N) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b),
E4(N) AS (SELECT 1 FROM E2 a CROSS JOIN E2 b)
SELECT
LOWER(x.Item) AS [Word],
COUNT(*) AS [Counts]
FROM
(SELECT * FROM words) a
CROSS APPLY
(SELECT
ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = LTRIM(RTRIM(SUBSTRING(a.text_column, l.N1, l.L1)))
FROM
(SELECT
s.N1,
L1 = ISNULL(NULLIF(CHARINDEX(' ',a.text_column,s.N1),0)-s.N1,4000)
FROM
(SELECT 1
UNION ALL
SELECT t.N+1
FROM
(SELECT TOP (ISNULL(DATALENGTH(a.text_column)/2,0))
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM E4) t(N)
WHERE SUBSTRING(a.text_column ,t.N,1) = ' '
) s(N1)
) l(N1, L1)
) x
WHERE
x.item <> ''
AND x.Item NOT IN ('0o', '0s', '3a', '3b', '3d', '6b', '6o', 'a', 'a1', 'a2', 'a3', 'a4', 'ab', 'able', 'about', 'above', 'abst', 'ac', 'accordance', 'according', 'accordingly', 'across', 'act', 'actually', 'ad', 'added', 'adj', 'ae', 'af', 'affected', 'affecting', 'affects', 'after', 'afterwards', 'ag', 'again', 'against', 'ah', 'ain', 'ain''t', 'aj', 'al', 'all', 'allow', 'allows', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'announce', 'another', 'any', 'anybody', 'anyhow', 'anymore', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'ao', 'ap', 'apart', 'apparently', 'appear', 'appreciate', 'appropriate', 'approximately', 'ar', 'are', 'aren', 'arent', 'aren''t', 'arise', 'around', 'as', 'a''s', 'aside', 'ask', 'asking', 'associated', 'at', 'au', 'auth', 'av', 'available', 'aw', 'away', 'awfully', 'ax', 'ay', 'az', 'b', 'b1', 'b2', 'b3', 'ba', 'back', 'bc', 'bd', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'begin', 'beginning', 'beginnings', 'begins', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'best', 'better', 'between', 'beyond', 'bi', 'bill', 'biol', 'bj', 'bk', 'bl', 'bn', 'both', 'bottom', 'bp', 'br', 'brief', 'briefly', 'bs', 'bt', 'bu', 'but', 'bx', 'by', 'c', 'c1', 'c2', 'c3', 'ca', 'call', 'came', 'can', 'cannot', 'cant', 'can''t', 'cause', 'causes', 'cc', 'cd', 'ce', 'certain', 'certainly', 'cf', 'cg', 'ch', 'changes', 'ci', 'cit', 'cj', 'cl', 'clearly', 'cm', 'c''mon', 'cn', 'co', 'com', 'come', 'comes', 'con', 'concerning', 'consequently', 'consider', 'considering', 'contain', 'containing', 'contains', 'corresponding', 'could', 'couldn', 'couldnt', 'couldn''t', 'course', 'cp', 'cq', 'cr', 'cry', 'cs', 'c''s', 'ct', 'cu', 'currently', 'cv', 'cx', 'cy', 'cz', 'd', 'd2', 'da', 'date', 'dc', 'dd', 'de', 'definitely', 'describe', 'described', 'despite', 'detail', 'df', 'di', 'did', 'didn', 'didn''t', 'different', 'dj', 'dk', 'dl', 'do', 'does', 'doesn', 'doesn''t', 'doing', 'don', 'done', 'don''t', 'down', 'downwards', 'dp', 'dr', 'ds', 'dt', 'du', 'due', 'during', 'dx', 'dy', 'e', 'e2', 'e3', 'ea', 'each', 'ec', 'ed', 'edu', 'ee', 'ef', 'effect', 'eg', 'ei', 'eight', 'eighty', 'either', 'ej', 'el', 'eleven', 'else', 'elsewhere', 'em', 'empty', 'en', 'end', 'ending', 'enough', 'entirely', 'eo', 'ep', 'eq', 'er', 'es', 'especially', 'est', 'et', 'et-al', 'etc', 'eu', 'ev', 'even', 'ever', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'ex', 'exactly', 'example', 'except', 'ey', 'f', 'f2', 'fa', 'far', 'fc', 'few', 'ff', 'fi', 'fifteen', 'fifth', 'fify', 'fill', 'find', 'fire', 'first', 'five', 'fix', 'fj', 'fl', 'fn', 'fo', 'followed', 'following', 'follows', 'for', 'former', 'formerly', 'forth', 'forty', 'found', 'four', 'fr', 'from', 'front', 'fs', 'ft', 'fu', 'full', 'further', 'furthermore', 'fy', 'g', 'ga', 'gave', 'ge', 'get', 'gets', 'getting', 'gi', 'give', 'given', 'gives', 'giving', 'gj', 'gl', 'go', 'goes', 'going', 'gone', 'got', 'gotten', 'gr', 'greetings', 'gs', 'gy', 'h', 'h2', 'h3', 'had', 'hadn', 'hadn''t', 'happens', 'hardly', 'has', 'hasn', 'hasnt', 'hasn''t', 'have', 'haven', 'haven''t', 'having', 'he', 'hed', 'he''d', 'he''ll', 'hello', 'help', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'heres', 'here''s', 'hereupon', 'hers', 'herself', 'hes', 'he''s', 'hh', 'hi', 'hid', 'him', 'himself', 'his', 'hither', 'hj', 'ho', 'home', 'hopefully', 'how', 'howbeit', 'however', 'how''s', 'hr', 'hs', 'http', 'hu', 'hundred', 'hy', 'i', 'i2', 'i3', 'i4', 'i6', 'i7', 'i8', 'ia', 'ib', 'ibid', 'ic', 'id', 'i''d', 'ie', 'if', 'ig', 'ignored', 'ih', 'ii', 'ij', 'il', 'i''ll', 'im', 'i''m', 'immediate', 'immediately', 'importance', 'important', 'in', 'inasmuch', 'inc', 'indeed', 'index', 'indicate', 'indicated', 'indicates', 'information', 'inner', 'insofar', 'instead', 'interest', 'into', 'invention', 'inward', 'io', 'ip', 'iq', 'ir', 'is', 'isn', 'isn''t', 'it', 'itd', 'it''d', 'it''ll', 'its', 'it''s', 'itself', 'iv', 'i''ve', 'ix', 'iy', 'iz', 'j', 'jj', 'jr', 'js', 'jt', 'ju', 'just', 'k', 'ke', 'keep', 'keeps', 'kept', 'kg', 'kj', 'km', 'know', 'known', 'knows', 'ko', 'l', 'l2', 'la', 'largely', 'last', 'lately', 'later', 'latter', 'latterly', 'lb', 'lc', 'le', 'least', 'les', 'less', 'lest', 'let', 'lets', 'let''s', 'lf', 'like', 'liked', 'likely', 'line', 'little', 'lj', 'll', 'll', 'ln', 'lo', 'look', 'looking', 'looks', 'los', 'lr', 'ls', 'lt', 'ltd', 'm', 'm2', 'ma', 'made', 'mainly', 'make', 'makes', 'many', 'may', 'maybe', 'me', 'mean', 'means', 'meantime', 'meanwhile', 'merely', 'mg', 'might', 'mightn', 'mightn''t', 'mill', 'million', 'mine', 'miss', 'ml', 'mn', 'mo', 'more', 'moreover', 'most', 'mostly', 'move', 'mr', 'mrs', 'ms', 'mt', 'mu', 'much', 'mug', 'must', 'mustn', 'mustn''t', 'my', 'myself', 'n', 'n2', 'na', 'name', 'namely', 'nay', 'nc', 'nd', 'ne', 'near', 'nearly', 'necessarily', 'necessary', 'need', 'needn', 'needn''t', 'needs', 'neither', 'never', 'nevertheless', 'new', 'next', 'ng', 'ni', 'nine', 'ninety', 'nj', 'nl', 'nn', 'no', 'nobody', 'non', 'none', 'nonetheless', 'noone', 'nor', 'normally', 'nos', 'not', 'noted', 'nothing', 'novel', 'now', 'nowhere', 'nr', 'ns', 'nt', 'ny', 'o', 'oa', 'ob', 'obtain', 'obtained', 'obviously', 'oc', 'od', 'of', 'off', 'often', 'og', 'oh', 'oi', 'oj', 'ok', 'okay', 'ol', 'old', 'om', 'omitted', 'on', 'once', 'one', 'ones', 'only', 'onto', 'oo', 'op', 'oq', 'or', 'ord', 'os', 'ot', 'other', 'others', 'otherwise', 'ou', 'ought', 'our', 'ours', 'ourselves', 'out', 'outside', 'over', 'overall', 'ow', 'owing', 'own', 'ox', 'oz', 'p', 'p1', 'p2', 'p3', 'page', 'pagecount', 'pages', 'par', 'part', 'particular', 'particularly', 'pas', 'past', 'pc', 'pd', 'pe', 'per', 'perhaps', 'pf', 'ph', 'pi', 'pj', 'pk', 'pl', 'placed', 'please', 'plus', 'pm', 'pn', 'po', 'poorly', 'possible', 'possibly', 'potentially', 'pp', 'pq', 'pr', 'predominantly', 'present', 'presumably', 'previously', 'primarily', 'probably', 'promptly', 'proud', 'provides', 'ps', 'pt', 'pu', 'put', 'py', 'q', 'qj', 'qu', 'que', 'quickly', 'quite', 'qv', 'r', 'r2', 'ra', 'ran', 'rather', 'rc', 'rd', 're', 'readily', 'really', 'reasonably', 'recent', 'recently', 'ref', 'refs', 'regarding', 'regardless', 'regards', 'related', 'relatively', 'research', 'research-articl', 'respectively', 'resulted', 'resulting', 'results', 'rf', 'rh', 'ri', 'right', 'rj', 'rl', 'rm', 'rn', 'ro', 'rq', 'rr', 'rs', 'rt', 'ru', 'run', 'rv', 'ry', 's', 's2', 'sa', 'said', 'same', 'saw', 'say', 'saying', 'says', 'sc', 'sd', 'se', 'sec', 'second', 'secondly', 'section', 'see', 'seeing', 'seem', 'seemed', 'seeming', 'seems', 'seen', 'self', 'selves', 'sensible', 'sent', 'serious', 'seriously', 'seven', 'several', 'sf', 'shall', 'shan', 'shan''t', 'she', 'shed', 'she''d', 'she''ll', 'shes', 'she''s', 'should', 'shouldn', 'shouldn''t', 'should''ve', 'show', 'showed', 'shown', 'showns', 'shows', 'si', 'side', 'significant', 'significantly', 'similar', 'similarly', 'since', 'sincere', 'six', 'sixty', 'sj', 'sl', 'slightly', 'sm', 'sn', 'so', 'some', 'somebody', 'somehow', 'someone', 'somethan', 'something', 'sometime', 'sometimes', 'somewhat', 'somewhere', 'soon', 'sorry', 'sp', 'specifically', 'specified', 'specify', 'specifying', 'sq', 'sr', 'ss', 'st', 'still', 'stop', 'strongly', 'sub', 'substantially', 'successfully', 'such', 'sufficiently', 'suggest', 'sup', 'sure', 'sy', 'system', 'sz', 't', 't1', 't2', 't3', 'take', 'taken', 'taking', 'tb', 'tc', 'td', 'te', 'tell', 'ten', 'tends', 'tf', 'th', 'than', 'thank', 'thanks', 'thanx', 'that', 'that''ll', 'thats', 'that''s', 'that''ve', 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'thered', 'therefore', 'therein', 'there''ll', 'thereof', 'therere', 'theres', 'there''s', 'thereto', 'thereupon', 'there''ve', 'these', 'they', 'theyd', 'they''d', 'they''ll', 'theyre', 'they''re', 'they''ve', 'thickv', 'thin', 'think', 'third', 'this', 'thorough', 'thoroughly', 'those', 'thou', 'though', 'thoughh', 'thousand', 'three', 'throug', 'through', 'throughout', 'thru', 'thus', 'ti', 'til', 'tip', 'tj', 'tl', 'tm', 'tn', 'to', 'together', 'too', 'took', 'top', 'toward', 'towards', 'tp', 'tq', 'tr', 'tried', 'tries', 'truly', 'try', 'trying', 'ts', 't''s', 'tt', 'tv', 'twelve', 'twenty', 'twice', 'two', 'tx', 'u', 'u201d', 'ue', 'ui', 'uj', 'uk', 'um', 'un', 'under', 'unfortunately', 'unless', 'unlike', 'unlikely', 'until', 'unto', 'uo', 'up', 'upon', 'ups', 'ur', 'us', 'use', 'used', 'useful', 'usefully', 'usefulness', 'uses', 'using', 'usually', 'ut', 'v', 'va', 'value', 'various', 'vd', 've', 've', 'very', 'via', 'viz', 'vj', 'vo', 'vol', 'vols', 'volumtype', 'vq', 'vs', 'vt', 'vu', 'w', 'wa', 'want', 'wants', 'was', 'wasn', 'wasnt', 'wasn''t', 'way', 'we', 'wed', 'we''d', 'welcome', 'well', 'we''ll', 'well-b', 'went', 'were', 'we''re', 'weren', 'werent', 'weren''t', 'we''ve', 'what', 'whatever', 'what''ll', 'whats', 'what''s', 'when', 'whence', 'whenever', 'when''s', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'wheres', 'where''s', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whim', 'whither', 'who', 'whod', 'whoever', 'whole', 'who''ll', 'whom', 'whomever', 'whos', 'who''s', 'whose', 'why', 'why''s', 'wi', 'widely', 'will', 'willing', 'wish', 'with', 'within', 'without', 'wo', 'won', 'wonder', 'wont', 'won''t', 'words', 'world', 'would', 'wouldn', 'wouldnt', 'wouldn''t', 'www', 'x', 'x1', 'x2', 'x3', 'xf', 'xi', 'xj', 'xk', 'xl', 'xn', 'xo', 'xs', 'xt', 'xv', 'xx', 'y', 'y2', 'yes', 'yet', 'yj', 'yl', 'you', 'youd', 'you''d', 'you''ll', 'your', 'youre', 'you''re', 'yours', 'yourself', 'yourselves', 'you''ve', 'yr', 'ys', 'yt', 'z', 'zero', 'zi', 'zz')
GROUP BY x.Item
ORDER BY COUNT(*) DESC
Here's the result of the above code, as you can see it's not counting correctly:
Word Counts
server 2
sql 2
data 1
database 1
popular 1
powerful 1
Can anyone help on this? Would be really appreciated!

You can make use of String_split here, such as
select value Word, Count(*) Counts
from words
cross apply String_Split(text_column, ' ')
where value not in(exclude list)
group by value
order by counts desc;

You should should the string_split function -- like this
SELECT id, value as aword
FROM words
CROSS APPLY STRING_SPLIT(text_column, ',');
This will create a table with all the words by id -- to get the count do this:
SELECT aword, count(*) as counts
FROM (
SELECT id, value as aword
FROM words
CROSS APPLY STRING_SPLIT(text_column, ',');
) x
GROUP BY aword
You may need to lower case the LOWER(text_column) if you want it to not matter

If you don't have access to STRING_SPLIT function, you can use weird xml trick to convert space to a word node and then shred it with nodes function:
select word, COUNT(*)
from (
select n.value('.', 'nvarchar(50)') AS word
from (
VALUES
(1, 'SQL Server is a popular database management system'),
(2, 'It is widely used for data storage and retrieval'),
(3, 'SQL Server is a powerful tool for data analysis')
) AS t (id, txt)
CROSS APPLY (
SELECT CAST('<x>' + REPLACE(txt, ' ', '</x><x>') + '</x>' AS XML) x
) x
CROSS APPLY x.nodes('x') z(n)
) w
GROUP BY word
Of course, this will fail on "bad" words and invalid xml-characters but it can be worked on. Text processing has never been SQL Server's strong-point though, so probably better to use some NLP library to do this kind of stuff

`sync_partition_metadata` in presto as a result returning query

Is there a way to make this query work?
SELECT "result" AS "result"
FROM (CALL system.sync_partition_metadata('schema', 'table', 'ADD')) AS "virtual_table"
LIMIT 1000;
this is the error I get:
presto error: line 3:18: mismatched input '.'. Expecting: '(', ')',
',', 'AS', 'CROSS', 'EXCEPT', 'FETCH', 'FULL', 'GROUP', 'HAVING',
'INNER', 'INTERSECT', 'JOIN', 'LEFT', 'LIMIT', 'MATCH_RECOGNIZE',
'NATURAL', 'OFFSET', 'ORDER', 'RIGHT', 'TABLESAMPLE', 'UNION',
'WHERE', 'WINDOW', ,

Slicing PySpark DataFrame by converting to Pandas DataFrame, Error when converting back to PySpark DataFrame

I want to slice a PySpark DataFrame by selecting a specific column and several rows as below:
import pandas as pd
# Data filled in our DataFrame
rows = [['Lee Chong Wei', 69, 'Malaysia'],
['Lin Dan', 66, 'China'],
['Srikanth Kidambi', 9, 'India'],
['Kento Momota', 15, 'Japan']]
# Columns of our DataFrame
columns = ['Player', 'Titles', 'Country']
# DataFrame is created
df = spark.createDataFrame(rows, columns)
# Converting DataFrame to pandas
pandas_df = df.toPandas()
# First DataFrame formed by slicing
df1 = pandas_df.iloc[[2], :2]
# Second DataFrame formed by slicing
df2 = pandas_df.iloc[[2], 2:]
# Converting the slices to PySpark DataFrames
df1 = spark.createDataFrame(df1, schema = "Country")
df2 = spark.createDataFrame(df2, schema = "Country")
I am running a notebook on Databricks and no need to import Spark Session.
There is an error message ParseException: when running following lines:
df1 = spark.createDataFrame(df1, schema = "Country")
df2 = spark.createDataFrame(df2, schema = "Country")
Please let me know any idea to solve this issue. Full error message is as below:
---------------------------------------------------------------------------
ParseException Traceback (most recent call last)
<command-4065192899858765> in <module>
23
24 # Converting the slices to PySpark DataFrames
---> 25 df1 = spark.createDataFrame(df1, schema = "Country")
26 df2 = spark.createDataFrame(df2, schema = "Country")
/databricks/spark/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
706
707 if isinstance(schema, str):
--> 708 schema = _parse_datatype_string(schema)
709 elif isinstance(schema, (list, tuple)):
710 # Must re-encode any unicode strings to be consistent with StructField names
/databricks/spark/python/pyspark/sql/types.py in _parse_datatype_string(s)
841 return from_ddl_datatype("struct<%s>" % s.strip())
842 except:
--> 843 raise e
844
845
/databricks/spark/python/pyspark/sql/types.py in _parse_datatype_string(s)
831 try:
832 # DDL format, "fieldname datatype, fieldname datatype".
--> 833 return from_ddl_schema(s)
834 except Exception as e:
835 try:
/databricks/spark/python/pyspark/sql/types.py in from_ddl_schema(type_str)
823 def from_ddl_schema(type_str):
824 return _parse_datatype_json_string(
--> 825 sc._jvm.org.apache.spark.sql.types.StructType.fromDDL(type_str).json())
826
827 def from_ddl_datatype(type_str):
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
1302
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1306
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
121 # Hide where the exception came from that shows a non-Pythonic
122 # JVM exception message.
--> 123 raise converted from None
124 else:
125 raise
ParseException:
mismatched input '<EOF>' expecting {'APPLY', 'CALLED', 'CHANGES', 'CLONE', 'COLLECT', 'CONTAINS', 'CONVERT', 'COPY', 'COPY_OPTIONS', 'CREDENTIAL', 'CREDENTIALS', 'DEEP', 'DEFINER', 'DELTA', 'DETERMINISTIC', 'ENCRYPTION', 'EXPECT', 'FAIL', 'FILES', 'FORMAT_OPTIONS', 'HISTORY', 'INCREMENTAL', 'INPUT', 'INVOKER', 'LANGUAGE', 'LIVE', 'MATERIALIZED', 'MODIFIES', 'OPTIMIZE', 'PATTERN', 'READS', 'RESTORE', 'RETURN', 'RETURNS', 'SAMPLE', 'SCD TYPE 1', 'SCD TYPE 2', 'SECURITY', 'SEQUENCE', 'SHALLOW', 'SNAPSHOT', 'SPECIFIC', 'SQL', 'STORAGE', 'STREAMING', 'UPDATES', 'UP_TO_DATE', 'VIOLATION', 'ZORDER', 'ADD', 'AFTER', 'ALL', 'ALTER', 'ALWAYS', 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 'CASCADE', 'CASE', 'CAST', 'CATALOG', 'CATALOGS', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 'CLUSTERED', 'CODE', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DAY', 'DATA', 'DATABASE', 'DATABASES', 'DATEADD', 'DATEDIFF', 'DBPROPERTIES', 'DEFAULT', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 'DESCRIBE', 'DFS', 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 'DIV', 'DROP', 'ELSE', 'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS', 'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 'FETCH', 'FIELDS', 'FILTER', 'FILEFORMAT', 'FIRST', 'FN', 'FOLLOWING', 'FOR', 'FOREIGN', 'FORMAT', 'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 'GENERATED', 'GLOBAL', 'GRANT', 'GRANTS', 'GROUP', 'GROUPING', 'HAVING', 'HOUR', 'IDENTITY', 'IF', 'IGNORE', 'IMPORT', 'IN', 'INCREMENT', 'INDEX', 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 'INTERSECT', 'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEY', 'KEYS', 'LAST', 'LATERAL', 'LAZY', 'LEADING', 'LEFT', 'LIKE', 'ILIKE', 'LIMIT', 'LINES', 'LIST', 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 'MATCHED', 'MERGE', 'MINUTE', 'MONTH', 'MSCK', 'NAMESPACE', 'NAMESPACES', 'NATURAL', 'NO', NOT, 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 'OPTION', 'OPTIONS', 'OR', 'ORDER', 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', 'OVERLAY', 'OVERWRITE', 'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENTILE_CONT', 'PERCENT', 'PIVOT', 'PLACING', 'POSITION', 'PRECEDING', 'PRIMARY', 'PRINCIPALS', 'PROPERTIES', 'PROVIDER', 'PROVIDERS', 'PURGE', 'QUALIFY', 'QUERY', 'RANGE', 'RECIPIENT', 'RECIPIENTS', 'RECORDREADER', 'RECORDWRITER', 'RECOVER', 'REDUCE', 'REFERENCES', 'REFRESH', 'REMOVE', 'RENAME', 'REPAIR', 'REPEATABLE', 'REPLACE', 'REPLICAS', 'RESET', 'RESPECT', 'RESTRICT', 'REVOKE', 'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', 'ROLLUP', 'ROW', 'ROWS', 'SECOND', 'SCHEMA', 'SCHEMAS', 'SELECT', 'SEMI', 'SEPARATED', 'SERDE', 'SERDEPROPERTIES', 'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHARE', 'SHARES', 'SHOW', 'SKEWED', 'SOME', 'SORT', 'SORTED', 'START', 'STATISTICS', 'STORED', 'STRATIFY', 'STRUCT', 'SUBSTR', 'SUBSTRING', 'SYNC', 'SYSTEM_TIME', 'SYSTEM_VERSION', 'TABLE', 'TABLES', 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 'TERMINATED', 'THEN', 'TIME', 'TIMESTAMP', 'TIMESTAMPADD', 'TIMESTAMPDIFF', 'TO', 'TOUCH', 'TRAILING', 'TRANSACTION', 'TRANSACTIONS', 'TRANSFORM', 'TRIM', 'TRUE', 'TRUNCATE', 'TRY_CAST', 'TYPE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', 'UNIQUE', 'UNKNOWN', 'UNLOCK', 'UNSET', 'UPDATE', 'USE', 'USER', 'USING', 'VALUES', 'VERSION', 'VIEW', 'VIEWS', 'WHEN', 'WHERE', 'WINDOW', 'WITH', 'WITHIN', 'YEAR', 'ZONE', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 7)
== SQL ==
Country
-------^^^

Athena query error : extraneous input 'select' expecting

Got an error while running query :
extraneous input 'select' expecting {'(', 'add', 'all', 'some', 'any', 'at', 'no', 'substring', 'position', 'tinyint', 'smallint', 'integer', 'date', 'time', 'timestamp', 'interval', 'year', 'month', 'day', 'hour', 'minute', 'second', 'zone', 'filter', 'over', 'partition', 'range', 'rows', 'preceding', 'following', 'current', 'row', 'schema', 'comment', 'view', 'replace', 'grant', 'revoke', 'privileges', 'public', 'option', 'explain', 'analyze', 'format', 'type', 'text', 'graphviz', 'logical', 'distributed', 'validate', 'show', 'tables', 'views', 'schemas', 'catalogs', 'columns', 'column', 'use', 'partitions', 'functions', 'to', 'system', 'bernoulli', 'poissonized', 'tablesample', 'unnest', 'array', 'map', 'set', 'reset', 'session', 'data', 'start', 'transaction', 'commit', 'rollback', 'work', 'isolation', 'level', 'serializable', 'repeatable', 'committed', 'uncommitted', 'read', 'write', 'only', 'call', 'input', 'output', 'cascade', 'restrict', 'including', 'excluding', 'properties', 'function', 'lambda_invoke', 'returns', 'sagemaker_invoke_endpoint', 'nfd', 'nfc', 'nfkd', 'nfkc', 'if', 'nullif', 'coalesce', identifier, digit_identifier, quoted_identifier, backquoted_identifier}
query is right join table ss with dd ,like:
select * from
(
select platform, id, nextMonth
FROM "logs"."user_records" as ss
right join
select id as idRight, platform, month
FROM "logs"."user_records" as dd
on ss.platform = dd.platform and ss.userid = dd.useridRight and ss.nextMonth=dd.month )

You probably need to surround the subquery after your right join in parentheses. Untested, but I'd guess:
select * from
(
select platform, id, nextMonth
FROM "logs"."user_records" as ss
right join
( select id as idRight, platform, month
FROM "logs"."user_records" ) as dd
on ss.platform = dd.platform and ss.userid = dd.useridRight and ss.nextMonth=dd.month )

Pandas - Extracting value based of common key

I have a Dataframe in the below format:
id, key1, key2
101, {'key': 'key_1001', 'fields': {'type': {'subtask': False}, 'summary': 'Title_1' , 'id': '71150'}}, NaN
101, NaN,{'key': 'key_1002', 'fields': {'type': {'subtask': False}, 'summary': 'Title_2' , 'id': '71151'}}
102, {'key': 'key_2001', 'fields': {'type': {'subtask': False}, 'summary': 'Title_11' , 'id': '71160'}}, NaN
102, NaN,{'key': 'key_2002', 'fields': {'type': {'subtask': False}, 'summary': 'Title_12' , 'id': '71161'}}
I am trying to achieve the below output from the above Dataframe.
id, key_value_1, key_value_2
101, key_1001, key_1002
102, key_2001, key_2002
Output of df.dict()
{'id': {103: '101', 676: '101'}, 'key1' : {103: {'fields': {'type': {'subtask': False}, 'summary': 'Title_1' , 'id': '71150'},
676: nan}

You can use:
s=df.set_index('id').stack().str.get('key').unstack()
key1 key2
id
101 key_1001 key_1002
102 key_2001 key_2002

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to explode quantiles in hive - sql

select s.*, e.i, e.x from ( select percentile_approx(probability, array(0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 0.99)) as quantiles from my_table ) s lateral view outer posexplode (s.quantiles) e as i, x ;

Related

Extract words from a column and count frequency

`sync_partition_metadata` in presto as a result returning query

Slicing PySpark DataFrame by converting to Pandas DataFrame, Error when converting back to PySpark DataFrame

Athena query error : extraneous input 'select' expecting

Pandas - Extracting value based of common key

Categories

Resources