Snowflake: Error loading a LOCAL file that has some ARRAY fields - sql

I am getting an Error when loading a LOCAL file that has some ARRAY fields.
I do not get this for source files where all fields are STRINGS.
CREATED TABLE:
CREATE TABLE "TESTDB"."LAYER2"."CREW" ("TCONST" STRING, "DIRECTORS" ARRAY, "WRITERS" ARRAY);
DATA FILE:
tconst directors writers
tt0000001 nm0005690 \N
tt0000002 nm0721526 \N
tt0000003 nm0721526 \N
tt0000004 nm0721526 \N
tt0000005 nm0005690 \N
tt0000006 nm0617588 nm0617588
tt0000007 nm0374658,nm0005690 \N
tt0000008 nm0719756 nm0331003,nm0759866,nm0173952,nm0719756,nm0816458
SQL
PUT file://<file_path>/title.crew_1.tsv #TEST_2/ui1770650179898
COPY INTO "TESTDB"."LAYER2"."TEST_2" FROM #/ui1770650179898
FILE_FORMAT = '"TESTDB"."LAYER1"."TSV"' ON_ERROR =
'ABORT_STATEMENT' PURGE = TRUE;
ERROR MESSAGE:
Unable to copy files into table. Error parsing JSON: nm0005690 File
'#TEST_2/ui1770650179898/sample.samplefile.tsv', line 2, character 11
Row 1, column "TEST_2"["DIRECTORS":2] If you would like to continue
loading when an error is encountered, use other values such as
'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more
information on loading options, please run 'info loading_data' in a
SQL client.
I feel it is an easy answer -- troubleshooted various methods with no success and could not find anything on the web to point me in the right direction.
Any assistance would be greatly appreciated.

You need to convert strings to arrays. I assume that the delimiter is the tab character:
create file format csvtab type=csv field_delimiter = '\t' skip_header=1;
COPY INTO "CREW" FROM
(select $1, SPLIT($2,','), SPLIT($3,',') from #mystage)
FILE_FORMAT = csvtab ON_ERROR = 'ABORT_STATEMENT' ;
select * from "CREW";
+-----------+---------------------------+---------------------------------------------------------------+
| TCONST | DIRECTORS | WRITERS |
+-----------+---------------------------+---------------------------------------------------------------+
| tt0000001 | ["nm0005690"] | |
| tt0000002 | ["nm0721526"] | |
| tt0000003 | ["nm0721526"] | |
| tt0000004 | ["nm0721526"] | |
| tt0000005 | ["nm0005690"] | |
| tt0000006 | ["nm0617588"] | ["nm0617588"] |
| tt0000007 | ["nm0374658","nm0005690"] | |
| tt0000008 | ["nm0719756"] | ["nm0331003","nm0759866","nm0173952","nm0719756","nm0816458"] |
+-----------+---------------------------+---------------------------------------------------------------+

Related

Conditionally remove a field in Splunk

I have a table generated by chart that lists the results of a compliance scan
These results are typically Pass, Fail, and Error - but sometimes there is "Unknown" as a response
I want to show the percentage of each (Pass, Fail, Error, Unknown), so I do the following:
| fillnull value=0 Pass Fail Error Unknown
| eval _total=Pass+Fail+Error+Unknown
<calculate percentages for each field>
<append "%" to each value (Pass, Fail, Error, Unknown)>
What I want to do is eliminate a "totally" empty column, and only display it if it actually exists somewhere in the source data (not merely because of the fillnull command)
Is this possible?
I was thinking something like this, but cannot figure out the second step:
| eventstats max(Unknown) as _unk
| <if _unk is 0, drop the field>
edit
This could just as easily be reworded to:
if every entry for a given field is identical, remove it
Logically, this would look something like:
if(mvcount(values(fieldname))<2), fields - fieldname
Except, of course, that's not valid SPL
could you try that logic after the chart :
``` fill with null values ```
| fillnull value=null()
``` do 90° two time, droping empty/null ```
| transpose 0 include_empty=false | transpose 0 header_field=column | fields - column
[edit:] it is working when I do the following but not sure it is easy to make it working on all conditions
| stats count | eval keep=split("1 2 3 4 5"," ") | mvexpand keep
| table keep nokeep
| fillnull value=null()
| transpose 0 include_empty=false | transpose 0 header_field=column | fields - column
[edit2:] and if you need to add more null() could be done like that
| stats count | eval keep=split("1 2 3 4 5"," "), nokeep=0 | mvexpand keep
| table keep nokeep
| foreach nokeep [ eval nokeep=if(nokeep==0,null(),nokeep) ]
| transpose 0 include_empty=false | transpose 0 header_field=column | fields - column

statistics chart in splunk using value from log

I am new to Splunk dashboard. I need some help with this kind of data.
2020-09-22 11:14:33.328+0100 org{abc} INFO 3492 --- [hTaskExecutor-1] c.j.a.i.p.v.b.l.ReadFileStepListener : [] read-feed-file-step ended with status exitCode=COMPLETED;exitDescription= with compositeReadCount 1 and other count status as: BatchStatus(readCount=198, multiEntityAccountCount=0, readMultiAccountEntityAdjustment=0, accountFilterSkipCount=7, broadRidgeFilterSkipCount=189, writeCount=2, taskCreationCount=4)
I wanted to have statistics in a dashboard showing all the integer values in the above log.
Edit 1:
I tried this but not working.
index=abc xyz| rex field=string .*readCount=(?P<readCount>\d+) | table readCount
See if this run-anywhere example helps.
| makeresults
| eval _raw="2020-09-22 11:14:33.328+0100 org{abc} INFO 3492 --- [hTaskExecutor-1] c.j.a.i.p.v.b.l.ReadFileStepListener : [] read-feed-file-step ended with status exitCode=COMPLETED;exitDescription= with compositeReadCount 1 and other count status as: BatchStatus(readCount=198, multiEntityAccountCount=0, readMultiAccountEntityAdjustment=0, accountFilterSkipCount=7, broadRidgeFilterSkipCount=189, writeCount=2, taskCreationCount=4)"
`comment("Everything above just sets up test data")`
| extract kvdelim=",", pairdelim="="
| timechart span=1h max(*Count) as *Count
I solved this using
index=xyz |regex ".*fileName=(\s*([\S\s]+))" | rex field=string .*compositeReadCount=(?P<compositeReadCount>\d+) |regex ".*readCount=(?P<readCount>\d+)" | regex ".*multiEntityAccountCount=(?P<multiEntityAccountCount>\d+)" | regex ".*readMultiAccountEntityAdjustment=(?P<readMultiAccountEntityAdjustment>\d+)" | regex ".*accountFilterSkipCount=(?P<accountFilterSkipCount>\d+)" | regex ".*broadRidgeFilterSkipCount=(?P<broadRidgeFilterSkipCount>\d+)" | regex ".*writeCount=(?P<writeCount>\d+)" | regex ".*taskCreationCount=(?P<taskCreationCount>\d+)" | regex ".*status=(\s*([\S\s]+))" | table _time fileName compositeReadCount readCount multiEntityAccountCount readMultiAccountEntityAdjustment accountFilterSkipCount broadRidgeFilterSkipCount writeCount taskCreationCount status

How to eval a string containing column names?

As I cannot attach a conditional formatting on a Table, I need an abstract function to chech if a set of records or all records have errors inside, and show these errors into forms and/or reports.
Because, to achieve this goal in the 'standard' mode, I have to define the rule [○for a field of a table every time I use that field in a control or report, and this means the need to repeate the same things an annoying lot of times, not to tell about introducing errors and resulting in a maintenance nightmare.
So, my idea is to define all the check for all the tables and their rows in an CheckError-table, like the following fragment related to the table 'Persone':
TableName | FieldName | TestNumber | TestCode | TestMessage | ErrorType[/B][/COLOR]
Persone | CAP | 4 | len([CAP]) = 0 or isnull([cap]) | CAP mancante | warning
Persone | Codice Fiscale | 1 | len([Codice Fiscale]) < 16 | Codice fiscale nullo o mancante | error
Persone | Data di nascita | 2 | (now() - [Data di nascita]) < 18 * 365 | Minorenne | info
Persone | mail | 5 | len([mail)] = 0 or isnull([mail] | email mancante | warning
Persone | mail | 6 | (len([mail)] = 0 or isnull([mail]) | richiesto l'invio dei referti via e- mail, | error
| | | and [modalità ritiro referti] = ""e-mail"" | ma l'indirizzo e-mail è mancante |
Persone | Via | 3 | len([Via]) = 0 or isnull([Via]) | Indirizzo mancante | warning
Now, in each form or report which use the table Persona, I want to set an 'onload' property to a function
' to validate all fields in all rows and set the appropriate bg and fg color
Private Sub Form_Open(Cancel As Integer)
Call validazione.validazione(Form, "Persone", 0)
End Sub
' to validate all fields in the row identified by ID and set the appropriate bg and fg color
Private Sub Codice_Fiscale_LostFocus()
Call validazione.validazione(Form, "Persone", ID)
End Sub
So, the function validazione, at a certain point, as exactly one row for the table Persone, and the set of expressions described in the column [TestCode] above.
Now, I need to logically evaluate the TestString against the table row, to obtain a true or a false.
If true, I'll set the fg and bg color of the field as normal
if false, I'll set the the fg and bg color as per error, info or warning, as defined by the column [ErrorType] above.
All the above is easy, ready, and running, except for the red statement above:
How can I evaluate the teststring against the table row, to obtain a result?
Thank you
Paolo

Apache-pig Number Extraction After a specific String

I have a file with 10,1900 lines with Delimiter as 5 ('|') [obviously 6 columns now] , and I have statement in sixth column like "Dropped 12 (0.01%)" !! I am longing to extract the number after Dropped within brackets;
Actual -- Dropped 12 (0.01%)
Expected -- 0.01
I need a solution using Apache pig.
You are looking for the REGEX_EXTRACT function.
Let's say you have a table A that looks like:
+--------------------+
| col1 |
+--------------------+
| Dropped 12 (0.01%) |
| Dropped 24 (0.02%) |
+--------------------+
You can extract the number in parenthesis with the following:
B = FOREACH A GENERATE REGEX_EXTRACT(col6, '.*\\((.*)%\\)', 1);
+---------+
| percent |
+---------+
| 0.01 |
| 0.02 |
+---------+
I'm specifying a regex capture group for whatever characters are between ( and %). Notice that I'm using \\ as the escape character so that I match the opening and closing parenthesis.

How to write a select query or server-side function that will generate a neat time-flow graph from many data points?

NOTE: I am using a graph database (OrientDB to be specific). This gives me the freedom to write a server-side function in javascript or groovy rather than limit myself to SQL for this issue.*
NOTE 2: Since this is a graph database, the arrows below are simply describing the flow of data. I do not literally need the arrows to be returned in the query. The arrows represent relationships.*
I have data that is represented in a time-flow manner; i.e. EventC occurs after EventB which occurs after EventA, etc. This data is coming from multiple sources, so it is not completely linear. It needs to be congregated together, which is where I'm having the issue.
Currently the data looks something like this:
# | event | next
--------------------------
12:0 | EventA | 12:1
12:1 | EventB | 12:2
12:2 | EventC |
12:3 | EventA | 12:4
12:4 | EventD |
Where "next" is the out() edge to the event that comes next in the time-flow. On a graph this comes out to look like:
EventA-->EventB-->EventC
EventA-->EventD
Since this data needs to be congregated together, I need to merge duplicate events but preserve their edges. In other words, I need a select query that will result in:
-->EventB-->EventC
EventA--|
-->EventD
In this example, since EventB and EventD both occurred after EventA (just at different times), the select query will show two branches off EventA as opposed to two separate time-flows.
EDIT #2
If an additional set of data were to be added to the data above, with EventB->EventE, the resulting data/graph would look like:
# | event | next
--------------------------
12:0 | EventA | 12:1
12:1 | EventB | 12:2
12:2 | EventC |
12:3 | EventA | 12:4
12:4 | EventD |
12:5 | EventB | 12:6
12:6 | EventE |
EventA-->EventB-->EventC
EventA-->EventD
EventB-->EventE
I need a query to produce a tree like:
-->EventC
-->EventB--|
| -->EventE
EventA--|
-->EventD
EDIT #3 and #4
Here is the data with edges shown as opposed to the "next" column above. I also added a couple additional columns here to hopefully clear up any confusion about the data:
# | event | ip_address | timestamp | in | out |
----------------------------------------------------------------------------
12:0 | EventA | 123.156.189.18 | 2015-04-17 12:48:01 | | 13:0 |
12:1 | EventB | 123.156.189.18 | 2015-04-17 12:48:32 | 13:0 | 13:1 |
12:2 | EventC | 123.156.189.18 | 2015-04-17 12:48:49 | 13:1 | |
12:3 | EventA | 103.145.187.22 | 2015-04-17 14:03:08 | | 13:2 |
12:4 | EventD | 103.145.187.22 | 2015-04-17 14:05:23 | 13:2 | |
12:5 | EventB | 96.109.199.184 | 2015-04-17 21:53:00 | | 13:3 |
12:6 | EventE | 96.109.199.184 | 2015-04-17 21:53:07 | 13:3 | |
The data is saved like this to preserve each individual event and the flow of a session (labeled by the ip address).
TL;DR
Got lots of events, some duplicates, and need them all organized into one neat time-flow graph.
Holy cow.
After wrestling with this for over a week I think I FINALLY have a working function. This isn't optimized for performance (oh the loops!), but gets the job done for the time being while I can work on performance. The resulting OrientDB server-side function (written in javascript):
The function:
// Clear previous runs
db.command("truncate class tmp_Then");
db.command("truncate class tmp_Events");
// Get all distinct events
var distinctEvents = db.query("select from Events group by event");
// Send 404 if null, otherwise proceed
if (distinctEvents == null) {
response.send(404, "Events not found", "text/plain", "Error: events not found" );
} else {
var edges = [];
// Loop through all distinct events
distinctEvents.forEach(function(distinctEvent) {
var newEvent = [];
var rid = distinctEvent.field("#rid");
var eventType = distinctEvent.field("event");
// The main query that finds all *direct* descendents of the distinct event
var result = db.query("select from (traverse * from (select from Events where event = ?) where $depth <= 2) where #class = 'Events' and $depth > 1 and #rid in (select from Events group by event)", [eventType]);
// Save the distinct event in a temp table to create temp edges
db.command("create vertex tmp_Events set rid = ?, event = ?", [rid, event]);
edges.push(result);
});
// The edges array defines which edges should exist for a given event
edges.forEach(function(edge, index) {
edge.forEach(function(e) {
// Create the temp edge that corresponds to its distinct event
db.command("create edge tmp_Then from (select from tmp_Events where rid = " + distinctEvents[index].field("#rid") + ") to (select from tmp_Events where rid = " + e.field("#rid") + ")");
});
});
var result = db.query("select from tmp_Events");
return result;
}
Takeaways:
Temp tables appeared to be necessary. I tried to do this without temp tables (classes), but I'm not sure it could be done. I needed to mock edges that didn't exist in the raw data.
Traverse was very helpful in writing the main query. Traversing through an event to find its direct, unique descendents was fairly simple.
Having the ability to write stored procs in Javascript is freaking awesome. This would have been a nightmare in SQL.
omfg loops. I plan to optimize this and continue to make it better so hopefully other people can find some use for it.