R XML out of a database into a list - sql

I'm trying to get a XML part out of my database into RStudio using R code.
The database is a Microsoft SQL. It looks like this:
SELECT [url]
,[attributeID]
,[propertyQuality]
,[type]
,[valueOld]
,[valueNew]
,[utcTime]
,[meaning]
,[auditSecurityID]
,[auditUserActionID]
,[auditAlarmID]
,[description]
FROM [EventLog].[dbo].[Rel_AuditChange_UserLogEntry]
INNER JOIN AuditChange ON AuditChange.id=Rel_AuditChange_UserLogEntry.auditChangeID
INNER JOIN UserLogEntry ON UserLogEntry.id=Rel_AuditChange_UserLogEntry.userLogEntryID
The value I want to extract is "valueNew", which looks something like this and is different from row to row:
<ResourceReference name="value"><ResourceId name="resourceIdVal">BatchReset</ResourceId><RevisionRef name="revision"><RevisionType name="revisionTypeVal">Specific</RevisionType><Revision name="revisionVal">1</Revision></RevisionRef></ResourceReference>
<FormatExternal name="value">BLANK</FormatExternal>
So in R I'm loading it in with:
auditChangeUserLog <- sqlQuery(con, "
SELECT [url]
,[valueOld]
,[valueNew]
,[utcTime]
FROM [EventLog].[dbo].[Rel_AuditChange_UserLogEntry]
INNER JOIN AuditChange ON AuditChange.id=Rel_AuditChange_UserLogEntry.auditChangeID
INNER JOIN UserLogEntry ON UserLogEntry.id=Rel_AuditChange_UserLogEntry.userLogEntryID
", stringsAsFactors=FALSE);
rm(con)
Then I try to get the XML into a list to add it back later behind the utcTime.
for(xml in 1:nrow(auditChangeUserLog))
{
result[xml] <- list(xmlToDataFrame(auditChangeUserLog$valueNew[xml]))
}
With the second entry from the auditChangeUserLog I get the following error:
Error in matrix(vals, length(nfields), byrow = TRUE) :
'data' must be of a vector type, was 'NULL'
If you need more information, please let me know.
Update 11.09.17:
With dput as suggested I get the following error:
"<ResourceReference name=\"value\"><ResourceId name=\"resourceIdVal\">BatchReset</ResourceId><RevisionRef name=\"revision\"><RevisionType name=\"revisionTypeVal\">Specific</RevisionType><Revision name=\"revisionVal\">1</Revision></RevisionRef></ResourceReference>"
"<FormatExternal name=\"value\">BLANK</FormatExternal>"
"<LotId name=\"value\">-</LotId>"
"<DateTimeUTCDelimited name=\"value\"><unsigned_short name=\"year\">0</unsigned_short><octet name=\"month\">0</octet><octet name=\"day\">0</octet><octet name=\"hour\">0</octet><octet name=\"minute\">0</octet><octet name=\"sec\">0</octet><unsigned_long name=\"microsec\">0<"
StartTag: invalid element name
Premature end of data in tag unsigned_long line 1
Premature end of data in tag DateTimeUTCDelimited line 1
Error: 1: StartTag: invalid element name
2: Premature end of data in tag unsigned_long line 1
3: Premature end of data in tag DateTimeUTCDelimited line 1
Update 11.09.17 (2)
I was playing around a little bit and now I have an other error:
The inpud is:
> print(auditChangeUserLog$valueNew[c(1:5)])
[1] "<AdjustmentValue name=\"value\"><AdjustmentState name=\"state\">Active</AdjustmentState><string name=\"value\">3,4</string></AdjustmentValue>"
[2] "<AdjustmentValue name=\"value\"><AdjustmentState name=\"state\">Active</AdjustmentState><string name=\"value\">5,0</string></AdjustmentValue>"
[3] "<AdjustmentValue name=\"value\"><AdjustmentState name=\"state\">Irrelevant</AdjustmentState><string name=\"value\"></string></AdjustmentValue>"
[4] "<AdjustmentValue name=\"value\"><AdjustmentState name=\"state\">Irrelevant</AdjustmentState><string name=\"value\"></string></AdjustmentValue>"
[5] "<AdjustmentValue name=\"value\"><AdjustmentState name=\"state\">Active</AdjustmentState><string name=\"value\">10</string></AdjustmentValue>"
I'm doing as I said before:
for(xml in 1:nrow(auditChangeUserLog))
{
result[xml] <- list(xmlToDataFrame(auditChangeUserLog$valueNew[xml]))
}
And I get this error message:
Input is not proper UTF-8, indicate encoding !
Bytes: 0xB5 0x6D 0x3C 0x2F
Error: 1: Input is not proper UTF-8, indicate encoding !
Bytes: 0xB5 0x6D 0x3C 0x2F
I think the biggest problem is that the XML code is from row to row a little bit different.

Related

Coalesce array of integers in Hive

foo_ids is an array of type bigint, but the entire array could be null. If the array is null, I want an empty array instead.
If I do this: COALESCE(foo_ids, ARRAY())
I get:
FAILED: SemanticException [Error 10016]: Line 13:45 Argument type mismatch 'ARRAY': The expressions after COALESCE should all have the same type: "array<bigint>" is expected but "array<string>" is found
If I do this: COALESCE(foo_ids, ARRAY<BIGINT>())
I get a syntax error: FAILED: ParseException line 13:59 cannot recognize input near ')' ')' 'AS' in expression specification
What's the proper syntax here?
Use this one:
coalesce(foo_ids, array(cast(null as bigint)))
Before, hive is treating empty array [] as []. But in Hadoop2, hive is now showing empty array [] as null (see refence below). Use array(cast(null as bigint)) for empty array of type bigint. Strangely, the size of empty array is -1 (instead of 0). Hope this helps. Thanks.
Sample data:
foo_ids
[112345677899098765,1123456778990987633]
[null,null]
NULL
select foo_ids, size(foo_ids) as sz from tbl;
Result:
foo_ids sz
[112345677899098765,1123456778990987633] 2
[null,null] 2
NULL -1
select foo_ids, coalesce(foo_ids, array(cast(null as bigint))) as newfoo from tbl;
Result:
foo_ids newfoo
[112345677899098765,1123456778990987633] [112345677899098765,1123456778990987633]
[null,null] [null,null]
NULL NULL
Reference: https://docs.treasuredata.com/articles/hive-change-201602

saved data frame is not shown correctly in sql server

I have data frame named distTest which have columns with UTF-8 format. I want to save the distTest as table in my sql database. My code is as follows;
library(RODBC)
load("distTest.RData")
Sys.setlocale("LC_CTYPE", "persian")
dbhandle <- odbcDriverConnect('driver={SQL Server};server=****;database=TestDB;
trusted_connection=true',DBMSencoding="UTF-8" )
Encoding(distTest$regsub)<-"UTF-8"
Encoding(distTest$subgroup)<-"UTF-8"
sqlSave(dbhandle,distTest,
tablename = "DistBars", verbose = T, rownames = FALSE, append = TRUE)
I considered DBMSencoding for my connection and encodings Encoding(distTest$regsub)<-"UTF-8"
Encoding(distTest$subgroup)<-"UTF-8"
for my columns. However, when I save it to sql the columns are not shown in correct format, and they are like this;
When I set fast in sqlSave function to FALSE, I got this error;
Error in sqlSave(dbhandle, Distbars, tablename = "DistBars", verbose =
T, : 22001 8152 [Microsoft][ODBC SQL Server Driver][SQL
Server]String or binary data would be truncated. 01000 3621
[Microsoft][ODBC SQL Server Driver][SQL Server]The statement has been
terminated. [RODBC] ERROR: Could not SQLExecDirect 'INSERT INTO
"DistBars" ( "regsub", "week", "S", "A", "F", "labeled_cluster",
"subgroup", "windows" ) VALUES ( 'ظâ€', 5, 4, 2, 3, 'cl1', 'ط­ظ…ظ„
ط²ط¨ط§ظ„ظ‡', 1 )'
I also tried NVARCHAR(MAX) for utf-8 column in the design of table with fast=false the error gone, but the same error with format.
By the way, a part of data is exported as RData in here.
I want to know why the data format is not shown correctly in sql server 2016?
UPDATE
I am fully assured that there is something wrong with RODBC package.
I tried inserting to table by
sqlQuery(channel = dbhandle,"insert into DistBars
values(N'7من',NULL,NULL,NULL,NULL,NULL,NULL,NULL)")
as a test, and the format is still wrong. Unfortunately, adding CharSet=utf8; to connection string does not either work.
I had the same issue in my code and I managed to fix it eliminating rows_at_time = 1 from my connection configuration.

How to return all column values as xml attributes from DB2?

So usually use would use * to indicate that you want all columns or M.* if you wanted all columns from a table with an alias M, but this doesn't seem to work inside of an XMLATTRIBUTES function of DB2. However, listing the required columns by name works (I'm working with the RODBC driver in R):
qry <- "
SELECT XML2CLOB(
XMLELEMENT(NAME \"my_object\",
XMLATTRIBUTES(M.COLUMN1 AS \"column_1\", M.COLUMN2)
)) as xml
FROM MYTABLE M
fetch first 100 rows only
"
Result:
XML
1: <my_object column_1="1000002" COLUMN2="1"/>
2: <my_object column_1="1000003" COLUMN2="2"/>
3: <my_object column_1="1000004" COLUMN2="1"/>
4: <my_object column_1="1000005" COLUMN2="2"/>
5: <my_object column_1="1000006" COLUMN2="2"/>
...
I am having trouble generalizing to all columns as in the following query:
qry <- "
SELECT XML2CLOB(
XMLELEMENT(NAME \"my_object\",
XMLATTRIBUTES(M.*)
)) as xml
FROM MYTABLE M
fetch first 100 rows only
"
Result:
V1
1: 42601 -104 [IBM][CLI Driver][DB2] SQL0104N An unexpected token "*" was found following "*". Expected tokens may include: "NEXTVAL CURRVAL". SQLSTATE=42601\r\n
2: [RODBC] ERROR: Could not SQLExecDirect '\nSELECT XML2CLOB(\n XMLELEMENT(NAME "claim",\n XMLATTRIBUTES(F.*)\n )) as xml\n FROM LRD.FEA F\n where F.CPU_STMP_DT_CEN = 20\n and F.CPU_STMP_DT_YR = 13\nfetch first 100 rows only\n'
I am not sure if the * shortcut is simply unsupported inside of XMLATTRIBUTES or I should construct some kind of alias of my own that pastes the column names into XMLATTRIBUTES but I am not sure how to do that.
Additionally, I would accept if each column name value was its own XMLELEMENT nested inside my_object.
Consider having R directly handle the development of the XML document instead of a DB2 specific function. SQL is considered a special-purpose language and hence not the best option to handle flat files, dynamically render content, and fluidly connect with other APIs.
Below you can import a simple select query using * into a data frame. Then iterate through each column of data frame as new attributes:
library(XML)
library(RODBC)
# ODBC DB CONNECTION
conn <-odbcDriverConnect('driver={DB2 Driver};host=hostname;
database=databasename; UID=username;PWD=password')
df <- sqlQuery(conn, "select * from tablename;")
close(conn)
# CREATE XML FILE
doc = newXMLDoc()
root = newXMLNode("Data", doc = doc)
# ADD NEW NODE WITH AN ATTRIBUTE
for (col in names(xmldf)) {
my_object = newXMLNode("my_object", attrs = c(column = col), parent=root)
}
print(doc)
<?xml version="1.0"?>
<Data>
<my_object column="first column"/>
<my_object column="second column"/>
<my_object column="third column"/>
</Data>

Pig: Cast error while grouping data

This is the code that I am trying to run. Steps:
Take an input (there is a .pig_schema file in the input folder)
Take only two fields (chararray) from it and remove duplicates
Group on one of those fields
The code is as follows:
x = LOAD '$input' USING PigStorage('\t'); --The input is tab separated
x = LIMIT x 25;
DESCRIBE x;
-- Output of DESCRIBE x:
-- x: {id: chararray,keywords: chararray,score: chararray,time: long}
distinctCounts = FOREACH x GENERATE keywords, id; -- generate two fields
distinctCounts = DISTINCT distinctCounts; -- remove duplicates
DESCRIBE distinctCounts;
-- Output of DESCRIBE distinctCounts;
-- distinctCounts: {keywords: chararray,id: chararray}
grouped = GROUP distinctCounts BY keywords; --group by keywords
DESCRIBE grouped; --THIS IS WHERE IT GIVES AN ERROR
DUMP grouped;
When I do the grouped, it gives the following error:
ERROR org.apache.pig.tools.pigstats.SimplePigStats -
ERROR: org.apache.pig.data.DataByteArray cannot be cast to java.lang.String
keywords is a chararray and Pig should be able to group on a chararray. Any ideas?
EDIT:
Input file:
0000010000014743 call for midwife 23 1425761139
0000010000062069 naruto 1 56 1425780386
0000010000079919 the following 98 1425788874
0000010000081650 planes 2 76 1425721945
0000010000118785 law and order 21 1425763899
0000010000136965 family guy 12 1425766338
0000010000136100 american dad 19 1425766702
.pig_schema file
{"fields":[{"name":"id","type":55},{"name":"keywords","type":55},{"name":"score","type":55},{"name":"time","type":15}]}
Pig is not able to identify the value of keywords as chararray.Its better to go for field naming during initial load, in this way we are explicitly stating the field types.
x = LOAD '$input' USING PigStorage('\t') AS (id:chararray,keywords:chararray,score: chararray,time: long);
UPDATE :
Tried the below snippet with updated .pig_schema to introduce score, used '\t' as separator and tried the below steps for the input shared.
x = LOAD 'a.csv' USING PigStorage('\t');
distinctCounts = FOREACH x GENERATE keywords, id;
distinctCounts = DISTINCT distinctCounts;
grouped = GROUP distinctCounts BY keywords;
DUMP grouped;
Would suggest to use unique alias names for better readability and maintainability.
Output :
(naruto 1,{(naruto 1,0000010000062069)})
(planes 2,{(planes 2,0000010000081650)})
(family guy,{(family guy,0000010000136965)})
(american dad,{(american dad,0000010000136100)})
(law and order,{(law and order,0000010000118785)})
(the following,{(the following,0000010000079919)})
(call for midwife,{(call for midwife,0000010000014743)})

How can I use the COUNT value obtained from a call to mkqlite()?

I'm using mksqlite to create and access an SQL database from matlab, and I want to get the number of rows in a table. I've tried this:
num = mksqlite('SELECT COUNT(*) FROM myTable');
, but the returned value isn't very helpful. If I put a breakpoint in my script and examine the variable, I find that it's a struct with a single field, called 'COUNT(_)', which seems to actually be an invalid name for a field, so I can't access it:
K>> class(num)
ans =
struct
K>> num
num =
COUNT(_): 0
K>> num.COUNT(_)
??? num.COUNT(_)
|
Error: The input character is not valid in MATLAB statements or expressions.
K>> num.COUNT()
??? Reference to non-existent field 'COUNT'.
K>> num.COUNT
??? Reference to non-existent field 'COUNT'.
Even the MATLAB IDE can't access it. If I try to double click the field in the variable editor, this gets spat out:
??? openvar('num.COUNT(_)', num.COUNT(_));
|
Error: The input character is not valid in MATLAB statements or expressions.
So how can I access this field?
You are correct that the problem is that mksqlite somehow manages to create an invalid field name that can't be read. The simplest solution is to add an AS clause to your SQL so that the field has a sensible name:
>> num = mksqlite('SELECT COUNT(*) AS cnt FROM myTable')
num =
cnt: 0
Then to remove the extra layer of indirection you can do:
>> num = num.cnt;
>> num
num =
0