Is there any way to convert string patterns as shown in the map below? - sql

Trying to convert 'Currently' patterns in the above table to 'Turn It Into' patterns in the data set ...
Examples are shown in the next 2 columns for each one....
I tried different ways but I could not get required output ....
Here is the code to get the input data....
data Current;
Input Currently :$40.;
Datalines;
HiKumar"^TM1()^",test
HiKumar"^TM2()^"
HiKumar^TM3()^HiKumar
HiKumar^f(‘VARNAME’).any(‘#’)^
HiKumar^f(‘VARNAME’)^
HiKumar^f(‘VARNAME’).get()==’#’^
HiKumar^f(‘VARNAME’)==’#’^
HiKumar^f(‘VARNAME’).toNumber()^
HiKumar^f(‘VARNAME’).toString()^
HiKumar^f(‘VARNAME’).toString().toLowerCase()^
HiKumar^f(‘VARNAME’).toString().toUpperCase()^
HiKumar^f(‘IFCONDITION’)?’THENTEXT’:’ELSETEXT’^
HiKumar<br>
HiKumar<br/>
HiKumar<br />
HiKumar^MobileHeader()^
HiKumar^MobileFooter()^
HiKumar<u>
HiKumar</u>
HiKumar&nbsp
;
run;

I'd go with regular expressions. You'll need several; here's one to start. (The quotation characters will probably be mucked up by the web browser and/or SAS, so I would recommend replacing those manually rather than trusting the copy/paste if they don't work at first). This only identifies the fourth row, but similar regex's can be constructed for the other rows (and some you should be able to use for multiple rows).
data want;
set current;
rx_1 = prxparse("/\^f\(‘(\w*?)’\)\^/");
rc_1 = prxmatch(rx_1,currently);
if rc_1 ne 0 then have = prxposn(rx_1,1,currently);
run;

Related

Open Refine: Exporting nested XML with templating

I have a question regarding the templating option for XML in Open Refine. Is it possible to export data from two columns in a nested XML-structure, if both columns contain multiple values, that need to be split first?
Here's an example to illustrate better what I mean. My columns look like this:
Column1
Column2
https://d-nb.info/gnd/119119110;https://d-nb.info/gnd/118529889
Grützner, Eduard von;Elisabeth II., Großbritannien, Königin
https://d-nb.info/gnd/1037554086;https://d-nb.info/gnd/1245873660
Müller, Jakob;Meier, Anina
Each value separated by semicolon in Column1 has a corresponding value in Column2 in the right order and my desired output would look like this:
<rootElement>
<recordRootElement>
...
<edm:Agent rdf:about="https://d-nb.info/gnd/119119110">
<skos:prefLabel xml:lang="zxx">Grützner, Eduard von</skos:prefLabel>
</edm:Agent>
<edm:Agent rdf:about="https://d-nb.info/gnd/118529889">
<skos:prefLabel xml:lang="zxx">Elisabeth II., Großbritannien, Königin</skos:prefLabel>
</edm:Agent>
...
</recordRootElement>
<recordRootElement>
...
<edm:Agent rdf:about="https://d-nb.info/gnd/1037554086">
<skos:prefLabel xml:lang="zxx">Müller, Jakob</skos:prefLabel>
</edm:Agent>
<edm:Agent rdf:about="https://d-nb.info/gnd/1245873660">
<skos:prefLabel xml:lang="zxx">Meier, Anina</skos:prefLabel>
</edm:Agent>
...
</recordRootElement>
<rootElement>
(note: in my initial posting, the position of the root element was not indicated and it looked like this:
<edm:Agent rdf:about="https://d-nb.info/gnd/119119110">
<skos:prefLabel xml:lang="zxx">Grützner, Eduard von</skos:prefLabel>
</edm:Agent>
<edm:Agent rdf:about="https://d-nb.info/gnd/118529889">
<skos:prefLabel xml:lang="zxx">Elisabeth II., Großbritannien, Königin</skos:prefLabel>
</edm:Agent>
)
I managed to split the values separated by ";" for both columns like this
{{forEach(cells["Column1"].value.split(";"),v,"<edm:Agent rdf:about=\""+v+"\">"+"\n"+"</edm:Agent>")}}
{{forEach(cells["Column2"].value.split(";"),v,"<skos:prefLabel xml:lang=\"zxx\">"+v+"</skos:prefLabel>")}}
but I can't find out how to nest the splitted skos:prefLabel into the edm:Agent element. Is that even possible? If not, I would work with seperate columns or another workaround, but I wanted to make sure, if there's a more direct way before.
Thank you!
Kristina
I am going to expand the answer from RolfBly using the Templating Exporter from OpenRefine.
I do have the following assumptions:
There is some other column left of Column1 acting as record identifying column (see first screenshot).
The columns actually have some proper names
The columns URI and Name are the only columns with multiple values. Otherwise we might produce empty XML elements with the following recipe.
We will use the information about records available via GREL to determine whether to write a <recordRootElement> or not.
Recipe:
Split first Name and then URI on the separator ";" via "Edit cells" => "Split multi-valued cells".
Go to "Export" => "Templating..."
In the prefix field use the value
<?xml version="1.0" encoding="utf-8"?>
<rootElement>
Please note that I skipped the namespace imports for edm, skos, rdf and xml.
In the row template field use the value:
{{if(row.index - row.record.fromRowIndex == 0, '<recordRootElement>', '')}}
<edm:Agent rdf:about="{{escape(cells['URI'].value, 'xml')}}">
<skos:prefLabel xml:lang="zxx">{{escape(cells['Name'].value, 'xml')}}</skos:prefLabel>
</edm:Agent>
{{if(row.index - row.record.fromRowIndex == row.record.rowCount - 1, '</recordRootElement>', '')}}
The row separator field should just contain a linebreak.
In the suffix field use the value:
</rootElement>
Disclaimer: If you're keen on using only OpenRefine, this won't be the answer you were hoping for. There may be ways in OR that I don't know of. That said, here's how I would do it.
Edit The trick is to keep URL and literal side by side on one line. b2m's answer below does just that: go from right to left splitting, not from left to right. You can then skip steps 2 and 3, to get the result in the image.
split each column into 2 columns by separator ;. You'll get 4 columns, 1 and 3 belong together, and 2 and 4 belong together. I'm assuming this will be the case consistently in your data.
export 1 and 3 to a file, and export 2 and 4 to another file, of any convenient format, using the custom tabular exporter.
concatenate those two files into one single file using an editor (I use Notepad++), or any other method you may prefer. Several ways to Rome here. Result in OR would be something like this.
You then have all sorts of options to put text strings in front, between and after your two columns.
In OR, you could use transform on column URL to build your XML using the below code
(note the \n for newline, that's probably just a line feed, you may want to use \r\n for carriage return + line feed if you're using Windows).
'<edm:Agent rdf:about="' + value + '">\n<skos:prefLabel xml:lang="zxx">' + cells.Name.value + '</skos:prefLabel>\n</edm:Agent>'
to get your XML in one column, like so
which you can then export using the custom tabular exporter again. Or instead you could use Add column based on this column in a similar manner, if you want to retain your URL column.
You could even do this in the editor without re-importing the file back into OR, but that's beyond the scope of this answer.

Is there a more efficient way to parse a fixed txt file in Access than using queries?

I have a few large fixed with text files that have multiple specification formats in them. I need to parse out the txt files based on a character with a set location in the file. That character can have a different position in the file.
I have written queries for each of the different specifications (95 of them) with the start position and length hard coded into the query using the mid() function with a WHERE() function to filter the [Record Identifier] from the specification. As you can see below the 2 specifications in the WHERE() function have different placements in the txt file.
\\\
SELECT Mid([AllData],1,5) AS PlanNumber, Mid([AllData],6,4) AS Spaces1, Mid([AllData],10,3) AS Filler1, Mid([AllData],13,11) AS SSN, Mid([AllData],24,1) AS AccountIdentifier, Mid([AllData],25,5) AS Filler2, Mid([AllData],30,2) AS RecordIdentifier, Mid([AllData],32,1) AS FieldType, Mid([AllData],33,4) AS Filler3, Mid([AllData],37,8) AS HireDate, Mid([AllData],45,8) AS ParticipationDate, Mid([AllData],53,8) AS VestinDate, Mid([AllData],61,8) AS DateOfBirth, Mid([AllData],77,1) AS Spaces2, Mid([AllData],78,1) AS Reserved1, Mid([AllData],79,1) AS Reserved2, Mid([AllData],80,1) AS Spaces3
FROM TBL_Company1
WHERE (((Mid([AllData],30,2))="02") AND ((Mid([AllData],32,1))="D"));
\\\
Or
\\\
SELECT Mid([AllData],1,5) AS PlanNumber, Mid([AllData],6,4) AS Spaces1, Mid([AllData],10,3) AS Filler1, Mid([AllData],13,11) AS SSN, Mid([AllData],24,1) AS AccountIdentifier, Mid([AllData],25,7) AS RecordIdentifier, Mid([AllData],32,22) AS StreetAddressForBank, Mid([AllData],54,20) AS CityForBank, Mid([AllData],74,2) AS StateForBank, Mid([AllData],76,5) AS ZipCodeForBank
FROM TBL_Company1
WHERE (((Mid([AllData],25,7))="49EFTAD"));
\\\
Is there a way to Parse out this without having to hard code every position and length into the code?
I was thinking of having a table with all of the specifications in it and have an import function look to the specification table and parse out the data accordingly to a new table or maybe something else.
What I have done is not very scalable and if the format changes a little I would have to go back to each query to change it.
Any Help is greatly appreciated
I think in your situation, I'd want to be able to generate the SQL statement dynamically, as you suggest.
I'd have a table something like:
Format#,Position,OutColName,FromPos,Length,WhereValue
1,1,"PlanNumber",1,5,
1,2,"Spaces1",6,4,
...
1,n,,30,2,"02"
1,n+1,,32,1"D"
and then some VBA to process it and build and execute the SQL string(s). The SELECT clause entries would be recognized by having a value in the OutColName field and WHERE clause entries by values in the the WhereValue column.
Of course this is only more "efficient" in the sense that it's a bit easier to code up new formats or fix/modify existing ones.

Is there a more efficient way to return all records in a "traversed" link'ed list?

I have a generic (non-V) class with a LINKMAP field that links a chain of time-series data. Each record links to the next record based on a key of the next month/day/hour/etc. and a value of the #rid of the next record. I could use edges (non-lightweight, with a single property the same as the LINKMAP key), but I don't want to because I only need a one-direction link and want to use the least amount of storage space (plus my assumption is that while using edges might offer slightly easier querying, it wouldn't offer any benefit in terms of performance or storage space).
I currently have the following batch query:
begin;
let r = select from data where key='AAA';
let y = select expand(links['2017']) from $r;
let m = select expand(links['07']) from $y;
let d = select expand(links['15']) from $m;
let h = select expand(links['10']) from $d;
return (select $r[0], $y[0], $m[0], $d[0], $h[0])
This works, in the sense that I get a result set with the #rids of each record in the link'ed list.
I have two questions:
Is there a more efficient way to do this query? I've tried various TRAVERSE options with $path and such, but nothing else I attempted worked at all. I did not try MATCH because I'm not using edges.
The current result set is flat, with each value prop ($r, $y, etc.) containing the corresponding #rid. That's OK for my purposes, but I'd like to know if there's a way to return the actual records instead of just the #rids. Nothing I've tried works.
PS - I'm using orientjs, but the batch runs identically from the console as well.

Talend - Dynamic Column Name (Enterprise version)

Can anyone help me solve this case?
I have much file to process, two of them is like on below screenshot with my expected output.
I use this transformation on Talend: tFileList---tInputExcel---tUnpivotRow---tMap---tPostgresqlOutput
The output is different to my expected output. This is the screenshot of the output
Can anyone help me to reach my expected output which is like on my first picture above?
This will be pretty hard. You'd have to handle that as a text file. And whenever you found "store" value in the first column you'd update your type with the value.
Here's how I'd start:
Basically tJavaFlex begin piece would contain:
String col1Type
String colNType
main part:
if input_row.col0.equalsIgnoreCase("store") {
col1Type = input_row.col1;
col2Type = input_row.col2;
colNType = input_row.colN;
continue; /*(so this record will be Ignored for the rest of the components!)*/
}
output_row.col1Type = col1Type;
output_row.col1Value = Integer.valueOf(input_row.col1);
/*coz we have text and need numbers :( */
I think using propagate results will save you from writing down all the other fields.
And from here it would be very simple as you have key-type-value-type-value-type-value results.

REGEX_EXTRACT error in PIG

I have a CSV file with 3 columns: tweetid , tweet, and Userid. However within the tweet column there are comma separated values.
i.e. of 1 row of data:
`396124437168537600`,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143
I want to extract all 3 fields individually, but REGEX_EXTRACT is giving me an error with this code:
a = LOAD tweets USING PigStorage(',') AS (f1,f2,f3);
b = FILTER a BY REGEX_EXTRACT(f1,'(.*)\\"(.*)',1);
The error is:
error: Filter's condition must evaluate to boolean.
In the use case shared, reading the data using PigStrorage(',') will result in missing savava143 (last field value)
A = LOAD '/Users/muralirao/learning/pig/a.csv' USING PigStorage(',') AS (f1,f2,f3);
DUMP A;
Output : A : Observe that the last field value is missing.
(396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.")
For the use case shared, to extract all the values from CSV file with field values having ',' we can use either CSVExcelStorage or CSVLoader.
Approach 1 : Using CSVExcelStorage
Ref : http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html
Input : a.csv
396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143
Pig Script :
REGISTER piggybank.jar;
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (f1,f2,f3);
DUMP A;
Output : A
(396124437168537600,I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.,savava143)
Approach 2 : Using CSVLoader
Ref : http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/CSVLoader.html
Below script makes use of CSVLoader(), DUMP A will result in the same output seen earlier.
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (f1,f2,f3);
The error is that you do not want to FILTER based on a regex but GENERATE new fields based on a regex. To filter, you need to know if the line have to be filtered, hence the boolean requirement.
Therefore, you have to use :
b = FOREACH a GENERATE REGEX_EXTRACT(FIELD, REGEX, HOW_MANY_GROUPS_TO_RETURN);
However, as #Murali Rao said, your values are not just coma separated but CSV (think how you will handle a coma in tweet : it is not a field separator, just some content).