XSLT to compare same nodeset A in 2 files and report specified nodes X,Y, Z where A differs - xslt-1.0

I want to compare 2 XML files made up of records which each contain an ID field.
File 1 is the raw data input, file 2 is the cleaned up version. I want to do a basic QA on the cleaning process.
Where there's any difference between the files in terms of
specific ID values that are in file 1 but not in file 2 (or vice versa)
multiple instances of ID values in either file (1 or 2)
...I want to report a couple of other nodes X, Y, Z from the offending "records" that contain the additional/missing/multiple IDs, to see why (and if) those records were cleaned going from file 1 to file 2.
I get this conceptually, my main question is how to write
the references to the separate input files;
the if statement(s).
XSLT 1.0 preferred.

Related

SQL Server big tables or store data in a xml field

I have a .net solution with a big form with many data that the customer need to fill, like a form with many steps to fill all data we need to get.
So i was wondering if it's better (from a performance and design approach) a traditional big table with many fields, o store the data only on one field of XML type.
Example of one "TraditionalTable":
RecordId
CustomerId
Data 1
Data 2....
to Data N
1
120
01/01/1980
abcd ....
123
2
20
04/02/2004
fgh ....
230
3
10
05/01/1995
xyz ....
135
Example of one "DataWithXMLField":
RecordId
CustomerId
FormData
1
120
< data>< customerdetails>< borndate>01/01/1980< /borndate>< /customerdetails >< financialinfo >...."
I've done many systems like this and prefer to keep the data as XML (often it's a serialized object). I find this to be efficient at runtime and at design time. (See item below about binary attachments).
The following are some suggestions based on what I've done in the past. Obviously it's not a one-sized hammer...
Often data is "collected" by a user and "approved" by an administrator. While collecting the data, it's stored as XML. When approved, the XML is shred and placed into "normal" relational tables/fields.
Often this data has been collected through multiple pages. Storing as XML allows collecting data in a way that is logical to the user but doesn't fit the final data structure very well.
If a form is abandoned (not completed or canceled) it's easy to delete a single row.
Things to keep in mind:
Some data is related to workflow and is separate from the data being collected. For example, and field for "Form Status" may go from "In Progress", to "Submitted" to "Approved". This type of data should be kept as regular columns.
Store Binary Data separately. If your form includes submitting binary data (like uploading a PDF) I like to generate a GUID on the front end. Store that GUID in the XML and then save the binary data separately using the GUID. Possibly on disk or in a separate "attachments" table.
Define a column for a "version number" of the XML. This way you can programmatically identify what is in the XML. This will help in the future when you need to make changes to the XML.
Define a column for a "Summary" that is short human-friendly version of the XML. For example, if your XML contains information for registering for summer camps, your "XML Summary" might contain the text: "SMITH,JOHN, Camp White Pine 2021". This text us calculated on the front end. It can then be used for displaying rows of data without having to poke into the XML. For example, an administrative page may exist that lists applications that require approval.
Define a column to indicate if the XML meets all your requirements. You don't want to validate XML in the database (it's often hard, and likely repetitive of the UI). Your business layer can apply business rules (Validation) to the XML (or classes) and store in the database an indicator that all business rules are met.

Import data from csv into database when not all columns are guaranteed

I am trying to build an automatic feature for a database that takes NOAA weather data and imports it into our own database tables.
Currently we have 3 steps:
1. Import the data literally into its own table to preserve the original data
2. Copy it's data into a table that better represents our own data in structure
3. Then convert that table into our own data
The problem I am having stems from the data that NOAA gives us. It comes in the following format:
Station Station_Name Elevation Latitude Longitude Date MXPN Measurement_Flag Quality_Flag Source_Flag Time_Of_Observation ...
Starting with MXPN (Maximum temperature for water in a pan) which for example is comprised of it's column and the 4 other columns after it, it repeats that same 5 columns for each form of weather observation. The problem though is that if a particular type of weather was not observed in any of the stations reported, that set of 5 columns will be completely omitted.
For example if you look at Central Florida stations, you will find no SNOW (Snowfall measured in mm). However, if you look at stations in New Jersey, you will find this column as they report snowfall. This means a 1:1 mapping of columns is not possible between different reports, and the order of columns may not be guaranteed.
Even worse, some of the weather types include wild cards in their definition, e.g. SN*# where * is a number from 0-8 representing the type of ground, and # is a number 1-7 representing the depth at which soil temperature was taken for the minimum soil temperature, and we'd like to collect these together.
All of these are column headers, and my instinct is to build a small Java program to map these properly to our data set as we'd like it. However, my superior believes it may be possible to have the database do this on a mass import, but he does not know how to do it.
Is there a way to do this as a mass import, or is it best for me to just write the Java program to convert the data to our format?
Systems in use:
MariaDB for the database.
Centos7 for the operating system (if it really becomes an issue)
Java is being done with JPA and Spring Boot, with hibernate where necessary.
You are creating a new table per each file.
I presume that the first 6 fields are always present, and that you have 0 or more occurrences of the next 5 fields. if you are using SQL Server i would approach it as follows,
Query the information_schema catalog to get a count of the fields in
the table. If the count= 6 then no observations are present, if 11
columns ,then you have 1 observation, if 17 then you have 2
observations, etc.
Now that you know the number of observations you can write some SQL
that will loop the over the observations and insert them into a
child table with a link back to a parent table which has the 1st 6
fields.
apologies if my assumptions are way off.
-HTH

Merge two CSV and collate data

I have two CSV files, the first like so:
Book1:
ID,TITLE,SUBJECT
0001,BLAH,OIL
0002,BLAH,HAMSTER
0003,BLAH,HAMSTER
0004,BLAH,PLANETS
0005,BLAH,JELLO
0006,BLAH,OIL
0007,BLAH,HAMSTER
0008,BLAH,JELLO
0009,BLAH,JELLO
0010,BLAH,HAMSTER
0011,BLAH,OIL
0012,BLAH,OIL
0013,BLAH,OIL
0014,BLAH,JELLO
0015,BLAH,JELLO
0016,BLAH,HAMSTER
0017,BLAH,PLANETS
0018,BLAH,PLANETS
0019,BLAH,HAMSTER
0020,BLAH,HAMSTER
And then a second CSV with items associated with the first list, with ID being the common attribute between the two.
Book2:
ID,ITEM
0001,PURSE
0001,STEAM
0001,SEASHELL
0002,TRUMPET
0002,TRAMPOLINE
0003,PURSE
0003,DOLPHIN
0003,ENVELOPE
0004,SEASHELL
0004,SERPENT
0004,TRUMPET
0005,CAR
0005,NOODLE
0006,CANNONBALL
0006,NOODLE
0006,ORANGE
0006,SEASHELL
0007,CREAM
0007,CANNONBALL
0007,GUM
0008,SERPENT
0008,NOODLE
0008,CAR
0009,CANNONBALL
0009,SERPENT
0009,GRAPE
0010,SERPENT
0010,CAR
0010,TAPE
0011,CANNONBALL
0011,GRAPE
0012,ORANGE
0012,GUM
0012,SEASHELL
0013,NOODLE
0013,CAR
0014,STICK
0014,ORANGE
0015,GUN
0015,GRAPE
0015,STICK
0016,BASEBALL
0016,SEASHELL
0017,CANNONBALL
0017,ORANGE
0017,TRUMPET
0018,GUM
0018,STICK
0018,GRAPE
0018,CAR
0019,CANNONBALL
0019,TRUMPET
0019,ORANGE
0020,TRUMPET
0020,CHERRY
0020,ORANGE
0020,GUM
The real datasets are millions of records, so I'm sorry in advance for my simple example.
The problem I need to solve is getting the data merged and collated in a way where I can see which item groupings most commonly appear together on the same ID. (e.g. GRAPE,GUM,SEASHELL appear together 340 times, ORANGE and STICK 89 times, etc...)
Then I need to see if there is any change/deviation to the general results in common appearance when grouped by SUBJECT.
Tools I'm familiar with are Excel and SQL, but I also have PowerBI and Alteryx at my disposal.
Full disclosure: Not homework, or work, but a volunteer project, thus my unfamiliarity with this kind of data manipulation.
Thanks in advance.
An Alteryx solution:
Drag the two .csv files onto your canvas (seen as book1.csv and book2.csv in my picture; Alteryx will create "Input" tools for you.
Drag a "Join" tool on and connect the two .csv files to its inputs; select "ID" as the join field; unselect the "Right_ID" as output since it's merely a duplicate of "ID"
Drag a "Summary" tool on and connect the Join tool's output to the Summary tool's input; select all three of the outputs and add as a "group by"... then add the ID column with a "count"
Drag a browse tool on and connect the summary's output to the browse tool's input.
run the workflow
After all that, click on the browse tool and you should see what is seen in my screenshot: (which is showing just the first ten rows of output):
+1 for taking on a volunteer project - I think anyone who knows data can have a big impact in support of their favourite group or cause.
I would just pull the 2 files into Power BI as 2 separate tables (Get Data / From File). Create a relationship between the 2 tables based on ID (it might get auto-generated). It should be one to many.
Then I would add a Calculated Column to the Book1 table to Concatenate the related ITEM values, eg.
Items =
CALCULATE (
CONCATENATEX (
DISTINCT ( 'Book2'[ITEM] ),
'Book2'[ITEM],
", ",
'Book2'[ITEM], ASC
)
)
Now you can use that Items field in visuals (e.g. a Table), along with Count of ID to get the frequency.
Adding Subject to a copy of the table (e.g. to the Columns well of a Matrix) will produce your grouped scenario, or you could add a Subject Slicer.
As you will be comparing subsets of varying size, I would change Count of ID to Show value as - % of grand total.
Little different solution using Alteryx.
With this dataset, there are very few repeating 3 or 4 item groups. You can do the two item affinity analysis and get a probability of 3 or 4 item groups, or you can count the 3 and 4 item groups individually. I believe what you want is the latter as your probability of getting grapes with oranges may be altered by whether you have bananas in the cart or not.
Anyway, I did not join in the subject until after finding all of my combinations. I found all the combinations by taking the Cartesian join of two, then three, then four of the original set. I then removed all duplicates by ensuring items were always in alphabetical order in each row. I then counted occurrences of each combination. More joins can be added in the same pattern to count groups of 5,6,7...
Once you have the counts of occurrences, then I would join back with the subjects and perform this analysis on each group and compare to the overall results.
I'm supposed to disclose that I work for Alteryx.
first of all if you are using windows
just navigate to the directory which contains the CSV and write the following command:
copy pattern newfileName.csv
#example
copy *.csv merged.csv
now you created one csv file, the file is too large now you can't process it once, depending on your programming language you can use appropriate way, for python you can use generators to process line by line, or pandas you can read chunk by chunk it will be easy.
I hope this help you.

Merge multiple rows in fixed width file source into one row

I'm working with the craziest file format I've seen. It is fixed width, and contains multiple record types (in the sense that each row may have different columns and widths). There's a file header, trailer, and then a static number of rows that when put together make up one record. The problem I'm having is that there is nothing in the rows that tell you they belong to the same record other than their sort order and a row number attribute.
Example:
001 David Wellingsworth Mr.
002 312-555-5555 3060 W Maple St. Chicago
001 Jimothy Bogendath Dr.
002 563-555-5432 123 Main St. Davenport
My question is therefore: is it possible, without using a Script Component, to process a file like this? I understand the basic concept of how to handle disparate record types in a fixed width file (making use of conditional splits and substrings), but I can't get past how to join up all this data after the splits if the rows don't have identifiers.
If it helps, my question is basically this previous question but in reverse.
Possible but with some work. I've worked with data like these and this was our approach on how we solved them.
You will need to build a table that will give them their own unique RecordID
Create another table for your Files to log in your filename and unique fileID
Link your fileID to the RecordID so you know which file each record came from
Build all your sub tables linking to each unique RecordID
Building your tables this way will give you:
Unique recordID for each row (though there maybe duplicate in the file, in your tables they are unique).
Knowing which file each record comes from.

Pentaho run contains list to file

I have this situation, 2 files.
Input file 2 fields 6 rows:
1|BANANA ON CAGES
2|APPLE CHIPS
3|SPORT CARS
4|PLANES
5|HOUSE
6|BOTTLES
List file 2 fields 4 rows
BANANA|FRUIT
APPLE|FRUIT
CAR|TRANSPORT
PLANE|TRANSPORT
And I wish this result:
Output file 3 fields 6 rows
1|BANANA ON CAGES|FRUIT
2|APPLE CHIPS|FRUIT
3|SPORT CARS|TRANSPORT
4|PLANES|TRANSPORT
5|HOUSE
6|BOTTLES
Is mandatory for me to use PDI.
Join files (Cartesian Product) is too slow.
Input file is around 1,000,000 rows and list file around 300,000 rows
Does your List file need to be dynamic or the content is reasonably static?
If static, you can try String Replace with RegEx. Something like:
After setting the category you would just need to filter where category != from the item description.
Don't know how it will perform with so many records though. Just used this step with few records until now.
EDIT: I've just seen that Join (Cartesian) has REGEXP option. Maybe it's faster than CONTAINS (which I think you've been using?). That would by far be better to set up.
Good luck!