Issue when importing dataset: Rows that contain more elements/columns than the previous row are divided between two rows - dataframe

For a project I receive datasets in the form of text files. These text files are generated by the measuring software from a machine. The data in the files is seperated by spaces and has no header, example of a row:
Mo 27.06.2022 12:01:11 MP2 mv:(mean. 5s): 4,824 mg/mü org.C
When loading this data using
my_data <- read.table("File.txt", header = FALSE, sep = "", dec = ",", fill=TRUE, na.strings=c("","NA"))
I obtain 9 columns in the following format (example), as intended.
|Mo|27.06.2022|12:01:11|MP2|mv:(mean.| 5s):| 4,824| mg/mü| org.C|
However, sometimes the data set starts with a notification from the machine (example):
Mo 27.06.2022 11:42:04 {SE14} service requestend
When this happens, the 'regular' 9 column rows are seperated between two rows (example):
Row 1: Mo|27.06.2022|11:58:26|MP1|mv:(mean.| 5s):|
Row 2: 7,858| mg/mü |org.C
How do I tell R to not perform this seperation between two rows? As I understand, it does this because earlier in the text file, an input of only 6 columns is recognized.
This is a script that we will use for years to come, so help is greatly appreciated!
I've tried removing the fill function from the read.table function, I have tried removing the na.strings, and ofcourse looking for answers on stack overflow, but was not able to encounter this specific problem.

Related

Data Factory Copy Activity: Error found when processing 'Csv/Tsv Format Text' source 'xxx.csv' with row number 6696: found more columns than expected

I am trying to perform a simply copy activity in Azure Data Factory from CSV to SQL Table, but I'm getting the following error:
{
"errorCode": "2200",
"message": "ErrorCode=DelimitedTextMoreColumnsThanDefined,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error found when processing 'Csv/Tsv Format Text' source 'organizations.csv' with row number 6696: found more columns than expected column count 41.,Source=Microsoft.DataTransfer.Common,'",
"failureType": "UserError",
"target": "Copy data1",
"details": []
}
The copy activity is as follows
Source
My Sink is as follows:
As preview of the data in source is as follows:
This seems like a very straight forward copy activity. Any thoughts on what might be causing the error?
My row 6696 looks like the following:
3b1a2e5f-d08b-166b-4b91-eb53009b2377 Compassites Software Solutions organization compassites-software https://www.crunchbase.com/organization/compassites-software 318375 17/07/2008 10:46 05/12/2022 12:17 company compassitesinc.com http://www.compassitesinc.com IND Karnataka Bangalore "Pradeep Court", #163/B, 6th Main 3rd Cross, JP Nagar 3rd phase 560078 operating Custom software solution experts Big Data,Cloud Computing,Information Technology,Mobile,Software Data and Analytics,Information Technology,Internet Services,Mobile,Software 01/11/2005 51-100 info#compassitesinc.com 080-42032572 http://www.facebook.com/compassites http://www.linkedin.com/company/compassites-software-solutions http://twitter.com/compassites https://res.cloudinary.com/crunchbase-production/image/upload/v1397190270/c3e5acbde40f36eaf4f8c6f6eda3f803.png company
No commas
As the error message indicates, there is a record at row number 6696 where there is a value containing , as a character in it.
Look at the following demonstration where I have taken a similar case. I have 3 columns in my source. The data looks as shown below:
When I run use similar dataset settings and read these values, the same error would be thrown.
So, the value T1,OG is being considered as if they belong to 2 different columns since they have dataset delimiter within the value.
Such values would throw an error as it is ambiguous to read. One way to avoid this is to enclose such values with quote character (double quote in this case).
Now when I run the copy activity, it would give the desired output.
The table data would look like this:

Open Refine: Exporting nested XML with templating

I have a question regarding the templating option for XML in Open Refine. Is it possible to export data from two columns in a nested XML-structure, if both columns contain multiple values, that need to be split first?
Here's an example to illustrate better what I mean. My columns look like this:
Column1
Column2
https://d-nb.info/gnd/119119110;https://d-nb.info/gnd/118529889
Grützner, Eduard von;Elisabeth II., Großbritannien, Königin
https://d-nb.info/gnd/1037554086;https://d-nb.info/gnd/1245873660
Müller, Jakob;Meier, Anina
Each value separated by semicolon in Column1 has a corresponding value in Column2 in the right order and my desired output would look like this:
<rootElement>
<recordRootElement>
...
<edm:Agent rdf:about="https://d-nb.info/gnd/119119110">
<skos:prefLabel xml:lang="zxx">Grützner, Eduard von</skos:prefLabel>
</edm:Agent>
<edm:Agent rdf:about="https://d-nb.info/gnd/118529889">
<skos:prefLabel xml:lang="zxx">Elisabeth II., Großbritannien, Königin</skos:prefLabel>
</edm:Agent>
...
</recordRootElement>
<recordRootElement>
...
<edm:Agent rdf:about="https://d-nb.info/gnd/1037554086">
<skos:prefLabel xml:lang="zxx">Müller, Jakob</skos:prefLabel>
</edm:Agent>
<edm:Agent rdf:about="https://d-nb.info/gnd/1245873660">
<skos:prefLabel xml:lang="zxx">Meier, Anina</skos:prefLabel>
</edm:Agent>
...
</recordRootElement>
<rootElement>
(note: in my initial posting, the position of the root element was not indicated and it looked like this:
<edm:Agent rdf:about="https://d-nb.info/gnd/119119110">
<skos:prefLabel xml:lang="zxx">Grützner, Eduard von</skos:prefLabel>
</edm:Agent>
<edm:Agent rdf:about="https://d-nb.info/gnd/118529889">
<skos:prefLabel xml:lang="zxx">Elisabeth II., Großbritannien, Königin</skos:prefLabel>
</edm:Agent>
)
I managed to split the values separated by ";" for both columns like this
{{forEach(cells["Column1"].value.split(";"),v,"<edm:Agent rdf:about=\""+v+"\">"+"\n"+"</edm:Agent>")}}
{{forEach(cells["Column2"].value.split(";"),v,"<skos:prefLabel xml:lang=\"zxx\">"+v+"</skos:prefLabel>")}}
but I can't find out how to nest the splitted skos:prefLabel into the edm:Agent element. Is that even possible? If not, I would work with seperate columns or another workaround, but I wanted to make sure, if there's a more direct way before.
Thank you!
Kristina
I am going to expand the answer from RolfBly using the Templating Exporter from OpenRefine.
I do have the following assumptions:
There is some other column left of Column1 acting as record identifying column (see first screenshot).
The columns actually have some proper names
The columns URI and Name are the only columns with multiple values. Otherwise we might produce empty XML elements with the following recipe.
We will use the information about records available via GREL to determine whether to write a <recordRootElement> or not.
Recipe:
Split first Name and then URI on the separator ";" via "Edit cells" => "Split multi-valued cells".
Go to "Export" => "Templating..."
In the prefix field use the value
<?xml version="1.0" encoding="utf-8"?>
<rootElement>
Please note that I skipped the namespace imports for edm, skos, rdf and xml.
In the row template field use the value:
{{if(row.index - row.record.fromRowIndex == 0, '<recordRootElement>', '')}}
<edm:Agent rdf:about="{{escape(cells['URI'].value, 'xml')}}">
<skos:prefLabel xml:lang="zxx">{{escape(cells['Name'].value, 'xml')}}</skos:prefLabel>
</edm:Agent>
{{if(row.index - row.record.fromRowIndex == row.record.rowCount - 1, '</recordRootElement>', '')}}
The row separator field should just contain a linebreak.
In the suffix field use the value:
</rootElement>
Disclaimer: If you're keen on using only OpenRefine, this won't be the answer you were hoping for. There may be ways in OR that I don't know of. That said, here's how I would do it.
Edit The trick is to keep URL and literal side by side on one line. b2m's answer below does just that: go from right to left splitting, not from left to right. You can then skip steps 2 and 3, to get the result in the image.
split each column into 2 columns by separator ;. You'll get 4 columns, 1 and 3 belong together, and 2 and 4 belong together. I'm assuming this will be the case consistently in your data.
export 1 and 3 to a file, and export 2 and 4 to another file, of any convenient format, using the custom tabular exporter.
concatenate those two files into one single file using an editor (I use Notepad++), or any other method you may prefer. Several ways to Rome here. Result in OR would be something like this.
You then have all sorts of options to put text strings in front, between and after your two columns.
In OR, you could use transform on column URL to build your XML using the below code
(note the \n for newline, that's probably just a line feed, you may want to use \r\n for carriage return + line feed if you're using Windows).
'<edm:Agent rdf:about="' + value + '">\n<skos:prefLabel xml:lang="zxx">' + cells.Name.value + '</skos:prefLabel>\n</edm:Agent>'
to get your XML in one column, like so
which you can then export using the custom tabular exporter again. Or instead you could use Add column based on this column in a similar manner, if you want to retain your URL column.
You could even do this in the editor without re-importing the file back into OR, but that's beyond the scope of this answer.

Group Pandas: How to concat or merge/join/append two csv files with same index but different extensions in grouped data?

I'd like to concat or merge or append/join two csv files with the same indix ID but different extensions on the same ID. The data are grouped by ID also. The 1st file looks like this:
ID,year,age
810006862,2000,49
810006862,2001,
810006862,2002,
810006862,2003,52
810023112,2003,27
810023112,2004,28
810023112,2005,29
810023112,2006,30
810033622,2000,24
810033622,2001,25
and the 2nd file looks like this:
ID,year,from1,to1
810006862,2002,15341,15705
810006862,2003,15706,16070
810006862,2004,16071,16436
810006862,2005,,
810023112,2000,14610,14975
810023112,2001,14976,15340
810023112,2003,15825,16523
810033622,2000,13211,14876
810033622,2001,14761,14987
I have set index of ID for both files after reading it to dataframe, and then concat them together, but it gets error message of "ValueError: Shape of passed values is (25, 2914), indices imply (25, 251)"
I've tried the following codes:
sp = pd.read_csv('sp1.csv')
sp = sp.set_index('ID')
op = pd.read_csv('op1.csv')
op = op.set_index('ID')
ff = pd.concat([sp, op], join = 'outer', sort = False, axis = 1)
I've also tried concat the two files together without setting up index, and the result seemed having correct rows, but the horizontal values were incorrect related.
I've also tried merge as well, but it came with many unnecessary duplicated rows within each group. Since each group has different year and age, I found quite difficult to delete those newly generated rows using this method.
full = pd.merge(sp, op, on = 'ID', how = 'outer', sort = False)
Maybe somebody can suggest ways to easily delete these duplicates, and this will also work for me, because the merged file became so huge! Thanks in advance!
Expected results would be including all different values from both csv files. It is somewhat like this:
ID,year,age,from1,to1
810006862,2000,49,,
810006862,2001,,,
810006862,2002,,15341,15705
810006862,2003,52,15706,16070
810006862,2004,,16071,16436
810006862,2005,,,
810023112,2000,,14610,14975
810023112,2001,,14976,15340
810023112,2003,27,15825,16523
810023112,2004,28,,
810023112,2005,29,,
810023112,2006,30,,
810033622,2000,24,13211,14876
810033622,2001,25,14761,14987
I've searched online for similar posts along for quite some time, but unable to solve my problem. Can anybody offer any clue how to do this? Thanks a lot!

How do I use values within a list to specify changing selection conditions and export paths?

I'm trying to split a large csv data using a condition. To automate this process, I'm pulling a list of unique conditions from a column in the data set and wanting to use this list within a loop to specify condition and also rename the export file.
I've converted the array of values into a list and have tried fitting my function into a loop, however, I believe syntax is the main error.
# df1718 is my df
# znlist is my list of values (e.g. 0 1 2 3 4)
# serial is specified at the top e.g. '4'
for x in znlist:
dftemps = df1718[(df1718.varname == 'RoomTemperature') & (df1718.zone == x)]
dftemps.to_csv('E:\\path\\test%d_zone(x).csv', serial)
So in theory, I would like each iteration to export the data relevant to the next zone in the list and the export file to be named test33_zone0.csv (for example). Thanks for any help!
EDIT:
The error I am getting is: "delimiter" must be string, not int
So if the error is in saving the file try this
dftemps.to_csv('E:\\path\\test{}_zone{}.csv'.format(str(serial),str(x)))

In SSRS Report Builder, the results that are being delivered are in a very long string, I need to be able to extract certain substrings

One result, under one field of data contains the below, it's extremely long. I need to be able to pull out certain substrings into seperate columns.
Desired Result:
1) email addresses that it's being sent to, identified by "TO": gregory.dettorre#cardinalhealth.com; scott.ballard#cardinalhealth.com
2) email addresses that it's being CC'd to, identified by "CC":
GMB-OptiFreight-CCBABR#cardinalhealth.com
3) email addresses that it's being CC'd to, identified by "ReplyTo":
OptiFreightcustomercare#cardinalhealth.com
4) Include report: True
5) Render Format: Excel
6) Subject: 13 Week Volume File - LifePoint Health - Brentwood, TN
Result:
"<ParameterValues><ParameterValue><Name>TO</Name>
<Value>gregory.dettorre#cardinalhealth.com;
scott.ballard#cardinalhealth.com</Value></ParameterValue><ParameterValue>
<Name>CC</Name><Value>GMB-OptiFreight-CCBABR#cardinalhealth.com</Value>
</ParameterValue><ParameterValue><Name>ReplyTo</Name>
<Value>OptiFreightcustomercare#cardinalhealth.com</Value></ParameterValue>
<ParameterValue><Name>IncludeReport</Name><Value>True</Value>
</ParameterValue><ParameterValue><Name>RenderFormat</Name>
<Value>EXCEL</Value></ParameterValue><ParameterValue><Name>Subject</Name>
<Value>13 Week Volume File - LifePoint Health - Brentwood, TN</Value>
</ParameterValue><ParameterValue><Name>Comment</Name><Value>Please see the
attached 13 week volume file and let us know if you have any questions.
OptiFreightcustomercare#cardinalhealth.com</Value></ParameterValue><ParameterValue><Name>IncludeLink</Name><Value>False</Value></ParameterValue><ParameterValue><Name>Priority</Name><Value>NORMAL</Value></ParameterValue></ParameterValues>"
Here there is an answered question on splitting strings, using SUBSTRING and CHARINDEX in SSRS. You Get the indexes of 2 delimiters (e.g. "TO" and "CC"), and by applying SUBSTRING between these 2 delimiters you get the value that you wanted.
Also, the best practice would probably be splitting the data in the dataset (e.g. SQL query) itself, instead of doing so in the report itself.