Underscore and dash in column names after JSON import - openrefine

I've been using OpenRefine very successfully for a couple of years, working solely with CSV (and TSV) source files. Recently I had some tables from an sql database that I wanted to bring into OpenRefine so I exported them (from SQL) as JSON and then used OpenRefine's JSON import feature. It works beautifully except that the column names all begin with _ - . For example, my JSON records start with
{"ID":"97247",
and OpenRefine made the first column name _ - ID instead of just ID (which I'd prefer - I know I can edit them later, but I have hundreds of fields). I can't see any settings in the parsing page that might help this. Does anyone know if there is a way to import without the extra characters (or if there's an explanation for the underscore dash)? I'm considering submitting a feature request but I thought I'd check to see what other users may know.

This is a known issue.
There has also been a proposal to switch to a standard representation for JSON paths.
Feel free to comment on either tickets to indicate which solution you would prefer.

Related

Clean unstructured place name to a structured format

I have around 300k unstructured data as below screen.I'm trying to use Google refine or OpenRefine to make this correct. However, I'm unable to find a proper way to do this. I'm new to this tool. Anyone's help would be greatly appreciated.Also, this tool is quite slow to process 300k records. If I am trying out something its taking lots of time to process and give an output.
OR Please suggest any other opensource tools and techniques do this?
As Owen said in comments, your question is probably too broad and cannot receive acceptable answer. We can just provide you with a general procedure to follow.
In Open Refine, you'll need to create a column based on the messy column and apply transformations to delete unwanted characters. You'll have to use regular expressions. But for that, it's necessary to be able to identify patterns. It's not clear to me why the "ST" of "Nat.secu ST." is important, but not the "US" in "Massy Intertech US". Not even the "36" in "Plowk 36" (Google doesn't know this word, so I'm not sure is an organisation name).
On the basis of your fifteen lines, however, we seem to distinguish some clear patterns. For example, it looks like you'll have to remove the tokens (character suites without spaces) at the end of the string that contain a #. For that, the GREL formula in Open Refine could look like this:
value.trim().replace(/\b\w+#\w+\b$/,'')
Here is a screencast if it's not clear to you.
But sometimes a company name may contain a #, in which case you will need to create more complex rules. For example, remove the token only if the string contains more than two words.
if(value.split(' ').length() > 2, value.replace(/\b\w+#\w+\b$/, ''), value)
And so on for the other patterns that you'll find (for example, any number sequence at the end that contains more than 4 numbers and one - between them)
Feel free to check out the Open Refine documentation in case of doubt.

.Net Parsing Fixed Width Data... From a Concatenated, Single, Fixed-Width Column

I was bored and looking at old code that runs like molasses on a cold day. I found that a group of tables in our accounting system - each with 500,000 records of ~20 datapoints - that use a single column of concatenated, fixed-width values instead of separate columns. (Fixing the tables isn't an option.) An old .net ETL project is grabbing all records, doing a bunch of substrings on each record to set an object's corresponding attributes, then sending the object to merge with production data via a stored proc.
The way it is working is fine. It works. And, to be perfectly honest, I doubt I'll be given the go-ahead to fix it even if I come up with a better solution, but I was curious to see if anyone knew of a better way of doing this, because it's not entirely unlikely that I'll face a situation like this in the future.
I was thinking that if there was a way to use the TextFieldParser to parse a static string instead of a file/stream that might be a valid idea. Or, instead, I could write the entire table to a text file and then use the TextFieldParser to send data to the SProc. http://www.dotnetperls.com/textfieldparser does show that TextFieldParser is quite a bit faster than split, which I would assume is tantamount to the string manipulation our project is currently doing with substring. So there may be something to that idea.
Or perhaps the whole, old project should be dumped for a shiny new SSIS project. Would it also have to write the records to a flat file before importing into SQL? Or can it import directly from the table?
Thank you in advance!

Is it feasible to split data from differently formatted csv files in MS-SQL into several tables with one row per field of a file?

I only found answers about how to import csv files into the database, for example as blob or as 1:1 representation of the table you are importing it into.
What I need is a little different: My team and I are tracking everything we do in a database. A lot of these tasks produce logfiles, benchmark results, etc., which are stored in CSV format. The number of columns are far from consistent and also the data could be completely different from file to file, e.g. it could be a log from fraps with frametimes in it or a log of CPU temparatures over an amount of time, or even something completely different.
Long story short, I came up with an idea, but - being far from a sql pro - I am not sure if it makes sense or if there is a more elegant solution.
Does this make sense to you:
We also need to deal with a lot of data that is produced, so please give me also your opinion if that is feasible with like 200 files per day which can easyly have a couple of thousands rows.
The purpose of all this will be, that we can generate reports form the stored data and perform analysis of the data. E.g. view it on a webpage in a graph or do calculations with it.
I'm limited to MS-SQL in this case, because that's what the current (quite complex) database is and I'm just adding a new schema with that functionality to it.
Currently we just archive the files on a raid and store a link to it in the database. So everyone who wants to do magic with the data needs to download every file he needs and then use R or Excel to create a visualization of the data.
Have you considered a column of XML data type for the file data as an alternative of ColumnId -> Data structure? SQL server provides is a special dedicated XML index (over the entire XML structure) so your data can be fully indexed no matter what CSV columns you have. You will have much less records in the database to handle (as an entire CSV file will be a single XML field value). There are good XML query options to search by values & attributes of the XML type.
For that you will need to translate CSV to XML, but you will have to parse it either way ...
Not that your plan won't work, I am just giving an idea :)
=========================================================
Update with some online info:
An article from Simple talk: The XML Methods in SQL Server
Microsoft documentation for nodes() with various use case samples: nodes() Method (xml Data Type)
Microsoft document for value() with various use case samples: value() Method (xml Data Type)

Solr 5.3 implementation processes docs but doesn't return results

I have recently set up a local instance of Solr 5.3 in an effort to get it going for my company. As an initial test case I've set up a Data Import Handler (DIH) that returns PDFs stored within a file directory. When I execute the full import in the admin tool, the DIH processes all the files within the directory, and I'm able to run a general query (*:*) which returns all indexed fields for every record in the index.
When I switch to a specific query using a word definitely contained within the files, however, Solr returns no results. What connection am I not making here?
I can provide excerpts from the schema, solrconfig, and custom data config if needed, but I don't want to oversaturate this post.
The answer I came up with involved a simple newbie mistake combined with something I wasn't anticipating.
1) First, I didn't have my field set to indexed="true". I set that. Yeesh, it stinks being new to this!
2) I needed to make a change to solrconfig.xml for the core in question. Thanks to this article, I was able to determine that I needed to add a default field in the /select requestHandler. Uncommenting the relevant line in solrconfig and changing the field name did the trick-- I no longer need to supply the name in df to return results.
My carryover question for anyone coming across this question in the future is whether this latter point is the proper way to go about using default fields. I see in schema.xml that is deprecated (or heading that direction) in 5.3.0. So is it alright to define df in solrconfig instead?

Removing privacy data from a database?

Say that I needed to share a database with a partner. Obviously I have customer information in that database. Short of going through and identifying every column that contains privacy information and a custom script to 'scrub' the data, is there any tool or script which can scrub the data, but keep the format in tact (for example, if a string is 5 characters, it would stay 5 characters, only scrubbed)?
If not, how would you accomplish something like this, preferably in TSQL?
You may consider only share VIEW, create VIEWs to hide data that you don't want share.
Example:
CREATE VIEW v_customer
AS
SELECT
NAME,
LEFT(CreditCard,5) + '****' As CreditCard -- OR, don't show this column at all
....
FROM customer
Firstly I need to state professional interest I work for IBM which has tools that do exactly this.
Step 1. Ensure you identify all the PII (Personally Identifiable Information). When sharing database information it is typical that the obvious column names like "name" are found but you also need to find the "hidden" data where either the data is embedded in a standard format eg string-name-string and column name is something like "reference code" or is in free format text fields . as you have seen this is not going to be an easy job unless you automate it. The Tool for this is InfoSphere Discovery
Step 2. What context does the "scrubbed" data need to be in. Changing named fields to random characters has problems when testing as users focus on text errors rather than functional failures, therefore change names to real but ficticious. Credit card information often needs to be "valid". by that I mean it needs to have a valid prefix say 49XX but the rest an invalid sequence. Finally you need to ensure that every instance of the change is propogated through the database to maintain consistency. Tool for this is Optim Test Data Management with Data Privacy option.
The two tools integrate to give a full data privacy solution.
Based on the original question, it seems you need the fields to be the same length, but not in a "valid" format? How about:
UPDATE customers
SET email = REPLICATE('z', LEN(email))
-- additional fields as needed
Copy/paste and rename tables/fields as appropriate. I think you're going to have a hard time finding a tool that's less work, unless your schema is very complicated, or my formatting assumptions are incorrect.
I don't have an MSSQL database in front of me right now, but you can also find all of the string-like columns by something like:
SELECT *
FROM INFORMATION_SCHEMA.COLUMNS
WHERE DATA_TYPE IN ('...', '...')
I don't remember the exact values you need to compare for, but if you run the query and see what's there, they should be pretty self-explanatory.