Use google-refine on csv without headers and with various number of columns per record - openrefine

I'm attempting to import in open-refine a csv extracted from a NoSQL database (Cassandra) without headers and with different number of columns per record.
For instance, fields are comma separated and could look like below:
1 - userid:100456, type:specific, status:read, feedback:valid
2 - userid:100456, status:notread, message:"some random stuff here but with quotation marks", language:french
There's a maximum number of columns and there aren't cleansing required on their names.
How do I make up a big excel file I could mine using pivot table?

If you can get JSON instead, Refine will ingest it directly.
If that's not a possibility, I'd probably do something along the lines of:
import as lines of text
split into two columns containing row ID and fields
split multi-valued cells on fields column using comma as a separatd
split fields column into two columns using colon as a separate
use key/value on these two columns to unfold into columns

Related

BigQuery - JSON_QUERY - Need to find a common path for multiple rows

I am new to Bigquery and stackoverflow as well. I am dealing with multiple rows of JSON codes in a given column and they are all similar but they have different Id's. I am trying to extract DAYS from them.
Currently I am using the following formula to extract DAYS.
SELECT JSON_QUERY(Column,'$.data.extract.b2ed07ab-8f70-47e1-9550-270a23ec5e37.sections[0].TIME[0].DAYS') FROM DEFAULT_TABLE_LOCATION
This gives me what I want but there are multiple rows of data where the only difference is the ID which I've mentioned above b2ed07ab-8f70-47e1-9550-270a23ec5e37.
Is there a way for me to use something like
SELECT JSON_QUERY(Column,'$.data.extract.????????-????-????-????-????????????.sections[0].TIME[0].DAYS') FROM DEFAULT_TABLE_LOCATION
to get the same data stored in different rows?
In summary, Is it possible for me to have a common JSON path to extract values stored in multiple rows, given that they have the same char length to find the DAYS?
Here's a sample code. I have omitted most of the irrelevant code as it was too big to paste here.
{"data":{"CONFDETAILS":[...Some XYZ code...},"extract":{"b2ed07ab-8f70-47e1-9550-270a23ec5e37":{.......},"entities":null,"sections":[{.......,"TIME":[{"DAYS":[false,false,false,false,false,false,true],"end":"23:59","start":"00:00"},{"DAYS":[true,false,false,false,false,false,false],"end":"23:59","start":"00:00"},{"DAYS":[false,true,false,false,false,false,false],"end":"23:59","start":"00:00"},{"DAYS":[false,false,true,false,false,false,false],"end":"23:59","start":"00:00"},{"DAYS":[false,false,false,true,false,false,false],"end":"23:59","start":"00:00"},{"DAYS":[false,false,false,false,true,false,false],"end":"23:59","start":"00:00"},{"DAYS":[false,false,false,false,false,true,false],"end":"23:59","start":"00:00"}],........}
and to give some more perspective, the data in the following rows look like this. With just the ID being different.
{"data":{"CONFDETAILS":[...Some XYZ code...},"extract":{"e520ab02-6ec1-4fdf-b810-0d1b74fc719c":{.......},"entities":null,"sections":[{.......,"TIME":[{"DAYS":[false,false,false,false,false,false,true],"end":"23:59","start":"00:00"},{"DAYS":[true,false,false,false,false,false,false],"end":"23:59","start":"00:00"},{"DAYS":[false,true,false,false,false,false,false],"end":"23:59","start":"00:00"},{"DAYS":[false,false,true,false,false,false,false],"end":"23:59","start":"00:00"},{"DAYS":[false,false,false,true,false,false,false],"end":"23:59","start":"00:00"},{"DAYS":[false,false,false,false,true,false,false],"end":"23:59","start":"00:00"},{"DAYS":[false,false,false,false,false,true,false],"end":"23:59","start":"00:00"}],........}

How do I load entire file content as a text into a column AzureSQLDW table?

I have a some file in an azure data lake 2 and I want to load them as a column value nvarchar(max) in AzureSQLDW. The table in AzureSQLDW is heap. I couldn't find any way to do it? All I see is column delimited when load them into multiple rows instead of one row in single column. How I achieve this?
I don't guarantee this will work, but try using COPY INTO and define non-present values for row and column delimiters. Make your target a single column table.
I would create a Source Dataset with a single column. You do this by specifying "No delimiter":
Next, go to the "Schema" tab and Import the schema, which should create a single column called "Prop_0":
Now the data should come through as a single string instead of delimited columns.

Is there any way to exclude columns from a source file/table in Pentaho using "like" or any other function?

I have a CSV file having more than 700 columns. I just want 175 columns from them to be inserted into a RDBMS table or a flat file usingPentaho (PDI). Now, the source CSV file has variable columns i.e. the columns can keep adding or deleting but have some specific keywords that remain constant throughout. I have the list of keywords which are present in column names that have to excluded, e.g. starts_with("avgbal_"), starts_with("emi_"), starts_with("delinq_prin_"), starts_with("total_utilization_"), starts_with("min_overdue_"), starts_with("payment_received_")
Any column which have the above keywords have to be excluded and should not pass onto my RDBMS table or a flat file. Is there any way to remove the above columns by writing some SQL query in PDI? Selecting specific 175 columns is not possible as they are variable in nature.
I think your example is fit to use meta data injection you can refer to example shared below
https://help.pentaho.com/Documentation/7.1/0L0/0Y0/0K0/ETL_Metadata_Injection
two things you need to be careful
maintain list of columns you need to push in.
since you have changing column names so you may face issue with valid columns as well which you want to import or work with. in order to do so make sure you generate the meta data file every time so you are sure about the column names you want to push out from the flat file.

Split multiple fields to rows in pentaho 'split field to rows' activity

I have an array of phone numbers in my pentaho workflow. I am using split fields to rows activity and I am able to split that array to rows. Now I want to split multiple fields to row by using 'split field to rows' only once. I tried giving comma as a delimiter in the 'Fields to Split' in the input but getting error.
How I can split multiple fields to rows by using 'split fields to rows' activity once ?
[P.S. see attached image for referance]
The Split field to rows accept only one input field to split.
You may put a series of Split field to rows, one for each field.
Alternatively, you may concatenate your multiple fields in a single field before to split it.

Solr Transform to split data into multiple fields

I am Trying to create a new Core in Solr and import data into it, but I am displaying this data on Kendo grid and I need to display data from one column from the DB in two different fields(columns)
Here is the Data in the column that i need to split into two different columns
10680756-1000-RAT
I need to split the "10680756" in 1 called called A and in another column called B
this is my data config and schema
<entity name="myTable" pk="testingId"
query="select * from myTable"
>
<field column="codeStatus" name="codeStatus"/> (this column has the data "10680756-1000-RAT")
I need to show it like:
Code Status A B
10680756-1000-RAT 10680756 1000
You can use an Update Request Processor to split the field into two separate fields (and still keep the original field if needed), for example by using a StatelessScriptUpdateProcessor and writing a small piece of javascript to split the field and add two new fields.
Another option is to use the PatternReplaceFilter or PatternReplaceCharFilter together with two copyField instructions. Use the PatternReplaceFilter to remove anything except for the part of the token you want to keep.