Indexing data from multiple data sources (CSV, RDBMS) - indexing

we are working on indexing data from multiple data sources using a single collection, for that specified data sources information in the data-config file and also updated managed-schema.xml by adding the fields from all the data sources by specifying the common unique key across all the sources.
Here is a sample config file.
<entity >
</entity>
</document>
Blockquote
Error Details:
Full Import failed:java.lang.RuntimeException:java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException: Invalid type for data source: Jdbc-2
Processing Document #1

Related

How to get an array from JSON in the Azure Data Factory?

My actual (not properly working) setup has two pipelines:
Get API data to lake: for each row in metadata table in SQL calling the REST API and copy the reply (json-files) to the Blob datalake.
Copy data from the lake to SQL: For Each file auto create table in SQL.
The result is the correct number of tables in SQL. Only the content of the tables is not what I hoped for. They all contain 1 column named odata.metadata and 1 entry, the link to the metadata.
If I manually remove the metadata from the JSON in the datalake and then run the second pipeline, the SQL table is what I want to have.
Have:
{ "odata.metadata":"https://test.com",
"value":[
{
"Key":"12345",
"Title":"Name",
"Status":"Test"
}]}
Want:
[{
"Key":"12345",
"Title":"Name",
"Status":"Test"
}]
I tried to add $.['value'] in the API call. The result then was no odata.metadata line, but the array started with {value: which resulted in an error copying to SQL
I also tried to use mapping (in sink) to SQL. That gives the wanted result for the dataset I manually specified the mapping for, but only goes well for the dataset with the same number of column in the array. I don't want to manually do the mapping for 170 calls...
Does anyone know how handle this in ADF? For now I feel like the only solution is to add a Python step in the pipeline, but I hope for a somewhat standard ADF way to do this!
You can add another pipeline with dataflow to remove the content from JSON file before copying data to SQL, using flatten formatters.
Before flattening the JSON file:
This is what I see when JSON data copied to SQL database without flattening:
After flattening the JSON file:
Added a pipeline with dataflow to flatten the JSON file to remove 'odata.metadata' content from the array.
Source preview:
Flatten formatter:
Select the required object from the Input array
After selecting value object from input array, you can see only the values under value in Flatten formatter preview.
Sink preview:
File generated after flattening.
Copy the generated file as Input to SQL.
Note: If your Input file schema is not constant, you can enable Allow schema drift to allow schema changes
Reference: Schema drift in mapping data flow

How can I load data into snowflake from S3 whilst specifying data types

I'm aware that its possible to load data from files in S3 (e.g. csv, parquet or json) into snowflake by creating an external stage with file format type csv and then loading it into a table with 1 column of type VARIANT. But this needs some manual step to cast this data into the correct types to create a view which can be used for analysis.
Is there a way to automate this loading process from S3 so the table column data types is either inferred from the CSV file or specified elsewhere by some other means? (similar to how a table can be created in Google BigQuery from csv files in GCS with inferred table schema)
As of today, the single Variant column solution you are adopting is the closest you can get with Snowflake out-of-the-box tools to achieve your goal which, as I understand from your question, is to let the loading process infer the source file structure.
In fact, the COPY command needs to know the structure of the expected file that it is going to load data from, through FILE_FORMAT.
More details: https://docs.snowflake.com/en/user-guide/data-load-s3-copy.html#loading-your-data

how to load multiple CSV files into Multiple Tables

I have Multiple CSV files in Folder
Example :
Member.CSv
Leader.CSv
I need to load them in to Data base tables .
I have worked on it using ForEachLoop Container ,Data FlowTask, Excel Source and OLEDB Destination
we can do if by using Expressions and Precedence Constraints but how can I do using Script task if I have more than 10 files ..I got Stuck with this one
We have a similar issue, our solution is a mixture of the suggestions above.
We have a number of files types sent from our client on a daily basis.
These have a specific filename pattern (e.g. SalesTransaction20160218.csv, Product20160218.csv)
Each of these file types have a staging "landing" table of the structure you expect
We then have a .net script task that takes the filename pattern and loads that data into a landing table.
There are also various checks that are done within the csv parser - matching number of columns, some basic data validation, before loading into the landing table
We are not good enough .net programmers to be able to dynamically parse an unknown file structure, create SQL table and then load the data in. I expect it is feasible, after all, that is what the SSIS Import/Export Wizard does (with some manual intervention)
As an alternative to this (the process is quite delicate), we are experimenting with a HDFS data landing area, then it allows us to use analytic tools like R to parse the data within HDFS. After that utilising PIG to load the data into SQL.

MS SQL Server loading XML Source with multiple elements in database table

I'm relatively new to SQL Server, so please bear with me!
I have a XML file with multiple elements that I am trying to load into a database table, see below:
<ReportStatus>
<Change>
<Signatory>XXX</Signatory>
<Status>XXX</Status>
<StatusTime>XXX</StatusTime>
</Change>
<Message>
<CreationTime>XXX</CreationTime>
<ID>XXX</ID>
<Version>0.0.1</Version>
</Message>
<Request>
<References>
<Reference>
<RefId>Reference</RefId>
<RefValue>XXX</RefValue>
</Reference>
</References>
<RequestNr>XXX</RequestNr>
<Service>
<Name>XXX</Name>
<Type>XXX</Type>
</Service>
</Request>
The flow will only let me load one element at one time, I have tried loading the data into the table as such:
However, this loads the data into the table in 2 separate records:
Is there a way to consolidate these 2 elements before placing them into the table so the entire record is loaded as a single record?
Thanks in advance!
EDIT
Here's the table with the matching fields:

Making XML according to XML Schema from data taken from SQL VIEW

Soon i will be doing xml file from my database according to given XML schema in .xsd file. I made it before but on simple tables with elements with no children. Now, schema is not that simple. Element has chilren and they also got it's children. I don't know how i can map data from my database into xml using that xml schema. For simple data it was very easy. I loaded schema and used it during creating xml file and giving fields from database names of tags in .xsd file. What happens when we got complex elements in schema? How can i map field from database into 3rd lvl elements?