Can Solr handle data import from tables like table1,table2,....tableN?

Can Solr handle data import from tables like table1,table2,....tableN? - dynamic

I have an mysql db, and the data tables named according to the sharding rule, which means the name has a same prefix,e.g: the_table_,and has a number for suffix, so the table name will be the_table_1,the_table_2,the_table_3.When we want to select a row, we have to find the table suffix from sql select mod(primary_key_id, 64)+1, the sharding rule, and the 64 is a base number, which means we have 64 tables.
Now we want to use Solr to index the data in the tables,but solr data-config.xml can not support sql query with dynamic name,So is there any suggestion to fix this problem?
Things like this?
<entity name="audit"
query="select id,monitor_type,city,STATUS,is_history,timeout,author,author_uid,author_ip,operator_uid,operator,created_at,description from `main_table`">
<field column="ID" name="id" />
<field column="MONITOR_TYPE" name="monitor_type" />
<field column="CITY" name="city" />
<field column="STATUS" name="status" />
<field column="IS_HISTORY" name="is_history" />
<field column="AUTHOR" name="author" />
<field column="AUTHOR_UID" name="author_uid" />
<field column="AUTHOR_IP" name="author_ip" />
<field column="OPERATOR_UID" name="operator_uid" />
<field column="OPERATOR" name="operator" />
<entity name="distribution" query="select mod(${audit.ID}, 256)+1 as table_id">
<entity name="details" query="select content from detail${distribution.table_id} where id=${audit.ID}">
<field column="CONTENT" name="content" />
</entity>
</entity>
</entity>

I would create a stored procedure to do what you want and then call it from Solr. Something like this
CREATE PROCEDURE `GetAuditDetails`(IN auditID INT)
BEGIN
DECLARE tableID INT DEFAULT 0;
DECLARE sqlString VARCHAR(100) DEFAULT '';
SELECT MOD(auditID, 256)+1 INTO tableID;
SET #sqlString = CONCAT('select content from detail', tableID, ' where id=', auditID );
PREPARE stmt FROM #sqlString;
EXECUTE stmt;
END;
<entity name="audit"
query="select id,monitor_type,city,STATUS,is_history,timeout,author,author_uid,author_ip,operator_uid,operator,created_at,description from `main_table`">
<field column="ID" name="id" />
<field column="MONITOR_TYPE" name="monitor_type" />
<field column="CITY" name="city" />
<field column="STATUS" name="status" />
<field column="IS_HISTORY" name="is_history" />
<field column="AUTHOR" name="author" />
<field column="AUTHOR_UID" name="author_uid" />
<field column="AUTHOR_IP" name="author_ip" />
<field column="OPERATOR_UID" name="operator_uid" />
<field column="OPERATOR" name="operator" />
<entity name="details" query="CALL GetAuditDetails(${audit.ID})">
<field column="CONTENT" name="content" />
</entity>
</entity>

First, it should not use audit.ID in the nested entry query,but ${audit.id}, otherwise there will be an exception.
Second,for Mysql procedure:
CREATE PROCEDURE `GetAuditDetails`(IN auditID INT)
BEGIN
DECLARE tableID INT DEFAULT 0;
DECLARE sqlString VARCHAR(100) DEFAULT '';
SELECT MOD(auditID, 256)+1 INTO tableID;
SET #sqlString = CONCAT('select content from detail', tableID, ' where id=', auditID );
PREPARE stmt FROM #sqlString;
EXECUTE stmt;
END;

Related

SOLR order/display response fields list in specific way when JSON call?

I have a question about SOLR response fields when JSON response format is obtained.
I have a Web service returning more than 20 fields and they are ordered by default as first the fields having data and after that all other fields.
My question is there a way to precise the order of the fields list, so that we obtain them always in same order ?
Example if I have the fields FIELD 1, FIELD 2, etc, I want that I preserve exactly this order and not FIELD 2, FIELD 1.
Thanks

yes you can precise the order editing the "data-config.xml" , here as you can see are declared all fields and on top are are the corresponding query , delta imports and full import . try to change the fields order in query and declaration in bottom,
example :
<dataConfig>
<dataSource type="HttpDataSource" />
<document>
<entity name="slashdot"
pk="link"
url="http://rss.slashdot.org/Slashdot/slashdot"
processor="XPathEntityProcessor"
forEach="/RDF/channel | /RDF/item"
transformer="DateFormatTransformer">
<field column="source" xpath="/RDF/channel/title" commonField="true" />
<field column="source-link" xpath="/RDF/channel/link" commonField="true" />
<field column="subject" xpath="/RDF/channel/subject" commonField="true" />
<field column="title" xpath="/RDF/item/title" />
<field column="link" xpath="/RDF/item/link" />
<field column="description" xpath="/RDF/item/description" />
<field column="creator" xpath="/RDF/item/creator" />
<field column="item-subject" xpath="/RDF/item/subject" />
<field column="slash-department" xpath="/RDF/item/department" />
<field column="slash-section" xpath="/RDF/item/section" />
<field column="slash-comments" xpath="/RDF/item/comments" />
<field column="date" xpath="/RDF/item/date" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" />
</entity>
</document>

index all files inside a folder in solr

I am having troubles indexing a folder in solr
example-data-config.xml:
<dataConfig>
<dataSource type="BinFileDataSource" />
<document>
<entity name="files"
dataSource="null"
rootEntity="false"
processor="FileListEntityProcessor"
baseDir="C:\Temp\" fileName=".*"
recursive="true"
onError="skip">
<field column="fileAbsolutePath" name="id" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastModified" />
<entity
name="documentImport"
processor="TikaEntityProcessor"
url="${files.fileAbsolutePath}"
format="text">
<field column="file" name="fileName"/>
<field column="Author" name="author" meta="true"/>
<field column="text" name="text"/>
</entity>
</entity>
</document>
then I create the schema.xml:
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="fileName" type="string" indexed="true" stored="true" />
<field name="author" type="string" indexed="true" stored="true" />
<field name="title" type="string" indexed="true" stored="true" />
<field name="size" type="plong" indexed="true" stored="true" />
<field name="lastModified" type="pdate" indexed="true" stored="true" />
<field name="text" type="text_general" indexed="true" stored="true" multiValued="true"/>
finally I modify the file solrConfig.xml adding the requesthandler and the dataImportHandler and dataImportHandler-extra jars:
<requestHandler name="/dataimport" class="solr.DataImportHandler">
<lst name="defaults">
<str name="config">example-data-config.xml</str>
</lst>
</requestHandler>
I run it and the result is:
Inside that folder there are like 20.000 files in diferent formats (.py,.java,.wsdl, etc)
Any suggestion will be appreciated. Thanks :)

Check your Solr logs . Answer for what is the Root Cause will definitely be there . I also faced same situation once and found through solr logs that my DataImportHandler was throwing exceptions because of encrypted documents present in the folder . Your reasons may be different, but first analyze your solr logs, execute your entity again in DataImport section, and then check the immediate logs for errors by going on the logging section on admin page . If you are getting errors other than I what I mentioned , post them here , so they can be understood and deciphered .

BULK INSERT SQL, Need to separate the data

I'm loading a Notepad txt file into a SQL Table. I'm trying to use the BULK INSERT command but I keep getting this error:
Msg 4863, Level 16, State 1, Line 8
Bulk load data conversion error (truncation) for row 1, column 3 (column3).
The txt file has each column separated by a | symbol. I just need each set of text between each | | to be in it's own column.
For example:
|100|AA|BCD|200|
I need each of those to be separated into a column in a table. My code may be too simple right now but any help would be appreciated.
CREATE TABLE BMData2 (
column1 varchar(30),
column2 varchar(30),
column3 character(3),
column4 varchar(10),
column5 varchar(10),
column6 varchar(10),
column7 varchar(10),
column8 varchar(10),
column9 varchar(10),
column10 varchar(10),
column11 varchar(10),
column12 varchar(10),
column13 varchar(10),
column14 varchar(10),
column15 varchar(10),
column16 varchar(10),
column17 varchar(10),
column18 varchar(10),
column19 varchar(10),
column20 varchar(10),
column21 varchar(10),
column22 varchar(10),
column23 varchar(10),
column24 varchar(10),
column25 varchar(10),
column26 varchar(10),
)
BULK INSERT BMData
FROM '\\DBV\march_june\All march june Data.txt'
WITH
(Fieldterminator = '|',
ROWTERMINATOR = '\n');
Line of data looks like this:
AB|1410|MTH|ART|20150401|3|1600|1600|1556|2048|2048|2101|0|0|168|185|-4|13|17|1630|2054|ARTPROJECT|34|7|144||0|0|0|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|
I really only need 5 of these "data points" i.e. |data point| but my data comes like this and there are so many rows that the clean up is impossible in Excel or Notepad.

Remove the column headers in your text file.
BULK INSERT (Transact-SQL)
The FIRSTROW attribute is not intended to skip column headers.
Skipping headers is not supported by the BULK INSERT statement. When
skipping rows, the SQL Server Database Engine looks only at the field
terminators, and does not validate the data in the fields of skipped
rows.
Edit: Your sample data does not match the table you present. You table has 26 columns but the sample data has about double.
Create a test file with the following test data (no column headers):
Note that the last column has a NULL value. It is after the last | and before the NewLine character.
AB|1410|MTH|ART|20150401|3|1600|1600|1556|2048|2048|2101|0|0|168|185|-4|13|17|1630|2054|ARTPROJECT|34|7|144|
AB|1410|MTH|ART|20150401|3|1600|1600|1556|2048|2048|2101|0|0|168|185|-4|13|17|1630|2054|ARTPROJECT|34|7|144|
AB|1410|MTH|ART|20150401|3|1600|1600|1556|2048|2048|2101|0|0|168|185|-4|13|17|1630|2054|ARTPROJECT|34|7|144|

Use xml format file instead. It works much better. Here is sample based on #Ricardo C sample data.
1. Prepare xml file and save (c:\temp\bulk.xml in this example).
<?xml version="1.0" encoding="utf-8" ?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<!--list all fields in your .txt here-->
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="2" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="3" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="4" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="5" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="6" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="7" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="8" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="9" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="10" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="11" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="12" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="13" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="14" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="15" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="16" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="17" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="18" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="19" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="20" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="21" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="22" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="23" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="24" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="25" xsi:type="CharTerm" TERMINATOR='|' MAX_LENGTH="510" />
<FIELD ID="26" xsi:type="CharTerm" TERMINATOR='\r\n' MAX_LENGTH="510" />
<!-- the last is end of row -->
</RECORD>
<ROW>
<!-- List only what you need -->
<COLUMN SOURCE="1" NAME="Id" xsi:type="SQLVARYCHAR"/>
<COLUMN SOURCE="2" NAME="Id1" xsi:type="SQLINT"/>
<COLUMN SOURCE="3" NAME="Name" xsi:type="SQLVARYCHAR"/>
</ROW>
</BCPFORMAT>
Usage:
insert dbo.myTable
SELECT *
FROM OPENROWSET(BULK 'C:\temp\bulk.txt',
FORMATFILE='C:\temp\bulk.xml'
) AS t1;

I would use PowerShell to pre-process the original file and extract only the columns I need and put the extracted data to a new file.
Let's say the original source file is located at c:\temp\src.txt as follows(put your column name in the first line for easy process):
col1|col2|col3|col4|col5|col6|col7|col8|col9|col10
1|2|3|4|5|6|7|8|9|10
11|12|13|14|15|16|17|18|19|20
21|22|23|24|25|26|27|28|29|30
31|32|33|34|35|36|37|38|39|40
41|42|43|44|45|46|47|48|49|50
51|52|53|54|55|56|57|58|59|60
61|62|3|64|65|66|67|68|69|70
Now let's say I only need to get col2, col4, col5 and col9 data. So here is the PowerShell code
[string[]]$col_wanted = 'col2', 'col4', 'col5', 'col9'; #only need four columns out of 10 columns
$csv = import-csv -path c:\temp\src.txt -Delimiter '|';
$t=($col_wanted -join "|") + "`r`n";
foreach ($c in $csv)
{
$col_wanted | % -begin {[string]$s='';} -process {$s+=$c.$_+'|';} -end {$s = $s.Substring(0, $s.Length-1) + "`r`n"}
$t += $s
}
$t | Out-File -FilePath c:\temp\target.txt -Force;
If we open c:\temp\target.txt, we will see the result like this:
col2|col4|col5|col9
2|4|5|9
12|14|15|19
22|24|25|29
32|34|35|39
42|44|45|49
52|54|55|59
62|64|65|69
Now you can use bulk insert to do the data importing, but since we have the first row as column header, in the bulk insert, we need to set firstrow = 2, i.e. like the following
bulk insert MyTable
from 'c:\temp\target.txt'
with (FIELDTERMINATOR ='|', firstrow=2);

Solr: how to query particuler entity when multiple

I am starting to learn Solr (using version 5.5.0). I am using managed-schema and data-congif.xml files to inex two sql server tables: Company & Contact.
I am able to execute from the UI, the data import, selecting one entity at a time.
This is the message I get for Company:
Indexing completed. Added/Updated: 8,293 documents. Deleted 0 documents. (Duration: 01s)
Requests: 1 (1/s), Fetched: 8,293 (8,293/s), Skipped: 0, Processed: 8,293 (8,293/s) Started: less than a minute ago
This is the message I get for Contact:
Indexing completed. Added/Updated: 81 documents. Deleted 0 documents.
Requests: 1, Fetched: 81, Skipped: 0, Processed: 81
Started: less than a minute ago
When I click the Query section, I want to perform a query to see all the Contact, and/ or Company records, not necessarily combined, but just be able to query them.
I am not sure how to do this, is it possible to get some help to understand how to specify against which entity I want to execute the query?
Here are the 2 files I modified:
data-cofig.xml:
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
url="jdbc:sqlserver://sql.server.com\test;databaseName=test"
user="testusr"
password="testpwd"/>
<document>
<entity name="Company" pk="CompanyID" query="SELECT * FROM tblCompany">
<field column="CompanyID" name="company_companyid"/>
<field column="Name" name="company_name"/>
<field column="Website" name="company_website"/>
<field column="Description" name="company_description"/>
<field column="NumberOfEmployees" name="company_numberofemployees"/>
<field column="AnnualRevenue" name="company_annualrevenue"/>
<field column="YearFounded" name="company_yearfounded"/>
</entity>
<entity name="Contact" pk="ContactID" query="SELECT * FROM tblContact">
<field column="ContactID" name="contact_contactid"/>
<field column="FirstName" name="contact_firstname"/>
<field column="MiddleInitial" name="contact_middleinitial"/>
<field column="LastName" name="contact_lastname"/>
<field column="Email" name="contact_email"/>
<field column="Description" name="contact_description"/>
</entity>
</document>
</dataConfig>
managed-schema:
<!-- Company Begin -->
<field name="company_companyid" type="string" indexed="true"/>
<field name="company_name" type="string" indexed="true"/>
<field name="company_website" type="string" indexed="true"/>
<field name="company_description" type="string" indexed="true"/>
<field name="company_numberofemployees" type="string" indexed="true"/>
<field name="company_annualrevenue" type="string" indexed="true"/>
<field name="company_yearfounded" type="string" indexed="true"/>
<!-- Company End -->
<!-- Contact Begin -->
<field name="contact_contactid" type="string" indexed="true" />
<field name="contact_firstname" type="string" indexed="true"/>
<field name="contact_middleinitial" type="string" indexed="true"/>
<field name="contact_lastname" type="string" indexed="true"/>
<field name="contact_email" type="string" indexed="true"/>
<!-- Contact End -->
UPDATE
I tried using the fl field to select company_companyid, but I did not get any results.
I am including a screen shot:

To get fields as needed from a document, use fl. For example, if you were using SolrJ, you would have something like query.set("fl", "fieldA, fieldB").
In a URL, it looks like this: http://host:port/solr/coreName/select?q=*%3A*&fl=fieldA,fieldB&wt=json&indent=true

DeltaImport not happening by default

I'm having issues with deltaquery where it's doesn't work automatically. Below is the data-config I have
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
url="jdbc:sqlserver://WTL-sql-1.com;databaseName=eng_metrics"
user="metrics"
password="metrics"/>
<document name="content">
<entity name="id"
query="select defect_id,headline,description,modify_date,issue_type,category,product,state FROM defects WHERE state not like 'Duplicate'"
deltaImportQuery="select defect_id,headline,description,modify_date,issue_type,category,product,state FROM defects WHERE defect_id = '${dataimporter.delta.defect_id}' and state not like 'Duplicate'"
deltaQuery="select defect_id FROM defects WHERE modify_date > '${dataimporter.last_index_time}'">
<field column="defect_id" name="defect_id" />
<field column="headline" name="headline" />
<field column="description" name="description" />
<field column="modify_date" name="modify_date" />
<field column="issue_type" name="issue_type" />
<field column="category" name="category" />
<field column="product" name="product" />
<field column="state" name="state" />
</entity>
</document>
</dataConfig>
But what I see that no matter the modify_date changes in the DB, I don't see any update happening unless I try doing a delta import explicitly.
Can someone provide me some thoughts on whether I need to change some config or some query to make that happen automatically?

Actually, DataImportHandler will not do it automatically. You have to trigger it by call delta import 's url.
You may want something like this:
http://wiki.apache.org/solr/DataImportHandler#Scheduling
or you can implement similar one by youself.

But I've this data-config which works fine in some cases
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
url="jdbc:sqlserver://127.0.0.1\SQLEXPRESS;databaseName=sustaining_trends"
user="sa"
password="metrics"/>
<document name="content">
<entity name="id"
query="select id,createtime,lastmodified,modifiedby,title,keywords,general,symptom,diagnosis,resolution FROM trends"
deltaImportQuery="select id,createtime,lastmodified,modifiedby,title,keywords,general,symptom,diagnosis,resolution FROM trends WHERE id = ${dataimporter.delta.id}"
deltaQuery="select id FROM trends WHERE lastmodified > '${dataimporter.last_index_time}' or createtime > '${dataimporter.last_index_time}'">
<field column="id" name="trendid" />
<field column="lastmodified" name="lastmodified" />
<field column="modifiedby" name="modifiedby" />
<field column="title" name="title" />
<field column="keywords" name="keywords" />
<field column="general" name="general" />
<field column="symptom" name="symptom" />
<field column="diagnosis" name="diagnosis" />
<field column="resolution" name="resolution" />
</entity>
</document>
</dataConfig>
Here if the item is modified immediately that gets updated without any interference but if a new data is created that doesn't get updated until either I do a manual delta import or else some entry gets modified.
How does this work automatically incase of modification and not work automatically for creation?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Can Solr handle data import from tables like table1,table2,....tableN? - dynamic

Related

SOLR order/display response fields list in specific way when JSON call?

index all files inside a folder in solr

BULK INSERT SQL, Need to separate the data

Solr: how to query particuler entity when multiple

DeltaImport not happening by default

Categories

Resources