Lucene limit file size while saving into data base - lucene

I'm new in luene and i want to save the index file into data base but i have this exeption and i would not change the max_allowed_packet but i want to limit the size of file.
Exception in thread "Lucene Merge Thread #0" org.apache.lucene.index.MergePolicy$MergeException: org.apache.lucene.store.jdbc.JdbcStoreException: Failed to execute sql [insert into search_lucene (name_, value_, size_, lf_, deleted_) values ( ?, ?, ?, current_timestamp, ? )]; nested exception is com.mysql.jdbc.PacketTooBigException: Packet for query is too large (1286944 > 1048576). You can change this value on the server by setting the max_allowed_packet' variable.
at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:309)
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:286)
Caused by: org.apache.lucene.store.jdbc.JdbcStoreException: Failed to execute sql [insert into search_lucene (name_, value_, size_, lf_, deleted_) values ( ?, ?, ?, current_timestamp, ? )]; nested exception is com.mysql.jdbc.PacketTooBigException: Packet for query is too large (1286944 > 1048576). You can change this value on the server by setting the max_allowed_packet' variable.
at org.apache.lucene.store.jdbc.support.JdbcTemplate.executeUpdate(JdbcTemplate.java:185)
at org.apache.lucene.store.jdbc.index.AbstractJdbcIndexOutput.close(AbstractJdbcIndexOutput.java:47)
at org.apache.lucene.store.jdbc.index.RAMAndFileJdbcIndexOutput.close(RAMAndFileJdbcIndexOutput.java:81)
at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java:203)
at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java:204)
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4263)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3884)
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:205)
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:260)
Caused by: com.mysql.jdbc.PacketTooBigException: Packet for query is too large (1286944 > 1048576). You can change this value on the server by setting the max_allowed_packet' variable.
at com.mysql.jdbc.MysqlIO.send(MysqlIO.java:3915)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2598)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2778)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2825)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2156)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2459)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2376)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2360)
at org.apache.lucene.store.jdbc.support.JdbcTemplate.executeUpdate(JdbcTemplate.java:175)

read from http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_store_the_Lucene_index_in_a_relational_database.3F
Can I store the Lucene index in a relational database?
Lucene does not support that functionality out of the box, but several people have implemented JdbcDirectory's. The reports we have seen so far indicate that performance with such implementations is not great, but it is doable.

To limit the file size you'll need to implement a Directory yourself. The trick is to split every file into parts. Perhaps you can borrow some code from lucene-appengine, which splits the file into multiple SegmentHunks.
I hope you know what you're doing because keeping lucene indexes in a database will be much slower than using the usual memory-mapped files.

essaies :
indexWriter.setUseCompoundFile(false);
LogByteSizeMergePolicy aLogByteSizeMergePolicy = new LogByteSizeMergePolicy();
aLogByteSizeMergePolicy.setMaxMergeMB(2);
aLogByteSizeMergePolicy.setMaxMergeMBForForcedMerge(4);
aLogByteSizeMergePolicy.setUseCompoundFile(false);
//voir aussi setMaxCFSSegmentSizeMB, setMaxMergeDocs, setMergeFactor
indexWriter.setMergePolicy(aLogByteSizeMergePolicy);
//Deprecated. use IndexWriterConfig.setMergePolicy(MergePolicy) instead.

Related

Ignore large or corrupt records when loading files with pig using PigStorage

I am seeing the following error when loading in a large file using pig.
java.io.IOException: Too many bytes before newline: 2147483971
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:251)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:176)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:123)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initialize(PigRecordReader.java:181)
at org.apache.tez.mapreduce.lib.MRReaderMapReduce.setupNewRecordReader(MRReaderMapReduce.java:157)
at org.apache.tez.mapreduce.lib.MRReaderMapReduce.setSplit(MRReaderMapReduce.java:88)
at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:703)
at org.apache.tez.mapreduce.input.MRInput.processSplitEvent(MRInput.java:631)
at org.apache.tez.mapreduce.input.MRInput.handleEvents(MRInput.java:590)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.handleEvent(LogicalIOProcessorRuntimeTask.java:732)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.access$600(LogicalIOProcessorRuntimeTask.java:106)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$1.runInternal(LogicalIOProcessorRuntimeTask.java:809)
at org.apache.tez.common.RunnableWithNdc.run(RunnableWithNdc.java:35)
at java.lang.Thread.run(Thread.java:748)
The command I am using is as follows:
LOAD 'file1.data' using PigStorage('\u0001') as (
id:long,
source:chararray,
)
Is there any option that can be passed here to drop the record that is causing the issue and continue?
You can skip a certain number of records by using the following setting at the top of your pig script
set mapred.skip.map.max.skip.records = 1000;
Link: The number of acceptable skip records surrounding the bad record PER bad record in mapper. The number includes the bad record as well. To turn the feature of detection/skipping of bad records off, set the value to 0. The framework tries to narrow down the skipped range by retrying until this threshold is met OR all attempts get exhausted for this task. Set the value to Long.MAX_VALUE to indicate that framework need not try to narrow down. Whatever records(depends on application) get skipped are acceptable.

java/jdbc timeout in clojure

I am trying to add timeout to jdbc/query and jdbc/execute!. Somewhere in the web I found that both functions take :timeout as an option. Documention also says the options are passed to prepare-statment which takes in :timeout as an option.
My function calls look like,
(jdbc/query db-read-spec query {:timeout 2})
(jdbc/execute! db-write-spec query {:timeout 2})
Is this how it is done? If yes, How do I test this?
If there is different way of doing this which is testable, that works too.
The :timeout option causes .setQueryTimeout to be called on the PreparedStatement used under the hood of clojure.java.jdbc. It is in seconds, not milliseconds, so your query would have to be extremely slow for a timeout of 2,000 seconds (just over half an hour) to take effect.
JDBC supports several different timeouts across several of its classes. For example, javax.sql.DataSource supports .setLoginTimeout (also in seconds), as does java.sql.DriverManager.
There are also database-specific options you can add to the connection string (which you can add as additional key/value pairs in your "db-spec") to control lower-level timeouts. For example, MySQL supports connectionTimeout and socketTimeout in the connection string -- and both of those are in milliseconds. clojure.java.jdbc allows for those to be provided in your "db-spec" hash map as :connectTimeout and :socketTimeout keys respectively.
Note that clojure.java.jdbc is considered "Stable" at this point and all current and future development effort is focused on next.jdbc at this point. next.jdbc makes it easier to use the loginTimeout since it operates on JDBC objects directly, so the whole (Java) API is available as well. It also has built-in support for connection pooling and is, overall, simpler and faster than clojure.java.jdbc.
You can leverage query-hint on mysql-select-queries (time in ms)
SELECT /*+ MAX_EXECUTION_TIME(1000) */ * FROM t1 INNER JOIN t2 WHERE....
then you can just wrap your queries:
(defn timed-query [db query t]
(j/query db [(str (subs query 0 6)
(format " /*+ MAX_EXECUTION_TIME(%s) */ " t)
(subs query 7))]))
and test:
(deftest test-query-timeout
(is (thrown? Exception (timed-query db "select * from Employees where id>5" 1))))
you should use much-complex queries for this to work with 1ms;
I figure out a work around to test this out. Since I use postgres I could leverage select pg_sleep(time-in-seconds)
And my test looks like
(is (thrown-with-msg? PSQLException #"ERROR: canceling statement due to user request"
(fetch-or-save "select pg_sleep(3)")))

Snowflake COPY INTO from JSON - ON_ERROR = CONTINUE - Weird Issue

I am trying to load JSON file from Staging area (S3) into Stage table using COPY INTO command.
Table:
create or replace TABLE stage_tableA (
RAW_JSON VARIANT NOT NULL
);
Copy Command:
copy into stage_tableA from #stgS3/filename_45.gz file_format = (format_name = 'file_json')
Got the below error when executing the above (sample provided)
SQL Error [100069] [22P02]: Error parsing JSON: document is too large, max size 16777216 bytes If you would like to continue loading
when an error is encountered, use other values such as 'SKIP_FILE' or
'CONTINUE' for the ON_ERROR option. For more information on loading
options, please run 'info loading_data' in a SQL client.
When I had put "ON_ERROR=CONTINUE" , records got partially loaded, i.e until the record with more than max size. But no records after the Error record was loaded.
Was "ON_ERROR=CONTINUE" supposed to skip only the record that has max size and load records before and after it ?
Yes, the ON_ERROR=CONTINUE skips the offending line and continues to load the rest of the file.
To help us provide more insight, can you answer the following:
How many records are in your file?
How many got loaded?
At what line was the error first encountered?
You can find this information using the COPY_HISTORY() table function
Try setting the option strip_outer_array = true for file format and attempt the loading again.
The considerations for loading large size semi-structured data are documented in the below article:
https://docs.snowflake.com/en/user-guide/semistructured-considerations.html
I partially agree with Chris. The ON_ERROR=CONTINUE option only helps if the there are in fact more than 1 JSON objects in the file. If it's 1 massive object then you would simply not get an error or the record loaded when using ON_ERROR=CONTINUE.
If you know your JSON payload is smaller than 16mb then definitely try the strip_outer_array = true. Also, if your JSON has a lot of nulls ("NULL") as values use the STRIP_NULL_VALUES = TRUE as this will slim your payload as well. Hope that helps.

Exceeded max configured index size while indexing document while running LS Agent

In our project we have a LS agent which is supposed to delete and create a new FT index. We haven't figured out yet why updating the index cannot be accomplished automatically(although in our database option this is set to update the index daily). So we decided to make roughly the same thing, but using our very simple agent. The agent looks like this and runs daily in the night:
Option Public
Option Declare
Dim s As NotesSession
Dim ndb As NotesDatabase
Sub Initialize
Set s = New NotesSession
Set ndb = s.CurrentDatabase
Print("BEFORE REMOVING INDEXES")
Call ndb.Removeftindex()
Print("INDEXES HAVE BEEN REMOVED SUCCESSFULLY")
Call ndb.createftindex(FTINDEX_ALL_BREAKS, true)
Print("INDEXES HAVE BEEN CREATED SUCCESSFULLY")
End Sub
In most cases it works very well, but sometimes when somebody creates a document which exceeds 12MB (we really don't know how this is possible) the agent fails to create the index. (but the latest index is already deleted).
Error message is:
31.05.2018 03:01:25 Full Text Error (FTG): Exceeded max configured index
size while indexing document NT000BD992 in database index *path to FT file*.ft
My question is how to avoid this problem? We've already expanded the limit of 6MB by the following command SET CONFIG FTG_INDEX_LIMIT=12582912. Can we expand it even more? And in general, how to solve the problem? Thanks in advance.
Using FTG_INDEX_LIMIT is an option to avoid this error, yes. But it will impact server performance in two ways - FTI-update processes will take more time and more memory.
There's no max size of this limit (in theory), but! As update-processes eats memory from common heap it can lead to 'out-of-memory-overheaping' and server crash.
You can try to exclude attachments from index - i don't think someone can put more than 1mb of text in one doc, but users can attach some big text files - and this will produce the error you writing about.
p.s. and yeah, i agree with Scott - why do you need such agent anyway? common ft-indexing working fine usualy.

Importing data from multi-value D3 database into SQL issues

Trying to use the mv.NET by bluefinity tools. Made some integration packages with it for importing data from a d3 multi-value database into MS SQL 2012 but seem to be having some trouble with the mapping.
For the VOYAGES table have some commentX fields in the D3 application that are acting quite unwieldy and the INSERT fails after a certain number of rows with the following message
>Error: 0xC0047062 at INSERT, mvNET Source[354]: System.Exception: Error #8: dataReader[0] = LTPAC002 ci.BufferColumnIndex = 52, ci.ColumnName = COMMGROUP(Error #8: dataReader[0] = LTPAC002 ci.BufferColumnIndex = 52, ci.ColumnName = COMMGROUP(The value is too large to fit in the column data area of the buffer.))
at mvNETDataSource.mvNETSource.PrimeOutput(Int32 outputs, Int32[] outputIDs, PipelineBuffer[] buffers)
at Microsoft.SqlServer.Dts.Pipeline.ManagedComponentHost.HostPrimeOutput(IDTSManagedComponentWrapper100 wrapper, Int32 outputs, Int32[] outputIDs, IDTSBuffer100[] buffers, IntPtr ppBufferWirePacket)
Error: 0xC0047038 at INSERT, SSIS.Pipeline: SSIS Error Code DTS_E_PRIMEOUTPUTFAILED.The PrimeOutput method on mvNET Source returned error code 0x80131500.The component returned a failure code when the pipeline engine called PrimeOutput().The meaning of the failure code is defined by the component, but the error is fatal and the pipeline stopped executing.There may be error messages posted before this with more information about the failure.
The value is too large to fit in the column data area of the buffer. -> tried changing the input / outputs types but can't seem to get it right.
In the SQL table the columns are of type ntext.
In the .dtsx job the data type for the columns are of type Unicode String [DT_WSTR] with length 4000 , I guess these are auto-detected.
The import worked for other D3 files like this not sure why it fails for these comment fields.
Running the query on the mv.NET Data Manager ( on the d3 server) times out after 240 seconds so maybe this is the underlying issue?
Any ideas how to proceed? Thank you ~
Most like reason is column COMMGROUP does not have correct data type or some record in source do not fit in output type
To find error record (causing) you have to use on redirect row (property of component failing component ) and get the result set in some txt.csv /or tsv file .
then check data
The exception is being thrown from mv.NET so I suggest you call (or ask your reseller) to call Bluefinity support and ask them about this. You're paying for support, might as well use it. Those programs shouldn't be allowed to throw exceptions like that.
D3 doesn't export Unicode, that might be one issue. But if the Data Manager times-out then I suspect something is wrong in the connectivity into D3. Open a Connection Monitor from the Session Monitor and watch the connection when you make the request. I'm guessing it's either hanging or more probably it's falling into BASIC Debug.
Make sure all D3-side programs related to this are either all Flash-compiled, or all Not Flashed. Your app code will fall into Debug if it's not Flashed but MVNET.BP is.
If it's your program that's in Debug, fix it. If you're not sure which program it is, LIST-RUNTIME-ERRORS in DM.
If it's a MVNET.BP program, again work with Bluefinity. If you are using MVSP for connectivity then the Connection Monitor may be useless, you'll need to change that to an IP (Telnet) connection to see the raw data exchange.