Finding duplicate documents in BLOB field of Oracle table (Oracle Text) - sql

I need to somehow detect duplicate documents (.doc, .pdf, etc) which are stored in BLOB field of my table.
I've been looking into Oracle Text functionality, but failed to find something that can help me in achieving my goal.
In fact, i need something like UTL_MATCH functionality but with possible comparison of entire document in file.
Can someone provide me any tips on how to do that?
EDIT:
I'm not searching for absolutely same duplicates, which can be done through file comparison, I need to analyse text in documents, that's why I'm trying to use Oracle Text.

Related

Fuzzy Match in excel/vba

I just wanted to know if the following is possible using fuzzy match in excel:
I have a database in excel that I am building a search engine for. The database is in a table format. My data involves 200 hyperlinks of excel files, so there’s 200 rows of data. So my data has specific data about each of these excel files like the topic of what these files contain. I want to build a search engine so someone can search for a specific topic.
I want the search engine to involve fuzzy matching so something can be typed wrong and a result can still be found from the dynamic table/database. It’s dynamic since there might be more hyperlinks added to the database in excel. I just want to know if this kind of search engine is possible because I have not been able to find any answer on this.
Excel supports fuzzy lookup. This is an AddIn
https://www.microsoft.com/en-us/download/details.aspx?id=15011

generate redshift ddls for multiple s3 text objects

I have many S3 objects that are all text files with the same delimiter (|) and retain unique headers on only the first line.
I want to create a Redshift Table DDL for each file with all fields defined as varchar or simply just some text-only data type format. There's a lot of material I've found relating to this use-case but haven't located something that readily accomplishes this seemingly trivial task.
The only source I've found that's somewhat does this is from here. Before I just make my own tool, I wanted to check in and ask here to insure I haven't missed anything that will make this labor-intensive, monotonous task quick and easy.

Read Excel Files from External Tables

I am tasked to create a template that will be Filled up by Business Users with Employee Information, then our program will load this into the Database using External Tables.
However, our Business Users constantly change the template by adding, removing or reordering fields.
I am convinced to use XLSX instead of CSV so that I can lock the Column Headers so they cannot remove, add and reorder the columns.
However, When i query the External Table, it shows Non-ASCII Characters when reading XLSX because its in Binary.
How can i do either of the following?
Effectively Read Excel Files from External Tables
Lock the Headers of CSV Files?
What you have here is a political problem, but you are looking for a technical fix. Not a good fit.
The problem comes in two halves:
Somebody decided it was a good idea to collect user input in a spreadsheet, which it is generally not.
Users are fiddling with the input format, which they should not.
Fixes are:
Strictly enforce the data structure. Reject any CSV which doesn't natch and make the users edit them. They will quickly tire of tweaking the spreadsheets when they realise they're just creating more work for themselves. But they will also get resentful, so consider ...
Building a data input screen. It's pretty simple to knock up a spreadsheet-like grid UI. You don't need anything complicated in Java: Oracle's Apex is intended for exactly this sort of thing. Find out more.
However, if you are stuck with Excel as a UI I suggest you have a look at Anton Scheffer's excellent PLSQL as_read_xlsx package on the Amis site. Check it out. You'll probably need to replace your external table with a view over a table (perhaps pipelined) function.

Using data from a text file for database in SQL Developer

I suppose this is somewhat of an extension of the question asked here.
However, I believe the linked OP and my own reason for reading a file with SQL Developer are different. I am learning SQL and databases and am attempting to create a model database (as in, I won't be editing the data after insertion, just set up search queries and what not). I want it to be large (over 100,000 entries), so I've created a C++ program that wrote randomly generate entries for the database on a .txt file (one entry per line in the .txt file) instead of hard coding the insertion of each entry. Now what I want to do is read the .txt file in SQL Developer and inserts it into a table.
My problem lies in the fact that I am not able to create directories. I am using a university oracle connection and I do not have the privileges to actually make a directory so that I can use UTL_FILE on my .txt file as was answered in the linked question. Assuming there is no way for me to gain this permission, is there an alternate way I can accomplish reading a .txt file for data for my table? Is there a better way to go about creating "dummy data" for my database?
What I ended up doing to insert my mock data was change the way the .txt file was formatted. Instead of having my C++ code write the data one entry per row, I actually made the code so that it wrote SQL code to the .txt file as I think #toddlermenot was suggesting, more or less. After I had the C++ code write as many inserts-with-mock-entries as I needed to the text file, I just copy/pasted it to SQL developer and achieved the desired results.
My problem is a classic case of making the process more complicated than it needed to be.
Also, even though I did not use the method,#Multisync provided an interesting way to go about achieving my goal. I had no idea SQL had the tools for me to generate mock data. Thanks for introducing me to that.

FastVectorhighlighter with External Database

I am using Lucene.NET 2.9 with one of my projects. I am using Lucene to create indexes for documents and search on those documents. A field in my document is text heavy and I have stored that into my MS SQL Database. So basically I search via lucene on its indexes and then fetch complete documents from MS SQL database.
The problem I am facing is that I want to highlight my search query terms in results. For that I am using FastVectorHighlighter. Now this particular highlighter required Lucence DocId and field to highlight fields. The problem is that this particular text heavy field since is not stored in lucene database, is not highlighted in my search results.
Any suggestion on how to accomplish same. I either add the same field to my lucene database. It will resolve the problem but would make my database very heavy. Secondly if there is some alternative method to highlight the text it will give me very high flexibility.
Thank you for reading question,
Naveen
if you dont want to store the text in the Lucene index, you should use the Highlighter contrib.
Latest sources for it can be grabbed at https://svn.apache.org/repos/asf/incubator/lucene.net/trunk/src/contrib/Highlighter/