I am working on a phone app and U would like to use Zxing for my project. However, I know that alot of people are crazy about Redlaser so I had decided to try it out. When I went to the mall I had noticed that the scanner does not even read the UPC for Stores like Forever 21, H&M or Tilly's! This is a huge problem for me because these are very popular stores in Southern California! I tried it at hot topic, but at least there it would read the barcode and return zero results, but at these other stores it was nothing.
If Redlaser can not even read the barcode at those stores, then I will make the assumption that Zxing definitely will not even attempt to read it either. Is there anyway to fix this? I know one issue is that those stores are not in the google shopping Api database, but if I added them to my datafeeds database Api would they still be unreadable? I'm really hoping for a soliton.
You're mixing up two things here: scanning and providing additional information. Both RedLaser and ZXing should be able to scan all UPC and EAN barcodes and come up with the scanned number. When it comes to providing additional information, neither the RedLaser SDK (as opposed to the RedLaser app) nor the ZXing library provide any additional information. That's up to you to implement.
If you weren't even able to scan the product's barcode in a store, it could also mean that the company uses a non-standard barcode format with company private barcode numbers. Even if you could scan these barcodes, it's very unlikely that there is any any service to get additional information for these private numbers. It also indicates that these products are probably sold by single company only. But most products today have a EAN/UPC/GS1 barcode with a unique barcode number.
Update:
If the product has a UPC/EAN barcode, you can scan it and get an (almost) unique product number. This is the kind of barcode all cash desk support. And the UPC/EAN/GS1 number is the product number support by almost all providers of product information.
If it's a Code 39, Code 128, ITF barcode (or few additional formats depending on the barcode scanner library), you can scan it as well and get a number or string. However, it's interpretation might differnt from shop to shop.
If it's yet another barcode symbology, you cannot even scan it with the barcode library.
Furthermore, many products have several barcodes with different purposes: one might indeed be a sort of product number but the other ones might be something that of no use for you even if you could decode it (such as the serial number of an electronic device).
I am guessing that you are not looking at a UPC/EAN product code, but most likely a Code 39 barcode that encodes some store-specific identifier.
ZXing definitely reads Code 39. Try it with Barcode Scanner. RedLaser might not since it is focused on UPC/EAN, though it's based on the same library.
But, even though you can read the contents, I doubt you will be able to do much with it. It is likely a number that doesn't mean anything outside the store's systems.
Related
I create a PDF file with 20,000 pages. Send it to a printer and individual pages are printed and mailed. These are tax bills to homeowners.
I would like to place the PDF file my web server.
When a customer inputs a unique bill number on a search page, a search for that specific page is started.
When the page within the PDF file is located, only that page is displayed to the requester.
There are other issues with security, uniqueness of bill number to search that can be worked out.
The main question is... 1: Can this be done 2: Is there third party program that is required.
I am a novice programmer and would like to try and do this myself.
Thank you
It is possible but I would strongly recommend a different route. Instead of one 20,000 page document which might be great for printing, can you instead make 20,000 individual documents and just name them with something unique (bill number or whatever)? PDFs are document presentations and aren't suited for searching or even text information storage. There's no "words" or "paragraphs" and there's even no guarantee that text is written letter after letter. "Hello World" could be written "Wo", "He", "llo", "rld". Your customer's number might be "H1234567" but be written "1234567", "H". Text might be "in-page" but it also might be in form fields which adds to the complexity. There are many PDF libraries out there that try to solve these problems but if you can avoid them in the first your life will be much easier.
If you can't re-make the main document then I would suggest a compromise. Take some time now and use a library like iText (Java) or iTextSharp (.Net) to split the giant document into smaller documents arbitrarily named. Then try to write your text extraction logic using the same libraries to find your uniqueifiers in the documents and rename each document accordingly. This is really the only way that you can prove that your logic worked on every possible scenario.
Also, be careful with your uniqueifiers. If you have accounts like "H1234" and "H12345" you need to make sure that your search algorithm is aware that one is a subset (and therefore a match) of the other.
Finally, and this depends on how sensitive your client's data is, but if you're transporting very sensitive material I'd really suggest you spot-check every single document. Sucks, I know, I've had to do it. I'd get a copy of Ghostscript and convert all of the PDFs to images and then just run them through a program that can show me the document and the file name all at once. Google Picasa works nice for this. You could also write a Photoshop action that cropped the document to a specific region and then just use Windows Explorer.
I am trying to map information of Linux packages (name + version) to their corresponding CPE strings (see http://nvd.nist.gov/cpe.cfm) in order to be able to automatically find possible vulnerabilities of a system.
There is an XML document provided by NIST which contains all relevant CPE. I thought about parsing this information into an SQL database so I can quickly search by name and version number. That would be some 70.000 rows.
The problem now is, of course, that there are variations of the spellings of the CPEs and the package names. For example, the CPE for Tomcat 6.0.36 would be cpe:/a:apache:tomcat:6.0.36 so you have the name tomcat and the version 6.0.36. Now, the package manager could give you something like tomcat6 for the name and 6.0.36-3 for the version. Its likely that both programs are the same or have at least the same vulnerabilities. So I need to be able to automatically identify the above mentioned CPE as the correct one for my tomcat package.
The first thing to do would be some kind of normalization, maybe converting everything to lowercase. But as you can see from the example, that's not enough. I need some kind of fuzzy search. From what I already found out, there are some solutions for identifying matches in the case of misspelling. That is not exactly what I need, though. The package names are not misspelled but may contain additional characters (or miss some).
The fuzzy search must also be relatively fast, since I need to execute it for multiple hosts which each could have some hundred packages installed and as I said, the database would have around 70.000 rows. I can introduce a primary lookup which tries to find an exact match first, but since I suspect many package will not have any corresponding CPE string, that will not decrease the amount too dramatically.
Another constraint is that the solution should be working on a non-proprietary database, since I don't have the financial means for anything else.
So, is there anything that matches these requirements? Or can you think of any solution to my problem except some kind of fuzzy searching?
Thanks in advance!
A general comment, first. The CPE nomenclature seems to have evolved organically, often depending on the vendors' (inconsistent) nomenclature. For example, Sun Java has major.minor.point_version. Adobe uses major.minor.point.subpoint. Microsoft operating systems use Service Packs_Language Packs. Some other vendors would use point releases with mostly numbers but occasional letters sprinkled in (e.g., .8, .9, .9R2, .10).
When I worked on the stated problem, I started from their XML files and manipulated them in Excel, splitting on the periods. Then I would sort either numerically (if they were all numeric) or as a text string. (Note that the letters sprinkled in to mostly numbers causes havoc, and that .10 comes lexically before .8)
This inconsistency is why third-party software vendors have sprouted like mushrooms after a spring rain. Companies would rather pay the software vendors than untangle this Gordian knot.
If you want a truly fuzzy search, please take a look at this question about using Soundex. Expect to get a lot of false positives.
If your goal is accurately mapping the CPE strings, you should probably think about implementing a lookup table that translates from CPE to a library name.
I have Lucene files indexed according to pageIds (UniqueKey). and one document can have multiple pages. Now once user perform some search it gives us pages that matches search criteria.
I am using Lucene.Net 2.9.2
We have 2 problems...
1- The file size is around 800GB and it has 130 million rows (pages) so the search time was really slow (all queries taking more than a min (we only have to return limited rows at a time)
To overcome the performance issue I shifted to SOLR which resolved the performance issue (which is quite strange as I am not using any extra functionality provided by SOLR like sharding etc - so could it be that Lucene.NET 2.9.2 is not really equivalent to performance compared to same version of JAVA??) but now I am having another issue...
2- The individual 'lucene document' is one page but i want to show results 'grouped by' 'real documents'. How many results I should be returned should be configurable based on 'real documents' not 'pages' (coz thats how I want to show to the user).
So lets say I want 20 'real documents' and ALL pages in them that matches the search criteria (doesnt matter if one document has 100 pages and another just 1).
From what I could get from SOLR forums was that it can be achieved by SOLR-236 patch (field collapsing) but I have not been able to apply the patch correctly with trunk (gives lots of errors).
This is really imp for me and I dont have much time, so can someone please either send me the SOLR 1.4.1 binary with this patch applied or guide me if there is any other way.
I would really appreciate it. Thanks!!
If you have issues with the collapse patch, then the Solr issue tracker is the channel to report them. I can see that other people are currently having some issues with it, so I suggest getting involved in its development.
That said: I recommend that if your application needs to search for 'real documents', then build your index around these 'real documents', not their individual pages.
If your only requirement is to show page numbers, I would suggest to play with the highlighter or made some custom development. You can store the word number of the start and end of each page in a custom structure, and knowing the matched word position in the whole document you can know in what page it appears. If the documents are very large you will get a good performance improvement.
You could also have a look at SOLR-1682 : Implement CollapseComponent, I havent tested it yet, but as far as I know it solves the collapsing too.
For a contract work, I need to digitalize a lot of old, scanned-graphic-only plenary debate protocol PDFs from the Federal Parliament of Germany.
The problem is that most of these files have a two-column format:
Sample Protocol http://sert.homedns.org/img/btp12001.png
I would love to read your answer to my following questions:
How I can split the two columns before feeding them into OCR?
Which commercial, open-source OCR software or framework, do you recommend and why?
Please note that any tool, programming-language, framework etc. is all fine. Don't hesitate recommend esoteric products, libraries if you think they are cut for the jub ^__^!!
UPDATE: These documents are already scanned by the parliament o_O: sample (same as the image above) and there are lots of them and I want to deliver on the contract ASAP so I can't go fetch print copies of the same documents, cut and scan them myself. There are just too many of them.
Best Regards,
Cetin Sert
Cut the pages down the middle before you scan.
It depends what OCR software you are using. A few years ago I did some work with an OCR API, I cant quite remember the name but I think there's lots of alternatives. Anyway this API allowed me to define regions on the page to OCR, If you always know roughly where the columns are you could use an SDK to map out parts of the page.
I use Omnipage 17 for such things. It has an batchmode too, where you can put the documents in an folder, where they was grabed, and put the result into another.
It autorecognit the layout, include columns, or you can set the default layout to columns.
You can set many options how the output should look like.
But try a demo, if it goes correct. I have at the moment problems with ligaturs in some of my documents. So words like "fliegen" comes out as "fl iegen" so you must spell them.
Take a look at http://www.wisetrend.com/wisetrend_ocr_cloud.shtml (an online, REST API for OCR). It is based on the powerful ABBYY OCR engine. You can get a free account and try it with a few of your images to see if it handles the 2-column format (it should be able to do it). Also, there are a bunch of settings you can play with (see API documentation) - you may have to tweak some of them before it will work with 2 columns. Finally, as a solution of last resort, if the 2-column split is always in the same place, you can first create a program that splits the input image into two images (shouldn't be very difficult to write this using some standard image processing library), and then feed the resulting images to the OCR process.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I'm building the world's simplest library application. All I want to be able to do is scan in a book's UPC (barcode) using a typical scanner (which just types the numbers of the barcode into a field) and then use it to look up data about the book... at a minimum, title, author, year published, and either the Dewey Decimal or Library of Congress catalog number.
The goal is to print out a tiny sticker ("spine label") with the card catalog number that I can stick on the spine of the book, and then I can sort the books by card catalog number on the shelves in our company library. That way books on similar subjects will tend to be near each other, for example, if you know you're looking for a book about accounting, all you have to do is find SOME book about accounting and you'll see the other half dozen that we have right next to it which makes it convenient to browse the library.
There seem to be lots of web APIs to do this, including Amazon and the Library of Congress. But those are all extremely confusing to me. What I really just want is a single higher level function that takes a UPC barcode number and returns some basic data about the book.
There's a very straightforward web based solution over at ISBNDB.com that you may want to look at.
Edit: Updated API documentation link, now there's version 2 available as well
Link to prices and tiers here
You can be up and running in just a few minutes (these examples are from API v1):
register on the site and get a key to use the API
try a URL like:
http://isbndb.com/api/books.xml?access_key={yourkey}&index1=isbn&results=details&value1=9780143038092
The results=details gets additional details including the card catalog number.
As an aside, generally the barcode is the isbn in either isbn10 or isbn13. You just have to delete the last 5 numbers if you are using a scanner and you pick up 18 numbers.
Here's a sample response:
<ISBNdb server_time="2008-09-21T00:08:57Z">
<BookList total_results="1" page_size="10" page_number="1" shown_results="1">
<BookData book_id="the_joy_luck_club_a12" isbn="0143038095">
<Title>The Joy Luck Club</Title>
<TitleLong/>
<AuthorsText>Amy Tan, </AuthorsText>
<PublisherText publisher_id="penguin_non_classics">Penguin (Non-Classics)</PublisherText>
<Details dewey_decimal="813.54" physical_description_text="288 pages" language="" edition_info="Paperback; 2006-09-21" dewey_decimal_normalized="813.54" lcc_number="" change_time="2006-12-11T06:26:55Z" price_time="2008-09-20T23:51:33Z"/>
</BookData>
</BookList>
</ISBNdb>
Note: I'm the LibraryThing guy, so this is partial self-promotion.
Take a look at this StackOverflow answer, which covers some good ways to get data for a given ISBN.
To your issues, Amazon includes a simple DDC (Dewey); Google does not. The WorldCat API does, but you need to be an OCLC library to use it.
The ISBN/UPC issue is complex. Prefer the ISBN, if you can find them. Mass market paperbacks sometimes sport UPCs on the outside and an ISBN on inside.
LibraryThing members have developed a few pages on the issue and on efforts to map the two:
http://www.librarything.com/wiki/index.php/UPC
http://www.librarything.com/wiki/index.php/CueCat:_ISBNs_and_Barcodes
If you buy from Borders your book's barcodes will all be stickered over with their own internal barcodes (called a "BINC"). Most annoyingly whatever glue they use gets harder and harder to remove cleanly over time. I know of no API that converts them. LibraryThing does it by screenscraping.
For an API, I'd go with Amazon. LibraryThing is a good non-API option, resolving BINCs and adding DDC and LCC for books that don't have them by looking at other editions of the "work."
What's missing is the label part. Someone needs to create a good PDF template for that.
Edit It would be pretty easy if you had ISBN. but converting from UPC to ISBN is not as easy as you'd like.
Here's some javascript code for it from http://isbn.nu where it's done in script
if (indexisbn.indexOf("978") == 0) {
isbn = isbn.substr(3,9);
var xsum = 0;
var add = 0;
var i = 0;
for (i = 0; i < 9; i++) {
add = isbn.substr(i,1);
xsum += (10 - i) * add;
}
xsum %= 11;
xsum = 11 - xsum;
if (xsum == 10) { xsum = "X"; }
if (xsum == 11) { xsum = "0"; }
isbn += xsum;
}
However, that only converts from UPC to ISBN some of the time.
You may want to look at the barcode scanning project page, too - one person's journey to scan books.
So you know about Amazon Web Services. But that assumes amazon has the book and has scanned in the UPC.
You can also try the UPCdatabase at http://www.upcdatabase.com/item/{UPC}, but this is also incomplete - at least it's growing..
The library of congress database is also incomplete with UPCs so far (although it's pretty comprehensive), and is harder to get automated.
Currently, it seems like you'd have to write this yourself in order to have a high-level lookup that returns simple information (and tries each service)
Sounds like the sort of job one might get a small software company to do for you...
More seriously, there are services that provide an interface to the ISBN catalog, www.literarymarketplace.com.
On worldcat.com, you can create a URL using the ISBN that will take you straight to a book detail page. That page isn't as very useful because it's still HTML scraping to get the data, but they have a link to download the book data in a couple "standard" formats.
For example, their demo book: http://www.worldcat.org/isbn/9780060817084
Has a "EndNote" format download link http://www.worldcat.org/oclc/123348009?page=endnote&client=worldcat.org-detailed_record, and you can harvest the data from that file very easily. That's linked from their own OCLC number, not the ISBN, but the scrape to convert that isn't hard, and they may yet have a good interface to do it.
My librarian wife uses http://www.worldcat.org/, but they key off ISBN. If you can scan that, you're golden. Looking at a few books, it looks like the UPC is the same or related to the ISBN.
Oh, these guys have a function for doing the conversion from UPC to ISBN.
Using the web site Library Thing, you can scan in your barcodes (the entire barcode, not just the ISBN - if you have a scanning "wedge" you're in luck) and build your library. (It is an excellent social network - think StackOverflow for book enthusiasts.)
Then, using the TOOLS section, you can export your library. Now you have a text file to import/parse and can create your labels, a card catalog, etc.
I'm afraid the problem is database access. Companies pay to have a UPC assigned and so the database isn't freely accessible. The UPCdatabase site mentioned by Philip is a start, as is UPCData.info, but they're user entered--which means incomplete and possibly inaccurate.
You can always enter in the UPC to Google and get a hit, but that's not very automated. But it does get it right most of the time.
I thought I remembered Jon Udell doing something like this (e.g., see this), but it was purely ISBN based.
Looks like you've found a new project for someone to work on!
If you're wanting to use Amazon you can implement it easily with LINQ to Amazon.
Working in the library world we simply connect to the LMS pass in the barcode and hey presto back comes the data. I believe there are a number of free LMS providers - Google for "open source lms".
Note: This probably works off ISBN...
You can find a PHP implemented ISBN lookup tool at Dawson Interactive.
I frequently recommend using Amazon's Product Affiliate API (check it out here https://affiliate-program.amazon.com), however there are a few other options available as well.
If you want to guarantee the accuracy of the data, you can go with the a paid solution. GS1 is the organization that issues UPC codes, so their information should always be accurate (https://www.gs1us.org/tools/gs1-company-database-gepir).
There are also a number of third party databases with relevant information such as https://www.upccodesearch.com/ or https://www.upcdatabase.com/ .