How can I lookup data about a book from its barcode number? [closed] - api

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I'm building the world's simplest library application. All I want to be able to do is scan in a book's UPC (barcode) using a typical scanner (which just types the numbers of the barcode into a field) and then use it to look up data about the book... at a minimum, title, author, year published, and either the Dewey Decimal or Library of Congress catalog number.
The goal is to print out a tiny sticker ("spine label") with the card catalog number that I can stick on the spine of the book, and then I can sort the books by card catalog number on the shelves in our company library. That way books on similar subjects will tend to be near each other, for example, if you know you're looking for a book about accounting, all you have to do is find SOME book about accounting and you'll see the other half dozen that we have right next to it which makes it convenient to browse the library.
There seem to be lots of web APIs to do this, including Amazon and the Library of Congress. But those are all extremely confusing to me. What I really just want is a single higher level function that takes a UPC barcode number and returns some basic data about the book.

There's a very straightforward web based solution over at ISBNDB.com that you may want to look at.
Edit: Updated API documentation link, now there's version 2 available as well
Link to prices and tiers here
You can be up and running in just a few minutes (these examples are from API v1):
register on the site and get a key to use the API
try a URL like:
http://isbndb.com/api/books.xml?access_key={yourkey}&index1=isbn&results=details&value1=9780143038092
The results=details gets additional details including the card catalog number.
As an aside, generally the barcode is the isbn in either isbn10 or isbn13. You just have to delete the last 5 numbers if you are using a scanner and you pick up 18 numbers.
Here's a sample response:
<ISBNdb server_time="2008-09-21T00:08:57Z">
<BookList total_results="1" page_size="10" page_number="1" shown_results="1">
<BookData book_id="the_joy_luck_club_a12" isbn="0143038095">
<Title>The Joy Luck Club</Title>
<TitleLong/>
<AuthorsText>Amy Tan, </AuthorsText>
<PublisherText publisher_id="penguin_non_classics">Penguin (Non-Classics)</PublisherText>
<Details dewey_decimal="813.54" physical_description_text="288 pages" language="" edition_info="Paperback; 2006-09-21" dewey_decimal_normalized="813.54" lcc_number="" change_time="2006-12-11T06:26:55Z" price_time="2008-09-20T23:51:33Z"/>
</BookData>
</BookList>
</ISBNdb>

Note: I'm the LibraryThing guy, so this is partial self-promotion.
Take a look at this StackOverflow answer, which covers some good ways to get data for a given ISBN.
To your issues, Amazon includes a simple DDC (Dewey); Google does not. The WorldCat API does, but you need to be an OCLC library to use it.
The ISBN/UPC issue is complex. Prefer the ISBN, if you can find them. Mass market paperbacks sometimes sport UPCs on the outside and an ISBN on inside.
LibraryThing members have developed a few pages on the issue and on efforts to map the two:
http://www.librarything.com/wiki/index.php/UPC
http://www.librarything.com/wiki/index.php/CueCat:_ISBNs_and_Barcodes
If you buy from Borders your book's barcodes will all be stickered over with their own internal barcodes (called a "BINC"). Most annoyingly whatever glue they use gets harder and harder to remove cleanly over time. I know of no API that converts them. LibraryThing does it by screenscraping.
For an API, I'd go with Amazon. LibraryThing is a good non-API option, resolving BINCs and adding DDC and LCC for books that don't have them by looking at other editions of the "work."
What's missing is the label part. Someone needs to create a good PDF template for that.

Edit It would be pretty easy if you had ISBN. but converting from UPC to ISBN is not as easy as you'd like.
Here's some javascript code for it from http://isbn.nu where it's done in script
if (indexisbn.indexOf("978") == 0) {
isbn = isbn.substr(3,9);
var xsum = 0;
var add = 0;
var i = 0;
for (i = 0; i < 9; i++) {
add = isbn.substr(i,1);
xsum += (10 - i) * add;
}
xsum %= 11;
xsum = 11 - xsum;
if (xsum == 10) { xsum = "X"; }
if (xsum == 11) { xsum = "0"; }
isbn += xsum;
}
However, that only converts from UPC to ISBN some of the time.
You may want to look at the barcode scanning project page, too - one person's journey to scan books.
So you know about Amazon Web Services. But that assumes amazon has the book and has scanned in the UPC.
You can also try the UPCdatabase at http://www.upcdatabase.com/item/{UPC}, but this is also incomplete - at least it's growing..
The library of congress database is also incomplete with UPCs so far (although it's pretty comprehensive), and is harder to get automated.
Currently, it seems like you'd have to write this yourself in order to have a high-level lookup that returns simple information (and tries each service)

Sounds like the sort of job one might get a small software company to do for you...
More seriously, there are services that provide an interface to the ISBN catalog, www.literarymarketplace.com.
On worldcat.com, you can create a URL using the ISBN that will take you straight to a book detail page. That page isn't as very useful because it's still HTML scraping to get the data, but they have a link to download the book data in a couple "standard" formats.
For example, their demo book: http://www.worldcat.org/isbn/9780060817084
Has a "EndNote" format download link http://www.worldcat.org/oclc/123348009?page=endnote&client=worldcat.org-detailed_record, and you can harvest the data from that file very easily. That's linked from their own OCLC number, not the ISBN, but the scrape to convert that isn't hard, and they may yet have a good interface to do it.

My librarian wife uses http://www.worldcat.org/, but they key off ISBN. If you can scan that, you're golden. Looking at a few books, it looks like the UPC is the same or related to the ISBN.
Oh, these guys have a function for doing the conversion from UPC to ISBN.

Using the web site Library Thing, you can scan in your barcodes (the entire barcode, not just the ISBN - if you have a scanning "wedge" you're in luck) and build your library. (It is an excellent social network - think StackOverflow for book enthusiasts.)
Then, using the TOOLS section, you can export your library. Now you have a text file to import/parse and can create your labels, a card catalog, etc.

I'm afraid the problem is database access. Companies pay to have a UPC assigned and so the database isn't freely accessible. The UPCdatabase site mentioned by Philip is a start, as is UPCData.info, but they're user entered--which means incomplete and possibly inaccurate.
You can always enter in the UPC to Google and get a hit, but that's not very automated. But it does get it right most of the time.
I thought I remembered Jon Udell doing something like this (e.g., see this), but it was purely ISBN based.
Looks like you've found a new project for someone to work on!

If you're wanting to use Amazon you can implement it easily with LINQ to Amazon.

Working in the library world we simply connect to the LMS pass in the barcode and hey presto back comes the data. I believe there are a number of free LMS providers - Google for "open source lms".
Note: This probably works off ISBN...

You can find a PHP implemented ISBN lookup tool at Dawson Interactive.

I frequently recommend using Amazon's Product Affiliate API (check it out here https://affiliate-program.amazon.com), however there are a few other options available as well.
If you want to guarantee the accuracy of the data, you can go with the a paid solution. GS1 is the organization that issues UPC codes, so their information should always be accurate (https://www.gs1us.org/tools/gs1-company-database-gepir).
There are also a number of third party databases with relevant information such as https://www.upccodesearch.com/ or https://www.upcdatabase.com/ .

Related

College/University list for populating an Auto-complete field? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I'm trying to create an auto-complete text field with a list of known Universities and Colleges. Do you know where I can get this sort of list? Or is there a public API that contains this data?
I've found that IPEDS (Integrated Postsecondary Education Data System) is probably the most authoritative source for this data (and tons more related to it), and it's easy to export.
That page has a bunch of different tools for exporting different sets of the data in various common ways, but they all wrap around the "Download Custom Data Files" tool which is the most basic.
For a list (rather than data on a single institution), you would go to that custom data file page, and click on "By Group" to select the actual filters to use to limit what list of institutions you want (and what year of datasets). Then you click "Search", and it will provide a sample list of your results. From there, you click "Continue" to select which variables you want in the report with each of the institutions you've already filtered down to.
There's tons of variables, but in this case, you'll find everything you need under "Institutional Characteristics" most likely. Once you've selected all of the columns of information you want, click the big "Continue" button up top. You will then be presented with a bunch of download links for your data in a few various formats including CSV.
For quick results, try http://ope.ed.gov/accreditation/GetDownloadFile.aspx - it has complete data sets ready to download with all the key information.
The US Federal Aid Application site (http://www.fafsa.ed.gov/) has a very complete list, although I'm not sure how many non-US universities are listed (some are).
You could consider starting an application and scraping the list.
After a few moments of review, I'm pretty sure that fafsa doesn't want that list used by the public or make it easily accessible to the public.
If you go to their school search form (https://fafsa.ed.gov/FAFSA/app/schoolSearch?locale=en_EN) it uses autocomplete to provide the filtered data. I didn't feel like looking for too long at the script to see if I could figure out a URI that might give the entire listing, but I'm pretty sure they don't want people using their bandwidth. I will continue looking elsewhere. I'll try to post back if I find something more like what we're looking for.
...I'm back. I found dbpedia.org as a possible source (for this & MUCH more). I also tweaked an example to list all it's universities in alphabetical order and saved the html output for my own use.
I also found
http://infochimps.com/search?query=universities
this site apparently only "deals" with datasets (some free, some not)
I'm still hoping to find a "straight up" web resource that I can ping with queries for JSON.
ahhhh what'ya gonna do!?!? 8)
You might try the Department of Education: http://www.ed.gov/developer Data is available by API call or as csv. About to build the same experience using the API endpoint so I'll update with notes.

optical character recognition of PDFs of parliamentary debates

For a contract work, I need to digitalize a lot of old, scanned-graphic-only plenary debate protocol PDFs from the Federal Parliament of Germany.
The problem is that most of these files have a two-column format:
Sample Protocol http://sert.homedns.org/img/btp12001.png
I would love to read your answer to my following questions:
How I can split the two columns before feeding them into OCR?
Which commercial, open-source OCR software or framework, do you recommend and why?
Please note that any tool, programming-language, framework etc. is all fine. Don't hesitate recommend esoteric products, libraries if you think they are cut for the jub ^__^!!
UPDATE: These documents are already scanned by the parliament o_O: sample (same as the image above) and there are lots of them and I want to deliver on the contract ASAP so I can't go fetch print copies of the same documents, cut and scan them myself. There are just too many of them.
Best Regards,
Cetin Sert
Cut the pages down the middle before you scan.
It depends what OCR software you are using. A few years ago I did some work with an OCR API, I cant quite remember the name but I think there's lots of alternatives. Anyway this API allowed me to define regions on the page to OCR, If you always know roughly where the columns are you could use an SDK to map out parts of the page.
I use Omnipage 17 for such things. It has an batchmode too, where you can put the documents in an folder, where they was grabed, and put the result into another.
It autorecognit the layout, include columns, or you can set the default layout to columns.
You can set many options how the output should look like.
But try a demo, if it goes correct. I have at the moment problems with ligaturs in some of my documents. So words like "fliegen" comes out as "fl iegen" so you must spell them.
Take a look at http://www.wisetrend.com/wisetrend_ocr_cloud.shtml (an online, REST API for OCR). It is based on the powerful ABBYY OCR engine. You can get a free account and try it with a few of your images to see if it handles the 2-column format (it should be able to do it). Also, there are a bunch of settings you can play with (see API documentation) - you may have to tweak some of them before it will work with 2 columns. Finally, as a solution of last resort, if the 2-column split is always in the same place, you can first create a program that splits the input image into two images (shouldn't be very difficult to write this using some standard image processing library), and then feed the resulting images to the OCR process.

Is this how a modern news site would handle it's sql/business logic?

Basically, the below image represents the components on the homepage of a site I'm working on, which will have news components all over the place. The sql snippets envision how I think they should work, I would really appreciate some business logic advice from people who've worked with news sites before though. Here's how I envision it:
Question #1: Does the sql logic make sense? Are there any caveats/flaws that you can see?
My schema would be something like:
articles:
article_id int unsigned not null primary key,
article_permalink varchar(100),
article_name varchar(100),
article_snippet text,
article_full text
article_type tinyint(3) default 1
I would store all articles ( main featured, sub-featured, the rest ) in one table, I would categorize them by a type column, which would correspond to a number in my news_types table ( for the example I used literal text as it's easier to understand ).
Question #1.1: Is it alright to rely on one table for different types of articles?
A news article can have 3 image types:
1x original image size, which would show up only on the article's permalink page
1x main featured image, which would show up on the homepage section #1
1x sub featured image, which would show up in the homepage section #2
For now I want each article to correspond to one image and not multiple. A user can post images for the article in the article_full TEXT column though.
Question #1.2: I'm not sure how I should incorporate the article images into my schema, is it common for a schema that relies on 2 tables like this?
article_image_links:
article_id article_image_id
1 1
article_images:
article_image_id article_image_url
1 media.site.com/articles/blah.jpg
Requirements for the data:
From the way I have my sql logic, there has to be some data in order for stuff to show up..
there has to be at least one main type article
there has to be at least four featured type articles which are below the main one
Question #1.3: Should I bother creating special cases for if data is missing? For example, if there's no main featured article, should I select the latest featured, or should I make it a requirement that someone always specify a main article?
Question #1.4: In the admin, when a user goes to post, by default I'll have a dropdown which specifies the article type, normal will be pre-selected and it will have the option for main and featured. So if a user at one point decides to change the article type, he/she can do that.
Question #1.5: The way my featured and main articles work is only by the latest date. If the user wants, for example, to specify an older article for whatever reason as the main article, should I create custom logic, or just tell them to update the article date so it's later than the latest?
In regard to the question in the title there is definitely more than one way to skin a cat. What's right for one site may not be right for another. Some things that could factor into your decision are how large the site needs to be scaled (eg are there going to be dozens of articles or millions?) and who will be entering the data (eg how much idiot-proofness do you need to build in). I'll try to answer the questions as best I can based on the information you gave.
Question # 1: Yes, looks fine to me. Be sure to set your indexes (I'd put indexes on [type,date] and [category,type,date]).
Question #1.1: Yes, I would say that is alright, in fact, I would say it is preferred. If I understand the question correctly (that this is as opposed to a table for each "type") then this sets you up better for adding new types in the future if you want to.
Question #1.2: If you only want one image for each story and one story for each image I'm not seeing the advantage of splitting that up into an extra table. It seems like it's just more overhead. But I could be missing something here.
Question #1.3: That's a design decision up to you, there's no "right" answer here. It all depends on your intended uses of the system.

Tools for Generating Mock Data? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I'm looking for recommendations of a good, free tool for generating sample data for the purpose of loading into test databases. By analogy, something that produces "lorem ipsum" text for any RDBMS. Features I'm looking for include:
Flexibility to generate data for an existing table definition.
Ability to generate small and large data sets (> 1 million rows or more).
Generate in SQL script format (INSERT statements) or else in a flat file format suitable for bulk import (which is usually faster).
A command-line interface for easy scripting.
Extensible, open source, written in a dynamic language (these are nice-to-haves, not strong requirements).
PS: I did search for a duplicate question on StackOverflow, but I didn't find one. If there is one, I'll be grateful to get a pointer to it.
Thanks for the great responses everyone! I should amend my requirements that I use Mac OS X as my primary development environment, not Windows (though I did say command-line interface is desirable, and that practically rules out Windows). The Windows-specific suggestions will no doubt be useful to other readers of this question, though, so thanks.
Here is my conclusion:
GenerateData:
PHP web app interface, not command line
limited to generating 200 records (or pay $20 for license to generating 5,000 records)
RedGate SQL Data Generator
not free, price $295
requires Windows, .NET, SQL Server
Visual Studio 2008 Database Edition
requires Windows
requires costly MSDN or ISV subscription
Banner Datadect
not free, price $595
requires Windows (?)
no support for MySQL (?)
GUI, not command line or scriptable
Ruby Faker gem
way too slow to use ActiveRecord for bulk data load
Super Smack
chiefly a load-testing tool, with a random data generator built in
pretty simple to use nevertheless
overall a good runner-up tool
Databene Benerator
best solution for my needs
XML scripts, compatible with DbUnit
open source (GPL) Java code
command-line usage
access many databases directly via JDBC
Take a look at databene benerator, a test data generator that looks close to your requirements.
it can generate data for an existing table definition (or even anonymize production data)
it can generate larges data set (unlimited size)
it supports various input (CSV, Flat Files, DBUnit) and output format (CSV, Flat Files, DBUnit, XML, Excel, Scripts)
it can be used on the command line or through a maven plugin
it's open source and customizable
I would give it a try.
BTW, a list of similar products is available on databene benerator's web site.
This looks quite promising: generatedata.com. Open-source, has lots of built-in data types.
There are several others listed here: Test (Sample) Data Generators. I don't have experience with any of them, but a few on that list look like they could be pretty decent.
Try http://www.mockaroo.com
This is a tool my company made to help test our own applications. We've made it free for anyone to use. It's basically the Forgery ruby gem with a web app wrapped around it. You can generate data in CSV, txt, or SQL formats. Hope this helps.
I know you said you were looking for a free tool, but this is one case where I would suggest that spending $295 will pay you back quickly in time saved. I've been using the RedGate tool SQL Data Generator for the last year and it is, to be short, an awesome tool. It allows for setting dependencies between columns, generates realistic data for business objects such as phone numbers, urls, names, etc. I can honestly state that this tool has paid for itself time and time again.
If you are looking or willing to use something MySQL-specific, you could take a look at Super Smack. It is currently maintained by Tony Bourke.
Super Smack allows you to generate random data to insert into your database tables. It is customizable, allowing you to use the packaged words.dat file, or any test data of your choice.
One of the nice things about it is that it is command-line is highly customizable. There is some fairly decent examples of usage in the book High Performance MySQL which is also excerpted here.
Not sure if that is along the lines of what you are looking for, but just a thought.
A Ruby script with one of the available fake data generators should do you just fine.
http://faker.rubyforge.org/ is one such gem. Unfortunately, this doesn't fulfill all your requirements.
Here is another: http://random-data.rubyforge.org/
And a tutorial for using Faker: http://www.rubyandhow.com/how-to-generate-fake-names-addresses-in-ruby/
RE: Flexibility to generate data for an existing table definition. Combine the Faker gem with one of the available ORMs. ActiveRecord would probably be easiest.
Normally very costly, but if you are a small ISV you can get Visual Studio 2008 Database Edition very cheaply, see the empower and bizspark promotions. It provides a lot more functionality then just generating test data (Integration with SCC, Unit Testing, DB Refactoring, etc.)
As I like the fact that Red-Grate tools are so easy to learn, I would still look at SQL Data Generator
a tool that really should not be missing from the list is the Data Generator from Datanamic that populates databases directly or generates insert scripts, has a large collection of pre-installed generators ( and supports multiple databases...
http://www.datanamic.com/datagenerator/index.html
I know you're not looking for actual lorem ipsum text; but in case anyone else searches for an actual lorem ipsum generator and finds this thread: lipsum.com does a great job of it.
Not free, but Visual Studio 2008 Database Edition is a good alternative and it provides a lot more functionality (Integration with SCC, Unit Testing, DB Refactoring, etc...)
I use a tool called Datatect:
Generates data to flat files or any ODBC compliant database.
Extensible via VBScript.
Referentially aware; will populate foreign keys with values from parent table.
Data is context aware; city, state and phone numbers for given zip codes, first names and titles with gender.
Can create custom, complex data types.
Generate over 2 billion proper names, business names, street addresses, cities, states, and zip codes.
I've used this tool to generate as many as 40,000,000 rows of data to a SQLServer database, and 8,000,000 rows of data to an Oracle database.
I am in no way affiliated with Banner Systems, just a satisfied customer.
Here is the list of such tools (both free and commercial):
http://c2.com/cgi/wiki?TestDataGenerator
For OS X there is Data Creator (US $ 7). Download is free for test purpose. You can use it to evaluate the software and its features.
It requires OS X Lion or successive. It can generate a lot of different field type and has a custom export mode plus some pre-set (TSV, CSV, Html table, web page with table inside).
http://www.tensionsoftware.com/osx/datacreator/
here at the App Store:
https://itunes.apple.com/us/app/data-creator/id491686136?mt=12
You can use DbSchema, www.dbschema.com it's a database management tool and it has a Random Data Generator to populate your database.
Not direct answer to your question but this can be helpful for certain kind of data :
Fake Name Generator can be useful - http://www.fakenamegenerator.com/ , not for everything but user accounts or stuff like that. AFAIK They provide support for bulk order.
+1 for Benerator: I tried 3 or 4 of the other tools on offer (including dbmonster) but found Benerator to be very quick, to deliver realistic data and to be flexible. I also got very quick & helpful feedback from the tool's creator when I posted on the forum.

Tool or methods for automatically creating contextual links within a large corpus of content?

Here's the basic scenario - I have a corpus of say 100,000 newspaper-like articles. Minimally they will all have a well-defined title, and some amount of body content.
What I want to do is find runs of text in articles that ought to link to other articles.
So, if article Foo has a run of text like "Students in 8th grade are being encouraged to read works by John-Paul Sartre" and article Bar is titled (and about) "The important works of John-Paul Sartre", I'd like to automagically create that HTML link from Foo to Bar within the text of Foo.
You should ask yourself something before adding the links. What benefit for users do you want to achieve by doing this? You probably want to increase the navigability of your site. Maybe it is better to create an easier way to add links to older articles in form used to submit new ones. Maybe it is possible to add a "one click search for selected text" feature. Maybe you can add a wiki-like functionality that lets users propose link for selected text. You probably want to add links to related articles (generated through tagging system or text mining) below the articles.
Some potential problems with fully automated link adder:
You may need to implement a good word sense disambiguation algorithm to avoid confusing or even irritating the user by placing bad automatic links with regex (or simple substring matching).
As the number of articles is large you do not want to generate the html for extra links on every request, cache it instead.
You need to make a decision on duplicate titles or titles that contain other title as substring (either take longest title or link to most recent article or prefer article from same category).
TLDR version: find alternative solutions that provide desired functionality to the users.
What you are looking for are text mining tools. You can find more info and links at http://en.wikipedia.org/wiki/Text_mining. You might also want to check out Lucene and its ports at http://lucene.apache.org. Using these tools, the basic idea would be to find a set of similar articles based on the article (or title) in question. You could search various properties of the article including titles and content or both. A tagging system a la Delicious (or Stackoverflow) might also be helpful. Rather than pre-creating the links between articles, you'd present the relevant articles in an interface much like the Related questions interface on the right-hand side of this page.
If you wanted to find and link specific text in each article, I think you'd need to do some preprocessing to select pertinent phrases to key on. Even then I think it would be very hard not to miss things due to punctuation/misspellings or to not include irrelevant links for the same reasons.