intelligent web crawler using machine learning - text-mining

I am building e commerce site
Problem Statement:
I want to crawl web pages to get product name, images and product specifications/features and store it in my database
Input to machine learning algorithm:
Web page with html content
Output expected from machine learning algo:
It should automatically detect whether it's product details page or not
If it's product details page then it should recognize product category
Then it should parse product name, specifications
Question
Which algorithm would be suitable for this problem statement?
Can anyone suggest proper approach to follow?

I'm not an expert in Machine Learning/Naturel Language Processing but my gut feeling says it is very difficult to fully implement this as an ML product.
So first look at whether your targeted eCommercise sites provides some kind of API to extract data. If such APIs are available use those and that will be rally easy than using ML.

Related

How do you do a load test for a Shopify app?

I have a technical question regarding Shopify app load testing. My app adds an alternate product page to different stores and there is an api endpoint used by the end customers.
How do you do load testing to see how many customers can be served?
How do you simulate the loads generated by different stores?
I was trying to use apache benchmark but I can only test a store with it. Also, I don't have many test stores. Let's say there are several hundred stores using my app now. I just can't create so many stores
I am asking a similar question to this for myself. I used this in-dept guide to help me select the right tool, and I ended up going for Locust.io as my Shopify app is implemented in Python and I felt keeping all my tools in the same code base has some value (and also it seems like a really good tool for the purpose).
To answer your question about multiple stores, you will have to generate dummy test stores in your app's database (stores that don't exist in Shopify but does exist in your database). Then you script your load test to access those stores.
For example if your app has 3 endpoints; GET /alternate_product_page, POST /alternate_product_page and GET /some_resource, then you would set up a HttpUser class in Locust that exercises those endpoints as per normal usage, then start locust with that test load for each store id in your database.
The thing I have not figured out yet is how I can spoof authentication towards shopify during testing. I guess we will have to simply disable authentication altogether during testing.

Trying to migrate online store,completely lost and being taken advantage of

We have an online store that used to be managed by a third party (lets call them BS Company) that did the hosting, webdesign and everything... Now we've decided to move to Shopify because this company has screwed us over in every step along the way.
We already have the store ready on Shopify. Our current website is hosted on DigitalOcean, for which BsCompany has all the access.
They are being impossible, saying that they can't transfer out of DigitalOcean because they have several client's sites with the same hosting. This sounds like royal BS. We just want to move everything to shopify.
I'm going to be on the phone with them shortly, I just want to be informed. What should I ask them to do specifically? What I'm I trying to get them to do?
I'm totally lost here guys, please help.
This goes outside of the scope of StackOveflow purpose which is to help developers with their code issues.
That said you are providing too little information here.
If you want to transfer your store to Shopify and the theme is already done then you need probably only the content from the old store.
Since you didn't mentioned what is the previous platform (WordPress/Drupal/Magento/etc...) it will be hard to provide a recommendation for an App that will import the content from the other platform to Shopify.
So pretty much you need to know what platform the current store is using. For example if you are using WordPress with WooCommerce you will need the following App in Shopify -> https://help.shopify.com/en/manual/migrating-to-shopify/import-store-app/woocommerce-migration and you will need to export the content from the WordPress.
Please have in mind that this will focus mainly on the products, if you like to import the pages and custom post types ( we are still are talking about WordPress here ) you will need some other App or custom logic.
So long story short is not an easy job if there is no App for it.
If you want to keep the SEO for your previous site please have in mind that the Shopify have a predefined URL structure that you can't overwrite.
This means that pages that were using a specific url structure will be different now and you will need to create redirect rules in Shopify manually for each page, which will be a tedious process if you have too much content.
So pretty much you need to know:
What platform the site was written at
Export of the database ( and what database was used MySQL/MongoDB/etc... )
The site files ( they need to provide them to you )
With this information you will be able to create a local copy of the site for reference.

Can someone explain me what is an API.?

I've googles about it, yet couldn't understand it properly.. Not sure if it's a library or intra-server communicator..
Can someone explain me in a high-level /low-level what is meant by an API.??
http://en.wikipedia.org/wiki/Application_programming_interface
Read it from here , will hopefully clear most of your doubts.
An API stands for Application Programming Interface, which means using and existing program or code and accessing it with your code.
===
Example, Search Engine:
Search engine 1: offers search and api (if you want this can be google)
Search engine 2: uses googles api to get results (this is your one)
To get results you basically search the other search engine and get their results to yours
====
An API can be used in many ways, to access others data or code, ect
An in-depth explination can be found here: http://en.wikipedia.org/wiki/Application_programming_interface
An application-programming interface (API) is a set of programming instructions and standards for accessing a Web-based software application or Web tool. A software company releases its API to the public so that other software developers can design products that are powered by its service.
For example, Amazon.com released its API so that Web site developers could more easily access Amazon's product information. Using the Amazon API, a third party Web site can post direct links to Amazon products with updated prices and an option to "buy now."
An API is a software-to-software interface, not a user interface. With APIs, applications talk to each other without any user knowledge or intervention. When you buy movie tickets online and enter your credit card information, the movie ticket Web site uses an API to send your credit card information to a remote application that verifies whether your information is correct. Once payment is confirmed, the remote application sends a response back to the movie ticket Web site saying it's OK to issue the tickets.
As a user, you only see one interface -- the movie ticket Web site -- but behind the scenes, many applications are working together using APIs. This type of integration is called seamless, since the user never notices when software functions are handed from one application to another.
This article shows an example
http://www.codeproject.com/Tips/127316/Integrate-FB-javascript-API-to-your-asp-net-app-to

Can you run your own reCaptcha service, in your own web app?

Is the backend used by reCaptcha open source? Is it a simple web app that can be deployed in a given container?
Thanks,
LES
It's a web service. It is supplied by a third party.
You can integrate it into your application, but as far as the source code goes, no. Its value is not in the source code but in the images that are supplied. They're not randomly generated but come from books from those parts an OCR system failed to process. So by solving reCaptcha people are actually helping scan books. Somebody takes care of the scanning process and supplied a constant flow of new challenges. Hard to beat.
Running reCaptcha on your own server would be very cumbersome, as it requires a constant supply of image data (scanned books) to work. Also it would kind of beat a part of the purpose, that is digitizing books for the common good. Besides, I don't think it's even available.
This should be able to answer all of your questions for you: recaptcha

Are there any Tools to Automate Migration from a Website to Amazon Webstore?

we have a current website which shows the product, shopping cart and the whole shebang. and we're trying to migrate to Amazon web store, not the Cloud computing architecture type but the basic html web store method.
And based on the tools that i see is available on amazon web store, there seems to be a lot of html editing, excel file editing, etc.
i can see some of the process there which might get automated via perl scripting etc, but it still leaves a lot of work to be done.
So i was wondering if there's anybody else who has done this before and know of a better way to automate or setup a process so any future work can be hugely reduced down to as much automation as possible.
thanks!!
Edit
In case somebody is as lost as i am i was able to find some links on best practices, and found out that there's actually a Amazon seller desktop and AMTU tool which can use xml or soap to handle some automation on the inventory management part.
Amazon offers API for automated integration (although not all Webstore features are supported through these APIs).
The Marketplace Web Services API (mws.amazon.com) is well documented, and example client libraries are provided for Java, C# and PHP. For feed and report formats, you should refer to Seller Central Help pages for all formats, or to http://g-ecx.images-amazon.com/images/G/01/rainier/help/tutorials/SOA-GuideToXML.pdf for more detailed documentation about XML formats.
For a higher level picture, www.amazonservices.com is a good start.