After creating a custom CVRPTW XML file I found that optaplanner is not considering the ready and due time for the given solution. After trying with many variations of ready time, due time and service duration, opta planner always calculate the same output.
You can see an example:
At the XML file, we are filling like this:
<VrpTimeWindowedCustomer id="848"><id>2</id><location class="VrpAirLocation" reference="4"/><demand>20620</demand><readyTime>0</readyTime><dueTime>432000</dueTime><serviceDuration>13650</serviceDuration></VrpTimeWindowedCustomer>
You can see the sample XML data here: https://codeshare.io/xZspP
Maybe we are wrong on how to understand the number format?
Related
I'm new to Apache Mahout and working on a classsification problem.
The Problem states:
There exists a set of data in a text file and I need to fetch some or all of the data from the file depending upon the given span of time.
Span of time : Each record would have a Date of transaction.
So, time span would be calculated using the logic (Sys_Date - Transaction_Date).
Thus, output would vary depending upon whether data is required for last month / week / specific number of days.
How can this filtering be achieved using Apache Mahout.
This by itself does not sound like a machine learning problem at all. You want to put your data in a database of some kind and query for records in a date range. Then, you want to do something with that data. This is not something ML tools do.
I haven't been working properly with hadoop yet. But it seems to me that this video should help:
http://www.youtube.com/watch?v=KwW7bQRykHI&feature=player_embedded
After the filtering, you can use result in mahout (for solving the classification problem)
I've built an app with a tiny amount of test data (clues & answers) that works fine. Now I need to think about bringing in a full set of clues & answers, which roughly 180K records (it's a word game). I am worried about speed and memory usage of course. Looking around the intertubes and my library, I have concluded that this is probably a job for core data. Within that approach however, I guess I can bring it in as a csv or as an xml (I can create either one from the raw data using a scripting language). I found some resources about how to handle each case. What I don't know is anything about overall speed and other issues that one might expect in using csv vs xml. The csv file is about 3.6 Mb and the data type is strings.
I know this is dangerously close to a non-question, but I need some advice as either approach requires a large coding commitment. So here are the questions:
For a file of this size and characteristics, would one expect csv or
xml to be a better approach? Is there some other
format/protocol/strategy that would make more sense?
Am I right to focus on core data?
Maybe I should throw some fake code here so the system doesn't keep warning me about asking a subjective question. But I have to try! Thanks for any guidance. Links to discussions appreciated.
As for file size CSV will always be smaller compared to an xml file as it contains only the raw data in ascii format. Consider the following 3 rows and 3 columns.
Column1, Column2, Column3
1, 2, 3
4, 5, 6
7, 8, 9
Compared to it's XML counter part which is not even including schema information in it. It is also in ascii format but the rowX and the ColumnX have to be repeated mutliple times throughout the file. Compression of course could help fix this but I'm guessing even with compression the CSV will still be smaller.
<root>
<row1>
<Column1>1</Column1>
<Column2>2</Column2>
<Column3>3</Column3>
</row1>
<row2>
<Column1>4</Column1>
<Column2>5</Column2>
<Column3>6</Column3>
</row2>
<row3>
<Column1>7</Column1>
<Column2>8</Column2>
<Column3>9</Column3>
</row3>
</root>
As for your other questions sorry I can not help there.
This is large enough that the i/o time difference will be noticeable, and where the CSV is - what? 10x smaller? the processing time difference (whichever is faster) will be negligible compared to the difference in reading it in. And CSV should be faster, outside of I/O too.
Whether to use core data depends on what features of core data you hope to exploit. I'm guessing the only one is query, and it might be worth it for that, although if it's just a simple mapping from clue to answer, you might just want to read the whole thing in from the CSV file into an NSMutableDictionary. Access will be faster.
Does anyone have any experience adding their own data to GapMinder, the really cool software that Hans Rosling uses in his TED talks? I have an array od objects in JSON that would be easy to show in moving bubbles. This would be really cool.
I can see that my Ubuntu box has what looks like data in /opt/Gapminder Desktop/share/assets/graphs/world, but I would need to figure out:
How to add a measure to a graph
How to add a data series
How to set the time range of the data
Identify the measures to follow at each time step
and so on.
Just for the record: if you want to use Gapminder with your own dataset, you have to convert your data in a format suitable to Gapminder. More specifically, looking in the assets/graphs/world, you will have to:
Edit the file overview.xml, which contains the tree structure of all the indicators (just copy/paste an entry and specify your own data);
Convert your data copying the structure of the xml files in that directory (this is the tricky part): you can specify some metadata in the preamble, and then specify your own data series, with something like:
<t1 m="i20,50.0,99.0,1992" d="90.0, ... ,50.0, ..."/> where i20 is the country id, which is followed by the minima and maxima of the series, and the year it refers to.
In my humble opinion, Gapminder is a great app but it definitely needs more work on integration with other datasets. Way better to use Google Motion Chart as you did, or MooGraph (site and doc), which is unfortunately not as great as Gapminder.
#Stefano
the information you provided is very valuable. Is somewhere available a detailed specification of the XML files containing the data?
Anyway, just to enrich your response, I also found that:
overview.xml file
The link between Nations and their IDs is in this file
The structure of the menus for the selection of the indicators is also in the same file (at the bottom) under the section <indicatorCategorization>
The structure of the datafile XML
For each line the year represents the first year of the serie, and then the values follow one per year, comma separated.
Grazie,
Max
I ended up using the google motion chart API. I ended up with this.
I'm working on a Pig script (my first) that loads a large text file. For each record in that text file, the content of one field needs to be sent off to a RESTful service for processing. Nothing needs to be evaluated or filtered. Capture data, send it off and the script doesn't need anything back.
I'm assuming that a UDF is required for this kind of functionality, but I'm new enough to Pig that I don't have a clear picture of what type of function I should build. My best guess would be a Store Function since the data is ultimately getting stored somewhere, but I feel like the amount of guesswork involved in coming to that conclusion is higher than I'd like.
Any insight or guidance would be much appreciated.
Have you had a look to DBStorage which does something similar?
everything = LOAD 'categories.txt' USING PigStorage() AS (category:chararray);
...
STORE ordered INTO RestStorage('https://...');
Having never found even a hint of an answer to this, I decided to move in a different direction. I'm using Pig to load and parse the large file, but then streaming each record that I care about to PHP for additional processing that Pig doesn't seem to have the capability to handle cleanly.
It's still not complete (read: there's a great big, very unhappy bug in the mix), but I think the concept is solid--just need to work out the implementation details.
everything = LOAD 'categories.txt' USING PigStorage() AS (category:chararray);
-- apply filter
-- apply filter
-- ...
-- apply last filter
ordered = ORDER filtered_categories BY category;
streamed = STREAM limited THROUGH `php -nF process_categories.php`;
DUMP streamed;
Looking at the combination of MapReduce and HBase from a data-flow perspective, my problem seems to fit. I have a large set of documents which I want to Map, Combine and Reduce. My previous SQL implementation was to split the task into batch operations, cumulatively storing what would be the result of the Map into table and then performing the equivalent of a reduce. This had the benefit that at any point during execution (or between executions), I had the results of the Map at that point in time.
As I understand it, running this job as a MapReduce would require all of the Map functions to run each time.
My Map functions (and indeed any function) always gives the same output for a given input. There is simply no point in re-calculating output if I don't have to. My input (a set of documents) will be continually growing and I will run my MapReduce operation periodically over the data. Between executions I should only really have to calculate the Map functions for newly added documents.
My data will probably be HBase -> MapReduce -> HBase. Given that Hadoop is a whole ecosystem, it may be able to know that a given function has been applied to a row with a given identity. I'm assuming immutable entries in the HBase table. Does / can Hadoop take account of this?
I'm made aware from the documentation (especially the Cloudera videos) that re-calculation (of potentially redundant data) can be quicker than persisting and retrieving for the class of problem that Hadoop is being used for.
Any comments / answers?
If you're looking to avoid running the Map step each time, break it out as its own step (either by using the IdentityReducer or setting the number of reducers for the job to 0) and run later steps using the output of your map step.
Whether this is actually faster than recomputing from the raw data each time depends on the volume and shape of the input data vs. the output data, how complicated your map step is, etc.
Note that running your mapper on new data sets won't append to previous runs - but you can get around this by using a dated output folder. This is to say that you could store the output of mapping your first batch of files in my_mapper_output/20091101, and the next week's batch in my_mapper_output/20091108, etc. If you want to reduce over the whole set, you should be able to pass in my_mapper_output as the input folder, and catch all of the output sets.
Why not apply your SQL workflow in a different environment? Meaning, add a "processed" column to your input table. When time comes to run a summary, run a pipeline that goes something like:
map (map_function) on (input table filtered by !processed); store into map_outputs either in hbase or simply hdfs.
map (reduce function) on (map_outputs); store into hbase.
You can make life a little easier, assuming you are storing your data in Hbase sorted by insertion date, if you record somewhere timestamps of successful summary runs, and open the filter on inputs that are dated later than last successful summary -- you'll save some significant scanning time.
Here's an interesting presentation that shows how one company architected their workflow (although they do not use Hbase):
http://www.scribd.com/doc/20971412/Hadoop-World-Production-Deep-Dive-with-High-Availability