HSQLDB: How to create a table with size larger than heap space? - hsqldb

I want to create a simple table (id INT PRIMARY KEY, value INT) and insert 16M rows.
If I use MySQL, that may take 128M+ data (MyISAM fixed format).
Now I'm using HSQLDB and only give 128M heap space (i.e. run with -Xmx128m).
I use the file mode jdbc:hsqldb:file:... and here is my code:
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.util.Collections;
import org.hsqldb.jdbc.JDBCDriver;
public class HSQLDBTest {
private static final int BULK_SIZE = 16384;
private static final int BATCHES = 1024;
public static void main(String[] args) throws Exception {
String sql = "INSERT INTO test (value) VALUES " + String.join(", ", Collections.nCopies(BULK_SIZE, "(?)"));
try (Connection conn = new JDBCDriver().connect("jdbc:hsqldb:file:test", null)) {
conn.createStatement().execute("CREATE TABLE test (id INT NOT NULL PRIMARY KEY IDENTITY, value INT DEFAULT 0 NOT NULL)");
PreparedStatement ps = conn.prepareStatement(sql);
for (int i = 0; i < BATCHES; i ++) {
for (int j = 0; j < BULK_SIZE; j ++) {
ps.setInt(j + 1, j * j);
}
ps.executeUpdate();
System.out.println(i);
}
}
}
}
The insertion stops at about 0.7M rows (i = 40) and throws java.sql.SQLException: java.lang.OutOfMemoryError: GC overhead limit exceeded
Is there any other mode or property that can make HSQLDB create table with size larger than heap space?

Use a cached table. If you don't specify table type memory is assumed.
create cached table test (...);
Quote from the manual
CACHED tables are created with the CREATE CACHED TABLE command. Only part of their data or indexes is held in memory, allowing large tables that would otherwise take up to several hundred megabytes of memory.

Related

Android custom keyboard suggestions

I am building a custom keyboard for android, the one that atleast supports autocomplete suggestions. To achieve this, I am storing every word that user types (not password fields) in a Room database table which has a simple model, the word and its frequency. Now for showing the suggestions, I am using a Trie which is populated by words from this database table. My query is basically to order by the table based on frequency of the word and limit the results to 5K (I do not feel like overpopulating the Trie, these 5K words can be considered as the users' favourite words that he uses often and needs suggestions for). Now my actual problem is the ORDER BY clause, this is a rapidly growing data set, sorting lets say 0.1M words to get 5K words seems like an overkill. How can i rework this approach to improve the efficiency of this entire suggestions logic.
If not already implemented, an index on the frequency #ColumnInfo(index = true).
Another could be to add a table that maintains the highest 5k. Supported by yet another table (the support table) that has 1 row, with columns for; the highest frequency (not really required), the lowest frequency in the current 5k, and a 3rd column for the number currently held. So you could then, after adding an existing word get whether or not the new/updated word should be added to the 5k table (perhaps a 4th column for the primary key of the lowest to facilitate efficient deletion).
So
if the number currently held is less than 5k insert or update the 5k table and increment the number currently held in the support table.
otherwise if the number is lower then the lowest skip it.
otherwise update if it already exists.
otherwise delete the lowest, insert the replacement and then update the lowest accordingly in the support table.
Note that the 5K table would probably only need to store the rowid as a pointer/reference/map to the core table.
rowid is a column that virtually all tables will have in Room (virtual tables an exception as are table that have the WITHOUT ROWID attribute but Room does not facilitate (as far as I am aware) WITHOUT ROWID table).
The rowid can be up to twice as fast as other indexes. I would suggest using #PrimaryKey Long id=null; (java) or #PrimaryKey var id: Long?=null (Kotlin) and NOT using #PrimaryKey(autogenerate = true).
autogenerate = true equates to SQLite's AUTOINCREMENT, about which the SQLite documentation says "The AUTOINCREMENT keyword imposes extra CPU, memory, disk space, and disk I/O overhead and should be avoided if not strictly needed. It is usually not needed."
see https://www.sqlite.org/rowidtable.html, and also https://sqlite.org/autoinc.html
curiously/funnily the support table mentioned isn't that far away from what coding AUTOINCREMENT does.
a table with a row per table that has AUTOINCREMENT, is used (sqlite_sequence) that stores the table name and the highest ever allocated rowid.
Without AUTOINCREMENT but with <column_name> INTEGER PRIMARY KEY and no value or null for the primary key column's value then SQLite generates a value that is 1 greater than max(rowid).
With AUTOINCREMENT/autogenerate=true then the generated value is the greater of max(rowid) and the value stored, for that table, in the sqlite_sequence table (hence the overheads).
of course those overheads, will very likely be insignificant in comparison to sorting 0.1M rows.
Demonstration
The following is a demonstration albeit just using a basic Word table as the source.
First the 2 tables (#Entity annotated classes)
Word
#Entity (
indices = {#Index(value = {"word"},unique = true)}
)
class Word {
#PrimaryKey Long wordId=null;
#NonNull
String word;
#ColumnInfo(index = true)
long frequency;
Word(){}
#Ignore
Word(String word, long frequency) {
this.word = word;
this.frequency = frequency;
}
}
WordSubset aka the table with the highest occurring 5000 frequencies, it simply has a reference/map/link to the underlying/actual word. :-
#Entity(
foreignKeys = {
#ForeignKey(
entity = Word.class,
parentColumns = {"wordId"},
childColumns = {"wordIdMap"},
onDelete = ForeignKey.CASCADE,
onUpdate = ForeignKey.CASCADE
)
}
)
class WordSubset {
public static final long SUBSET_MAX_SIZE = 5000;
#PrimaryKey
long wordIdMap;
WordSubset(){};
#Ignore
WordSubset(long wordIdMap) {
this.wordIdMap = wordIdMap;
}
}
note the constant SUBSET_MAX_SIZE, hard coded just the once so a simple single change to adjust (lowering it after rows have been added may cause issues)
WordSubsetSupport this will be a single row table that contains the highest and lowest frequencies (highest is not really needed), the number of rows in the WordSubset table and a reference/map to the word with the lowest frequency.
#Entity(
foreignKeys = {
#ForeignKey(
entity = Word.class,
parentColumns = {"wordId"},
childColumns = {"lowestWordIdMap"}
)
}
)
class WordSubsetSupport {
#PrimaryKey
Long wordSubsetSupportId=null;
long highestFrequency;
long lowestFrequency;
long countOfRowsInSubsetTable;
#ColumnInfo(index = true)
long lowestWordIdMap;
WordSubsetSupport(){}
#Ignore
WordSubsetSupport(long highestFrequency, long lowestFrequency, long countOfRowsInSubsetTable, long lowestWordIdMap) {
this.highestFrequency = highestFrequency;
this.lowestFrequency = lowestFrequency;
this.countOfRowsInSubsetTable = countOfRowsInSubsetTable;
this.lowestWordIdMap = lowestWordIdMap;
this.wordSubsetSupportId = 1L;
}
}
For access an abstract class (rather than interface, as this, in Java, allows methods/functions with a body, a Kotlin interface allows these) CombinedDao :-
#Dao
abstract class CombinedDao {
#Insert(onConflict = OnConflictStrategy.IGNORE)
abstract long insert(Word word);
#Insert(onConflict = OnConflictStrategy.IGNORE)
abstract long insert(WordSubset wordSubset);
#Insert(onConflict = OnConflictStrategy.IGNORE)
abstract long insert(WordSubsetSupport wordSubsetSupport);
#Query("SELECT * FROM wordsubsetsupport LIMIT 1")
abstract WordSubsetSupport getWordSubsetSupport();
#Query("SELECT count() FROM wordsubsetsupport")
abstract long getWordSubsetSupportCount();
#Query("SELECT countOfRowsInSubsetTable FROM wordsubsetsupport")
abstract long getCountOfRowsInSubsetTable();
#Query("UPDATE wordsubsetsupport SET countOfRowsInSubsetTable=:updatedCount")
abstract void updateCountOfRowsInSubsetTable(long updatedCount);
#Query("UPDATE wordsubsetsupport " +
"SET countOfRowsInSubsetTable = (SELECT count(*) FROM wordsubset), " +
"lowestWordIdMap = (SELECT word.wordId FROM wordsubset JOIN word ON wordsubset.wordIdMap = word.wordId ORDER BY frequency ASC LIMIT 1)," +
"lowestFrequency = (SELECT coalesce(min(frequency),0) FROM wordsubset JOIN word ON wordsubset.wordIdMap = word.wordId)," +
"highestFrequency = (SELECT coalesce(max(frequency),0) FROM wordsubset JOIN word ON wordsubset.wordIdMap = word.wordId)")
abstract void autoUpdateWordSupportTable();
#Query("DELETE FROM wordsubset WHERE wordIdMap= (SELECT wordsubset.wordIdMap FROM wordsubset JOIN word ON wordsubset.wordIdMap = word.wordId ORDER BY frequency ASC LIMIT 1)")
abstract void deleteLowestFrequency();
#Transaction
#Query("")
int addWord(Word word) {
/* try to add the add word, setting the wordId value according to the result.
The result will be the wordId generated (1 or greater) or if the word already exists -1
*/
word.wordId = insert(word);
/* If the word was added and not rejected as a duplicate, then it may need to be added to the WordSubset table */
if (word.wordId > 0) {
/* Are there any rows in the support table? if not then add the very first entry/row */
if (getWordSubsetSupportCount() < 1) {
/* Need to add the word to the subset */
insert(new WordSubset(word.wordId));
/* Can now add the first (and only) row to the support table */
insert(new WordSubsetSupport(word.frequency,word.frequency,1,word.wordId));
autoUpdateWordSupportTable();
return 1;
}
/* If there are less than the maximum number of rows in the subset table then
1) insert the new subset row, and
2) update the support table accordingly
*/
if (getCountOfRowsInSubsetTable() < WordSubset.SUBSET_MAX_SIZE) {
insert(new WordSubset(word.wordId));
autoUpdateWordSupportTable();
return 2;
}
/*
Last case is that the subset table is at the maximum number of rows and
the frequency of the added word is greater than the lowest frequency in the
subset, so
1) the row with the lowest frequency is removed from the subset table and
2) the added word is added to the subset
3) the support table is updated accordingly
*/
if (getCountOfRowsInSubsetTable() >= WordSubset.SUBSET_MAX_SIZE) {
WordSubsetSupport currentWordSubsetSupport = getWordSubsetSupport();
if (word.frequency > currentWordSubsetSupport.lowestFrequency) {
deleteLowestFrequency();
insert(new WordSubset(word.wordId));
autoUpdateWordSupportTable();
return 3;
}
}
return 4; /* indicates word added but does not qualify for addition to the subset */
}
return -1;
}
}
The addWord method/function is the only method that is used as this automatically maintains the WordSubset and the WordSubsetSupport tables.
TheDatabase is a pretty standard #Database annotated class, other than that it allows use the main thread for the sake of convenience and brevity of the demo:-
#Database( entities = {Word.class,WordSubset.class,WordSubsetSupport.class}, version = TheDatabase.DATABASE_VERSION, exportSchema = false)
abstract class TheDatabase extends RoomDatabase {
abstract CombinedDao getCombinedDao();
private static volatile TheDatabase instance = null;
public static TheDatabase getInstance(Context context) {
if (instance == null) {
instance = Room.databaseBuilder(context,TheDatabase.class,DATABASE_NAME)
.addCallback(cb)
.allowMainThreadQueries()
.build();
}
return instance;
}
private static Callback cb = new Callback() {
#Override
public void onCreate(#NonNull SupportSQLiteDatabase db) {
super.onCreate(db);
}
#Override
public void onOpen(#NonNull SupportSQLiteDatabase db) {
super.onOpen(db);
}
};
public static final String DATABASE_NAME = "the_database.db";
public static final int DATABASE_VERSION = 1;
}
Finally activity code that randomly generates and adds 10,000 words (or thereabouts as some could be duplicate words), each word having a frequency that is also randomly generated (between 1 and 10000) :-
public class MainActivity extends AppCompatActivity {
TheDatabase db;
CombinedDao dao;
#Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
db = TheDatabase.getInstance(this);
dao = db.getCombinedDao();
for (int i=0; i < 10000; i++) {
Word currentWord = generateRandomWord();
Log.d("ADDINGWORD","Adding word " + currentWord.word + " frequency is " + currentWord.frequency);
dao.addWord(generateRandomWord());
}
}
public static final String ALPHABET = "abcdefghijklmnopqrstuvwxyz";
private Word generateRandomWord() {
Random r = new Random();
int wordLength = (abs(r.nextInt()) % 24) + 1;
int frequency = abs(r.nextInt()) % 10000;
StringBuilder sb = new StringBuilder();
for (int i=0; i < wordLength; i++) {
int letter = abs(r.nextInt()) % (ALPHABET.length());
sb.append(ALPHABET.substring(letter,letter+1));
}
return new Word(sb.toString(),frequency);
}
}
Obviously the results would differ per run, also the demo is only really designed to be run once (although it could be run more).
After running, using AppInspection, then
The support table (in this instance) is :-
So as countOfRowsInSubsetTable is 5000 then the subset table has been filled to it's capacity/limit.
The highest frequency encountered as 9999 (as could well be expected)
The lowest frequency in the subset is 4690 and that is for the word with the wordId that is 7412.
The subset table on it's own means little as it just contains a map to the actual word. So it's more informative to use a query to look at what it contains.
e.g.
As can be seen the query shows that the word who's wordId is 7412 is the one with the lowest frequency of 4690 (as expected according to the support table)
Going to the last page shows:-

Beam Java SDK with TFRecord and Compression GZIP

We're using Beam Java SDK (and Google Cloud Dataflow to run batch jobs) a lot, and we noticed something weird (possibly a bug?) when we tried to use TFRecordIO with Compression.GZIP. We were able to come up with some sample code that can reproduce the errors we face.
To be clear, we are using Beam Java SDK 2.4.
Suppose we have PCollection<byte[]> which can be a PC of proto messages, for instance, in byte[] format.
We usually write this to GCS (Google Cloud Storage) using Base64 encoding (newline delimited Strings) or using TFRecordIO (without compression). We have had no issue reading the data from GCS in this manner for a very long time (2.5+ years for the former and ~1.5 years for the latter).
Recently, we tried TFRecordIO with Compression.GZIP option, and sometimes we get an exception as the data is seen as invalid (while being read). The data itself (the gzip files) is not corrupted, and we've tested various things, and reached the following conclusion.
When a byte[] that is being compressed under TFRecordIO is above certain threshold (I'd say when at or above 8192), then TFRecordIO.read().withCompression(Compression.GZIP) would not work.
Specifically, it will throw the following exception:
Exception in thread "main" java.lang.IllegalStateException: Invalid data
at org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:444)
at org.apache.beam.sdk.io.TFRecordIO$TFRecordCodec.read(TFRecordIO.java:642)
at org.apache.beam.sdk.io.TFRecordIO$TFRecordSource$TFRecordReader.readNextRecord(TFRecordIO.java:526)
at org.apache.beam.sdk.io.CompressedSource$CompressedReader.readNextRecord(CompressedSource.java:426)
at org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.advanceImpl(FileBasedSource.java:473)
at org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:468)
at org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:261)
at org.apache.beam.runners.direct.BoundedReadEvaluatorFactory$BoundedReadEvaluator.processElement(BoundedReadEvaluatorFactory.java:141)
at org.apache.beam.runners.direct.DirectTransformExecutor.processElements(DirectTransformExecutor.java:161)
at org.apache.beam.runners.direct.DirectTransformExecutor.run(DirectTransformExecutor.java:125)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This can be reproduced easily, so you can refer to the code at the end. You will also see comments about the byte array length (as I tested with various sizes, I concluded that 8192 is the magic number).
So I'm wondering if this is a bug or known issue -- I couldn't find anything close to this on Apache Beam's Issue Tracker here but if there is another forum/site I need to check, please let me know!
If this is indeed a bug, what would be the right channel to report this?
The following code can reproduce the error we have.
A successful run (with parameters 1, 39, 100) would show the following message at the end:
------------ counter metrics from CountDoFn
[counter] plain_base64_proto_array_len: 8126
[counter] plain_base64_proto_in: 1
[counter] plain_base64_proto_val_cnt: 39
[counter] tfrecord_gz_proto_array_len: 8126
[counter] tfrecord_gz_proto_in: 1
[counter] tfrecord_gz_proto_val_cnt: 39
[counter] tfrecord_uncomp_proto_array_len: 8126
[counter] tfrecord_uncomp_proto_in: 1
[counter] tfrecord_uncomp_proto_val_cnt: 39
With parameters (1, 40, 100) which would push the byte array length over 8192, it will throw the said exception.
You can tweak the parameters (inside CreateRandomProtoData DoFn) to see why the length of byte[] being gzipped matters.
It may help you also to use the following protoc-gen Java class (for TestProto used in the main code above. Here it is: gist link
References:
Main Code:
package exp.moloco.dataflow2.compression; // NOTE: Change appropriately.
import java.util.Arrays;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Random;
import java.util.TreeMap;
import org.apache.beam.runners.direct.DirectRunner;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.Compression;
import org.apache.beam.sdk.io.TFRecordIO;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.metrics.Counter;
import org.apache.beam.sdk.metrics.MetricResult;
import org.apache.beam.sdk.metrics.Metrics;
import org.apache.beam.sdk.metrics.MetricsFilter;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.Create;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.values.PCollection;
import org.apache.commons.codec.binary.Base64;
import org.joda.time.DateTime;
import org.joda.time.DateTimeZone;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.google.protobuf.InvalidProtocolBufferException;
import com.moloco.dataflow.test.StackOverflow.TestProto;
import com.moloco.dataflow2.Main;
// #formatter:off
// This code uses TestProto (java class) that is generated by protoc.
// The message definition is as follows (in proto3, but it shouldn't matter):
// message TestProto {
// int64 count = 1;
// string name = 2;
// repeated string values = 3;
// }
// Note that this code does not depend on whether this proto is used,
// or any other byte[] is used (see CreateRandomData DoFn later which generates the data being used in the code).
// We tested both, but are presenting this as a concrete example of how (our) code in production can be affected.
// #formatter:on
public class CompressionTester {
private static final Logger LOG = LoggerFactory.getLogger(CompressionTester.class);
static final List<String> lines = Arrays.asList("some dummy string that will not used in this job.");
// Some GCS buckets where data will be written to.
// %s will be replaced by some timestamped String for easy debugging.
static final String PATH_TO_GCS_PLAIN_BASE64 = Main.SOME_BUCKET + "/comp-test/%s/output-plain-base64";
static final String PATH_TO_GCS_TFRECORD_UNCOMP = Main.SOME_BUCKET + "/comp-test/%s/output-tfrecord-uncompressed";
static final String PATH_TO_GCS_TFRECORD_GZ = Main.SOME_BUCKET + "/comp-test/%s/output-tfrecord-gzip";
// This DoFn reads byte[] which represents a proto message (TestProto).
// It simply counts the number of proto objects it processes
// as well as the number of Strings each proto object contains.
// When the pipeline terminates, the values of the Counters will be printed out.
static class CountDoFn extends DoFn<byte[], TestProto> {
private final Counter protoIn;
private final Counter protoValuesCnt;
private final Counter protoByteArrayLength;
public CountDoFn(String name) {
protoIn = Metrics.counter(this.getClass(), name + "_proto_in");
protoValuesCnt = Metrics.counter(this.getClass(), name + "_proto_val_cnt");
protoByteArrayLength = Metrics.counter(this.getClass(), name + "_proto_array_len");
}
#ProcessElement
public void processElement(ProcessContext c) throws InvalidProtocolBufferException {
protoIn.inc();
TestProto tp = TestProto.parseFrom(c.element());
protoValuesCnt.inc(tp.getValuesCount());
protoByteArrayLength.inc(c.element().length);
}
}
// This DoFn emits a number of TestProto objects as byte[].
// Input to this DoFn is ignored (not used).
// Each TestProto object contains three fields: count (int64), name (string), and values (repeated string).
// The three parameters in DoFn determines
// (1) the number of proto objects to be generated,
// (2) the number of (repeated) strings to be added to each proto object, and
// (3) the length of (each) string.
// TFRecord with Compression (when reading) fails when the parameters are 1, 40, 100, for instance.
// TFRecord with Compression (when reading) succeeds when the parameters are 1, 39, 100, for instance.
static class CreateRandomProtoData extends DoFn<String, byte[]> {
static final int NUM_PROTOS = 1; // Total number of TestProto objects to be emitted by this DoFn.
static final int NUM_STRINGS = 40; // Total number of strings in each TestProto object ('repeated string').
static final int STRING_LEN = 100; // Length of each string object.
// Returns a random string of length len.
// For debugging purposes, the string only contains upper-case English alphabets.
static String getRandomString(Random rd, int len) {
StringBuffer sb = new StringBuffer();
for (int i = 0; i < len; i++) {
sb.append('A' + (rd.nextInt(26)));
}
return sb.toString();
}
// Returns a randomly generated TestProto object.
// Each string is generated randomly using getRandomString().
static TestProto getRandomProto(Random rd) {
TestProto.Builder tpBuilder = TestProto.newBuilder();
tpBuilder.setCount(rd.nextInt());
tpBuilder.setName(getRandomString(rd, STRING_LEN));
for (int i = 0; i < NUM_STRINGS; i++) {
tpBuilder.addValues(getRandomString(rd, STRING_LEN));
}
return tpBuilder.build();
}
// Emits TestProto objects are byte[].
#ProcessElement
public void processElement(ProcessContext c) {
// For debugging purposes, we set the seed here.
Random rd = new Random();
rd.setSeed(132475);
for (int n = 0; n < NUM_PROTOS; n++) {
byte[] data = getRandomProto(rd).toByteArray();
c.output(data);
// With parameters (1, 39, 100), the array length is 8126. It works fine.
// With parameters (1, 40, 100), the array length is 8329. It breaks TFRecord with GZIP.
System.out.println("\n--------------------------\n");
System.out.println("byte array length = " + data.length);
System.out.println("\n--------------------------\n");
}
}
}
public static void execute() {
PipelineOptions options = PipelineOptionsFactory.create();
options.setJobName("compression-tester");
options.setRunner(DirectRunner.class);
// For debugging purposes, write files under 'gcsSubDir' so we can easily distinguish.
final String gcsSubDir =
String.format("%s-%d", DateTime.now(DateTimeZone.UTC), DateTime.now(DateTimeZone.UTC).getMillis());
// Write PCollection<TestProto> in 3 different ways to GCS.
{
Pipeline pipeline = Pipeline.create(options);
// Create dummy data which is a PCollection of byte arrays (each array representing a proto message).
PCollection<byte[]> data = pipeline.apply(Create.of(lines)).apply(ParDo.of(new CreateRandomProtoData()));
// 1. Write as plain-text with base64 encoding.
data.apply(ParDo.of(new DoFn<byte[], String>() {
#ProcessElement
public void processElement(ProcessContext c) {
c.output(new String(Base64.encodeBase64(c.element())));
}
})).apply(TextIO.write().to(String.format(PATH_TO_GCS_PLAIN_BASE64, gcsSubDir)).withNumShards(1));
// 2. Write as TFRecord.
data.apply(TFRecordIO.write().to(String.format(PATH_TO_GCS_TFRECORD_UNCOMP, gcsSubDir)).withNumShards(1));
// 3. Write as TFRecord-gzip.
data.apply(TFRecordIO.write().withCompression(Compression.GZIP)
.to(String.format(PATH_TO_GCS_TFRECORD_GZ, gcsSubDir)).withNumShards(1));
pipeline.run().waitUntilFinish();
}
LOG.info("-------------------------------------------");
LOG.info(" READ TEST BEGINS ");
LOG.info("-------------------------------------------");
// Read PCollection<TestProto> in 3 different ways from GCS.
{
Pipeline pipeline = Pipeline.create(options);
// 1. Read as plain-text.
pipeline.apply(TextIO.read().from(String.format(PATH_TO_GCS_PLAIN_BASE64, gcsSubDir) + "*"))
.apply(ParDo.of(new DoFn<String, byte[]>() {
#ProcessElement
public void processElement(ProcessContext c) {
c.output(Base64.decodeBase64(c.element()));
}
})).apply("plain-base64", ParDo.of(new CountDoFn("plain_base64")));
// 2. Read as TFRecord -> byte array.
pipeline.apply(TFRecordIO.read().from(String.format(PATH_TO_GCS_TFRECORD_UNCOMP, gcsSubDir) + "*"))
.apply("tfrecord-uncomp", ParDo.of(new CountDoFn("tfrecord_uncomp")));
// 3. Read as TFRecord-gz -> byte array.
// This seems to fail when 'data size' becomes large.
pipeline
.apply(TFRecordIO.read().withCompression(Compression.GZIP)
.from(String.format(PATH_TO_GCS_TFRECORD_GZ, gcsSubDir) + "*"))
.apply("tfrecord_gz", ParDo.of(new CountDoFn("tfrecord_gz")));
// 4. Run pipeline.
PipelineResult res = pipeline.run();
res.waitUntilFinish();
// Check CountDoFn's metrics.
// The numbers should match.
Map<String, Long> counterValues = new TreeMap<String, Long>();
for (MetricResult<Long> counter : res.metrics().queryMetrics(MetricsFilter.builder().build()).counters()) {
counterValues.put(counter.name().name(), counter.committed());
}
StringBuffer sb = new StringBuffer();
sb.append("\n------------ counter metrics from CountDoFn\n");
for (Entry<String, Long> entry : counterValues.entrySet()) {
sb.append(String.format("[counter] %40s: %5d\n", entry.getKey(), entry.getValue()));
}
LOG.info(sb.toString());
}
}
}
This looks clearly like a bug in TFRecordIO. Channel.read() can read fewer bytes than the capacity of the input buffer. 8192 seems to be the buffer size in GzipCompressorInputStream. I filed https://issues.apache.org/jira/browse/BEAM-5412.
It is a bug, please see: https://issues.apache.org/jira/browse/BEAM-7695, I have solved it.

What's the term for saving values of calculations instead of recalculating multiple times?

When you have code like this (written in java, but applicable to any similar language):
public static void main(String[] args) {
int total = 0;
for (int i = 0; i < 50; i++)
total += i * doStuff(i % 2); // multiplies i times doStuff(remainder of i / 2)
}
public static int doStuff(int i) {
// Lots of complicated calculations
}
You can see that there's room for improvement. doStuff(i % 2) only returns two different values - one for doStuff(0) on even numbers and one for doStuff(1) on odd numbers. Therefore you're wasting a lot of computation time/power on recalculating those values each time by saying doStuff(i % 2). You can improve like this:
public static void main(String[] args) {
int total = 0;
boolean[] alreadyCalculated = new boolean[2];
int[] results = new int[2];
for (int i = 0; i < 50; i++) {
if (!alreadyCalculated[i % 2]) {
results[i % 2] = doStuff(i % 2);
alreadyCalculated[i % 2] = true;
}
total += i * results[i % 2];
}
}
Now it accesses a stored value instead of recalculating each time. It might seem silly to keep arrays like that, but for cases like looping from, say, i = 0, i < 500 and you're checking i % 32 each time, or something, an array is an elegant approach.
Is there a term for this kind of code optimization? I'd like to read up more on the different forms and the conventions of it but I'm lacking a concise description.
Is there a term for this kind of code optimization?
Yes, there is:
In computing, memoization is an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls and returning the cached result when the same inputs occur again.
https://en.wikipedia.org/wiki/Memoization
Common-subexpression-elimination (CSE) is related to this. This case is a combination of that and hoisting a loop-invariant calculation out of a loop.
I'd agree with CBroe that you could call this specific form of caching memoization, esp the way you're implementing it with the clunky alreadyCalculated array. You can optimize that away since you know which calls will be new values and which will be repeats. Normally you'd implement memoization with a static array inside the called function, for the benefit of all callers. Ideally there's a sentinel value you can use to mark entries which don't have a result computed yet, instead of maintaining a separate array for that. Or for a sparse set of input values, just use a hash (instead of e.g. an array with 2^32 entries).
You can also avoid the if in the main loop.
public class Optim
{
public static int doStuff(int i) { return (i+5) << 1; }
public static void main(String[] args)
{
int total = 0;
int results[] = new int[2];
// more interesting if we pretend the loop count isn't known to be > 1, so avoiding calling doStuff(1) for n=1 is useful.
// otherwise you'd just do int[] results = { doStuff(0), doStuff(1) };
int n = 50;
for (int i = 0 ; i < Math.min(n, 2) ; i++) {
results[i] = doStuff(i);
total += i * results[i];
}
for (int i = 2; i < n; i++) { // runs zero times if n < 2
total += i * results[i % 2];
}
System.out.print(total);
}
}
Of course, in this case we can optimize a lot further. sum(0..n) = n * (n+1) / 2, so we can use that to get a closed-form (non-looping) solution in terms of doStuff(0) (sum of the even terms) and doStuff(1) (sum of the odd terms). So we only need the two doStuff() results once each, avoiding any need to memoize.

Using memcpy and malloc resulting in corrupted data stream

The code below attempts to save a data stream to a file using fwrite. The first example using malloc works but with the second example the data stream is %70 corrupted. Can someone explain to me why the second example is corrupted and how I can remedy it?
short int fwBuffer[1000000];
// short int *fwBuffer[1000000];
unsigned long fwSize[1000000];
// Not Working *********
if (dataFlow) {
size = sizeof(short int)*length*inchannels;
short int tmpbuffer[length*inchannels];
int count = 0;
for (count = 0; count < length*inchannels; count++)
{
tmpbuffer[count] = (short int) (inbuffer[count]);
}
memcpy(&fwBuffer[saveBufferCount], tmpbuffer, sizeof(tmpbuffer));
fwSize[saveBufferCount] = size;
saveBufferCount++;
totalSize += size;
}
// Working ***********
if (dataFlow) {
size = sizeof(short int)*length*inchannels;
short int *tmpbuffer = (short int*)malloc(size);
int count = 0;
for (count = 0; count < length*inchannels; count++)
{
tmpbuffer[count] = (short int) (inbuffer[count]);
}
fwBuffer[saveBufferCount] = tmpbuffer;
fwSize[saveBufferCount] = size;
saveBufferCount++;
totalSize += size;
}
// Write to file ***********
for (int i = 0; i < saveBufferCount; i++) {
if (isRecording && outFile != NULL) {
// fwrite(fwBuffer[i], 1, fwSize[i],outFile);
fwrite(&fwBuffer[i], 1, fwSize[i],outFile);
if (fwBuffer[i] != NULL) {
// free(fwBuffer[i]);
}
fwBuffer[i] = NULL;
}
}
You initialize your size as
size = sizeof(short int) * length * inchannels;
then you declare an array of size
short int tmpbuffer[size];
This is already highly suspect. Why did you include sizeof(short int) into the size and then declare an array of short int elements with that size? The byte size of your array in this case is
sizeof(short int) * sizeof(short int) * length * inchannels
i.e. the sizeof(short int) is factored in twice.
Later you initialize only length * inchannels elements of the array, which is not entire array, for the reasons described above. But the memcpy that follows still copies the entire array
memcpy(&fwBuffer[saveBufferCount], &tmpbuffer, sizeof (tmpbuffer));
(Tail portion of the copied data is garbage). I'd suspect that you are copying sizeof(short int) times more data than was intended. The recipient memory overflows and gets corrupted.
The version based on malloc does not suffer from this problem since malloc-ed memory size is specified in bytes, not in short int-s.
If you want to simulate the malloc behavior in the upper version of the code, you need to declare your tmpbuffer as an array of char elements, not of short int elements.
This has very good chances to crash
short int tmpbuffer[(short int)(size)];
first size could be too big, but then truncating it and having whatever size results of that is probably not what you want.
Edit: Try to write the whole code without a single cast. Only then the compiler has a chance to tell you if there is something wrong.

generating 9 digit ids without database sequence

I'd like to create 9-digit numeric ids that are unique across machines. I'm currently using a database sequence for this, but am wondering if it could be done without one. The sequences will be used for X12 EDI transactions, so they don't have to be unique forever. Maybe even only unique for 24 hours.
My only idea:
Each server has a 2 digit server identifier.
Each server maintains a file that essentially keeps track of a local sequence.
id = + <7 digit sequence which wraps>
My biggest problem with this is what to do if the hard-drive fails. I wouldn't know where it left off.
All of my other ideas essentially end up re-creating a centralized database sequence.
Any thoughts?
The Following
{XX}{dd}{HHmm}{N}
Where {XX} is the machine number {dd} is the day of the month {HHmm} current time (24hr) and {N} a sequential number.
A hd crash will take more than a minute so starting at 0 again is not a problem.
You can also replace {dd} with {ss} for seconds, depending on requirements. Uniqueness period vs. requests per minute.
If HD fails you can just set new and unused 2 digit server identifier and be sure that the number is unique (for 24 hours at least)
How about generating GUIDs (ensures uniqueness) and then using some sort of hash function to turn the GUID into a 9-digit number?
Just off the top of my head...
Use a variation on:
md5(uniqid(rand(), true));
Just a thought.
In my recent project I also come across this requirement, to generate N digit long sequence number without any database.
This is actually a good Interview question, because there are consideration on performance and software crash recovery. Further Reading if interested.
The following code has these features:
Prefix each sequence with a prefix.
Sequence cache like Oracle Sequence.
Most importantly, there is recovery logic to resume sequence from software crash.
Complete implementation attached:
import java.util.concurrent.atomic.AtomicLong;
import org.apache.commons.lang.StringUtils;
/**
* This is a customized Sequence Generator which simulates Oracle DB Sequence Generator. However the master sequence
* is stored locally in the file as there is no access to Oracle database. The output format is "prefix" + number.
* <p>
* <u><b>Sample output:</u></b><br>
* 1. FixLengthIDSequence(null,null,15,0,99,0) will generate 15, 16, ... 99, 00<br>
* 2. FixLengthIDSequence(null,"K",1,1,99,0) will generate K01, K02, ... K99, K01<br>
* 3. FixLengthIDSequence(null,"SG",100,2,9999,100) will generate SG0100, SG0101, ... SG8057, (in case server crashes, the new init value will start from last cache value+1) SG8101, ... SG9999, SG0002<br>
*/
public final class FixLengthIDSequence {
private static String FNAME;
private static String PREFIX;
private static AtomicLong SEQ_ID;
private static long MINVALUE;
private static long MAXVALUE;
private static long CACHEVALUE;
// some internal working values.
private int iMaxLength; // max numeric length excluding prefix, for left padding zeros.
private long lNextSnapshot; // to keep track of when to update sequence value to file.
private static boolean bInit = false; // to enable ShutdownHook routine after program has properly initialized
static {
// Inspiration from http://stackoverflow.com/questions/22416826/sequence-generator-in-java-for-unique-id#35697336.
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
if (bInit) { // Without this, saveToLocal may hit NullPointerException.
saveToLocal(SEQ_ID.longValue());
}
}));
}
/**
* This POJO style constructor should be initialized via Spring Singleton. Otherwise, rewrite this constructor into Singleton design pattern.
*
* #param sFilename This is the absolute file path to store the sequence number. To reset the sequence, this file needs to be removed manually.
* #param prefix The hard-coded identifier.
* #param initvalue
* #param minvalue
* #param maxvalue
* #param cache
* #throws Exception
*/
public FixLengthIDSequence(String sFilename, String prefix, long initvalue, long minvalue, long maxvalue, int cache) throws Exception {
bInit = false;
FNAME = (sFilename==null)?"C:\\Temp\\sequence.txt":sFilename;
PREFIX = (prefix==null)?"":prefix;
SEQ_ID = new AtomicLong(initvalue);
MINVALUE = minvalue;
MAXVALUE = maxvalue; iMaxLength = Long.toString(MAXVALUE).length();
CACHEVALUE = (cache <= 0)?1:cache; lNextSnapshot = roundUpNumberByMultipleValue(initvalue, cache); // Internal cache is always 1, equals no cache.
// If sequence file exists and valid, restore the saved sequence.
java.io.File f = new java.io.File(FNAME);
if (f.exists()) {
String[] saSavedSequence = loadToString().split(",");
if (saSavedSequence.length != 6) {
throw new Exception("Local Sequence file is not valid");
}
PREFIX = saSavedSequence[0];
//SEQ_ID = new AtomicLong(Long.parseLong(saSavedSequence[1])); // savedInitValue
MINVALUE = Long.parseLong(saSavedSequence[2]);
MAXVALUE = Long.parseLong(saSavedSequence[3]); iMaxLength = Long.toString(MAXVALUE).length();
CACHEVALUE = Long.parseLong(saSavedSequence[4]);
lNextSnapshot = Long.parseLong(saSavedSequence[5]);
// For sequence number recovery
// The rule to determine to continue using SEQ_ID or lNextSnapshot as subsequent sequence number:
// If savedInitValue = savedSnapshot, it was saved by ShutdownHook -> use SEQ_ID.
// Else if saveInitValue < savedSnapshot, it was saved by periodic Snapshot -> use lNextSnapshot+1.
if (saSavedSequence[1].equals(saSavedSequence[5])) {
long previousSEQ = Long.parseLong(saSavedSequence[1]);
SEQ_ID = new AtomicLong(previousSEQ);
lNextSnapshot = roundUpNumberByMultipleValue(previousSEQ,CACHEVALUE);
} else {
SEQ_ID = new AtomicLong(lNextSnapshot+1); // SEQ_ID starts fresh from lNextSnapshot+!.
lNextSnapshot = roundUpNumberByMultipleValue(SEQ_ID.longValue(),CACHEVALUE);
}
}
// Catch invalid values.
if (minvalue < 0) {
throw new Exception("MINVALUE cannot be less than 0");
}
if (maxvalue < 0) {
throw new Exception("MAXVALUE cannot be less than 0");
}
if (minvalue >= maxvalue) {
throw new Exception("MINVALUE cannot be greater than MAXVALUE");
}
if (cache >= maxvalue) {
throw new Exception("CACHE value cannot be greater than MAXVALUE");
}
// Save the next Snapshot.
saveToLocal(lNextSnapshot);
bInit = true;
}
/**
* Equivalent to Oracle Sequence nextval.
* #return String because Next Value is usually left padded with zeros, e.g. "00001".
*/
public String nextVal() {
if (SEQ_ID.longValue() > MAXVALUE) {
SEQ_ID.set(MINVALUE);
lNextSnapshot = roundUpNumberByMultipleValue(MINVALUE,CACHEVALUE);
}
if (SEQ_ID.longValue() > lNextSnapshot) {
lNextSnapshot = roundUpNumberByMultipleValue(lNextSnapshot,CACHEVALUE);
saveToLocal(lNextSnapshot);
}
return PREFIX.concat(StringUtils.leftPad(Long.toString(SEQ_ID.getAndIncrement()),iMaxLength,"0"));
}
/**
* Store sequence value into the local file. This routine is called either by Snapshot or ShutdownHook routines.<br>
* If called by Snapshot, currentCount == Snapshot.<br>
* If called by ShutdownHook, currentCount == current SEQ_ID.
* #param currentCount - This value is inserted by either Snapshot or ShutdownHook routines.
*/
private static void saveToLocal (long currentCount) {
try (java.io.Writer w = new java.io.BufferedWriter(new java.io.OutputStreamWriter(new java.io.FileOutputStream(FNAME), "utf-8"))) {
w.write(PREFIX + "," + SEQ_ID.longValue() + "," + MINVALUE + "," + MAXVALUE + "," + CACHEVALUE + "," + currentCount);
w.flush();
} catch (Exception e) {
e.printStackTrace();
}
}
/**
* Load the sequence file content into String.
* #return
*/
private String loadToString() {
try {
return new String(java.nio.file.Files.readAllBytes(java.nio.file.Paths.get(FNAME)));
} catch (Exception e) {
e.printStackTrace();
}
return "";
}
/**
* Utility method to round up num to next multiple value. This method is used to calculate the next cache value.
* <p>
* (Reference: http://stackoverflow.com/questions/18407634/rounding-up-to-the-nearest-hundred)
* <p>
* <u><b>Sample output:</b></u>
* <pre>
* System.out.println(roundUpNumberByMultipleValue(9,10)); = 10
* System.out.println(roundUpNumberByMultipleValue(10,10)); = 20
* System.out.println(roundUpNumberByMultipleValue(19,10)); = 20
* System.out.println(roundUpNumberByMultipleValue(100,10)); = 110
* System.out.println(roundUpNumberByMultipleValue(109,10)); = 110
* System.out.println(roundUpNumberByMultipleValue(110,10)); = 120
* System.out.println(roundUpNumberByMultipleValue(119,10)); = 120
* </pre>
*
* #param num Value must be greater and equals to positive integer 1.
* #param multiple Value must be greater and equals to positive integer 1.
* #return
*/
private long roundUpNumberByMultipleValue(long num, long multiple) {
if (num<=0) num=1;
if (multiple<=0) multiple=1;
if (num % multiple != 0) {
long division = (long) ((num / multiple) + 1);
return division * multiple;
} else {
return num + multiple;
}
}
/**
* Main method for testing purpose.
* #param args
*/
public static void main(String[] args) throws Exception {
//FixLengthIDSequence(Filename, prefix, initvalue, minvalue, maxvalue, cache)
FixLengthIDSequence seq = new FixLengthIDSequence(null,"H",50,1,999,10);
for (int i=0; i<12; i++) {
System.out.println(seq.nextVal());
Thread.sleep(1000);
//if (i==8) { System.exit(0); }
}
}
}
To test the code, let the sequence run normally. You can press Ctrl+C to simulate the server crash. The next sequence number will continue from NextSnapshot+1.
Cold you use the first 9 digits of some other source of unique data like:
a random number
System Time
Uptime
Having thaught about it for two seconds, none of those are unique on there own but you could use them as seed values for hash functions as was suggested in another answer.