I have loaded a 1.5GB csv file and successfully loading my table size is 250MB why this is so? - google-bigquery

In google Bigquery ....I have loaded a 1.5GB csv file from googlstorage after successfully loading,.... my table size is 250MB why this is so?

Likely because the binary encoding of numbers is more efficient than encoding them as strings. For example, the string "1234567890" takes 10 bytes (at least, or 20 bytes if it is UTF-16 encoded), but it can be represented by a 4 byte integer which only takes, ehm, 4 bytes.
Furthermore, the table in bigquery can also leave out the separators, because it knows how many bytes each field is wide. Thats another byte saved for every ,.

Related

what is the packed binary data and unpacked binary data in ISO 8583 message?

I am new in this field, and working on payment gateway, please tell me what is the difference between packed and unpacked binary data used in iso8583 message...!
The schema definition files for ISO8583 are available at http://dfdlschemas.github.io/ISO8583. In ISO8583_1993.xsd it says:
* This DFDL schema provides a DFDL model for ISO8583 1993 binary data
* where each bitmap in the message is encoded as 8 bytes of binary data
* (8 bits per byte). The bitmaps are said to be 'packed'.
So, the term "packed" refers to the bitmaps, which can be either packed or unpacked.
In en.wikipedia.org/wiki/ISO_8583#Bitmaps, it says
The bitmap may be transmitted as 8 bytes of binary data, or as 16 hexadecimal > characters 0-9, A-F in the ASCII or EBCDIC character sets.
In data structures, packed binary data usually means that more (if not all available) bit combinations are used to encode some values, while unpacked means that some bit combinations remain unused, either to improve readability or to make certain calculations easier (but unpacked data takes more space).
For example, one unsigned byte (8 bits) can encode numbers from 0 to 255. If the numbers are BCD encoded, only numbers from 0 to 99 can be represented, and some bit combinations remain unused. However, it is in some cases easier to base calculations on a BCD encoded number than on a binary encoded number.
In summary, ISO 8583 defines two different encodings:
packed which is 8 bytes of binary data
unpacked which is 16 bytes as hexadecimal characters (in two different encodings, but that is another aspect).
One obvious difference is, that when you dump this data to a console, you can immediately read the unpacked data as hexadecimal numbers, while the binary encoding will only print some garbage characters, depending on your console, your locale and the font which you have installed.

Best Firebird blob size page size relation

I have a small Firebird 2.5 database with a blob field called "note" declared as this:
BLOB SUB_TYPE 1 SEGMENT SIZE 80 CHARACTER SET UTF8
The data base page size is:
16.384 (That I'm suspecting is too high)
I have ran this select in order to discover the average size of the blob fields available:
select avg(octet_length(items.note)) from items
and got this information:
2.671
As a beginner, I would like to know the better segment size for this blob field and the best database page size in your opinion (I know that this depends of other information, but I still don't know how to figure it out).
Blobs in Firebird are stored in separate pages of your database. The exact storage format depends on the size of your blob. As described in Blob Internal Storage:
Blobs are created as part of a data row, but because a blob could be
of unlimited length, what is actually stored with the data row is a
blobid, the data for the blob is stored separately on special blob
pages elsewhere in the database.
[..]
A blob page stores data for a blob. For large blobs, the blob page
could actually be a blob pointer page, i.e. be used to store pointers
to other blob pages. For each blob that is created a blob record is
defined, the blob record contains the location of the blob data, and
some information about the blobs contents that will be useful to the
engine when it is trying to retrieve the blob. The blob data could be
stored in three slightly different ways. The storage mechanism is
determined by the size of the blob, and is identified by its level
number (0, 1 or 2). All blobs are initially created as level 0, but
will be transformed to level 1 or 2 as their size increases.
A level 0 blob is a blob that can fit on the same page as the blob
header record, for a data page of 4096 bytes, this would be a blob of
approximately 4052 bytes (Page overhead - slot - blob record header).
In other words, if your average size of blobs is 2671 bytes (and most larger ones are still smaller than +/- 4000 bytes), then likely a page size of 4096 is optimal as it will reduce wasted space from on average 16340 - 2671 = 13669 bytes to 4052 - 2671 = 1381 bytes.
However for performance itself this likely hardly going to matter, and smaller page sizes have other effects that you will need to take into account. For example a smaller page size will also reduce the maximum size of a CHAR/VARCHAR index key, indexes might become deeper (more levels), and less records fit in a single page (or wider records become split over multiple pages).
Without measuring and testing it is hard to say if using 4096 for the page size is the right size for your database.
As to segment sizes: it is a historic artifact that is best ignored (and left off). Sometimes applications or drivers incorrectly assume that blobs need to be written or read in the specified segment size. In those rare cases specifying a larger segment size might improve performance. If you leave it off, Firebird will default to a value of 80.
From Binary Data Types:
Segment Size: Specifying the BLOB segment is throwback to times past,
when applications for working with BLOB data were written in C
(Embedded SQL) with the help of the gpre pre-compiler. Nowadays, it is
effectively irrelevant. The segment size for BLOB data is determined
by the client side and is usually larger than the data page size, in
any case.

How to execute query longer than 32767 characters on Firebird?

I'm developing a Java web application that deals with large amounts of text (HTML code strings encoded using base64), which I need to save in my database. I'm using Firebird 2.0, and every time I try to insert a new record with strings longer than 32767 characters, I receive the following error:
GDS Exception. 335544726. Error reading data from the connection.
I have done some research about it, and apparently this is the character limit for Firebird, both for query strings and records in the database. I have tried a couple of things, like splitting the string in the query and then concatenating the parts, but it didn't work. Does anyone know any workarounds for this issue?
If you need to save large amount of text data in the database - just use BLOB fields. Varchar field size is limited to 32Kb.
For better performance you can use binary BLOBs and save there zipped data.
Firebird query strings are limited to 64 kilobytes in Firebird 2.5 and earlier. The maximum length of a varchar field is 32766 byte (which means it can only store 8191 characters when using UTF-8!). The maximum size of a row (with blobs counting for 8 bytes) is 64 kilobytes as well.
If you want to store values longer than 32 kilobytes, you need to use a BLOB SUB_TYPE TEXT, and you need to use a prepared statement to set the value.

Most Compact File Storage of Time Stamp and String Pairs

I would like to write a time stamp and string pair to file in the most compact way possible. I started out writing the string representation of Ticks, then ASCII 31 as a seperator, then the string, then a CR.
Then I realised that as ticks is a long and can be stored as only 8 bytes I should convert ticks to bytes and write those bytes to the file. That's fine except those timestamp bytes might contain a byte whose value is 31 so my ASCII 31 delimiter is no longer unique.
What is the most compact way to store a timestamp and string pair to file?
Thanks.
Since Ticks has a fixed maximum length, you could avoid using the separator, reading the first 8 bytes of Ticks data and then reading the remaining bytes as the string.
:)

What is the VInt in Lucene?

I want to know what is the VInt in Lucene ?
I read this article , but i don't understand what is it and where does Lucene use it ?
Why Lucene doesn't use simple integer or big integer ?
Thanks .
VInt is extremely space efficient. It could theoretically save upto 75% space.
In Lucene, many of the structures are list of integers. For example, list of documents for a given term, positions (and offsets) of the terms in documents, among others. These lists form bulk of the lucene data.
Think of Lucene indices for millions of documents that need tens of GBs of space. Shrinking space by more than half reduces disk space requirements. While savings of disk space may not be a big win, given that disk space is cheap, the real gain comes reduced disk IO. Disk IO for reading VInt data is lower than reading integers which automatically translates to better performance.
VInt refers to Lucene's variable-width integer encoding scheme. It encodes integers in one or more bytes, using only the low seven bits of each byte. The high bit is set to zero for all bytes except the last, which is how the length is encoded.
For your first question:
A variable-length format for positive integers is defined where the high-order bit of each byte indicates whether more bytes remain to be read. The low-order seven bits are appended as increasingly more significant bits in the resulting integer value. Thus values from zero to 127 may be stored in a single byte, values from 128 to 16,383 may be stored in two bytes, and so on. https://lucene.apache.org/core/3_0_3/fileformats.html.
So, to save a list of n integers the amount of memory you would need is [eg] 4*n bytes. But with Vint all numbers under 128 would be stored using only 1 byte [and so on] saving a lot of memory.
Vint provides a compressed representation of integers and Shashikant's answer already explains the requirements and benefits of compression in Lucene.