Column size of Google Big Query - google-bigquery

I am populating the data from server to google big query. One of the attributes in the table is a string that has close to 150+ characters in it.
For example, "Had reseller test devices in a vehicle with known working device
Set to power cycle, never got green light Checked with cell provider and all SIMs were active all cases the modem appears to be dead,light in all but not green light".
Table in GBQ gets populated until it hits this specific attribute. When this attribute is about to load, this does not get loaded in the single cell. It gets splitted into different cells and it corroupts the table.
Is there any restriction on each field of the GBQ? Any information regarding this would be appreciated.

My guess is that quote and comma characters in the CSV data are confusing the CSV parser. For example, if one of your fields is hello, world, this will look like two separate fields. The way around this is to quote the field, so you'd need "hello, world". This, of course, has problems if you have embedded quotes in the field. For instance if you wanted to have a field that said She said, "Hello, world", you would either need to escape the quotes by doubling the internal quotes, as in "She said, ""Hello, world""", or by using a different field separator (for instance, |) and dropping the quote separator (using \0).
One final complication is if you have embedded newlines in your field. If you have Hello\nworld, this means you need to set the allow_quoted_newlines on the load job configuration. The downside is that large files will be slower to import with this option, since they can't be done in parallel.
These configuration options are all described here, and can be used via either the web UI or the bq command line shell.

I'm not sure there is a limit imposed, and certainly I have seen string fields with over 8,000 characters.
Can you please clarify, 'When this attribute is about to load, this does not get loaded in the single cell. It gets splitted into different cells and it corroupts the table.'? Does this happen every time? Could it be associated with certain punctuation?

Related

Reason why am I getting results querying a column with Data actually in another column, using like '%text%'

With Firebird 2.5.8, and a table with a dozen of blob fields, I have this weird behavior querying this way:
SELECT *
FROM TABLE
WHERE BLOBFIELD4 LIKE '%SOMETEXT%'
and I get results though SOMETEXT is actually in a different column and not in BLOBFIELD4 (happens with every blob column).
What am I missing?
Thanks for the data. I made few fast tests using latest IB Expert with Firebird 2.5.5 (what i had on hands).
It seems that you actually have much more data, than you might think you have.
First of all - it is a bad, dangerous practice to keep text data in columns marked as CHARSET NONE ! Make sure that your columns are marked with some reasonable charset, like Windows 1250 or UTF8 or something. And also that the very CONNECTION of your all applicationa (including development tools) to the database server also has some explicitly defined character set that suits your textual data.
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
Or, if you want those BLOBs be seen as binary - then explitly create them as SUB_TYPE BINARY not SUB_TYPE TEXT
However, here is the simple script to run on your database.
alter table comm
add NF_VC VARCHAR(4000) CHARACTER SET UTF8,
add NF_BL BLOB SUB_TYPE 1 SEGMENT SIZE 4096 CHARACTER SET UTF8
then
update comm
set nf_vc = '**' || com1 || '**'
then
update comm
set nf_bl = '##' || nf_vc || '##'
Notice, i intentionally force Firebird to do conversion BLOB -> VARCHAR -> BLOB.
Just to be on a safe side.
Now check some data.
select id_comm, nf_vc
from comm where
nf_vc containing 'f4le dans 2 ans'
and
select id_comm, nf_bl
from comm where
nf_bl containing 'f4le dans 2 ans'
What do you see now?
On the first picture we see that very mystery - the line is selected, but we can not see your search pattern in it, the "f4le dans 2 ans".
BUT !!!
Can you see the marks, the double asterisks, the ** ?
Yes, you can, in the beginning! But you can not see them in the ending!!!
That means, you DO NOT see the whole text, but only some first part of it!
On the second picture - you see the very same row ID=854392, but re-converted back to BLOB and additionally marked with ## at both ends.
Can you see the marks on both start and end?
Can you see your search pattern?
Yes and yes - if you look at the grid row (white).
No and no - if you look and the tooltip (yellow).
So, again, the data you search for - it DOES exist. But you just fail to see it for some reason.
Now, when may be a typical reason the string is not displayed completely?
It can be the zero-value byte (or several bytes, UNICODE codepoint), the way C language marks the end of line, the custom that is vastly used in Windows and many libraries and programs. Or maybe some other unusual value (EOF, EOT, -1, etc), that makes those programs you use falsely detect the end of the text where it did not actually ended yet.
Look at the two screenshots again, where is that, that lines start to differ? it is after \viewkind4 ... \par} and before pard. Notice the weird anomaly! that said pard should start with reversed slash - \ - to be a vaild RTF command. But it is instead prepended with something invisible, something blank. What can it be?...
Let us go back to your original query in your comments.
Also, it is bad practice to put important details into comments! They are hard to find there for any person, that was not tracking the story from the very start. And the more comments added, the harder it gets. The proper avenue for you would have been to EDIT the question adding the new data into the quesiton body, and then adding a comment (for notification sake) saying the question was edited. Please, in future, add new data that way.
select id_comm, COM1
from comm where
COM1 containing 'f4le dans 2 ans'
On the first glance our fishing ended with nothing, we see the text that does not have your pattern, ending at that very \par}.
But is it so? Switch into binary view, and....
Voila! What is there before the found-lost-found-again pard? there is that very ZERO BYTE i talked about earlier.
So, what did happenned, to wrap it up.
Firebird is correct, the data is found because the data is really there, in the BLOBs.
Your applications, reading the data, are not correct. Being confused with zero byte they show you only part of data, not all of data.
Your application, writing the data, might be not correct. Or the data itself.
How did that zero byte ended there? Why RTF structure was corrupt, lacking the reversed slash before pard? Was data size, you passed to the server when inserting that data, larger than it should had been, passing some garbage after a meaningful data? Was data size correct, but the data contents corrupt before inserting?
Something is fishy there. I do not think RTF Specifications explicitly prohibits zero byte, but having it is very untypical, because it triggers bugs like this in way too many applications and libraries.
P.S. the design of the table having MANY columns with BLOB types seems poor.
"wide" tables often lead to problems in future development and maintenance.
While it is not the essense of your quesiton, but please do think about remaking this table into a narrow one, and save your data as a number of one-BLOB rows.
It will give you some fixed added work now, but probably would save you from a snowballing problems in future.

Display 500+ character field from SAP transparent table

As it commonly known, it is not recommended by SAP to use 255+ character fields in transparent tables. One should use several 255 fields instead, wrap text in LCHR, LRAW or STRING, or use SO10 text etc.
However, while maintaining legacy (and ugly) developments, such problem often arises: how to view what is stored in char500 or char1000 field in database?
The real life scenario:
we have a development where some structure written and read from char1000 field in transparent table
we know field structure and parsing the field through CL_ABAP_CONTAINER_UTILITIES=>FILL_CONTAINER_C or SO_STRUCT_TO_CHAR goes fine, all fields are put wonderfully
displaying the fields via SE11/SE16/SE16n gives nothing as the field is truncated to 255, and to 132 in debugger, AFAIR.
Is there any standard tool, transaction or FM we can use to display such long field?
In the DBA cockpit (ST04), there is a SQL command line, where you can enter directly the "native" SQL commands and display the result as an ALV view. With a substring function, you can split a field into several sections (expl: select substr(sql_text,1,100) s1, substr(sql_text,101,100) s2, substr(sql_text,201,100) s3, substr(sql_text,301,100) s4 from dba_hist_sqltext where sql_id = '0cuyjatkcmjf0'). PS: every ALV cell is 128 characters maximum.
Not sure whether this tool is available for all supported database softwares.
There is also an equivalent program named RSDU_EXEC_SQL (in all ABAP-based systems?)
Unfortunately, they won't work for ersatz of tables by SAP (clustered tables and so on) as they can be queried only with ABAP "Open SQL".
If you have an ERP system to you hand check transaction PP01 out with infotype 1002. Basically They store text in table HRP1002 and HRT1002 and create a special view with an text editor. It looks like this: http://www.sapfunctional.com/HCM/Positions/Page1.13.jpg
In debugger you can switch the view to e.g. HTML and you should see the whole string, but editing is limited as far as i know to a certain number of charachters.

Importing File WIth Field Terminators In Data

I've been given some csv files that I want to turn into tables in a SQL database. However, the genius who created the files used comma delimiters, even though several data fields contain commas. So when I try to BCP the data into the database, I get a whole bunch of errors.
Is there a way that I can escape the commas that aren't field separators? At the moment I'm tempted to write a script to manually replace every comma in each file with a pipe, and then go through and manually change the affected rows back.
The only way to fix this is to write a script or program that fixes the data.
If the bad data is limited to a single field the process should be trivial:
You consume the row from either side by the count of good delimiters and replace with a new unique delimiter and what remains is the column with the extra old delimiters that you would just leave as is.
If you have two bad fields straddling good fields, you would need some kind of advanced logic, for instance I had XML data with delimiters, I had to parse the XML until I found a terminating tag and then process the other delimiters as needed.

Writing on HDFS messed the data

I was trying to save the output of a Hive query on HDFS but the data got changed. Any idea?
See below the data and the changed one. remove the space before the file name :)
[[Correct]: i.stack.imgur.com/ DLNTT.png
[[Messed up]: i.stack.imgur.com/ 7WIO3.png
Any feedback would be appreciated.
Thanks in advance.
It looks like you are importing an array into Hive which is one of the available complex types. Internally, Hive separates the elements in an array with the ASCII character 002. If you consult an ascii table, you can see that this is the non printable character "start of text". It looks like your terminal does actually print the non-printable character, and by comparing the two images you can see that 002 does indeed separate every item of your array.
Similarly, Hive will separate every column in a row with ASCII 001, and it will separate map keys/values and structure fields/values with ASCII 003. These values were chosen because they are unlikely to show up in your data. If you want to change this, you can manually specify delimiters using ROW FORMAT in you create table statement. Be careful though, if you switch the collection items terminator to something like , then any commas in your input will look like collection terminators to Hive.
Unless you need to store the data in human readable form and are sure there is a printable character that will not collide with your terminators, I would leave them as is. If you need to read the HDFS files you can always hadoop fs -cat /exampleWarehouseDir/exampleTable/* | tr '\002' '\t' to display array items as separated with tabs. If you write a MapReduce or Pig job against the Hive tables, just be aware what your delimiters are. Learning how to write and read Hive tables from MapReduce was how I learned about these terminators in first place. And if you are doing all of your processing in Hive, you shouldn't ever have to worry about what the terminators are (unless they show up in your input data).
Now this would explain why you would see ASCII 002 popping up if you were reading the file contents off of HDFS, but it looks like you are seeing it from the Hive Command Line Interface which should be aware of the collection terminators (and therefore use them to separate elements of the array instead of printing them). My best guess there is you have specified the schema wrong and the column of the table results is a string where you meant to make it an array. This would explain why it went ahead and printed the ASCII 002's instead of using them as collection terminators.

Testing a CSV - how far should I go?

I'm generating a CSV which contains several rows and columns.
However, when I'm testing said CSV I feel like I am simply repeating the code that builds the file in the test as I'm checking each and every field is correct.
Question is, is this more sensible than it seems to me, or is there a better way?
A far simpler test is to just import the CSV into a spreadsheet or database and verify the data output is aligned to the proper fields. No extra columns or extra rows, data selected from the imported recordset is a perfect INTERSECT with the recordset from which the CSV was generated, etc.
More importantly, I recommend making sure your test data includes common CSV fail scenarios such as:
Field contains a comma (or whatever your separator character)
Field contains multiple commas (You might think it's the same thing, but I've seen one fail where the other succeeded)
Field contains the new-row character(s)
Field contains characters not in the code page of the CSV file
...to make sure your code is handling them properly.