Reading Hadoop SequenceFiles with Hive

Reading Hadoop SequenceFiles with Hive - hive

I have some mapred data from the Common Crawl that I have stored in a SequenceFile format. I have tried repeatedly to use this data "as is" with Hive so I can query and sample it at various stages. But I always get the following error in my job output:
LazySimpleSerDe: expects either BytesWritable or Text object!
I have even constructed a simpler (and smaller) dataset of [Text, LongWritable] records, but that fails as well. If I output the data to text format and then create a table on that, it works fine:
hive> create external table page_urls_1346823845675
> (pageurl string, xcount bigint)
> location 's3://mybucket/text-parse/1346823845675/';
OK
Time taken: 0.434 seconds
hive> select * from page_urls_1346823845675 limit 10;
OK
http://0-italy.com/tag/package-deals 643 NULL
http://011.hebiichigo.com/d63e83abff92df5f5913827798251276/d1ca3aaf52b41acd68ebb3bf69079bd1.html 9 NULL
http://01fishing.com/fly-fishing-knots/ 3437 NULL
http://01fishing.com/flyin-slab-creek/ 1005 NULL
...
I tried using a custom inputformat:
// My custom input class--very simple
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
public class UrlXCountDataInputFormat extends
SequenceFileInputFormat<Text, LongWritable> { }
I create the table then with:
create external table page_urls_1346823845675_seq
(pageurl string, xcount bigint)
stored as inputformat 'my.package.io.UrlXCountDataInputFormat'
outputformat 'org.apache.hadoop.mapred.SequenceFileOutputFormat'
location 's3://mybucket/seq-parse/1346823845675/';
But I still get the same SerDer error.
I'm sure there's something really basic I'm missing here, but I can't seem to get it right. Additionally, I have to be able to parse the SequenceFiles in place (i.e. I can't convert my data to text). So I need to figure out the SequenceFile approach for future portions of my project.
Solution:
As #mark-grover pointed out below, the issue is that Hive ignores the key by default. With only one column (i.e. just the value), the serder was unable to map my second column.
The solution was to use a custom InputFormat that was great deal more complex than what I had used originally. I tracked down one answer at link to a Git about using the keys instead of the values, and then I modified it to suit my needs: take the key and value from an internal SequenceFile.Reader and then combining them into the final BytesWritable. I.e. something like this (from the custom Reader, as that's where all the hard work happens):
// I used generics so I can use this all with
// other output files with just a small amount
// of additional code ...
public abstract class HiveKeyValueSequenceFileReader<K,V> implements RecordReader<K, BytesWritable> {
public synchronized boolean next(K key, BytesWritable value) throws IOException {
if (!more) return false;
long pos = in.getPosition();
V trueValue = (V) ReflectionUtils.newInstance(in.getValueClass(), conf);
boolean remaining = in.next((Writable)key, (Writable)trueValue);
if (remaining) combineKeyValue(key, trueValue, value);
if (pos >= end && in.syncSeen()) {
more = false;
} else {
more = remaining;
}
return more;
}
protected abstract void combineKeyValue(K key, V trueValue, BytesWritable newValue);
}
// from my final implementation
public class UrlXCountDataReader extends HiveKeyValueSequenceFileReader<Text,LongWritable>
#Override
protected void combineKeyValue(Text key, LongWritable trueValue, BytesWritable newValue) {
// TODO I think we need to use straight bytes--I'm not sure this works?
StringBuilder builder = new StringBuilder();
builder.append(key);
builder.append('\001');
builder.append(trueValue.get());
newValue.set(new BytesWritable(builder.toString().getBytes()) );
}
}
With that, I get all my columns!
http://0-italy.com/tag/package-deals 643
http://011.hebiichigo.com/d63e83abff92df5f5913827798251276/d1ca3aaf52b41acd68ebb3bf69079bd1.html 9
http://01fishing.com/fly-fishing-knots/ 3437
http://01fishing.com/flyin-slab-creek/ 1005
http://01fishing.com/pflueger-1195x-automatic-fly-reels/ 1999

Not sure if this is impacting you but Hive ignores keys when reading SequenceFiles. You may need to create a custom InputFormat (unless you can find one online:-))
Reference: http://mail-archives.apache.org/mod_mbox/hive-user/200910.mbox/%3C5573211B-634D-4BB0-9123-E389D90A786C#metaweb.com%3E

Related

Spark JavaPairRDD iteration

How can iterate on JavaPairRDD. I have done a group by and got back a RDD as below JavaPairRDD (Tuple 7 set of Strings and List of Objects)
Now I have to iterate over this RDD and do some calculations like FOR EACH in Pig.
Basically I would like to iterate the key and the list of values and do some operations and then return back a JavaPairRDD?
JavaPairRDD<Tuple7<String, String,String,String,String,String,String>, List<Records>> sizes =
piTagRecordData.groupBy( new Function<Records, Tuple7<String, String,String,String,String,String,String>>() {
private static final long serialVersionUID = 2885738359644652208L;
#Override
public Tuple7<String, String,String,String,String,String,String> call(Records row) throws Exception {
Tuple7<String, String,String,String,String,String,String> compositeKey = new Tuple7<String, String, String, String, String, String, String>(row.getAsset_attribute_id(),row.getDate_time_value(),row.getOperation(),row.getPi_tag_count(),row.getAsset_id(),row.getAttr_name(),row.getCalculation_type());
return compositeKey;
}
});
After this I want to perform FOR EACH member of sizes (JavaPairRDD), operation -- something like
rejected_records = FOREACH sizes GENERATE FLATTEN(Java function on the List of Records based on the group key
I am using Spark 0.9.0

Even though you are talking about "FOR EACH", it really sounds like you want the flatMap operation, since you want to produce new values and flatten them. This is available for Java RDDs, including a JavaPairRDD.

You can use void foreach(VoidFunction<T> f) method. More info and methods: https://spark.apache.org/docs/1.1.0/api/java/org/apache/spark/api/java/JavaRDDLike.html#foreach(org.apache.spark.api.java.function.VoidFunction)

if you want to view some value of JavaPairRDD, I would do like this
for (Tuple2<String, String> test : pairRdd.take(10)) //or pairRdd.collect()
{
System.out.println(test._1);
System.out.println(test._2);
}
Note:Tuple2 (assuming you have strings inside the JavaPairRDD), change the datatype according to the data type stored in the JavaPairRDD.

In VTD-XML how to add new attribute into tag with existing attributes?

I'm using VTD-XML to update XML files. In this I am trying to get a flexible way of maintaining attributes on an element. So if my original element is:
<MyElement name="myName" existattr="orig" />
I'd like to be able to update it to this:
<MyElement name="myName" existattr="new" newattr="newValue" />
I'm using a Map to manage the attribute/value pairs in my code and when I update the XML I'm doing something like the following:
private XMLModifier xm = new XMLModifier();
xm.bind(vn);
for (String key : attr.keySet()) {
int i = vn.getAttrVal(key);
if (i!=-1) {
xm.updateToken(i, attr.get(key));
} else {
xm.insertAttribute(key+"='"+attr.get(key)+"'");
}
}
vn = xm.outputAndReparse();
This works for updating existing attributes, however when the attribute doesn't already exist, it hits the insert (insertAttribute) and I get "ModifyException"
com.ximpleware.ModifyException: There can be only one insert per offset
at com.ximpleware.XMLModifier.insertBytesAt(XMLModifier.java:341)
at com.ximpleware.XMLModifier.insertAttribute(XMLModifier.java:1833)
My guess is that as I'm not manipulating the offset directly this might be expected. However I can see no function to insert an an attribute at a position in the element (at end).
My suspicion is that I will need to do it at the "offset" level using something like xm.insertBytesAt(int offset, byte[] content) - as this is an area I have needed to get into yet is there a way to calculate the offset at which I can insert (just before the end of the tag)?
Of course I may be mis-using VTD in some way here - if there is a better way of achieving this then happy to be directed.
Thanks

That's an interesting limitation of the API I hadn't encountered yet. It would be great if vtd-xml-author could elaborate on technical details and why this limitation exists.
As a solution to your problem, a simple approach would be to accumulate your key-value pairs to be inserted as a String, and then to insert them in a single call after your for loop has terminated.
I've tested that this works as per your code:
private XMLModifier xm_ = new XMLModifier();
xm.bind(vn);
String insertedAttributes = "";
for (String key : attr.keySet()) {
int i = vn.getAttrVal(key);
if (i!=-1) {
xm.updateToken(i, attr.get(key));
} else {
// Store the key-values to be inserted as attributes
insertedAttributes += " " + key + "='" + attr.get(key) + "'";
}
}
if (!insertedAttributes.equals("")) {
// Insert attributes only once
xm.insertAttribute(insertedAttributes);
}
This will also work if you need to update the attributes of multiple elements, simply nest the above code in while(autoPilot.evalXPath() != -1) and be sure to set insertedAttributes = ""; at the end of each while loop.
Hope this helps.

HibernateException: Errors in named query

When running a particular unit-test, I am getting the exception:
Caused by: org.hibernate.HibernateException: Errors in named queries: UPDATE_NEXT_FIRE_TIME
at org.hibernate.impl.SessionFactoryImpl.<init>(SessionFactoryImpl.java:437)
at org.hibernate.cfg.Configuration.buildSessionFactory(Configuration.java:1385)
at org.hibernate.cfg.AnnotationConfiguration.buildSessionFactory(AnnotationConfiguration.java:954)
at org.hibernate.ejb.Ejb3Configuration.buildEntityManagerFactory(Ejb3Configuration.java:891)
... 44 more
for the named query defined here:
#Entity(name="fireTime")
#Table(name="qrtz_triggers")
#NamedQueries({
#NamedQuery(
name="UPDATE_NEXT_FIRE_TIME",
query= "update fireTime t set t.next_fire_time = :epochTime where t.trigger_name = 'CalculationTrigger'")
})
public class JpaFireTimeUpdaterImpl implements FireTimeUpdater {
#Id
#Column(name="next_fire_time", insertable=true, updatable=true)
private long epochTime;
public JpaFireTimeUpdaterImpl() {}
public JpaFireTimeUpdaterImpl(final long epochTime) {
this.epochTime = epochTime;
}
#Override
public long getEpochTime() {
return this.epochTime;
}
public void setEpochTime(final long epochTime) {
this.epochTime = epochTime;
}
}
After debugging as deep as I could, I've found that the exception occurs in w.statement(hqlAst) in QueryTranslatorImpl:
private HqlSqlWalker analyze(HqlParser parser, String collectionRole) throws QueryException, RecognitionException {
HqlSqlWalker w = new HqlSqlWalker( this, factory, parser, tokenReplacements, collectionRole );
AST hqlAst = parser.getAST();
// Transform the tree.
w.statement( hqlAst );
if ( AST_LOG.isDebugEnabled() ) {
ASTPrinter printer = new ASTPrinter( SqlTokenTypes.class );
AST_LOG.debug( printer.showAsString( w.getAST(), "--- SQL AST ---" ) );
}
w.getParseErrorHandler().throwQueryException();
return w;
}
Is there something wrong with my query or annotations?

NamedQuery should be written with JPQL, but query seems to mix both names of persistent attributes and names of database columns. Names of database columns cannot be used in JPQL.
In this case instead of next_fire_time name of the persistent attribute epochTime should be used. Also trigger_name looks more like name of the database column than name of the persistent attribute, but it seems not to be mapped in your current class at all. After it is mapped, query is as follows:
update fireTime t set t.epochTime = :epochTime
where t.triggerName = 'CalculationTrigger'
If SQL query is preferred, then #NamedNativeQuery should be used instead.
As a side note, JPA 2.0 specification doesn't encourage changing primary key:
The application must not change the value of the primary key[10]. The
behavior is undefined if this occurs.[11]
In general entities are not aware of changed made via JPQL queries. That gets especially interesting when trying to refresh entity that does not exist anymore (because primary key was changed).
Additionally naming is little bit confusing:
Name of the class looks more like name of the service class
than name of the entity.
Starting name of the entity with lower
case letter is rather rare style.
Name of the entity, name of the
table and name of the class do not match too well.

Can I pass parameters to UDFs in Pig script?

I am relatively new to PigScript. I would like to know if there is a way of passing parameters to Java UDFs in Pig?
Here is the scenario:
I have a log file which have different columns (each representing a Primary Key in another table). My task is to get the count of distinct primary key values in the selected column.
I have written a Pig script which does the job of getting the distinct primary keys and counting them.
However, I am now supposed to write a new UDF for each column. Is there a better way to do this? Like if I can pass a row number as parameter to UDF, it avoids the need for me writing multiple UDFs.

The way to do it is by using DEFINE and the constructor of the UDF. So here is an example of a customer "splitter":
REGISTER com.sample.MyUDFs.jar;
DEFINE CommaSplitter com.sample.MySplitter(',');
B = FOREACH A GENERATE f1, CommaSplitter(f2);
Hopefully that conveys the idea.

To pass parameters you do the following in your pigscript:
UDF(document, '$param1', '$param2', '$param3')
edit: Not sure if those params need to be wrappedin ' ' or not
while in your UDF you do:
public class UDF extends EvalFunc<Boolean> {
public Boolean exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return false;
FileSystem fs = FileSystem.get(UDFContext.getUDFContext().getJobConf());
String var1 = input.get(1).toString();
InputStream var1In = fs.open(new Path(var1));
String var2 = input.get(2).toString();
InputStream var2In = fs.open(new Path(var2));
String var3 = input.get(3).toString();
InputStream var3In = fs.open(new Path(var3));
return doyourthing(input.get(0).toString());
}
}
for example

Yes, you can pass any parameter in the Tuple parameter input of your UDF:
exec(Tuple input)
and access it using
input.get(index)

error, string or binary data would be truncated when trying to insert

I am running data.bat file with the following lines:
Rem Tis batch file will populate tables
cd\program files\Microsoft SQL Server\MSSQL
osql -U sa -P Password -d MyBusiness -i c:\data.sql
The contents of the data.sql file is:
insert Customers
(CustomerID, CompanyName, Phone)
Values('101','Southwinds','19126602729')
There are 8 more similar lines for adding records.
When I run this with start > run > cmd > c:\data.bat, I get this error message:
1>2>3>4>5>....<1 row affected>
Msg 8152, Level 16, State 4, Server SP1001, Line 1
string or binary data would be truncated.
<1 row affected>
<1 row affected>
<1 row affected>
<1 row affected>
<1 row affected>
<1 row affected>
Also, I am a newbie obviously, but what do Level #, and state # mean, and how do I look up error messages such as the one above: 8152?

From #gmmastros's answer
Whenever you see the message....
string or binary data would be truncated
Think to yourself... The field is NOT big enough to hold my data.
Check the table structure for the customers table. I think you'll find that the length of one or more fields is NOT big enough to hold the data you are trying to insert. For example, if the Phone field is a varchar(8) field, and you try to put 11 characters in to it, you will get this error.

I had this issue although data length was shorter than the field length.
It turned out that the problem was having another log table (for audit trail), filled by a trigger on the main table, where the column size also had to be changed.

In one of the INSERT statements you are attempting to insert a too long string into a string (varchar or nvarchar) column.
If it's not obvious which INSERT is the offender by a mere look at the script, you could count the <1 row affected> lines that occur before the error message. The obtained number plus one gives you the statement number. In your case it seems to be the second INSERT that produces the error.

Just want to contribute with additional information: I had the same issue and it was because of the field wasn't big enough for the incoming data and this thread helped me to solve it (the top answer clarifies it all).
BUT it is very important to know what are the possible reasons that may cause it.
In my case i was creating the table with a field like this:
Select '' as Period, * From Transactions Into #NewTable
Therefore the field "Period" had a length of Zero and causing the Insert operations to fail. I changed it to "XXXXXX" that is the length of the incoming data and it now worked properly (because field now had a lentgh of 6).
I hope this help anyone with same issue :)

Some of your data cannot fit into your database column (small). It is not easy to find what is wrong. If you use C# and Linq2Sql, you can list the field which would be truncated:
First create helper class:
public class SqlTruncationExceptionWithDetails : ArgumentOutOfRangeException
{
public SqlTruncationExceptionWithDetails(System.Data.SqlClient.SqlException inner, DataContext context)
: base(inner.Message + " " + GetSqlTruncationExceptionWithDetailsString(context))
{
}
/// <summary>
/// PArt of code from following link
/// http://stackoverflow.com/questions/3666954/string-or-binary-data-would-be-truncated-linq-exception-cant-find-which-fiel
/// </summary>
/// <param name="context"></param>
/// <returns></returns>
static string GetSqlTruncationExceptionWithDetailsString(DataContext context)
{
StringBuilder sb = new StringBuilder();
foreach (object update in context.GetChangeSet().Updates)
{
FindLongStrings(update, sb);
}
foreach (object insert in context.GetChangeSet().Inserts)
{
FindLongStrings(insert, sb);
}
return sb.ToString();
}
public static void FindLongStrings(object testObject, StringBuilder sb)
{
foreach (var propInfo in testObject.GetType().GetProperties())
{
foreach (System.Data.Linq.Mapping.ColumnAttribute attribute in propInfo.GetCustomAttributes(typeof(System.Data.Linq.Mapping.ColumnAttribute), true))
{
if (attribute.DbType.ToLower().Contains("varchar"))
{
string dbType = attribute.DbType.ToLower();
int numberStartIndex = dbType.IndexOf("varchar(") + 8;
int numberEndIndex = dbType.IndexOf(")", numberStartIndex);
string lengthString = dbType.Substring(numberStartIndex, (numberEndIndex - numberStartIndex));
int maxLength = 0;
int.TryParse(lengthString, out maxLength);
string currentValue = (string)propInfo.GetValue(testObject, null);
if (!string.IsNullOrEmpty(currentValue) && maxLength != 0 && currentValue.Length > maxLength)
{
//string is too long
sb.AppendLine(testObject.GetType().Name + "." + propInfo.Name + " " + currentValue + " Max: " + maxLength);
}
}
}
}
}
}
Then prepare the wrapper for SubmitChanges:
public static class DataContextExtensions
{
public static void SubmitChangesWithDetailException(this DataContext dataContext)
{
//http://stackoverflow.com/questions/3666954/string-or-binary-data-would-be-truncated-linq-exception-cant-find-which-fiel
try
{
//this can failed on data truncation
dataContext.SubmitChanges();
}
catch (SqlException sqlException) //when (sqlException.Message == "String or binary data would be truncated.")
{
if (sqlException.Message == "String or binary data would be truncated.") //only for EN windows - if you are running different window language, invoke the sqlException.getMessage on thread with EN culture
throw new SqlTruncationExceptionWithDetails(sqlException, dataContext);
else
throw;
}
}
}
Prepare global exception handler and log truncation details:
protected void Application_Error(object sender, EventArgs e)
{
Exception ex = Server.GetLastError();
string message = ex.Message;
//TODO - log to file
}
Finally use the code:
Datamodel.SubmitChangesWithDetailException();

Another situation in which you can get this error is the following:
I had the same error and the reason was that in an INSERT statement that received data from an UNION, the order of the columns was different from the original table. If you change the order in #table3 to a, b, c, you will fix the error.
select a, b, c into #table1
from #table0
insert into #table1
select a, b, c from #table2
union
select a, c, b from #table3

on sql server you can use SET ANSI_WARNINGS OFF like this:
using (SqlConnection conn = new SqlConnection("Data Source=XRAYGOAT\\SQLEXPRESS;Initial Catalog='Healthy Care';Integrated Security=True"))
{
conn.Open();
using (var trans = conn.BeginTransaction())
{
try
{
using cmd = new SqlCommand("", conn, trans))
{
cmd.CommandText = "SET ANSI_WARNINGS OFF";
cmd.ExecuteNonQuery();
cmd.CommandText = "YOUR INSERT HERE";
cmd.ExecuteNonQuery();
cmd.Parameters.Clear();
cmd.CommandText = "SET ANSI_WARNINGS ON";
cmd.ExecuteNonQuery();
trans.Commit();
}
}
catch (Exception)
{
trans.Rollback();
}
}
conn.Close();
}

I had the same issue. The length of my column was too short.
What you can do is either increase the length or shorten the text you want to put in the database.

Also had this problem occurring on the web application surface.
Eventually found out that the same error message comes from the SQL update statement in the specific table.
Finally then figured out that the column definition in the relating history table(s) did not map the original table column length of nvarchar types in some specific cases.

I had the same problem, even after increasing the size of the problematic columns in the table.
tl;dr: The length of the matching columns in corresponding Table Types may also need to be increased.
In my case, the error was coming from the Data Export service in Microsoft Dynamics CRM, which allows CRM data to be synced to an SQL Server DB or Azure SQL DB.
After a lengthy investigation, I concluded that the Data Export service must be using Table-Valued Parameters:
You can use table-valued parameters to send multiple rows of data to a Transact-SQL statement or a routine, such as a stored procedure or function, without creating a temporary table or many parameters.
As you can see in the documentation above, Table Types are used to create the data ingestion procedure:
CREATE TYPE LocationTableType AS TABLE (...);
CREATE PROCEDURE dbo.usp_InsertProductionLocation
#TVP LocationTableType READONLY
Unfortunately, there is no way to alter a Table Type, so it has to be dropped & recreated entirely. Since my table has over 300 fields (😱), I created a query to facilitate the creation of the corresponding Table Type based on the table's columns definition (just replace [table_name] with your table's name):
SELECT 'CREATE TYPE [table_name]Type AS TABLE (' + STRING_AGG(CAST(field AS VARCHAR(max)), ',' + CHAR(10)) + ');' AS create_type
FROM (
SELECT TOP 5000 COLUMN_NAME + ' ' + DATA_TYPE
+ IIF(CHARACTER_MAXIMUM_LENGTH IS NULL, '', CONCAT('(', IIF(CHARACTER_MAXIMUM_LENGTH = -1, 'max', CONCAT(CHARACTER_MAXIMUM_LENGTH,'')), ')'))
+ IIF(DATA_TYPE = 'decimal', CONCAT('(', NUMERIC_PRECISION, ',', NUMERIC_SCALE, ')'), '')
AS field
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = '[table_name]'
ORDER BY ORDINAL_POSITION) AS T;
After updating the Table Type, the Data Export service started functioning properly once again! :)

When I tried to execute my stored procedure I had the same problem because the size of the column that I need to add some data is shorter than the data I want to add.
You can increase the size of the column data type or reduce the length of your data.

A 2016/2017 update will show you the bad value and column.
A new trace flag will swap the old error for a new 2628 error and will print out the column and offending value. Traceflag 460 is available in the latest cumulative update for 2016 and 2017:
https://support.microsoft.com/en-sg/help/4468101/optional-replacement-for-string-or-binary-data-would-be-truncated
Just make sure that after you've installed the CU that you enable the trace flag, either globally/permanently on the server:
...or with DBCC TRACEON:
https://learn.microsoft.com/en-us/sql/t-sql/database-console-commands/dbcc-traceon-trace-flags-transact-sql?view=sql-server-ver15

Another situation, in which this error may occur is in
SQL Server Management Studio. If you have "text" or "ntext" fields in your table,
no matter what kind of field you are updating (for example bit or integer).
Seems that the Studio does not load entire "ntext" fields and also updates ALL fields instead of the modified one.
To solve the problem, exclude "text" or "ntext" fields from the query in Management Studio

This Error Comes only When any of your field length is greater than the field length specified in sql server database table structure.
To overcome this issue you have to reduce the length of the field Value .
Or to increase the length of database table field .

If someone is encountering this error in a C# application, I have created a simple way of finding offending fields by:
Getting the column width of all the columns of a table where we're trying to make this insert/ update. (I'm getting this info directly from the database.)
Comparing the column widths to the width of the values we're trying to insert/ update.
Assumptions/ Limitations:
The column names of the table in the database match with the C# entity fields. For eg: If you have a column like this in database:
You need to have your Entity with the same column name:
public class SomeTable
{
// Other fields
public string SourceData { get; set; }
}
You're inserting/ updating 1 entity at a time. It'll be clearer in the demo code below. (If you're doing bulk inserts/ updates, you might want to either modify it or use some other solution.)
Step 1:
Get the column width of all the columns directly from the database:
// For this, I took help from Microsoft docs website:
// https://learn.microsoft.com/en-us/dotnet/api/system.data.sqlclient.sqlconnection.getschema?view=netframework-4.7.2#System_Data_SqlClient_SqlConnection_GetSchema_System_String_System_String___
private static Dictionary<string, int> GetColumnSizesOfTableFromDatabase(string tableName, string connectionString)
{
var columnSizes = new Dictionary<string, int>();
using (var connection = new SqlConnection(connectionString))
{
// Connect to the database then retrieve the schema information.
connection.Open();
// You can specify the Catalog, Schema, Table Name, Column Name to get the specified column(s).
// You can use four restrictions for Column, so you should create a 4 members array.
String[] columnRestrictions = new String[4];
// For the array, 0-member represents Catalog; 1-member represents Schema;
// 2-member represents Table Name; 3-member represents Column Name.
// Now we specify the Table_Name and Column_Name of the columns what we want to get schema information.
columnRestrictions[2] = tableName;
DataTable allColumnsSchemaTable = connection.GetSchema("Columns", columnRestrictions);
foreach (DataRow row in allColumnsSchemaTable.Rows)
{
var columnName = row.Field<string>("COLUMN_NAME");
//var dataType = row.Field<string>("DATA_TYPE");
var characterMaxLength = row.Field<int?>("CHARACTER_MAXIMUM_LENGTH");
// I'm only capturing columns whose Datatype is "varchar" or "char", i.e. their CHARACTER_MAXIMUM_LENGTH won't be null.
if(characterMaxLength != null)
{
columnSizes.Add(columnName, characterMaxLength.Value);
}
}
connection.Close();
}
return columnSizes;
}
Step 2:
Compare the column widths with the width of the values we're trying to insert/ update:
public static Dictionary<string, string> FindLongBinaryOrStringFields<T>(T entity, string connectionString)
{
var tableName = typeof(T).Name;
Dictionary<string, string> longFields = new Dictionary<string, string>();
var objectProperties = GetProperties(entity);
//var fieldNames = objectProperties.Select(p => p.Name).ToList();
var actualDatabaseColumnSizes = GetColumnSizesOfTableFromDatabase(tableName, connectionString);
foreach (var dbColumn in actualDatabaseColumnSizes)
{
var maxLengthOfThisColumn = dbColumn.Value;
var currentValueOfThisField = objectProperties.Where(f => f.Name == dbColumn.Key).First()?.GetValue(entity, null)?.ToString();
if (!string.IsNullOrEmpty(currentValueOfThisField) && currentValueOfThisField.Length > maxLengthOfThisColumn)
{
longFields.Add(dbColumn.Key, $"'{dbColumn.Key}' column cannot take the value of '{currentValueOfThisField}' because the max length it can take is {maxLengthOfThisColumn}.");
}
}
return longFields;
}
public static List<PropertyInfo> GetProperties<T>(T entity)
{
//The DeclaredOnly flag makes sure you only get properties of the object, not from the classes it derives from.
var properties = entity.GetType()
.GetProperties(System.Reflection.BindingFlags.Public
| System.Reflection.BindingFlags.Instance
| System.Reflection.BindingFlags.DeclaredOnly)
.ToList();
return properties;
}
Demo:
Let's say we're trying to insert someTableEntity of SomeTable class that is modeled in our app like so:
public class SomeTable
{
[Key]
public long TicketID { get; set; }
public string SourceData { get; set; }
}
And it's inside our SomeDbContext like so:
public class SomeDbContext : DbContext
{
public DbSet<SomeTable> SomeTables { get; set; }
}
This table in Db has SourceData field as varchar(16) like so:
Now we'll try to insert value that is longer than 16 characters into this field and capture this information:
public void SaveSomeTableEntity()
{
var connectionString = "server=SERVER_NAME;database=DB_NAME;User ID=SOME_ID;Password=SOME_PASSWORD;Connection Timeout=200";
using (var context = new SomeDbContext(connectionString))
{
var someTableEntity = new SomeTable()
{
SourceData = "Blah-Blah-Blah-Blah-Blah-Blah"
};
context.SomeTables.Add(someTableEntity);
try
{
context.SaveChanges();
}
catch (Exception ex)
{
if (ex.GetBaseException().Message == "String or binary data would be truncated.\r\nThe statement has been terminated.")
{
var badFieldsReport = "";
List<string> badFields = new List<string>();
// YOU GOT YOUR FIELDS RIGHT HERE:
var longFields = FindLongBinaryOrStringFields(someTableEntity, connectionString);
foreach (var longField in longFields)
{
badFields.Add(longField.Key);
badFieldsReport += longField.Value + "\n";
}
}
else
throw;
}
}
}
The badFieldsReport will have this value:
'SourceData' column cannot take the value of
'Blah-Blah-Blah-Blah-Blah-Blah' because the max length it can take is
16.

Kevin Pope's comment under the accepted answer was what I needed.
The problem, in my case, was that I had triggers defined on my table that would insert update/insert transactions into an audit table, but the audit table had a data type mismatch where a column with VARCHAR(MAX) in the original table was stored as VARCHAR(1) in the audit table, so my triggers were failing when I would insert anything greater than VARCHAR(1) in the original table column and I would get this error message.

I used a different tactic, fields that are allocated 8K in some places. Here only about 50/100 are used.
declare #NVPN_list as table
nvpn varchar(50)
,nvpn_revision varchar(5)
,nvpn_iteration INT
,mpn_lifecycle varchar(30)
,mfr varchar(100)
,mpn varchar(50)
,mpn_revision varchar(5)
,mpn_iteration INT
-- ...
) INSERT INTO #NVPN_LIST
SELECT left(nvpn ,50) as nvpn
,left(nvpn_revision ,10) as nvpn_revision
,nvpn_iteration
,left(mpn_lifecycle ,30)
,left(mfr ,100)
,left(mpn ,50)
,left(mpn_revision ,5)
,mpn_iteration
,left(mfr_order_num ,50)
FROM [DASHBOARD].[dbo].[mpnAttributes] (NOLOCK) mpna
I wanted speed, since I have 1M total records, and load 28K of them.

This error may be due to less field size than your entered data.
For e.g. if you have data type nvarchar(7) and if your value is 'aaaaddddf' then error is shown as:
string or binary data would be truncated

You simply can't beat SQL Server on this.
You can insert into a new table like this:
select foo, bar
into tmp_new_table_to_dispose_later
from my_table
and compare the table definition with the real table you want to insert the data into.
Sometime it's helpful sometimes it's not.
If you try inserting in the final/real table from that temporary table it may just work (due to data conversion working differently than SSMS for example).
Another alternative is to insert the data in chunks, instead of inserting everything immediately you insert with top 1000 and you repeat the process, till you find a chunk with an error. At least you have better visibility on what's not fitting into the table.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Reading Hadoop SequenceFiles with Hive - hive

Not sure if this is impacting you but Hive ignores keys when reading SequenceFiles. You may need to create a custom InputFormat (unless you can find one online:-)) Reference: http://mail-archives.apache.org/mod_mbox/hive-user/200910.mbox/%3C5573211B-634D-4BB0-9123-E389D90A786C#metaweb.com%3E

Related

Spark JavaPairRDD iteration

In VTD-XML how to add new attribute into tag with existing attributes?

HibernateException: Errors in named query

Can I pass parameters to UDFs in Pig script?

error, string or binary data would be truncated when trying to insert

Categories

Resources