How would you format/indent this piece of code?
int ID = Blahs.Add( new Blah( -1, -2, -3) );
or
int ID = Blahs.Add( new Blah(
1,2,3,55
)
);
Edit:
My class has lots of parameters actually, so that might effect your response.
I agree with Patrick McElhaney; there is no need to nest it....
Blah aBlah = new Blah( 1, 2, 3, 55 );
int ID = Blahas.Add( aBlah );
There are a couple of small advantage here:
You can set a break point on the second line and inspect 'aBlah'.
Your diffs will be cleaner (changes more obvious) without nesting the statements, e.g. creating the new Blah is in an independent statement from adding it to the list.
I'd go with the one-liner. If the real arguments make one line too long, I would break it up with a variable.
Blah blah = new Blah(1,2,3,55);
int ID = Blahs.Add( blah );
All numbers are being added to a result. No need to comment each number separately. A comment "these numbers are added together" will do it. I'm going to do it like this:
int result = Blahs.Add( new Blah(1, 2, 3, 55) );
but if those numbers carry some meaning on their own, each number could stand for something entirely different, for example if Blah denotes the type for an inventory item. I would go with
int ID = Blahs.Add( new Blah(
1, /* wtf is this */
2, /* wtf is this */
3, /* wtf is this */
55 /* and huh */
));
int ID = Blahs.Add
(
new Blah
(
1, /* When the answer is within this percentage, accept it. */
2, /* Initial seed for algorithm */
3, /* Maximum threads for calculation */
55 /* Limit on number of hours, a thread may iterate */
)
);
or
int ID = Blahs.Add(
new Blah( 1, 2, 3, 55 )
);
I must confess, though, that 76 times out of 77 I do what you did the first time.
first way since you are inlining it anyway.
I would use similar formatting as your first example, but without the redundant space delimiters before and after the parenthesis delimiters:
int id = BLahs.Add(new Blah(-1, -2, -3));
Note that I also wouldn't use an all upper-case variable name in this situation, which often implies something special, like a constant.
Either split it into two lines:
new_Blah = new Blah(-1, -2, -3)
int ID = BLahs.Add(new_Blah);
Or indent the new Blah(); call:
int ID = BLahs.Add(
new Blah(-1, -2, -3)
);
Unless the arguments were long, in which case I'd probably do something like..
int ID = BLahs.Add(new Blah(
(-1 * 24) + 9,
-2,
-3
));
As a slightly more practical example, in Python I quite commonly do the either of the following:
myArray.append(
someFunction(-1, -2, -3)
)
myArray.append(someFunction(
otherFunction("An Arg"),
(x**2) + 4,
something = True
))
One line, unless there's a lot of data. I'd draw the line at about ten items or sixty, seventy columns in total, whatever comes first.
Whatever Eclipse's auto-formatter gives me, so when the next dev works on that code and formats before committing, there aren't weird issues with the diff.
int ID = Blahs.Add(new Blah(1,2,3,55)); // Numbers n such that the set of base 4 digits of n equals the set of base 6 digits of n.
The problem with
Blah aBlah = new Blah( 1, 2, 3, 55 );
int ID = Blahas.Add( aBlah );
is that it messes with your namespace. If you don't need a reference to the Blah you shouldn't create it.
I'd either do it as a one-liner or assign the new Blah to a variable, depending on whether I'll need to reference that Blah directly again.
As far as the readability issue which a couple answers have addressed by putting each argument on a separate line with comments, I would address that by using named parameters. (But not all languages support named parameters, unfortunately.)
int ID = BLahs.Add(new Blah( foo => -1, bar => -2, baz => -3 ));
Related
I have a table with three columns: Name, Address, City. This table is around a million records long. The name and address fields can probably have duplicates.
An example of duplicate names are:
XYZ foundation Coorporation
XYZ foundation Corp
XYZ foundation Co-orporation
Or another example
XYZ Center
XYZ Ctr
An example of duplication in addresses would be
60909 East 34TH STREET BAY #1
60909 East 34TH ST. BAY #1
60909 East 34TH ST. BAY 1
As you can see, the name and address fields are duplicates, but only to the human eye, because we understand abbreviations and short forms. How do I build this into a select statement in SQL Server? If not SQL Server, is there another way to scan and remove such duplicates?
The approach that I used is better suited for surnames, but I used it for company names as well. Most likely it will not work well for addresses.
Stage 1
Add a column to the table that stores a "normalized" company name. In my case I've written a function that populates the column via a trigger. The function has a set of rules, like this:
adds one space in the front and one in the back
replaces single char symbols ~`!##$%^&*()=_+[]{}|;':",.<>? with space (all except / -)
replaces multi-char tokens with space: T/A C/- P/L
replaces single char symbols -/ with space
replaces multi-char tokens with space: PTY PTE INC INCORPORATED LTD LIMITED CO COMPANY MR DR THE AND 'TRADING AS' 'TRADE AS' 'OPERATING AS'
replaces CORPORATION with CORP
trim all leading and trailing spaces
replace multiple consecutive spaces with single space
Note: when dealing with multi-char tokens surround them with spaces
I looked through my data and made these rules up. Adjust them for you case.
Stage 2
I used the so-called Jaro-Winkler metric to calculate the distance between two normalized company names. I implemented the function that calculates this metric in CLR.
In my case my goal was to check for duplicates as a new entry is added to the system. The user enters the company name, program normalizes it and calculates the Jaro-Winkler distance between the given name and all existing names. The closer the distance to 1, the closer the match. The user saw existing records ordered by the relevance and could decide whether the company name that he just entered already exists in the database, or he still wanted to create a new one.
There exist other metrics that try to perform fuzzy search, like Levenshtein distance. Most likely, you'll have to use different metrics for names and addresses, because the types of mistakes are significantly different for them.
SQL Server has built-in functions to do fuzzy search, but I didn't use them and I'm not sure if they are available in standard editions or only enterprise, e.g. CONTAINSTABLE
Returns a table of zero, one, or more rows for those columns
containing precise or fuzzy (less precise) matches to single words and
phrases, the proximity of words within a certain distance of one
another, or weighted matches.
Note
When I was looking into this topic I came to the conclusion that all these metrics (Jaro-Winkler, Levenstein, etc.) look for simple mistypes, like a missed/extra letter or two letters swapped. In my and your cases this approach as-is would perform poorly, because you effectively have a dictionary of contractions first and then on top of that there can be simple mistypes. That's why I ended up doing it in two stages - normalization and then applying the fuzzy search metric.
To make a list of rules that I mentioned above I made a dictionary of all words that appear in my data. Essentially, take each Name and split it into multiple rows by space. Then group by found tokens and count how many times they appear. Manually look through the list of tokens. This list should not be too long when you remove rare tokens from it. Hopefully common words and contractions would be easy to spot. I would imagine that the word Corporation and "Corp" would appear many times, as opposed to the actual company name XYZ. Those odd mistypes like "Coorporation" should be picked up by the fuzzy metric later.
In a similar way make a separate dictionary for Addresses, where you would see that Street and St. appear many times. For addresses you can "cheat" and get a list of common words from the index of some city map (street/st, road/rd, highway/hwy, grove/gv, etc.)
This is my implementation of the Jaro-Winkler metric:
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
public partial class UserDefinedFunctions
{
/*
The Winkler modification will not be applied unless the percent match
was at or above the WeightThreshold percent without the modification.
Winkler's paper used a default value of 0.7
*/
private static readonly double m_dWeightThreshold = 0.7;
/*
Size of the prefix to be concidered by the Winkler modification.
Winkler's paper used a default value of 4
*/
private static readonly int m_iNumChars = 4;
[Microsoft.SqlServer.Server.SqlFunction(DataAccess = DataAccessKind.None, SystemDataAccess = SystemDataAccessKind.None, IsDeterministic = true, IsPrecise = true)]
public static SqlDouble StringSimilarityJaroWinkler(SqlString string1, SqlString string2)
{
if (string1.IsNull || string2.IsNull)
{
return 0.0;
}
return GetStringSimilarityJaroWinkler(string1.Value, string2.Value);
}
private static double GetStringSimilarityJaroWinkler(string string1, string string2)
{
int iLen1 = string1.Length;
int iLen2 = string2.Length;
if (iLen1 == 0)
{
return iLen2 == 0 ? 1.0 : 0.0;
}
int iSearchRange = Math.Max(0, Math.Max(iLen1, iLen2) / 2 - 1);
bool[] Matched1 = new bool[iLen1];
for (int i = 0; i < Matched1.Length; ++i)
{
Matched1[i] = false;
}
bool[] Matched2 = new bool[iLen2];
for (int i = 0; i < Matched2.Length; ++i)
{
Matched2[i] = false;
}
int iNumCommon = 0;
for (int i = 0; i < iLen1; ++i)
{
int iStart = Math.Max(0, i - iSearchRange);
int iEnd = Math.Min(i + iSearchRange + 1, iLen2);
for (int j = iStart; j < iEnd; ++j)
{
if (Matched2[j]) continue;
if (string1[i] != string2[j]) continue;
Matched1[i] = true;
Matched2[j] = true;
++iNumCommon;
break;
}
}
if (iNumCommon == 0) return 0.0;
int iNumHalfTransposed = 0;
int k = 0;
for (int i = 0; i < iLen1; ++i)
{
if (!Matched1[i]) continue;
while (!Matched2[k])
{
++k;
}
if (string1[i] != string2[k])
{
++iNumHalfTransposed;
}
++k;
// even though length of Matched1 and Matched2 can be different,
// number of elements with true flag is the same in both arrays
// so, k will never go outside the array boundary
}
int iNumTransposed = iNumHalfTransposed / 2;
double dWeight =
(
(double)iNumCommon / (double)iLen1 +
(double)iNumCommon / (double)iLen2 +
(double)(iNumCommon - iNumTransposed) / (double)iNumCommon
) / 3.0;
if (dWeight > m_dWeightThreshold)
{
int iComparisonLength = Math.Min(m_iNumChars, Math.Min(iLen1, iLen2));
int iCommonChars = 0;
while (iCommonChars < iComparisonLength && string1[iCommonChars] == string2[iCommonChars])
{
++iCommonChars;
}
dWeight = dWeight + 0.1 * iCommonChars * (1.0 - dWeight);
}
return dWeight;
}
};
You could look for a more customized solution, together with the DIFFERENCE function for instance. (see: DIFFERENCE function, SQL Server)
Is it possible for Name and City to be logically similar yet different, as well?
Since there's a lot of room for variations here and only you have access to the real data, only you can check what works and what kind of exceptions you basically have there.
But hopefully this will get you started.
-- Creating the test set
DECLARE #TESTTABLE TABLE (Name VARCHAR(256), City VARCHAR(256), Address VARCHAR(256))
INSERT INTO #TESTTABLE VALUES ('Billy bob' ,'New York' ,'Baker street 125')
INSERT INTO #TESTTABLE VALUES ('Billy bob' ,'New York' ,'Baker street 120')
INSERT INTO #TESTTABLE VALUES ('Billy bob' ,'New York' ,'Baker st 125')
INSERT INTO #TESTTABLE VALUES ('Billy bob' ,'New York' ,'Mallroad 1')
INSERT INTO #TESTTABLE VALUES ('James Dean' ,'Washington DC' ,'Primadonna road 15 c 100')
INSERT INTO #TESTTABLE VALUES ('James Dean' ,'Washington DC' ,'Primadonna r 15')
INSERT INTO #TESTTABLE VALUES ('Got Nuttin' ,'Philly' ,'Mystreet 1500') -- Doesn't show, since no real duplicates
And then, after the test data, the actual query.
-- The query
;WITH CTE AS
(SELECT DISTINCT SRC.RN, T1.*, DIFFERENCE(T1.Address, T2.Address) DIFF_FACTOR
FROM #TESTTABLE T1
JOIN #TESTTABLE T2 ON T1.Name = T1.Name AND T2.City = T1.City AND T1.Address <> T2.Address
JOIN (SELECT DENSE_RANK() OVER (ORDER BY Name, City) RN, Name, City FROM #TESTTABLE T3 GROUP BY Name, City HAVING COUNT(*) > 1) SRC
ON SRC.City = T1.City AND SRC.Name = T1.Name)
SELECT DISTINCT RN, Name, City, COUNT(DISTINCT C.Address) Address_CT
, STUFF((SELECT ','+B.Address
FROM CTE B
WHERE B.RN = C.RN AND B.DIFF_FACTOR = C.DIFF_FACTOR
ORDER BY B.Address ASC
FOR XML PATH('')),1,1,'') AllAdresses
, DIFF_FACTOR
FROM CTE C
WHERE DIFF_FACTOR > 1 -- Comment this row to see that 'Mallroad 1' was considered to be too different from the rest, so this filter prevents us from considering that in the result set
GROUP BY RN, Name, City, DIFF_FACTOR
ORDER BY RN ASC, DIFF_FACTOR DESC
That is probably not the most effective - or accurrate - way to go about doing this, but it's a good place to start and to show what can be done. If there's a chance for Name and City to also be different but duplicates to human eyes, you could modify the query to match any two identical column values, comparing the third. But it gets really difficult to automate comparisons in cases where you have one identifying column, and both of the others can be different from one another to varying degrees.
I suspect you need to make several queries to first sort out the biggest mess, and eventually find the last most evasive "duplicates" by hand, a few at a time.
I have a table made up of one column that has a 100 character string in each row. A second column was added to house the result. I needed to amend certain fix position elements and planned to do the following:
UPDATE myData
SET newData = REPLACE(oldData,SUBSTRING(eftnwsfull, 16,2),'OC')
The element at position 16,2 is '17'. But, if there are other parts of the string (not at position 16,2) that happen to be '17' are getting changed to 'OC' as well.
I'm baffled to understand how this can happen as I'm specifying the exact position of where to make the replacement. What am I doing wrong?
Try STUFF
UPDATE myData
SET newData = STUFF(oldData, 16, 2, 'OC')
Here are a couple of ways (please test as the offsets may be one off) ..
SET newdata = SUBSTRING(oldData, 1, 15) + 'OC' + SUBSTRING(oldData, 18, LEN(oldData) - 17)
or
SET newdata = LEFT(oldData, 15) + 'OC' + RIGHT(oldData, LEN(oldData) - 17)
I'm looking for Berkeley DB equivalent of
SELECT COUNT All, SELECT COUNT WHERE LIKE "%...%"
I have got 100 records with keys: 1, 2, 3, ... 100.
I have got the following code:
//Key = 1
i=1;
strcpy_s(buf, to_string(i).size()+1, to_string(i).c_str());
key.data = buf;
key.size = to_string(i).size()+1;
key.flags = 0;
data.data = rbuf;
data.size = sizeof(rbuf)+1;
data.flags = 0;
//Cursor
if ((ret = dbp->cursor(dbp, NULL, &dbcp, 0)) != 0) {
dbp->err(dbp, ret, "DB->cursor");
goto err1;
}
//Get
dbcp->get(dbcp, &key, &data_read, DB_SET_RANGE);
db_recno_t cnt;
dbcp->count(dbcp, &cnt, 0);
cout <<"count: "<<cnt<<endl;
Count cnt is always 1 but I expect it calculates all the partial key matches for Key=1: 1, 10, 11, 21, ... 91.
What is wrong in my code/understanding of DB_SET_RANGE ?
Is it possible to get SELECT COUNT WHERE LIKE "%...%" in BDB ?
Also is it possible to get SELECT COUNT All records from the file ?
Thanks
You're expecting Berkeley DB to be way more high-level than it actually is. It doesn't contain anything like what you're asking for. If you want the equivalent of WHERE field LIKE '%1%' you have to make a cursor, read through all the values in the DB, and do the string comparison yourself to pick out the ones that match. That's what an SQL engine actually does to implement your query, and if you're using libdb instead of an SQL engine, it's up to you. If you want it done faster, you can use a secondary index (much like you can create additional indexes for a table in SQL), but you have to provide some code that links the secondary index to the main DB.
DB_SET_RANGE is useful to optimize a very specific case: you're looking for items whose key starts with a specific substring. You can DB_SET_RANGE to find the first matching key, then DB_NEXT your way through the matches, and stop when you get a key that doesn't match. This works only on DB_BTREE databases because it depends on the keys being returned in lexical order.
The count method tells you how many exact duplicate keys there are for the item at the current cursor position.
You can use method DB->stat().
For example, number of unique keys in the BT_TREE.
bool row_amount(DB *db, size_t &amount) {
amount = 0;
if (db==NULL) return false;
DB_BTREE_STAT *sp;
int ret = db->stat(db, NULL, &sp, 0);
if(ret!=0) return false;
amount = (size_t)sp->bt_nkeys;
return true;
}
i am trying to insert multiple insert in table. For that i have created an int column table and inserting using for loop function but i am not able to write proper code which i want to do.
i Need some thing like that
for(i=0;i<1800;i++)
{
retcode = SQLPrepare(hstmt,(SQLCHAR *)"insert into dbo.vivtest values(i)",SQL_NTS);
if (retcode != SQL_SUCCESS)
{
printf("Error in SQLPrepare - insert\n");
odbc_Error(henv,hdbc,hstmt);
getch();
}
else
printf("Successfull execution of %d th Prepare\n",i);
It's giving me error every time.
The following is a useful link about SQLPrepare() and how to bind parameters. Binding parameters is a safe way of inserting variable contents into your SQL strings and also a prepared statement made in this way is an efficient way of executing a statement multiple times.
http://msdn.microsoft.com/en-us/library/windows/desktop/ms716365(v=vs.85).aspx
It gives an example of binding parameters to an SQL query. The give the following example:
SQLPrepare(hstmt, "UPDATE Parts SET Price = ? WHERE PartID = ?", SQL_NTS);
In the SQL string you can see a few ? (question marks). These are like placeholders in the SQL string which you can then "bind parameters to" (i.e., substitute in a variable contents in place of).
To continue the MSDN example...
SQLBindParameter(hstmt, 1, SQL_PARAM_INPUT, SQL_C_FLOAT, SQL_REAL, 7, 0,
&Price, 0, &PriceInd);
SQLBindParameter(hstmt, 2, SQL_PARAM_INPUT, SQL_C_ULONG, SQL_INTEGER, 10, 0,
&PartID, 0, &PartIDInd);
The first statement replaces the first question mark with a floating point value from the variable Price and the second bind replaces the second question mark with an integer from the PartID variable.
Your prepare statement should probably looks something like...
SQLINTEGER iInd;
SQLUINTEGER i;
...
...
retcode = SQLPrepare(hstmt,(SQLCHAR *)"insert into dbo.vivtest values(?)",SQL_NTS);
...
...
SQLBindParameter(hstmt, 1, SQL_PARAM_INPUT, SQL_C_ULONG, SQL_INTEGER, 10, 0,
&i, 0, &iInd);
You should use either SQL Parameters, or sprintf().
Preferably the former.
I am having some trouble creating some SQL (for SQL server 2008).
I have a table of tasks that are priority ordered, comma delimited tasks:
Id = 1, LongTaskName = "a,b,c"
Id = 2, LongTaskName = "a,c"
Id = 3, LongTaskName = "b,c"
Id = 4, LongTaskName = "a"
etc...
I am trying to build a new table that groups them by the first task, along with the id:
GroupName: "a", TaskId: 1
GroupName: "a", TaskId: 2
GroupName: "a", TaskId: 4
GroupName: "b", TaskId: 3
Here is the naive, slow, linq code:
foreach(var t in Tasks)
{
var gt = new GroupedTasks();
gt.TaskId = t.Id;
var firstWord = t.LongTaskName.Split(',');
if(firstWord.Count() > 0)
{
gt.GroupName = firstWord.First();
}
else
{
gt.GroupName = t.LongTaskName;
}
GroupedTasks.InsertOnSubmit(gt);
}
I wrote a sql function to do the string split:
create function fn_Split(
#String nvarchar (4000),
#Delimiter nvarchar (10)
)
returns nvarchar(4000)
begin
declare #FirstComma int
set #FirstComma = charindex(#Delimiter,#String)
if(#FirstComma = 0)
return #String
return substring(#String, 0, #FirstComma)
end
go
However, I am getting stuck on the real sql to do the work.
I can get the group by alone:
SELECT dbo.fn_Split(LongTaskName, ',')
FROM [dbo].[Tasks]
GROUP BY dbo.fn_Split(LongTaskName, ',')
And I know I need to head down something like this:
DECLARE #RowSet TABLE (GroupName nvarchar(1024), Id nvarchar(5))
insert into #RowSet
select ???
FROM [dbo].Tasks as T
INNER JOIN
(
SELECT dbo.fn_Split(LongTaskName, ',')
FROM [dbo].[Tasks]
GROUP BY dbo.fn_Split(LongTaskName, ',')
) G
ON T.??? = G.???
ORDER BY ???
INSERT INTO dbo.GroupedTasks(GroupName, Id)
select * from #RowSet
But I am not quite groking how to reference the grouped relationships and am confused about having to call split multiple times.
Any thoughts?
If you only care about the first item in the list, there's no need really for a function. I would recommend this way. You also don't need the #RowSet table variable for any temporary holding.
INSERT dbo.GroupedTasks(GroupName, Id)
SELECT
LEFT(LongTaskName, COALESCE(NULLIF(CHARINDEX(',', LongTaskName)-1, -1), 1024)),
Id
FROM dbo.Tasks;
It is even easier if the tasks are 1-character long, you can use LEFT(LongTaskName, 1) instead of the ugly SUBSTRING/CHARINDEX mess. But I'm guessing your task names are not one character long (if this is the case, you should include some data that varies a bit so that others don't make assumptions about length).
Now, keep in mind that you'll have to do something like this to keep dbo.GroupedTasks up to date every time a dbo.Tasks row is inserted, updated or deleted. How are you going to keep these two tables in sync?
More to the point, you should consider storing the top priority task separately in the first place, either by using a computed column or separating it out before the insert. Munging data together is something that you do with hash tables and arrays in application code, but it rarely has any positive attributes inside a database. You almost always spend more time and effort extracting the data apart than you ever saved by keeping it together in the first place. This will negate the need for a second table at all.
Select Id, Split( ',', LongTaskName ) as GroupName into TasksWithGroupInfo
Does this answer your question?