--diff doesn’t appear to compute deltas properly when comparing directories - cloc

We are using cloc.pl for analysis purpose. And cloc was proved very useful so far when we were just counting lines of code. But now we are trying to get diff between two branches.
Using the documentation mentioned in the link above, I am trying to get the diff:
perl cloc.pl --diff branch-1.0/ExampleClass.java branch-2.0/ExampleClass.java
This produces perfect result for a single file and reports modified lines correctly. The same is true for other values like removed, added and so on.
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Java
same 0 0 209 294
modified 1 0 170 12
added 0 0 647 1
removed 0 5 64 46
-------------------------------------------------------------------------------
SUM:
same 0 0 209 294
modified 1 0 170 12
added 0 0 647 1
removed 0 5 64 46
-------------------------------------------------------------------------------
But now when I’m trying to accomplish same result for complete branch, i.e., all the files under folder, by issuing something like this:
perl cloc.pl --diff branch-1.0\ branch-2.0\
Now comes the problem.
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Java
same 0 0 0 0
modified 0 0 0 0
added 0 110 2408 789
removed 1 32 443 352
-------------------------------------------------------------------------------
SUM:
same 0 0 0 0
modified 0 0 0 0
added 0 110 2408 789
removed 1 32 443 352
-------------------------------------------------------------------------------
As you can see that when I am trying to issue the command at folder level, all modified number of comments shows 0. All we have is added or removed lines of code or files.
Not sure if I am missing something silly or any issue with cloc tool. I am using version 1.56.

This issue got resolved version 1.6. Which helped me to move ahead. Apparently its a bug with version 1.56. Also I moved to use prebuilt "cloc-1.6.exe"
Another thing I found in this regard is one can find more help/support related blog/discussion on http://sourceforge.net/p/cloc/bugs/ which actually helped in my case.

Related

Excel xlsx file modified into a dataframe is not recognized by an R package that uses dataframe

I uploaded an Excel xlsx file then created a dataframe by converting numeric variables into categories. When I run a R package that uses dataframe, the output shows the following error:
> library(DiallelAnalysisR)
> Griffing(Yield, Rep, Cross1, Cross2, GriffingData41, 4, 1)
Error in `$<-.data.frame`(`*tmp*`, "Trt", value = character(0)) :
replacement has 0 rows, data has 20
When I issue a str() function, it shows the modifications of the numeric columns into catergories as below.
> str(GriffingData41)
'data.frame': 20 obs. of 4 variables:
$ CROSS1: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 2 2 2 3 3 4 ...
$ CROSS2: Factor w/ 4 levels "2","3","4","5": 1 2 3 4 2 3 4 3 4 4 ...
$ REP : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 2 ...
$ YIELD : num 11.9 14.5 9 13.5 20.5 9.8 16.5 22.1 18.4 19.4 ...
Is this a problem in my dataframe creation?
I would appreciate it if I could be helped with this error. By the way, I am running this in R Studio.
Thank you.
Note: This is not really a solution to my problem but I managed to move forward by saving my Excel data in CSV format; changing the data type of the specific columns to character and importing to R Studio. From there, creating the dataframe and running the R package went smoothly. Still, I am curious why it did not work on the "xlsx" file.

To One-Hot encode or not to One-Hot encode

My data set has the day of the week number (Mon = 1, Tue = 2, Wed = 3 ...)
My data look like this
WeekDay Col1 Col2 Target
1 2.2 8 126
6 3.5 4 354
1 8.0 2 322
3 7.2 4 465
7 3.2 5 404
6 3.8 3 134
1 3.6 5 455
1 5.5 8 345
6 7.0 6 442
Shall I one-hot encode WeekDay so it will look like this ?
WeekDay Col1 Col2 Target Mo Tu We Th Fr Sa Su
1 2.2 8 126 1 0 0 0 0 0 0
6 3.5 4 354 0 0 0 0 0 1 0
1 8.0 2 322 1 0 0 0 0 0 0
3 7.2 4 465 0 0 1 0 0 0 0
7 3.2 5 404 0 0 0 0 0 0 1
6 3.8 3 134 0 0 0 0 0 1 0
1 3.6 5 455 1 0 0 0 0 0 0
1 5.5 8 345 1 0 0 0 0 0 0
6 7.0 6 442 0 0 0 0 0 1 0
I am going to use Random Forest
You should not use one hot encoding since you are using a random forest model. An RF model will be able to find the patterns from label encoding as well and generally RF models perform worse with one hot encoding as they might decide to lost a few days when creating a tree. Also one hot encoding introduces the curse of dimensionality in your data, which is never good.
One hot encoding is better in cases of methods like linear regression or logistic regression, where 1 i.e. Monday might get more importance then 6 i.e. Saturday as these models have a multiplication model on the backend.
Generally, it's preferable to use One-Hot-Encoding, before use Random Forest. If this is only a categorical variable in your dataset then go for One-hot-Encoding. If you use R's random forest then as I know R's library deal with it itself. For scikit-learn that's not the case and you have to one-hot encode yourself. There is a trade off. One-Hot encoding introduces sparsity which is undesirable for tree-based models if the cardinality of the categorical variable is big, or in other words, there are many unique values in the categorical variable. However, Python's catboost deals with categorical variables.

How to Create a CDF out of a PDF in SQL

So I have a datatable that looks something like that following. ID represents an object, bin represents how I am segmenting the data, and percent is how much of a data falls into that bin.
id bin percent
2 8 0.20030698388
2 16 0.14504988488
2 24 0.12356101304
2 32 0.09976976208
2 40 0.09056024558
2 48 0.07137375287
2 56 0.04067536454
2 64 0.03914044512
2 72 0.02916346891
2 80 0.16039907904
3 8 0.36316695352
3 16 0.03958691910
3 24 0.11876075731
3 32 0.13253012048
3 40 0.03098106712
3 48 0.07228915662
3 56 0.07745266781
3 64 0.02581755593
3 72 0.02065404475
3 80 0.11876075731
I am looking for a function to turn this dataset into a cdf partitioning id. I have tried cume_dist and percent_rank, but they do not appear to work.
I am facing a similar problem and found this great tutorial for doing exactly that:
https://dwaincsql.com/2015/05/14/excel-in-t-sql-part-2-the-normal-distribution-norm-dist-density-functions/
It tries to rebuild the Excel function NORM.DIST function which gives you either the PDF if you set the cummulative flag as FALSE and the CDF if you set it as TRUE. I assumed that CUME_DIST would do the exact same thing in SQL. However, it turns out that the latter distributes by counting the elements whereas Excel uses the relative differences in the values.

TetGen generates tets for empty chambers/holes in a model

I am using tetgen to generate meshes for my research.
My models have empty internal chambers inside them. For example, an empty box of size (5,5,5) inside a box of size (10, 10, 10). See image:
The problem is that tetgen generates tetrahedrons inside the empty chamber. Why? Is there a way to avoid it?
I tried using -YY, -q, -CC, -c, and their combinations, but all had the same problem, and did not give insight on the error. (http://wias-berlin.de/software/tetgen/1.5/doc/manual/manual005.html).
The way I solved it was to create a .poly file (http://wias-berlin.de/software/tetgen/fformats.poly.html). I created a .poly file from a .off file (https://en.wikipedia.org/wiki/OFF_(file_format)), which I could export from OpenScad.
.poly file has 4 parts, from which the 3rd specifies holes in the object. You need to tell TetGen where you have holes in the object.
The way to do it, is by specifying one point in the hole/chamber.
A possible .poly file would look like this:
part1 - vertices:
40 3 0 0
0 0.2 0 1
1 0.161803 0.117557 0
...
part2 - faces:
72 0
1
3 0 1 2
1
3 1 0 3
...
part3 - holes <============== the one I needed
1
1 0 0 0.5 <=== this is a point, which I know is inside my hole/chamber
So here is the file, without any breaks, just in case:
40 3 0 0
0 0.2 0 1
1 0.161803 0.117557 0
...
72 0
1
3 0 1 2
1
3 1 0 3
...
1
1 0 0 0.5

How to return a group of rows when one row meets "where" criteria in SQL Anywhere

I am somewhat overwhelmed by what I am trying to do, since I have only been using SQL for 3 days now, but I already love the increased functionality over MS query. The need for the IN function is what drove me to learn about this, and I thank the community for the info here to get me through learning that.
I tried looking thru other questions, but I couldn't find one in which the intent was to group more than two rows, or to group a varying number of rows. This means that count and duplicate are both out as options.
What I am doing is analyzing a table of part number information that spans multiple store locations. The table gives a row to each instance of a part number, so if all 15 stores have some sort of history for a given part number, that part number will have 15 rows in the table.
I am wanting to look at other store's history for parts that meet the criteria of 0 sales history for my location. The purpose is to see if they can be transferred to another store instead of being returned to the vendor and incurring a restock fee.
Here is a simplified version of the table organized in the way I would want the output to be structured. I got here by having suspected part numbers and using the list of them as a text string in IN() but I want to go about this the other way and build a list of part numbers from sales data in this table.
Branch| Part_No| Description| Bin Qty|current 12 mo sales|previous 12 mo sales|
------|--------|------------|---------|-------------------|--------------------|
20 CA38385 SUPPORT 2 1 1
23 CA38385 SUPPORT 1 0 0
25 CA38385 SUPPORT 0 0 1
20 DFC10513 Hdw Kit 0 1 0
23 DFC10513 Hdw Kit 1 0 0
07 DFC10513 Hdw Kit 0 1 0
3 D59096 VALVE 0 0 12
5 D59096 VALVE 0 0 4
6 D59096 VALVE 4 6 12
8 D59096 VALVE 0 0 0
33 D59096 VALVE 11 14 18
21 D59096 VALVE 4 4 4
22 D59096 VALVE 0 0 0
23 D59096 VALVE 10 0 0
24 D59096 VALVE 0 0 0
25 D59096 VALVE 0 0 0
26 D59096 VALVE 2 2 0
1 TE67401 Repair Kit 1 1 2
21 TE67401 REPAIR KIT 1 3 0
22 TE67401 REPAIR KIT 0 1 0
I am branch 23, so the start of the query as I understand it would be
Select * from part_information
Group By part_number
Having IN(Branch) 23 and bin qty > 0 and current_12_mo_sales=0 and previous_12_mo_sales = 0
Can you point me down the right track? This table has approx. 200000 rows in it, so I really need to learn how to do this. I really don't see a better way.
Thank you in advance for your help and or criticism -Cody
Select * from part_information
where part_number not in (
select part_number from part_information
where branch = 23 and bin_qty > 0 -- etc...
)
(Apologies for lack of formatting).
This ended up working the way I wanted
SELECT pi_Branch, pi_Franchise, pi_Part_No, pi_Description, pi_Bin_Qty,
pi_Bin, pi_current_12_mo_sales, pi_previous_12_mo_sales, pi_Inventory_Cost,
pi_Return_Indicator
From Part_Information
Where pi_Part_No IN (Select pi_Part_No
From Part_Information
Where pi_Branch=23 And
pi_Bin_Qty>0 And pi_current_12_mo_sales<=0
And pi_previous_12_mo_sales<=0)
I was thinking that this had to be some complex process, but in reality, two simple queries were all that was needed.
I would still be interested in anyone's opinion on a better or more efficient way of handling this.
Thanks Mischa for getting me there!