Group by Regression in TensorFlow

Group by Regression in TensorFlow - tensorflow

I am very new to TensorFlow - so please bear with me if this is a trivial question.
I'm coding in Python+TensorFlow. I have a dataframe with the following structure -
Y | X_1 | X_2 | ... | X_p | Grp
where Y is the continuous response, X_1 through X_p are features, and Grp is a categorical value indicating group. I want to fit a separate linear regression of Y on (X_1,...,X_p)for each Grp and save the weights/coefficients. I do not want to use the out of the shelf tf.estimator.LinearRegressor. Instead I want to go the loss function-optimizer-session.run() route.
The relevant tutorial pages on internet talk about linear regression but not per group. I would appreciate any suggestions. I am thinking to do this -
For each g in Grps :
1. Call the optimizer by passing the data for Group g as the placeholders.
2. Get the estimated weights (for Group g) and save them in a dataframe : Grp | weights
Another approach that sounds reasonable is to have separate graphs for each group and kick them all together using various "sessions".
Are these reasonable and feasible in TF? Which one is easier or are there better approaches?
Thank you,
Sai

Related

moving from tabular to graph representation of a given data

Suppose that I have the following data t:
activity
teacher
group
students
duration
subject
One
A
a
3
45
Math
One
B
b
2
45
Math
two
A
c
7
60
P.E
One
D
a
3
45
Math
two
C
c
7
60
P.E
I want to construct a graph data instead of this tabular data. I am actually interested in predicting the teacher by applying some kind of Graph ML. is there a way to transform the tabular data into graphical data ? maybe using networkX.
I tried the following code
G = nx.from_pandas_edgelist(df, "subject", "teacher", edge_attr=True, create_using=nx.Graph())
nx.draw_networkx(G)
plt.show()
the output of this looks like a graph, but I don't understand how it works or how can I get the new data or what is the best way to identify the node and the edge.
thank you in advance for any help.

Multiple Object Tracking (MOT) benchmark data-set format for ground truth tracking

I am trying to evaluate the performance of my object detection+tracking on the standard dataset used in the industry in the 2DMOT Challenge 2015. I have downloaded the dataset but I am unable to understand the data fields in the labelled ground truth data.
I have understood the first six columns of the dataset but unable to do so for the rest four columns. Following is the sample data from the directory <\2DMOT2015\train\ETH-Bahnhof\gt>:
frame no. object_id bb_left bb_top bb_width bb_height (?) (?) (?) (?)
1 1 212 204 20 57 0 -3.1784 16.34 0.45739
1 2 223 181 36 104 1 -1.407 9.0212 0.68774
Please let me know if you are aware of this?

The last three fields represent the 3D real-world coordinates of the objects. A similar data structure can be found in videos of ETH-Bahnhof, ETH-Sunnyday, PETS09-S2L1 and TUD-Stadtmitte in 2DMOT2015. For ground-truth, score=1. But sometimes it varies b/w 0-1, then it acts as a flag value and zeroes mean that the line is not to be considered for evaluation. So the data fields are in the format:
frame no. , object_id , bb_left , bb_top , bb_width , bb_height , score, X, Y, Z

Customizing tables in Stata

Using Stata14 on windows, I am wondering how to build customized tables from several regression results. Here is an example. We have
reg y, x1
predict resid1, residuals
summarize resid1
Which gives:
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
resid1 | 5,708,529 4.83e-11 .7039736 -3.057633 3.256382
And run another regrerssion and similarly obtain the residuals:
reg y, x2
predict resid2, residuals
I would like to create a table which has the two standard deviations of the two residuals, and optimally output it to latex. I am familiar with the esttab and estout commands for outputting regression results to latex, but these do not work for customized tables as in the above example.

You need to use estpost. This should get you started.
sysuse auto, clear
regress price weight
predict error1, residuals
regress price trunk
predict error2, residuals
eststo clear
estpost summarize error1 error2
eststo
esttab, cells("count mean sd min max") noobs nonum
esttab using so.tex, cells("count mean sd min max") noobs nonum replace
More here.

PIG Item Count and Histogram

This is a two part problem:
PART 1:
I am using the cloudera pig editor to transform my data. The data set is derived from the US Patents Citations data set. The first column is the "Cited" patent. The remaining data is the patents that cite the first patent.
3858241 3634889,3557384,3398406,1324234,956203
3858242 3707004,3668705,3319261,1515701
3858243 3684611,3681785,3574238,3221341,3156927,3146465,2949611
3858244 2912700,2838924,2635670,2211676,17445,14040
3858245 3755824,3699969,3621837,3608095,3553737,3176316,2072303
3858246 3601877,3503079,3451067
3858247 3755824,3694819,3621837,2807431,1600859
I need to create PIG code that will count the number of citation that the first patent has. So, I need the output to be:
3858241 5
3858242 4
3858243 7
3858244 6
3858245 7
3858246 3
3858247 6
PART 2:
I need to create a histogram of the output from problem 1 using a PIG script.
Any help would be greatly appreciated.
Thanks

this script should work.
X = LOAD 'pigpatient.txt' using PigStorage(' ') AS (pid:int,str:chararray);
X1 = FOREACH X GENERATE pid,STRSPLIT(str, ',') AS (y:tuple());
X2 = FOREACH X1 GENERATE pid,SIZE(y) as numofcitan;
dump X2;
X3 = group X2 by numofcitan;
Histograms = foreach X3 GENERATE group as numofcitan,COUNT(X2.pid);
dump Histograms;
input:
3858241 3634889,3557384,3398406,1324234,956203
3858242 3707004,3668705,3319261,1515701
3858243 3684611,3681785,3574238,3221341,3156927,3146465,2949611
3858244 2912700,2838924,2635670,2211676,17445,14040
3858245 3755824,3699969,3621837,3608095,3553737,3176316,2072303
3858246 3601877,3503079,3451067
3858247 3755824,3694819,3621837,2807431,1600859
Result:
(3858241,5)
(3858242,4)
(3858243,7)
(3858244,6)
(3858245,7)
(3858246,3)
(3858247,5)
Histogram output:
Number of citatatins,number of patients
(3,1)
(4,1)
(5,2)
(6,1)
(7,2)

#Sravan K Reddy's answer is good enough to be a solution, but it is essential to know what is histogram?
Histogram is frequency distribution of datasets and gives statistical information about data. Most commonly used histogram types are; Equi-width and equi-depth which is called equi-height or height-balanced.
In database tools, equi-depth histogram is prefered. ex: Oracle see
#Sravan K Reddy intends to create equi-width histogram of patent citations. However, in order to create histogram, data must be sorted. That is vital for histogram construction.
If you want to create histogram of your big data, read this paper and check Apache Pig Scripts.

Identifying graphs in heap of connected nodes -- how is this called?

I have a SQL table with three columns X, Y, Z. I need to split it in groups in such a way that all records with same value of X or Y or Z are assigned to the same group. I need to make sure that the records with same value X or Y or Z are never split across multiple groups.
If you think of records as nodes and values of X, Y, Z as edges, this problem is the same as finding all graphs where the nodes in each graph will be connected directly or indirectly via X, Y, or Z-edge, but each graph will have no edges in common with other graphs (otherwise it would be part of the same graph).
A few years ago I knew what this was called and even remembered the algorithm but now it escapes me. Please tell me how this problem is called so I can Google for solution. If you now a good algorithm -- please point me to it. If you have a SQL implementation -- I will marry you :)
Example:
X Y Z BUCKET
--------- ---------------- --------- -----------
1 34 56 1
54 43 45 2
1 12 22 1
2 34 11 1
The last row is in bucket 1 because of the value of Y=34 which is the same as of the first row, which is in bucket 1.

It looks not like a graph, more like a simplicial complex.
But if we treat this complex as its skeletal graph (the numbers are treated as vertices and a row in a table means that all that three vertices are connected by an edge), then we may just use any algorithm to find connected components of this graph. I'm not sure whether there is a feasible way to do this in SQL though, perhaps it would be more prudent to use a graph database somehow.
However, for this specific problem there may be some easy solution attainable by means of SQL which I didn't look for.

to find how many nodes in each group x:
select x, count(x)
from mytable
group by x
or to find the list of sets x:
select distinct x from mytable;

Why don't you initially GROUP BY one of the colums (say X), make buckets, then do so for Y and Z, each time merging all the buckets from the previous step if you find new groups.
Repeat the process for X, Y, and Z until the buckets stop changing.
Are you working for linked-in or facebook? :)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Group by Regression in TensorFlow - tensorflow

Related

moving from tabular to graph representation of a given data

Multiple Object Tracking (MOT) benchmark data-set format for ground truth tracking

Customizing tables in Stata

PIG Item Count and Histogram

Identifying graphs in heap of connected nodes -- how is this called?

Categories

Resources