In the following LINQ example, how can you get a list of index numbers from the original rows that the group is made up of? I would like to show the user where the data comes from.
Dim inputDt As New DataTable
inputDt.Columns.Add("Contractor")
inputDt.Columns.Add("Job_Type")
inputDt.Columns.Add("Cost")
inputDt.Rows.Add({"John Smith", "Roofing", "2408.68"})
inputDt.Rows.Add({"John Smith", "Electrical", "1123.08"})
inputDt.Rows.Add({"John Smith", "Framing", "900.99"})
inputDt.Rows.Add({"John Smith", "Electrical", "892.00"})
Dim results = From rows In inputDt Where rows!Contractor <> ""
Group rows By rows!Job_Type
Into cost_total = Sum(CDec(rows!Cost))
For Each r In results
' Show results.
'r.Job_Type
'r.cost_total
' Show line numbers of original rows... ?
Next
For the result (Job_Type="Electrical", cost_total=2015.08), the original index numbers are 1 and 3.
Thanks
First and perhaps foremost, set Option Strict On. This will not allow the old VB6 style rows!Cost type notation. But this is for the better because that way always returns Object and the data rarely is. This is no loss at all as NET has better ways to type and convert variables.
Second, and somewhat related, is that all your DataTable columns are text even though one is clearly decimal. Next, your query relates to working with the data in the table but you want to also include a DataRow property which is a bit odd. Better would be to add an Id or Number to the data to act as the identifier. This will also help the results make sense if the View (order of rows) changes.
You did not clarify whether you wanted a CSV of indices (now IDs) or a collection of them. A CSV of them seems simpler, so thats what this does.
The code also uses more idiomatic names and demonstrates casting data to the needed type using other extension methods. It also uses the extension method approach. First the DataTable with non-string Data Types specified:
Dim inputDt As New DataTable
inputDt.Columns.Add("ID", GetType(Int32))
inputDt.Columns.Add("Contractor")
inputDt.Columns.Add("JobType")
inputDt.Columns.Add("Cost", GetType(Decimal))
inputDt.Rows.Add({1, "John Smith", "Roofing", "2408.68"})
inputDt.Rows.Add({5, "John Smith", "Electrical", "1123.08"})
inputDt.Rows.Add({9, "John Smith", "Framing", "900.99"})
inputDt.Rows.Add({17, "John Smith", "Electrical", "892.00"})
then the query:
Dim summary = inputDt.AsEnumerable().GroupBy(Function(g) g.Field(Of String)("JobType"),
Function(k, v) New With {.Job = k,
.Cost = v.Sum(Function(q) q.Field(Of Decimal)("Cost")),
.Indices = String.Join(",", inputDt.AsEnumerable().
Where(Function(q) q.Field(Of String)("JobType") = k).
Select(Function(j) j.Field(Of Int32)("Id")))
}).
OrderBy(Function(j) j.Cost).
ToArray()
' Debug, test:
For Each item In summary
Console.WriteLine("Job: {0}, Cost: {1}, Ids: {2}", item.Job, item.Cost, item.Indices)
Next
The excessive scroll is unfortunate but I left it to allow the clauses to align with what "level" they are acting at. As you can see, a separate query is run on the DataTable to get the matching Indicies.
It is a little more typical to write such a thing as
Dim foo = Something.GroupBy(...).Select(...)
But you can skip the SELECT by using this overload of GroupBy as the above does:
Dim foo = Something.GroupBy(Function (g) ..., Function (k, v) ... )
Results:
Job: Framing, Cost: 900.99, Ids: 9
Job: Electrical, Cost: 2015.08, Ids: 5,17
Job: Roofing, Cost: 2408.68, Ids: 1
Related
I have a dataset like a multiple choice quiz result. One of the fields is semi-colon delimited. I would like to break these in to true/false columns.
Input
Student
Answers
Alice
B;C
Bob
A;B;D
Carol
A;D
Desired Output
Student
A
B
C
D
Alice
False
True
True
False
Bob
True
True
False
True
Carol
True
False
False
True
I've already tried "Split multi-valued cells" and "Split in to several columns", but these don't give me what I would like.
I'm aware that I could do a custom grel/python/jython along the lines of "if value in string: return true" for each value, but I was hoping there would be a more elegant solution.
Can anyone suggest a starting point?
GREL in OpenRefine has a somehow limited number of datastructures, but you can still build simple algorithms with it.
For your encoding you need two datastructures:
a list (technical array) of all available categories.
a list of the categories in the current cell.
With this you can check for each category, whether it is present in the current cell or not.
Assuming that the number of all available categories is somehow assessable,
I will use a hard coded list ["A", "B", "C", "D"].
The list of categories in the current cell we get via value.split(/\s*;\s*/).
Note that I am using an array instead of string matching
and use splitting with a regular expression considering whitespace.
This is mainly defensive programming and hopefully the algorithm will still be understandable.
So let's wrap this all together into a GREL expression and create a new column (or transform the current one):
with(
value.split(/\s*;\s*/),
cell_categories,
forEach(
["A", "B", "C", "D"],
category,
if(cell_categories.inArray(category), 1, 0)))
.join(";")
You can then split the new column into several columns using ; as separator.
The new column names you have to assign manually (sry ;).
Update: here is a more elaborate version to automatically extract the categories.
The idea is to create a single record for the whole dataset to be able to access all the entries in the column "Answers" and then extract all available categories from it.
Create a new column "Record" with content "Record".
Move the column "Record" to the beginning.
Blank down the column "Record".
Add a new column "Categories" based on the column "Answers" with the following GREL expression:
if(row.index>0, "",
row.record.cells["Answers"].value
.join(";")
.split(/\s*;\s*/)
.uniques()
.sort()
.join(";"))
Fill down the column "Categories".
Add a new column "Encoding" based on the column "Answers with the following GREL expression:
with(
value.split(/\s*;\s*/),
cell_categories,
forEach(
cells["Categories"].value.split(";"),
category,
if(cell_categories.inArray(category), 1, 0)))
.join(";")
Split the column "Encoding" on the character ;.
Delete the columns "Record" and "Categories".
I need to read a csv file from S3 ,it has string,double data but i will read as string which will provide a dynamic frame of only string. I want to do below for each row
concatenate few columns and create new columns
Add new columns
Convert value in 3rd column from string to date
Convert values of column 4,5,6 individually from string to decimal
Storename,code,created_date,performancedata,accumulateddata,maxmontlydata
GHJ 0,GHJ0000001,2020-03-31,0015.5126-,0024.0446-,0017.1811-
MULT,C000000001,2020-03-31,0015.6743-,0024.4533-,0018.0719-
Below is the code that I have written so far
def ConvertToDec(myString):
pattern = re.compile("[0-9]{0,4}[\\.]?[0-9]{0,4}[-]?")
myString=myString.strip()
doubleVal="";
if myString and not pattern.match(myString):
doubleVal=-9999.9999;
else:
doubleVal=-Decimal(myString);
return doubleVal
def rowwise_function(row):
row_dict = row.asDict()
data='d';
if not row_dict['code']:
data=row_dict['code']
else:
data='CD'
if not row_dict['performancedata']:
data= data +row_dict['performancedata']
else:
data=data + 'HJ'
// new columns
row_dict['LC_CODE']=data
row_dict['CD_CD']=123
row_dict['GBL']=123.345
if rec["created_date"]:
rec["created_date"]= convStr =datetime.datetime.strptime(rec["created_date"], '%Y-%m-%d')
if rec["performancedata"]
rec["performancedata"] = ConvertToDec(rec["performancedata"])
newrow = Row(**row_dict)
return newrow
store_df = spark.read.option("header","true").csv("C:\\STOREDATA.TXT", sep="|")
ratings_rdd = store_df.rdd
ratings_rdd_new = ratings_rdd.map(lambda row: rowwise_function(row))
updatedDF=spark.createDataFrame(ratings_rdd_new)
Basically, I am creating almost new DataFrame. My questions are below -
is this right approach ?
Since i am my changing schema mostly is there any other approach
Use Spark dataframes/sql, why use rdd? You don't need to perform any low level data operations, all are column level so dataframes are easier/efficient to use.
To create new columns - .withColumn(<col_name>, <expression/value>) (refer)
All the if's can be made .filter (refer)
The whole ConvertToDec can be written better using strip and ast module or float.
How to sort somelist As List(of T) by the order set in another list sortorder As List(of Integer)? Both somelist and sortorder are of the same size and are indexed from 0 to n. Integers in the sortorder list determine the sort order: new index of item X in somelist = value of item X in sortorder.
Like this:
somelist = (itemA, itemB, itemC)
sortorder = (3, 1, 2)
somelist.sort()
somelist = (itemB, itemC, itemA)
I am trying to sort several equally sized lists using the predefined sort order.
You could use LINQ, although i hate the ugly method syntax in VB.NET:
somelist = somelist.
Select(Function(t, index) New With {.Obj = t, .Index = index}).
OrderBy(Function(x) sortorder(x.Index)).
Select(Function(x) x.Obj).
ToList()
This uses the overload of Enumerable.Select that projects the index of the item. The object and the index are stored in an anonymous type which is used for the ordering, finally i'm selecting the object and use ToList to build the ordered list.
Another way is to use Enumerable.Zip to merge both into an anonymous type:
Dim q = From x In somelist.Zip(sortorder, Function(t, sort) New With {.Obj = t, .Sort = sort})
Order By x.Sort Ascending
Select x.Obj
somelist = q.ToList()
If you want to order it descending, so the highest values first, use OrderByDescending in the method syntax and Order By x.Sort Descending in the query.
But why do you store such related informations in two different collections at all?
I have a data table with time-series data. How do I write a query to add daily data together for selected series?
My table looks like this...
Day,Y,Series
1,1,A
1,2,A
1,3,A
2,2,A
2,3,B
2,5,C
3,4,A
3,1,B
3,4,C
etc.
I want to return an array (dY) based on a list e.g. {"A","C"}. e.g. giving the Y value (for A+C) for each day...
dY = {4,7,8}
I have managed to write the query in SQL
SELECT Sum(myTable.Y) AS [Total Of Y]
FROM AAAA
WHERE (((myTable.Series) In (1,3)))
GROUP BY myTable.X;
and I think it should be something like this in LINQ (VB.NET)
Dim mySeries = {1, 3}
Dim Ys = (From myrows In oSubData Where mySeries.Contains(myrows("Series")) Select mycol = Sum(Val(myrows("Y"))))
I have a situation where I need to sort arrays and preserve the current key - value pairs.
For example, this array:
(0) = 4
(1) = 3
(2) = 1
(3) = 2
Needs to sort like this
(2) = 1
(3) = 2
(1) = 3
(0) = 4
Retaining the original keys. Array.Sort(myArray) sorts into the right sequence but doesn't keep the indexes. I need a variant that does.
edit
Using the links, this seems close to what I want. Do I just need to remove the extra brackets to convert this to vb.net?
myList.Sort((firstPair,nextPair) =>
{
return firstPair.Value.CompareTo(nextPair.Value);
}
);
(also would I intergrate this as a function or something else?)
In an array, the order is determined by the indexes (what you call "keys"). Thus, there cannot be an array like this:
(2) = 1
(3) = 2
(1) = 3
(0) = 4
What you need is a data structure that has keys, values and an order (which is independent from the keys). You can use a List(Of KeyValuePair) or (if you use .net 4) List(Of Tuple(Of Integer, Integer)) for this; a few examples are shown in the link provided by Ken in the comment (which I will repeat here for convenience):
How do you sort a C# dictionary by value?
EDIT: Another option would be to use LINQ to automatically create a sorted IEnumerable(Of Tuple(Of Integer, Integer)):
Dim a() As Integer = {4, 3, 1, 2} ' This is your array
Dim tuples = a.Select(Function(value, key) New Tuple(Of Integer, Integer)(key, value))
Dim sorted = tuples.OrderBy(Function(t) t.Item2)
(untested, don't have Visual Studio available right now)
Since you are using .net 2.0 (since you said that you are using Visual Studio 2005 in one of the comments), using an OrderedDictionary might be an option for you, if every array value appears only once. Since OrderedDictionaries are ordered by the key, you could add your array entries to such a dictionary, using
the array index as the dictionary value and
the array value as the dictionary key (which will be used to order the dictionary).
What you are looking for is storing it as a Dictionary < int,int > and sort by the dictionary by Value.
I think the VB .Net synatx for dictionary is Dictionary( of Int, Int)