Scalding write to JDBCSource having greater than 22 columns - scalding

Is there a way in scalding to write to a SQL table that has greater than 22 columns? The problem I am facing is as follows. I have a table which has 28 columns, each row of which I am representing using a case class. Something like
case class ResultsRow(field1: String, field2: Int, ... field28: String)
I am at the last stage, where I have a TypedPipe[ResultsRow] which I need to serialize to the DB. However, neither JDBCSource provided by scalding nor parallelai seems to support only taking tuples as input and not case classes. I did like to do something of the sort
val results: TypedPipe[ResultsRow] = getResultRows
val dbOutput: JDBCSource = JDBCSource(...)
and I can't do
.map { row => (row.field1, row.field2, ..., row.field28) }
because you can't define a tuple having more than 22 fields.
Note: This prescribed on Scalding FAQ doesn't work because it is to handle reading more than 22 fields and not serializing the same to a database


How can I display enum string key instead of integer value in SQL query?

I am pretty new to SQL. I assume this is fairly simple, but I haven't been able to find a straightforward answer online.
I am writing a simple SQL query to group database records by an enum column, and display the count of each value. It works fine, but the output is displaying the enum integer, where I want it to display the string key of that enum value.
Here is an example of the SQL query:
SELECT COUNT(a.sound) as "Sound Count", a.sound
FROM animals a
GROUP BY a.sound
Here is the enum definition:
enum sound: {
bark: 0,
meow: 1,
moo: 2
And here is the output of the query:
Sound Count Sound
2 0
4 1
3 2
Whereas I really want:
Sound Count Sound
2 bark
4 meow
3 moo
You are asking the DB for info using SQL and so it will not have any knowledge of your Rails enums. You need to use Rails to make the query:
=> {"bark"=>2, "meow"=>4, "moo"=>3}
For a pure sql answer with Postgresql:
SELECT temp.sound_count,
when temp.sound = 0 then 'bark'
when temp.sound = 1 then 'meow'
when temp.sound = 2 then 'moo'
AS my_sound
FROM (SELECT COUNT(s.sound) as sound_count, a.sound from animals a
GOUP BY a.sound)
AS temp;
If you're not working on a legacy database and are able to change the schema, then I would suggest not using an integer backed enum. Using a string backed enum will make your database readable without the application code. Then when you add new values to your code, you don't need to document what the integers mean.
Instead of defining the enum as you do, define it as strings:
enum sound: {
bark: 'bark',
meow: 'meow',
moo: 'moo'
And make sure that the column in the database is also a string.
Now you get all the benefits of enum without the hassle of integers in the database. Your query will also work as-is and produce the result you asked for.
As long as the column is indexed, it's basically just as fast to query as an integer. It will just take a few more bytes of space.
If you want to enforce values on the database level, a postgres enum could also be considered.

Using rails , what 's wrong with this query , it does not return a valid id
it returns id like that : Store:0x00007f8717546c30
select does not return an array of strings or integers for the given column(s), but rather an active record relation containing objects with just the given field:
Your code is then converting that relation to an array, and taking the first object in that array, which is an instance of the Store class. If you want the ID, then try:
However, I think you're misunderstanding how to structure the queries. Put the where part first, and then find the ID of the first result:
And if there is only 1 store, then:
Store.find_by(user: current_user).id
(or if there are many)

How to count rows based on another column?

I am a beginner at SQL and I am using Microsoft Access. I am trying to create count tables based on Object Def. However, some objects in Object Def have a related column that indicates how many objects are within that row. Most objects in Object Def are singular objects and are represented with a blank field.
I want the output to look something like this:
Object Def Total
Cat 3
Dog 4
Rat 4
You could use
SELECT object_def,
SUM(Nz(object_count, 1)) AS total
FROM table_name
GROUP BY object_def;
For ease of testing for your scenario:!18/1e9d5/3
Otherwise, Group By gets you the objective you are trying to obtain.

Maintaining auto ranking as a column in MongoDB

I am using MongoDB as my database.
I have data which contains rank and name as columns. Now a new row can be updated with a rank different from ranks already existing or can be same.
If same then the ranks of other rows must be adjusted .
Rows having lesser rank than the to be inserted one must be incremented by one and the rows which are having ranks can remain as it it.
Feature is something like number bulleted list in MS Word type of applications. Where inserting a row in between adjust the numbering of other rows below it.
Rank 1 is the highest rank.
For e.g. there are 3 rows
Name Rank
A 1
B 2
C 3
Now i want to update a row with D as name and 2 as rank. So now after the row insert, the DB should like below
Name Rank
A 1
B 3
C 4
D 2
Probably using Database triggers i can achieve this by updating the other rows.
I have couple of questions
(a) Is there any other better way than using database trigger for achieving this kind of scenario ? Updating all the rows might be a time consuming job.
(b) Does MongoDB support database trigger natively ?
Best Regards,
No, MongoDB, does not provide triggers (yet). Also I don't think trigger is really a great way to achieve this.
So I would just like to throw some ideas, see if it makes sense.
Approach 1
Maybe instead of disturbing those many documents, you can create a collection with only one document (Let's call the collection ranking). In that document, have an array field call ranks. Since it's an array it's already maintaining a sequence.
_id : "RANK",
"ranks" : ["A","B","C"]
Now if you want to add D to this rank at 2nd position
db.ranking.update({_id:"RANK"},{$push : {"ranks":{$each : ["D"],$position:1}}});
it would add D to index 1 which is 2nd position considering index starts at 0.
_id : "RANK",
"ranks" : ["A","D","B","C"]
But there is a catch, what if you want to change C position to 1st from 4th, you need to remove it from end and put it in the beginning, I am sure both operation can't be achieved in single update (didn't dig in the options much), so we can run two queries
db.ranking.update({_id:"RANK"},{$pull : {"ranks": "C"}});
db.ranking.update({_id:"RANK"},{$push : {"ranks":{$each : ["C"],$position:0}}});
Then it would be like
_id : "RANK",
"ranks" : ["C","A","D","B"]
maintaining the rest of sequence.
Now you would probably want to store id instead of A,B,C etc. one document can be 16MB so basically this ranks array can store more than 1.3 million entries of id, if id is MongoDB ObjectId of 12 bytes each. if that is not enough, we still have option to have followup document(s) with further ranking.
Approach 2
you can also, instead of having rank as number, just have two field like followedBy and precededBy.
so your user document would look
if you want to add D at second position, then you need to change the current 2nd position and you need to insert the new One, so it would be change in only two document
"precededBy":"D" //changed from A to D
The downside of this approach is that you cannot sort in query based on the ranking until and unless you get all these in application and create a linkedlist sort of structure.
This approach just preserve the ranking with minimum DB changes.

search within an array with a condition

I have two array I'm trying to compare at many levels. Both have the same structure with 3 "columns.
The first column contains the polygon's ID, the second a area type, and the third, the percentage of each area type for a polygone.
So, for many rows, it will compare, for example, ID : 1 Type : aaa % : 100
But for some elements, I have many rows for the same ID. For example, I'll have ID 2, Type aaa, 25% --- ID 2, type bbb, 25% --- ID 2, type ccc, 50%. And in the second array, I'll have ID 2, Type aaa, 25% --- ID 2, type bbb, 10% --- ID 2, type eee, 38% --- ID 2, type fff, 27%.
here's a visual example..
So, my function has to compare these two array and send me an email if there are differences.
(I wont show you the real code because there are 811 lines). The first "if" condition is
if = Then
if array1.type = array2.type Then
if array1.percent = array2.percent Then
zone_verification = True
zone_verification = False
The probleme is because there are more than 50 000 rows in each array. So when I run the function, for each "", the function search through 50 000 rows in array2. 50 000 searchs for 50 000 rows.. it's pretty long to run!
I'm looking for something to get it running faster. How could I get my search more specific. Example : I have many id "2" in the array1. If there are many id "2" in the array2, find it, and push all the = 3 in a "sub array" or something like that, and search in these specific rows. So I'll have just X rows in array1 to compare with X rows in array 2, not with 50 000. and when each "id 2" in array1 is done, do the same thing for "id 4".. and for "id 5"...
Hope it's clear. it's almost the first time I use, and I have this big function to get running.
Here's what I wanna do.
I have two different layers in a geospatial database. Both layers have the same structure. They are a "spatial join" of the land parcels (55 000), and the land use layer. The first layer is the current one, and the second layer is the next one we'll use after 2015.
So I have, for each "land parcel" the percentage of each land use. So, for a "land parcel" (ID 7580-80-2532, I can have 50% of farming use (TYPE FAR-23), and 50% of residantial use (RES-112). In the first array, I'll have 2 rows with the same ID (7580-80-2532), but each one will have a different type (FAR-23, RES-112) and a different %.
In the second layer, the same the municipal zoning (land use) has changed. So the same "land parcel" will now be 40% of residential use (RES-112), 20% of commercial (COM-54) and 40% of a new farming use (FAR-33).
So, I wanna know if there are some differences. Some land parcels will be exactly the same. Some parcels will keep the same land use, but not the same percentage of each. But for some land parcel, there will be more or less land use types with different percentage of each.
I want this script to compare these two layers and send me an email when there are differences between these two layers for the same land parcel ID.
The script is already working, but it takes too much time.
The probleme is, I think, the script go through all array2 for each row in array 1.
What I want is when there are more than 1 rows with the same ID in array1, take only this ID in both arrays.
Maybe if I order them by IDs, I could write a condition. kind of "when you find what you're looking for, stop searching when you'll find a different value?
It's hard to explain it clearly because I've been using VB since last week.. And english isn't my first language! ;)
If you just want to find out if there are any differences between the first and second array, you could do:
Dim diff = New HashSet(of Polygon)(array1)
diff will contain any Polygon which is unique to array1 or array2. If you want to do other types of comparisons, maybe you should explain what you're trying to do exactly.
You could use grouping and lookups like this:
'Create lookup with first array, for fast access by ID
Dim lookupByID = array1.ToLookup(Function(p)
'Loop through each group of items with same ID in array2
For Each secondArrayValues in array2.GroupBy(Function(p)
Dim currentID As Integer = secondArrayValues.Key 'Current ID is the grouping key
'Retrieve values with same ID in array1
'Use a hashset to easily compare for equality
Dim firstArrayValues As New HashSet(of Polygon)(lookupByID(currentID))
'Check for differences between the two sets of data, for this ID
If Not firstArrayValues.SetEquals(secondArrayValues) Then
'Data has changed, do something
Console.WriteLine("Differences for ID " & currentID)
End If
I am answering this question based on the first part that you wrote (that is without the EDIT section). The correct answer should explain a good algorithm but I am suggesting you to use DB capabilities because they have optimized many queries for these purpose.
Put all the records in DB two tables - O(n) time ... If the records are static you dont need to perform this step every time.
Table 1
id type percent
Table 2
id type percent
Then use the DB query, some thing like this
select count(*) from table1 t1, table2 t2 where! and t1.type!=t2.type
(you can use some better queries, what I am trying to say is give the control to DB to perform this operation)
retrieve the result in your code and perform the necessary operation.
1) You can sort them in O(n logn) time based on ID + type + Percent and then perform binary search.
2) Store the first record in hash map with appropriate key - could be ID only or ID+type
this will take O(n) time and searching ,if key is correct, will take constant time.
You need to define a structure to store this data. We'll store all the data in a LandParcel class, which will have a HashSet<ParcelData>
public class ParcelData
public ParcelType Type { get; set; } // This can be an enum, string, etc.
public int Percent { get; set; }
// Redefine Equals and GetHashCode conveniently
public class LandParcel
public ID Id { get; set; } // Whatever the type of the ID is...
public HashSet<ParcelData> Data { get; set; }
Now you have to build your data structure, with something like this:
Dictionary<ID, LandParcel> data1 = new ....
foreach (var item in array1)
LandParcel p;
if (!data1.TryGetValue(, out p)
data1[] = p = new LandParcel(id);
// Can this data be repeated?
p.Data.Add(new ParcelData(item.type, item.percent));
You do the same with a data2 dictionary for the second array. Now you iterate for all items in data1 and compare them with the item with the same id for data2.
foreach (var parcel2 in data2.Values)
var parcel1 = data1[parcel2.ID]; // Beware with exceptions here !!!
if (!parcel1.Data.SetEquals(parcel2.Data))
// You have different parcels
(Now that I look at it, we are practically doing a small database query here, kind of smelly code ...)
Sorry for the C# code since I don't really feel so comfortable with VB, but it should be fairly straightforward.