Short question: I would like to split a BQ table into multiple small tables, based on the distinct values of a column. So, if column country has 10 distinct values, it should split the table into 10 individual tables, with each having respective country data. Best, if done from within a BQ query (using INSERT, MERGE, etc.).
What I am doing right now is importing data to gstorage -> local storage -> doing splits locally and then pushing into tables (which is kind of a very time consuming process).
Thanks.
If the data has the same schema, just leave it in one table and use the clustering feature: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#creating_a_clustered_table
#standardSQL
CREATE TABLE mydataset.myclusteredtable
PARTITION BY dateCol
CLUSTER BY country
OPTIONS (
description="a table clustered by country"
) AS (
SELECT ....
)
https://cloud.google.com/bigquery/docs/clustered-tables
The feature is in beta though.
You can use Dataflow for this. This answer gives an example of a pipeline that queries a BigQuery table, splits the rows based on a column and then outputs them to different PubSub topics (which could be different BigQuery tables instead):
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
PCollection<TableRow> weatherData = p.apply(
BigQueryIO.Read.named("ReadWeatherStations").from("clouddataflow-readonly:samples.weather_stations"));
final TupleTag<String> readings2010 = new TupleTag<String>() {
};
final TupleTag<String> readings2000plus = new TupleTag<String>() {
};
final TupleTag<String> readingsOld = new TupleTag<String>() {
};
PCollectionTuple collectionTuple = weatherData.apply(ParDo.named("tablerow2string")
.withOutputTags(readings2010, TupleTagList.of(readings2000plus).and(readingsOld))
.of(new DoFn<TableRow, String>() {
#Override
public void processElement(DoFn<TableRow, String>.ProcessContext c) throws Exception {
if (c.element().getF().get(2).getV().equals("2010")) {
c.output(c.element().toString());
} else if (Integer.parseInt(c.element().getF().get(2).getV().toString()) > 2000) {
c.sideOutput(readings2000plus, c.element().toString());
} else {
c.sideOutput(readingsOld, c.element().toString());
}
}
}));
collectionTuple.get(readings2010)
.apply(PubsubIO.Write.named("WriteToPubsub1").topic("projects/fh-dataflow/topics/bq2pubsub-topic1"));
collectionTuple.get(readings2000plus)
.apply(PubsubIO.Write.named("WriteToPubsub2").topic("projects/fh-dataflow/topics/bq2pubsub-topic2"));
collectionTuple.get(readingsOld)
.apply(PubsubIO.Write.named("WriteToPubsub3").topic("projects/fh-dataflow/topics/bq2pubsub-topic3"));
p.run();
The following SQLite3 statement is very slow in my Swift-Code (see below)
Any idea why ??
Currently, it takes approx 30 seconds from breakpoint-1 to breakpoint-2
// .. breakpoint-1
while ((sqlite3_step(statement) == SQLITE_ROW) {
// .. breakpoint-2
}
My SQLite-db is 450 MB in size and the query is quite complex.
I am simply not sure if this "while-loop" cannot be made faster somehow ??
Here is the Query:
var query1: String = ""
query1 =
"""
SELECT DISTINCT st.departure_time as time, t.trip_headsign, r.route_desc, t.trip_id
FROM stops s
INNER JOIN calendar c ON t.service_id = c.service_id
INNER JOIN stop_times st ON st.stop_id = s.stop_id
INNER JOIN trips t ON t.trip_id = st.trip_id
INNER JOIN routes r ON r.route_id = t.route_id
WHERE c.\(weekDay) = 1
AND st.departure_time > "\(departureTime)"
AND st.stop_id LIKE '\(stop_id)%'
AND s.stop_name <> t.trip_headsign
ORDER BY st.departure_time ASC
LIMIT 80
"""
Also, after maddy's input, I did create a new DB - this time "indexed" (see on which columns below..) . After indexing - the size of the DB increased to double its original size.
But with the above query and the newly indexed DB, the speed is unfortunately still identically slow (> 30 sec).
How would I need to change the query after having an "indexed" DB ???
How can I increase query-speed ???
These are the indexed columns:
CREATE INDEX departure_time_IDX ON stop_times(departure_time);
CREATE INDEX stop_times_id_IDX ON stop_times(stop_id);
CREATE INDEX stop_times_trip_id_IDX ON stop_times(trip_id);
CREATE INDEX stop_name_IDX ON stops(stop_name);
CREATE INDEX stop_id_IDX ON stops(stop_id);
CREATE INDEX trip_headsign_IDX ON trips(trip_headsign);
CREATE INDEX trip_id_IDX ON trips(trip_id);
CREATE INDEX trip_route_id_IDX ON trips(route_id);
CREATE INDEX service_id_IDX ON trips(service_id);
CREATE INDEX route_desc_IDX ON routes(route_desc);
CREATE INDEX route_id_IDX ON routes(route_id);
CREATE INDEX cal_service_id_IDX ON calendar(service_id);
CREATE INDEX monday_IDX ON calendar(monday);
CREATE INDEX tuesday_IDX ON calendar(tuesday);
CREATE INDEX wednesday_IDX ON calendar(wednesday);
CREATE INDEX thursday_IDX ON calendar(thursday);
CREATE INDEX friday_IDX ON calendar(friday);
CREATE INDEX saturday_IDX ON calendar(saturday);
CREATE INDEX sunday_IDX ON calendar(sunday);
Here is the EXPLAIN QUERY PLAN
I have the query down to 3 seconds if I eliminate the INNER JOIN's of trips and routes. Of course, I need those two for the result I expect out of this. But at least it shows that the INNER JOIN's of those two extra tables are slowing things down.
Also, it has to be noticed that stop_times is the biggest table of all !
(then comes trips - and the others are all smaller...)
The 3-sec query looks like this :
var faster_query: String = ""
faster_query =
"""
SELECT DISTINCT st.departure_time as time
FROM stops s
INNER JOIN stop_times st ON st.stop_id = s.stop_id
WHERE st.departure_time > "19:51:00"
AND st.stop_id LIKE '8505000%'
ORDER BY st.departure_time ASC
LIMIT 80
"""
The EXPLAIN QUERY PLAN for this faster_query looks like this:
Here is the whole code for the SQLite3 query
// Open SQLite database
var db: OpaquePointer? = nil
if let path = self.departure_filePath?.path {
if sqlite3_open(path, &db) == SQLITE_OK {
var statement: OpaquePointer? = nil
// Run SELECT query from db
if sqlite3_prepare_v2(db, query1, -1, &statement, nil) == SQLITE_OK {
// Loop through all results from query1
// .. breakpoint-1
while ((sqlite3_step(statement) == SQLITE_ROW) {
// .. breakpoint-2
let ID = sqlite3_column_text(statement, 0)
if ID != nil {
IDString = String(cString:ID!)
} else {
print("ID not found", terminator: "")
return nil
}
}
}
}
}
I'm currently using indexdDB to store offline data of some records in a sales store. The store has columns such as id, shopname, and lastsaledate. I would like some help performing the same operation as the following SQL statement using indexedDB:
SELECT MAX(LastSaleDate) FROM Sales;
Any suggestions?
Ensure you have an index for the lastsaledate property, e.g. when upgrading the database do:
store.createIndex('by_lastsaledate', 'lastsaledate');
When querying, use a reverse cursor ('prev') and null range (i.e. all records):
var store = transaction.objectStore('records');
var index = store.index('by_lastsaledate');
var request = index.openCursor(/*query*/null, /*direction*/'prev');
request.onsuccess = function() {
var cursor = request.result;
if (cursor) {
console.log('max date is: ' + cursor.key);
} else {
console.log('no records!');
}
};
I'm currently using Entity Framework for my db access but want to have a look at Dapper. I have classes like this:
public class Course{
public string Title{get;set;}
public IList<Location> Locations {get;set;}
...
}
public class Location{
public string Name {get;set;}
...
}
So one course can be taught at several locations. Entity Framework does the mapping for me so my Course object is populated with a list of locations. How would I go about this with Dapper, is it even possible or do I have to do it in several query steps?
Alternatively, you can use one query with a lookup:
var lookup = new Dictionary<int, Course>();
conn.Query<Course, Location, Course>(#"
SELECT c.*, l.*
FROM Course c
INNER JOIN Location l ON c.LocationId = l.Id
", (c, l) => {
Course course;
if (!lookup.TryGetValue(c.Id, out course))
lookup.Add(c.Id, course = c);
if (course.Locations == null)
course.Locations = new List<Location>();
course.Locations.Add(l); /* Add locations to course */
return course;
}).AsQueryable();
var resultList = lookup.Values;
See here https://www.tritac.com/blog/dappernet-by-example/
Dapper is not a full blown ORM it does not handle magic generation of queries and such.
For your particular example the following would probably work:
Grab the courses:
var courses = cnn.Query<Course>("select * from Courses where Category = 1 Order by CreationDate");
Grab the relevant mapping:
var mappings = cnn.Query<CourseLocation>(
"select * from CourseLocations where CourseId in #Ids",
new {Ids = courses.Select(c => c.Id).Distinct()});
Grab the relevant locations
var locations = cnn.Query<Location>(
"select * from Locations where Id in #Ids",
new {Ids = mappings.Select(m => m.LocationId).Distinct()}
);
Map it all up
Leaving this to the reader, you create a few maps and iterate through your courses populating with the locations.
Caveat the in trick will work if you have less than 2100 lookups (Sql Server), if you have more you probably want to amend the query to select * from CourseLocations where CourseId in (select Id from Courses ... ) if that is the case you may as well yank all the results in one go using QueryMultiple
No need for lookup Dictionary
var coursesWithLocations =
conn.Query<Course, Location, Course>(#"
SELECT c.*, l.*
FROM Course c
INNER JOIN Location l ON c.LocationId = l.Id
", (course, location) => {
course.Locations = course.Locations ?? new List<Location>();
course.Locations.Add(location);
return course;
}).AsQueryable();
I know I'm really late to this, but there is another option. You can use QueryMultiple here. Something like this:
var results = cnn.QueryMultiple(#"
SELECT *
FROM Courses
WHERE Category = 1
ORDER BY CreationDate
;
SELECT A.*
,B.CourseId
FROM Locations A
INNER JOIN CourseLocations B
ON A.LocationId = B.LocationId
INNER JOIN Course C
ON B.CourseId = B.CourseId
AND C.Category = 1
");
var courses = results.Read<Course>();
var locations = results.Read<Location>(); //(Location will have that extra CourseId on it for the next part)
foreach (var course in courses) {
course.Locations = locations.Where(a => a.CourseId == course.CourseId).ToList();
}
Sorry to be late to the party (like always). For me, it's easier to use a Dictionary, like Jeroen K did, in terms of performance and readability. Also, to avoid header multiplication across locations, I use Distinct() to remove potential dups:
string query = #"SELECT c.*, l.*
FROM Course c
INNER JOIN Location l ON c.LocationId = l.Id";
using (SqlConnection conn = DB.getConnection())
{
conn.Open();
var courseDictionary = new Dictionary<Guid, Course>();
var list = conn.Query<Course, Location, Course>(
query,
(course, location) =>
{
if (!courseDictionary.TryGetValue(course.Id, out Course courseEntry))
{
courseEntry = course;
courseEntry.Locations = courseEntry.Locations ?? new List<Location>();
courseDictionary.Add(courseEntry.Id, courseEntry);
}
courseEntry.Locations.Add(location);
return courseEntry;
},
splitOn: "Id")
.Distinct()
.ToList();
return list;
}
Something is missing. If you do not specify each field from Locations in the SQL query, the object Location cannot be filled. Take a look:
var lookup = new Dictionary<int, Course>()
conn.Query<Course, Location, Course>(#"
SELECT c.*, l.Name, l.otherField, l.secondField
FROM Course c
INNER JOIN Location l ON c.LocationId = l.Id
", (c, l) => {
Course course;
if (!lookup.TryGetValue(c.Id, out course)) {
lookup.Add(c.Id, course = c);
}
if (course.Locations == null)
course.Locations = new List<Location>();
course.Locations.Add(a);
return course;
},
).AsQueryable();
var resultList = lookup.Values;
Using l.* in the query, I had the list of locations but without data.
Not sure if anybody needs it, but I have dynamic version of it without Model for quick & flexible coding.
var lookup = new Dictionary<int, dynamic>();
conn.Query<dynamic, dynamic, dynamic>(#"
SELECT A.*, B.*
FROM Client A
INNER JOIN Instance B ON A.ClientID = B.ClientID
", (A, B) => {
// If dict has no key, allocate new obj
// with another level of array
if (!lookup.ContainsKey(A.ClientID)) {
lookup[A.ClientID] = new {
ClientID = A.ClientID,
ClientName = A.Name,
Instances = new List<dynamic>()
};
}
// Add each instance
lookup[A.ClientID].Instances.Add(new {
InstanceName = B.Name,
BaseURL = B.BaseURL,
WebAppPath = B.WebAppPath
});
return lookup[A.ClientID];
}, splitOn: "ClientID,InstanceID").AsQueryable();
var resultList = lookup.Values;
return resultList;
There is another approach using the JSON result. Even though the accepted answer and others are well explained, I just thought about an another approach to get the result.
Create a stored procedure or a select qry to return the result in json format. then Deserialize the the result object to required class format. please go through the sample code.
using (var db = connection.OpenConnection())
{
var results = await db.QueryAsync("your_sp_name",..);
var result = results.FirstOrDefault();
string Json = result?.your_result_json_row;
if (!string.IsNullOrEmpty(Json))
{
List<Course> Courses= JsonConvert.DeserializeObject<List<Course>>(Json);
}
//map to your custom class and dto then return the result
}
This is an another thought process. Please review the same.
I am re-writing a query which is created in response to user's entry into text fields in order to offer some protection against SQL injection attack.
SELECT DISTINCT (FileNameID) FROM SurNames WHERE Surname IN
('Jones','Smith','Armitage')
AND FileNameID IN ( SELECT DISTINCT (FileNameID) FROM FirstNames WHERE FirstName
IN ('John','William') )
There can be up to 3 other tables involved in this process.
The parameter lists can be up to 50-100 entries so building a parameterized query is tedious and cumbersome.
I am trying to create a Linq query which should take care of the parameterization and offer the protection I need.
This gives me what I need
var surnameValues = new[] { "Jones","Smith","Armitage" };
var firstnameValues = new[] { "John","William" };
var result = (from sn in db.Surnames
from fn in db.FirstNames
where surnameValues.Contains(sn.Surname) &&
firstnameValues.Contains(fn.FirstName)
select fn.FileNameID).Distinct().ToArray();
I now need a way to dynamically create this depending upon whether the user has selected/entered values in the surname or firstname text entry boxes?
Any pointers will be gratefully received
Thanks
Roger
you could combine all the logic into the query;
var surnameValues = new[] { "Jones","Smith","Armitage" };
var firstnameValues = null;
// Set these two variables to handle null values and use an empty array instead
var surnameCheck= surnameValues ?? new string[0];
var firstnameCheck= firstnameValus ?? new string[0];
var result = (from sn in db.Surnames
from fn in db.FirstNames
where
(!surnameCheck.Any() || surnameCheck.Contains(sn.Surname)) &&
(!firstnameCheck.Any() || firstnameCheck.Contains(fn.FirstName))
select fn.FileNameID).Distinct().ToArray();
Your query doesn't seem to have a join condition betwwen the Surnames table and the firstNames table?
You could dynamically build the query (as you appear to be doing I cross join I've used SelectMany)
var query=db.Surnames.SelectMany(sn=>db.FirstNames.Select (fn => new {fn=fn,sn=sn}));
if (surnameValues!=null && surnameValues.Any()) query=query.Where(x=>surnameValues.Contains(x.sn.Surname));
if (firstnameValues !=null && firstnameValues.Any()) query=query.Where(x=>firstnameValues.Contains(x.fn.FirstName));
var result=query.Select(x=>x.fn.FileNameID).Distinct();