Finding collocated data in ignite

Finding collocated data in ignite - ignite

I am trying to collocate data based on SQL given in this link https://ignite.apache.org/features/collocatedprocessing.html .
I have created 2 caches 'Country' and 'City' using following SQLs.
-- Cache Country
CREATE TABLE Country (
Code CHAR(3),
Name CHAR(52),
Continent CHAR(50),
Region CHAR(26),
SurfaceArea DECIMAL(10,2),
Population INT(11),
Capital INT(11),
PRIMARY KEY (Code)) WITH "template=partitioned, backups=1";
--Cache City
CREATE TABLE City (
ID INT(11),
Name CHAR(35),
CountryCode CHAR(3),
District CHAR(20),
Population INT(11),
PRIMARY KEY (ID, CountryCode)
) WITH "template=partitioned, backups=1, affinityKey=CountryCode";
I have inserted some sample records, for example :
insert into Country values('RU','Rusia','Rusia','Rusia',0.0,00,0);
insert into Country values('IND','India','Asia','Asia',0.0,00,0);
insert into City values(101,'Mumbai','IND','NA',00);
insert into City values(102,'Moscow','RU','NA',00);
I have started 2 ignite(on different machines) node to collocate data on different nodes. After finding records presents on node 0 through visor
cache -scan -c=#c0 -id8=#n0
I can see both cities Mumbai and Moscow are present on node 0 (n0) as well as on node 1. I was expecting that cities of India will be collocated on node 0 and cities of Rusia will collocated on node 1 but not both on the same node.
My questions are :
I am doing anything wrong while collcating the data .
Running visor cache -scan command is correct way to find collocated data on nodes ?
If this is not correct way then, how can we find which data is collocated on node 0 and node 1 ?
Let's say data is collocated on node 0 (cities of India) and node 1 (cities on Rusia) . What will happen if one of the node will be disconnected from cluster ? Will there be a data loss ? After restarting the node, Will data be collocated again ?
Thank you in Advance.

I have checked and Visor will not be running local scan queries, so any node will return all data that is present in cluster.
So you can only check data collocation with code.
When you have backups, data will be failed over and there will be no data loss. When you add node back data will be rebalanced to it and backup restored.

Related

How to structure an SQL table with metadata

So I am an SQL noobie, and I would like to organize a structured database that has metadata and data in SQLite. I am not sure how to do this, and I have looked around at different internet sites but I haven't found anything helpful.
Basically what I would want is something like this (using different data collection stations as an example):
SQL TABLE:
Location
Lat
Long
Other Important info about the station
And then somehow when I query this table and want to see info about specific stations data I would be able to pull up the data that would look something like this:
datetime data
1/1/1980 11.6985
1/2/1980 43.6431
1/3/1980 54.9089
1/4/1980 63.1225
1/5/1980 72.4399
1/6/1980 79.1363
1/7/1980 82.2778
1/8/1980 86.0785
1/9/1980 86.8612
1/10/1980 84.3342
1/11/1980 80.4646
1/12/1980 77.1508
1/13/1980 74.827
1/14/1980 73.387
1/15/1980 72.1774
1/16/1980 71.6423
Since I don't know much about table hierarchy, I don't know how to do this, but I feel like it is probably possible. Any help would be appreciated!

using different data collection stations
Immediately indicates that a separate table for stations should be used and that the readings should relate to/associate with/reference the stations table.
For the stations table you could have something like :-
CREATE TABLE IF NOT EXISTS stations (id INTEGER PRIMARY KEY, station_name TEXT, station_latitude REAL, station_longitude REAL);
This will create a table (if it doesn't already exist) that has 4 columns :-
The first column id is a unique identifier that will be generated automatically and is what you would use to reference a specific station.
The second column, station_name is for the name of the station an is of type TEXT.
The third and fourth columns are for the stations location according to lat and long.
You could add a couple of stations using :-
INSERT INTO stations (station_name, station_latitude,station_longitude) VALUES("Zebra", 100.7892, 60.789);
INSERT INTO stations (station_name, station_latitude,station_longitude) VALUES("Yankee", 200.2967, 95.234);
You could display/return these using :-
SELECT * FROM stations
that is SELECT all columns (*) FROM the table called stations, the result would be :-
Next you could create the readings table e.g. :-
CREATE TABLE IF NOT EXISTS readings(recorded_datetime INTEGER DEFAULT (datetime('now')), data_recorded REAL, station_reference INTEGER);
This will create a table named readings (if it doesn't already exist) it will have 3 columns :-
recorded_datetime which is of type INTEGER (can store integer of up to 8 bytes i.e. pretty large). This will be used to store a time stamp. Although perhaps not what you want, but as an example, a default value will be used, the current datetime, if no value is specified for this column.
data_recorded as a REAL that is for the data.
station_reference this will refer to the station's id.
You could then insert a reading for the for the Zebra station using :-
INSERT INTO readings (data_recorded,station_reference) VALUES(11.6985,1);
As the record_datetime column has not been provided then the current datetime will be used.
If :-
INSERT INTO readings VALUES(datetime('1980-01-01 10:40'),11.6985,1);
Then this reading would be for 1/1/1980 at 10:40 for station 1.
Using :-
INSERT INTO readings VALUES(datetime('1980-01-01 10:40'),13.6985,2);
INSERT INTO readings VALUES(datetime('1966-03-01 10:40'),15.6985,2);
INSERT INTO readings VALUES(datetime('2000-01-01 10:40'),11.6985,2);
Will add some readings for Yankee station (id 2).
using SELECT station_reference, recorded_datetime, data_recorded FROM READINGS; will select all the columns but the station_reference will be the first column in the result etc e.g. :-
The obvious progression is to display the data including the respective station. For this a JOIN will be used. That is the readings table will be joined with the stations table where the respective stations details are according to the station_refrence value matching the station's id.
However, let's say that we wanted the Station info to be something like stationname (Long=???? - Lat=????) date/time data and be sorted according to station name and then according to date/time. Then the following could be used :-
SELECT
stations.station_name ||
'(Long='||station_longitude||' - Lat='||station_latitude||')'
AS stationinfo,
readings.recorded_datetime,
readings.data_recorded
FROM readings
JOIN stations ON readings.station_reference = stations.id
ORDER BY stations.station_name ASC, readings.recorded_datetime
Note this is shown more of an example that you can do quite complex things in SQL, rather than there being an expectation of fully understanding the coding.
This would result in :-
You may (or some would) argue but why can't I just have a single table with reading, datetime, station name, station latitude, station longitude.
Well you could. BUT :-
Say there were a directive to change station Zebra's name then you'd have to trawl through all the rows to and change the name numerous times. Easy to code but relatively costly in terms of resource usage. Using the two tables means that just a single update is required.
For each row you would have to duplicate data very likely wasting disk space and thus increasing the resources needed to access the data. That is Zebra would take at least 5 bytes, Real's take 8 bytes (2 of them) so that's 21 bytes. A reference to one occurrence of that repeated data would take a maximum of 8 bytes for an integer (initially just a single byte). So there would be a cost of 13 bytes per reading.

SQL: Inserting into a (dynamic) lookup table

Most articles about lookup tables deal with its creation, initial population and use (for looking up: id-->value).
My question is about dynamic updating (inserting new values) of the lookup table, as new data is stored in data tables.
For example, we have a table of persons, and one attribute (column) of it is city of residency. Many persons would have the same value, so it makes sense to use a lookup table for it. As the list of cities that would appear is not known beforehand, the lookup table is initially empty.
To clarify, the value(s) of city is/are:
not know beforehand (we don't know what customer might contact us tomorrow)
there is no "list of all possible cities" (real life cities come and go, get renamed etc)
many persons will share the same value
initially, there will be a few different values (up to 10), later more (but not very much, a few hundred)
Also, the expected number of person objects will be thousands if not millions.
So the basic algorithm is (pseudocode):
procedure insertPerson(name,age,city)
{
cityId := lookup(city);
if cityId == null
cityId := insertIntoLookupTableAndReturnId(city);
INSERT INTO person_table VALUES (name,age,cityId);
}
What is a good lookup table organization for this problem? What exact code to use?
The goal is high performance of person insertion (whether the city is already in the lookup table or not).
General answers are welcome and Oracle 11g would be great.
Note: This is about an OLTP scenario. New persons are inserted in real time. There is no known list of persons that can be used for initialization of the lookup table.

Your basic approach appears to be OK except for one small change I would do: The function lookup(city) will search for the city and return the ID and, if the city is not found, will insert a new record and return its ID. This way, you are further encapsulating the management of the lookup table (cities). As such, your code would become:
procedure insertPerson(name,age,city)
{
INSERT INTO person_table VALUES (name,age,lookup(city));
}
One additional thing you may consider is to create a VIEW that would be used to query for persons' information, including the name of the city.

After some testing, the best performance (least block accesses) I could find was with an index organized table as the lookup table and the below SQL for inserting data.
create table citylookup (key number primary key, city varchar2(100)) organization index;
create unique index cltx1 on citylookup(city);
create sequence lookupkeys;
create sequence datakeys;
create table data (x number primary key, k number references citylookup(key) not null);
-- "Rome" is the city we try to insert
insert all
when oldkey is null then -- if the city is not in the lookup yet
into citylookup values (lookupkeys.nextval, 'Rome') -- then insert it
-- finally, insert the data row with the correct lookup key
when 1=1 then into data values (datakeys.nextval,nvl(oldkey, lookupkeys.nextval))
select (select key from citylookup where city='Rome') as oldkey from dual;
Result: 6+2 blocks for city-exists case, 10+2 for city-doesn't-exists yet (as reported by SQL*Plus with set autotrace on: first value is db block gets, the second consistent gets).
Alternatively, as suggested by Dudu Markovitz, the lookup table could cached in the application and in the hit case just perform an simple INSERT into the DATA table, which then costs only 6+1 block accesses (for the above test case). Here the problem is keeping the cached lookup table in sync with the database and possible other instances of the server application.
PS: The above INSERT ALL command "wastes" a sequence value from the lookupkeys sequence on each run, even if no new city is inserted into the lookup table. It is an additional exercise to solve that.

Relational model : Company has multiple companies

my problem is the following :
How should I represent in a relational model :
A HQ has at least 0 or more (0,N) companies and those depend of 1 and only 1 HQ.
Knowing that : HQ has many fields similar to companies.
A) Should I create 2 tables ? One called HQ and another company.
B) Should it be a recursive on the same table ?
C) Is there another way to represent this relation ?

Using the same table with a parent field works very well on its own if the HQ has all the same fields as the rest. However, if there attributes of a HQ that are not shared by a company as you say, then you'll also need to have a separate table for the HQ-specific data. So yes, 2 tables. But take jbarker's idea as a starting point. Then add an HQ table with a companyID foreign key. An HQ record will have the companyID of the company that is a HQ, which as he says will have a value of NULL for the parent.
As for your question about recursivity, you'll have recursive relationships or "self joins" for the company data, and not for HQ-specific data.

Design a database table using this data

Hi I just want to sort out these fields to a table. But I'm having trouble.
I have details of cities.
These are main city detail examples.
CityName CityID
Colombo 001
Kandy 002
Kandy is Directly connected to these cities
CityName CityID DistancefromKandy
Katugastota 006 1km
Peadeniya 008 2km
I want to store distance between every city. As an example Katugastota to Peradeniya, Colombo to Katugastota, Colombo to Kandy and Kandy To Peradeniya too.
And for a singe city I want to store what are the directly connected cities and the distance to those cities.
How to sort this data to tables..?Any ideas.. I have given the table structure I tried but I cant add the distance between each city in to that and the directly connected cities and distance to directly connected cities in this..
Appreciate any help in this..
I just don't need the sql, if someone can suggest a better table design that would be a great help.

Like #EvilEpidemic suggested above, my first choice would be to store coordinates (latitude & longitude) for each city and calculate the distances between them.
That said, if you need to store your pre-calculated distances for specific pairs of cities, then you may want to try the following:
Add a table that includes two (2) CityID columns (for example, SourceCityId and DestinationCityId) as well as a NOT NULL distance column (of a numeric data type).
For example, in SQL Server you might have a table like (this oversimplified example assumes you store distances as int kilometers, but feel free to change the data type as needed):
CREATE TABLE Distances (
[SourceCityId] NOT NULL int,
[DestinationCityId] NOT NULL int,
[DistanceInKilometers] NOT NULL int,
CONSTRAINT [PK_Distances] PRIMARY KEY CLUSTERED (
[SourceCityId] ASC,
[DestinationCityId] ASC
)
)

How to merge two identical database data to one?

Two customers are going to merge. They are both using my application, with their own database. About a few weeks they are merging (they become one organisation). So they want to have all the data in 1 database.
So the two database structures are identical. The problem is with the data. For example, I have Table Locations and persons (these are just two tables of 50):
Database 1:
Locations:
Id Name Adress etc....
1 Location 1
2 Location 2
Persons:
Id LocationId Name etc...
1 1 Alex
2 1 Peter
3 2 Lisa
Database 2:
Locations:
Id Name Adress etc....
1 Location A
2 Location B
Persons:
Id LocationId Name etc...
1 1 Mark
2 2 Ashley
3 1 Ben
We see that person is related to location (column locationId). Note that I have more tables that is referring to the location table and persons table.
The databases contains their own locations and persons, but the Id's can be the same. In case, when I want to import everything to DB2 then the locations of DB1 should be inserted to DB2 with the ids 3 and 4. The the persons from DB1 should have new Id 4,5,6 and the locations in the person table also has to be changed to the ids 4,5,6.
My solution for this problem is to write a query which handle everything, but I don't know where to begin.
What is the best way (in a query) to renumber the Id fields also having a cascade to the childs? The databases does not containing referential integrity and foreign keys (foreign keys are NOT defined in the database). Creating FKeys and Cascading is not an option.
I'm using sql server 2005.

You say that both customers are using your application, so I assume that it's some kind of "shrink-wrap" software that is used by more customers than just these two, correct?
If yes, adding special columns to the tables or anything like this probably will cause pain in the future, because you either would have to maintain a special version for these two customers that can deal with the additional columns. Or you would have to introduce these columns to your main codebase, which means that all your other customers would get them as well.
I can think of an easier way to do this without changing any of your tables or adding any columns.
In order for this to work, you need to find out the largest ID that exists in both databases together (no matter in which table or in which database it is).
This may require some copy & paste to get a lot of queries that look like this:
select max(id) as maxlocationid from locations
select max(id) as maxpersonid from persons
-- and so on... (one query for each table)
When you find the largest ID after running the query in both databases, take a number that's larger than that ID, and add it to all IDs in all tables in the second database.
It's very important that the number needs to be larger than the largest ID that already exists in both databases!
It's a bit difficult to explain, so here's an example:
Let's say that the largest ID in any table in both databases is 8000.
Then you run some SQL that adds 10000 to every ID in every table in the second database:
update Locations set Id = Id + 10000
update Persons set Id = Id + 10000, LocationId = LocationId + 10000
-- and so on, for each table
The queries are relatively simple, but this is the most work because you have to build a query like this manually for each table in the database, with the correct names of all the ID columns.
After running the query on the second database, the example data from your question will look like this:
Database 1: (exactly like before)
Locations:
Id Name Adress etc....
1 Location 1
2 Location 2
Persons:
Id LocationId Name etc...
1 1 Alex
2 1 Peter
3 2 Lisa
Database 2:
Locations:
Id Name Adress etc....
10001 Location A
10002 Location B
Persons:
Id LocationId Name etc...
10001 10001 Mark
10002 10002 Ashley
10003 10001 Ben
And that's it! Now you can import the data from one database into the other, without getting any primary key violations at all.

If this were my problem, I would probably add some columns to the tables in the database I was going to keep. These would be used to store the pk values from the other db. Then I would insert records from the other tables. For the ones with foreign keys, I would use a known value. Then I would update as required and drop the columns I added.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas