Unable to add relationship in featuretools entity set - data-science

New to feature tools, getting this error while creating entity
Unable to add relationship because child column 'order_id' in 'orders' is also its index
I suspect that featuretools expect one to many relationship, is there a way to specify one to one relationship?

Yes, Featuretools generally expects a one to many relationship between tables in an EntitySet, which is why the child column cannot be the index of its table.
There's not a way to override this in relationship creation, but you can take steps to use a different index column in the child dataframe, allowing order_id to be the child column of the relationship.
You could create a new index column in prejoin_foodorder by setting make_index=True and the index to be some column name that's not in the DataFrame when adding the table to the EntitySet. This will create a new integer column in the DataFrame that ranges from 0 to the length of the dataframe. That column will then be used as the DataFrame's index, leaving order_id to be used as the child column of a relationship.
es = EntitySet()
... add any other dataframes to the EntitySet ...
es.add_dataframe('prejoin_foodorder', index='new_index', make_index=True, ...)
es.add_relationship(parent_dataframe_name='orders',
parent_column_name='id',
child_dataframe_name='prejoin_foodorder',
child_column_name='order_id')

Related

Pandas and SQLAlchemy: renaming columns during join

I have table A and table B. Both have a column id and a column name.
When I use pd.read_sql() to convert the result of a SQLAlchemy query to a pandas DataFrame, the resulting DataFrame has two columns named id and two columns named name.
The join is executed on the id column, therefore, even if there are two id columns, there won't be any ambiguity since both columns contain the same values. I can simply drop one of the column.
The two columns named name represent an issue because they are not identical: column name of table A represents name of an entity A, while column name of table B represents name of an entity B. At this point I won't know for sure which of the two columns of the DataFrame comes from table A and which from table B. Is there any way to solve this by, for instance, adding a prefix to the column names? More in general, is there any way to exploit the practical pd.from_sql() in this situation?
my_dataframe = pd.read_sql(
session.query(TableA, TableB)
.join(TableB)
.statement,
session.bind)
Note: in this question I am trying to simplify the structure of a more complex preexisting Postgres database. Therefore, it won't be possible to alter the structure of the database.
The solution was actually really simple, but you have to rename each single field:
my_dataframe = pd.read_sql(
session.query(TableA.field1.label('my_new_name1'),
TableA.field2.label('my_new_name2'),
TableB.field1.label('my_other_name2'))
.join(TableB)
.statement,
session.bind)

Extending table with another table ... sort of

I have a DB about renting cars.
I created a CarModels table (ModelID as PK).
I want to create a second table with the same primary key as CarModels have.
This table only contains the number of times this Model was searched on my website.
So lets say you visit my website, you can check a list that contains common cars rented.
"Most popular Cars" table.
It's not about One-to-One relationship, that's for sure.
Is there any SQL code to connect two Primary keys together ?
select m.ModelID, m.Field1, m.Field2,
t.TimesSearched
from CarModels m
left outer join Table2 t on m.ModelID = t.ModelID
but why not simply add the field TimesSearched to table CarModels ?
Then you dont need another table
Easiest is to just use a new primary key on the new table with a foreign key to the CarModels table, like [CarModelID] INT NOT NULL. You can put an index and a unique constraint on the FK.
If you reeeealy want them to be the same, you can jump through a bunch of hoops that will make your life Hell, like creating the table from the CarModels table, then setting that field as the primary key, then whenever you add a new CarModel you'll have to create a trigger that will SET IDENTITY_INSERT ON so you can add the new one, and remember to SET IDENTITY_INSERT OFF when you're done.
Personally, I'd create a CarsSearched table that holds ThisUser selected ThisCarModel on ThisDate: then you can start doing some fun data analysis like [are some cars more popular in certain zip codes or certain times of year?], or [this company rents three cars every year in March, so I'll send them a coupon in January].
You are not extending anything (modifying the actual model of the table). You simply need to make INNER JOIN of the table linking with the primary keys being equal.
It could be outer join as it has been suggested but if it's 1:1 like you said ( the second table with have exact same keys - I assume all of them), inner will be enough as both tables would have the same set of same prim keys.
As a bonus, it will also produce fewer rows if you didn't match all keys as a nice reminder if you fail to match all PKs.
That being said, do you have a strong reason why not to keep the said number in the same table? You are basically modeling 1:1 relationship for 1 extra column (and small one too, by data type)
You could extend (now this is extending tables model) with the additional attribute of integer that keeps that number for you.
Later is preferred for simplicity and lower query times.

Unique values for two columns in Doctrine

I have entity called Dimension. It has three attributes - ID, width and height.
ID is primary key. In the table, the dimension should be unique so there has to be only one record with given dimension (for example 40x30). What constraints I need to set?
Is uniqueConstraints={#UniqueConstraint(name="dimension", columns={"width", "height"})} correct?
From the documentation,
#UniqueConstraint annotation is used inside the #Table annotation on
the entity-class level. It allows to hint the SchemaTool to generate a
database unique constraint on the specified table columns. It only has
meaning in the SchemaTool schema generation context.
Required attributes:
name: Name of the Index
columns: Array of columns.
The anwser is then YES
/**
* #Entity
* #Table(name="xxx",uniqueConstraints={#UniqueConstraint(name="dimension", columns={"width", "height"})})
*/
class Dimension
should then do the job.

Is it OK to have 2 FKs on a table, which point to different tables, and one of which will only be used?

Lets assume I have 2 tables:
Order -< OrderItem
I have another table with 2 FKs:
Feature
- Id
- FkOrderId
- FkOrderItemId
- Text
UPDATE
This table is linked to another called FeatureReason which is common to both types of record, be they OrderFeatures or OrderItem features.
Feature -< FeatureReason
If I had 2 feature tables to account for both types of records, would this then require 2 FeatureReason tables. Same issue here with the FeatureReason table needing to have 2 FKs, each pointing to a different master table.
An Order can have a Feature record, as can an OrderItem. Therefore either "FkOrderId" OR FkOrderItemId would be populated. Is this fine to do?
I would also seriously think about using Views to to insert/edit and read either OrderFeatures or OrderItemFeatures.
Thoughts appreciated.
I would recommend using following structure, because if you have 2 foreign keys which either of them can be null, you can have rows with both columns being null or having value.
Added the FeatureReason table too
You can do this, but why? What is your reasoning for collating these two distinct items in a single table?
I would suggest having two separate tables, OrderFeatures and OrderItemFeatures, and on those occasions that you need to query both, collate them with a union query.
It is possible to have 2 foreign keys in one table. As long as the foreign key is mapping with the primary key on another table, it's OK
By not populating FkOrderItemId or FkOrderId, will you not be violating one or other of the FK constraints?
You can populate FkOrderItemId or FkOrderId according to your needs, I'm just not sure about defining an FK where it is not mandatory to supply a FK value.
Just a thought...

Problems creating a Parent Child reference in SSAS

In my data warehouse, I have the following dimension that I want to create a Parent Child hierarchy. My problem is this. The Primary key is OfficerPeopleID, which is NOT either the parent or child. The Parent is MgrPeopleID, and the child is PeopleID.
If I change the default key when creating a dimension to PeopleID, it appears as if it will work, but then I receive errors while processing because of it seeing multiple copies of PeopleID. The reason there are multiples is because it is a SCD type 2 and the Primary Key, (OfficerPeopleID) is the surrogate key for the table. I know I am not the only one that has tried creating a parent child reference on fields other than the Primary key?
Thank you!
I don't think you would want to do that. If I understand you correctly, PeopleID is your natural key or your source system key and OfficerPeopleID is your DW surrogate key. In this case you would need to have a column which stores the Parent surrogate key not the parent natural key. In other words, you should be able to create a foreign key for the table to itself. Based on what you have right now, you could have more than one record for the manager which would make it ambiguous as to which record is the correct one. Also, for the parent child to work you, the child has to be the key for the table.
If you want to do it properly you should populate the MgrOfficerPoepleID (new column) in your ETL process. If you are going to do that make sure you update the manager key value when you have a new row because of SCD2. However, if you still wish to do it as a named query in SSAS DSV, you can do something like this
SELECT
OffcerPeopleID,
-- ... insert other columns here
PeopleID,
MgrPeopleID,
(SELECT OfficerPeopleID
FROM dbo.Employee
WHERE(e.MgrPeopleID = PeopleID) AND (IsCurrent = 1)) AS MgrOfficerPoepleID
FROM dbo.OfficerPeopleDim AS e
WHERE IsCurrent = 1 -- this is your SCD2 flag. you could also use two date range columns
you cant do that if the PeopleID contains duplicate records, either you make it unique or you create the relationship using both fields.
I also advise you to create two separate entries on the DSV, one for Managers and another for Employees, with queries like this:
Manager:
select PeopleID as ManagerID, name as Name from OfficerPeopleDim
Employee:
select PeopleID as EmployeeID, name, MgrPeopleId as Manager
from OfficerPeopleDim
where MgrPeopleId is not null
So it will look like this(left) and produce the result on the right: