ScaNN weight features for similary search - cosine-similarity

I am using ScaNN to perform similarity searches and would like to place more emphasis on some features than others when performing a similarity search.
for example, if I have the following data
name | age | country | income
John 29 US $47k
Susan 28 US $44k
Bill 26 US $39k
Sarah 35 UK $100k
Jack 34 UK $90k
Maggie 37 UK $95k
and income has more importance, then given the following query:
George, 28, US, $100k
it would return
Sarah, Jack, Maggie
adding more weight to the income feature.
Training data values are normalized before building the similarity index
df_np = preprocessing.normalize(df[features])
and likewise the query values are normalized before performing a search
np_q = preprocessing.normalize([list(query.values())])

Related

Change column values to column headers in Postgres SQL replacing with values from another column

I am unable to figure out how to make a column values into column headers and assign appropriate values as it happens
Say I have a Postgres database with the following table:
Name Subject Score Region
======= ========= ======= =======
Joe Chemistry 20 America
Robert Math 30 Europe
Jason Physics 50 Europe
Joe Math 70 America
Robert Physics 80 Europe
Jason Math 40 Europe
Jason Chemistry 60 Europe
I want to select/fetch data in the following form:
Name Chemistry Math Physics Region
======= ========== ======= ======== ========
Joe 20 70 null America
Robert null 30 80 Europe
Jason 60 40 50 Europe
Considering that there are 80 subjects. How do I write an SQL select statement that returns data in this format?
In Postgres, I recommend using the FILTER syntax for conditional aggregation:
SELECT name,
MAX(score) FILTER (WHERE subject = 'Chemistry') AS Chemistry,
MAX(score) FILTER (WHERE subject = 'Math') AS Math,
MAX(score) FILTER (WHERE subject = 'Physics') AS Physics
FROM grades
GROUP BY name

Loop through a table based on multiple conditions

Students table
student_id student_name
1 John
2 Mary
Grades table
student_id year grade_level school Course Mark
1 2015 10 Smith High Algebra 95
1 2015 10 Smith High English 96
1 2016 11 Smith High Geometry 85
1 2016 11 Smith High Science 88
2 2015 10 Smith High Algebra 98
2 2015 10 Smith High English 93
2 2016 11 Smith High Geometry 97
2 2016 11 Smith High Science 86
I'm trying to show results for each year and what class a student took with the grade.
So the final output i'm looking for is something like:
[student_id1] [year1] [grade1] [school1]
[course1] [mark1]
[course2] [mark2]
[course3] [mark3]...
[student_id1] [year2] [grade2] [school1]
[course1] [mark1]
[course2] [mark2]
[course3] [mark3]...
[student_id2] [year1] [grade1] [school1]
[course1] [mark1]
[course2] [mark2]
[course3] [mark3]...
This would all go in one column/row. So in this particular example, this would be my result:
1 2015 10 Smith High
Algebra 95
English 96
1 2016 11 Smith High
Geometry 85
Science 88
2 2015 10 Smith High
Algebra 98
English 93
2 2016 11 Smith High
Geometry 97
Science 86
So anytime a student id, year, grade, or school name changes, I would have a line for that and loop through the classes taken within that group. And all of this would be in one column/row.
This is what I have so far but I'm not sure how I can properly loop through course and grades for each group. I'd appreciate it if I can be pointed in the right direction.
select s.student_id + '' + year + '' + grade_level + '' + school
from students
join grades on students.student_id = grades.student_id
If you want to do it in your SQL Enviromnment, it depends on the Database Management System you are using.
For example, if you are using Transact SQL you can try to look at this link.
Generally this kind of loops and interactions are done in the programming language that is coupled with the SQL DB.
Anyway, you should look at Stored Procedures and Cursors if you really want to do this in SQL.
You are trying to mix presentation with retrieval of data from database tables. Looping through the resultset in sql can be achieved via cursor but that isn't adviced. You are better off by pulling the required data using two queries and later print it using a language of your choice.

Pandas difficult to add new column with condition?

I was trying to do multiple group and also adding count to new column.
My input file
OrderDate Region Rep Item Units Unit Cost Total
----------------------------------------------------------
1/6/18 East Jones Pencil 95 1.99 189.05
1/23/18 Central Kivell Binder 50 19.99 999.50
2/9/18 Central Jardine Pencil 36 4.99 179.64
2/26/18 Central Gill Pen 27 19.99 539.73
3/15/18 West Sorvino Pencil 56 2.99 167.44
4/1/18 East Jones Binder 60 4.99 299.40
4/18/18 Central Andrews Pencil 75 1.99 149.25
4/18/18 West Jones Pencil 75 1.99 149.25
I am trying to do like
Region Rep Count same/diff
-------------------------------
east jones 2 2-same
jones
central Kivell 4 >3 differnce
Jardine
Gill
Andrews
West Sorvino 2 2-different
West jones1
My code:
df1 = pd.read_excel(excel_path, sheet_name = 'SalesOrders', index_col=0)
df3 = (df1.groupby('Region')['Rep'].value_counts())
print(df3)
Please help me to do this. Thanks
In rep column, based on Region i have done group by to know Rep values. if Rep member are same then 2 same people, consider central region has 4 different people working so it i greater than 3 .

How do you read two-way tables?

I want to know what is two-way tables in SQL?
And how can i read these two-way tables
Two-way tables is no way of storing data, but of displaying data. It doesn't say anything about how the data is stored.
Let's say we store persons along with their IQ and the country they live in. The table may look like this:
name iq country
John Smith 125 GB
Mary Jones 150 GB
Juan Lopez 150 ES
Liz Allen 125 GB
The two-way table to show the relation between IQ and country would be:
| 125 | 150
---+------+----
GB | 2 | 1
ES | 0 | 1
or
| GB | ES
----+-----+---
125 | 2 | 0
150 | 1 | 1
In order to retrieve this data from the database you might write this query:
select iq, country, count(*)
from persons
group by iq, country;
SQL is meant to retrieve data; it is not really meant to care about it's presentation, the layout. So you'd write a program (in PHP, C#, Java, whatever) sending the query to the database, receiving the data and filling a GUI grid in a loop.
In rare cases SQL can provide the layout itself, i.e. give you the data in columns and rows. This is when the attributes of one dimensions are known beforehand. This is usually not the case with IQs or countries as in the example given (i.e. you wouldn't have known which countries and which IQs are present in the table, if I hadn't shown you). But of course you could have retrieved either the countries or the IQs first and then build the main query dynamically (with conditional aggregation or pivot). Another case when values are known beforehand is booleans, e.g. a flag in the persons table to show whether a person is homeless. If you wanted results for how many homeless persons in which countries, you could easily write a query with two columns for homeless and not homeless.
As mentioned: that you can display data in a two-way table doesn't say anything about how this data is stored. Above I showed you a one table example. But let's say you have stores in many cities and want to know in which cities live thinner or thicker people. You decide to check which t-shirt sizes you sold in which cities. So you take your clients orders, look up the clients and the cities they live in. You also look up the order details and the items they refer to, then take all items of type t-shirt. There are many tables involved, but the result is again a two-sided table showing the relation of two attributes. E.g:
city | S | M | L | XL
------------+-----+-----+-----+-----
New York | 5% | 8% | 7% | 10%
Los Angeles | 10% | 7% | 7% | 8%
Chicago | 1% | 4% | 6% | 11%
Houston | 2% | 2% | 5% | 7%

How to query DBpedia online using SQL?

DBpedia just released their data as tables, suitable to import into a relational database. How can I query this data online using SQL?
Dataset:
http://wiki.dbpedia.org/DBpediaAsTables
I took the raw data, uploaded it to BigQuery, and made it public. So far I've done it with the 'person' and the 'place' table. Check them at https://bigquery.cloud.google.com/table/fh-bigquery:dbpedia.person.
Now is easy to know what are the most popular alma maters, for example:
SELECT COUNT(*), almaMater_label
FROM [fh-bigquery:dbpedia.person]
WHERE almaMater_label != 'NULL'
GROUP BY 2
ORDER BY 1 DESC
It's a little more complicated than that, as some people have more than one alma mater - and the particular way DBpedia encodes that. I left the complete query at http://www.reddit.com/r/bigquery/comments/1rjee7/query_wikipedia_in_bigquery_the_dbpedia_dataset/.
Btw, the top alma maters are:
494 Harvard University
320 University of Cambridge
314 University of Michigan
267 Yale University
216 Trinity College Cambridge
You can also do joins between tables.
For example, for each building (from the place table) that has an architect: What year was that architect born? How many buildings with an architect born that year are listed in DBpedia?
SELECT COUNT(*), LEFT(b.birthDate, 4) birthYear
FROM [fh-bigquery:dbpedia.place] a
JOIN EACH [fh-bigquery:dbpedia.person] b
ON a.architect = b.URI
WHERE a.architect != 'NULL'
AND birthDate != 'NULL'
GROUP BY 2
ORDER BY 2
Results:
...
8 1934
13 1935
9 1937
7 1938
17 1939
7 1941
1 1943
15 1944
10 1945
12 1946
7 1947
9 1950
20 1951
1 1952
...
(Google BigQuery has a free monthly quota to query, up to a 100GB each month)
(DBpedia data from version 3.4 on is licensed under the terms of the Creative Commons Attribution-ShareAlike 3.0 license and the GNU Free Documentation License. http://dbpedia.org/Datasets#h338-24)