db design for ad delivery - sql

i am building an ad delivery system similar to the one on facebook.. Basically my clients can create adverts and target them to particular members, based on location and bio data that we will constantly collect from them.
at the moment it is designed as so..
member
-----
id
fname
lname
country_id
state_id
region_id
postcode_id
etc..
I know i only need the postcodeid and the look it up in the postcode table.. but i have around 50 postcode tables one for each country.. just easier for me..
member_profile
-----
id
mem_id
bio_q (id linked to varchar .. eg. 'profession')
bio_a
advert
-------
id
client_id
ad_title
ad_line
start_date
end_date
status
etc..
advert_target
-----
id
advert_id
fk_id
type
eg data for above..
1, 1, 1, 'state'
2, 1, 2, 'state'
3, 1, 5, 'profession'
one way i though of doing it is by doing a whole heap of union statements... not sure if this the most effecient way.. any help with direction would be greatly appreciated..
thanks
a

Typically, this kind of application introduces the concept of "segment", assigns users to segments, and targets ads at segments.
So, you might assign a given user to 5 segments "lives in Europe", "lives in Italy", "lives in Rome", "lives within 5 kilometers of postcode x", "is at least 18 years old",
You then might have ads targeting segments "lives in Italy", "is at least 18 years old", "has visited the site at least 3 times"; our sample user score 2 out of 3 criteria, so might only be shown ads if noboy more qualified turns up.
You want to pre-populate your segments in regular database jobs; doing this on the fly will cripple your system with even moderate levels of traffic.

Related

match tables with intermediate mapping table (fuzzy joins with similar strings)

I'm using BigQuery.
I have two simple tables with "bad" data quality from our systems. One represents revenue and the other production rows for bus journeys.
I need to match every journey to a revenue transaction but I only have a set of fields and no key and I don't really know how to do this matching.
This is a sample of the data:
Revenue
Year, Agreement, Station_origin, Station_destination, Product
2020, 123123, London, Manchester, Qwerty
Journeys
Year, Agreement, Station_origin, Station_destination, Product
2020, 123123, Kings Cross, Piccadilly Gardens, Qwer
2020, 123123, Kings Cross, Victoria Station, Qwert
2020, 123123, London, Manchester, Qwerty
Every station has a maximum of 9 alternative names and these are stored in a "station" table.
Stations
Station Name, Station Name 2, Station Name 3,...
London, Kings Cross, Euston,...
Manchester, Piccadilly Gardens, Victoria Station,...
I would like to test matching or joining the tables first with the original fields. This will generate some matches but there are many journeys that are not matched. For the unmatched revenue rows, I would like to change the product name (shorten it to two letters and possibly get many matches from production table) and then station names by first change the station_origin and then station_destination. When using a shorter product name I could possibly get many matches but I want the row from the production table with the most common product.
Something like this:
1. Do a direct match. That is, I can use the fields as they are in the tables.
2. Do a match where the revenue.product is changed by shortening it to two letters. substr(product,0,2)
3. Change the rev.station_origin to the first alternative, Station Name 2, and then try a join. The product or other station are not changed.
4. Change the rev.station_origin to the first alternative, Station Name 2, and then try a join. The product is changed as above with a substr(product,0,2) but rev.station_destination is not changed.
5. Change the rev.station_destination to the first alternative, Station Name 2, and then try a join. The product or other station are not changed.
I was told that maybe I should create an intermediate table with all combinations of stations and products and let a rank column decide the order. The station names in the station's table are in order of importance so "station name" is more important than "station name 2" and so on.
I started to do a query with a subquery per rank and do a UNION ALL but there are so many combinations that there must be another way to do this.
Don't know if this makes any sense but I would appreciate any help or ideas to do this in a better way.
Cheers,
Cris
To implement a complex joining strategy with approximate matching, it might make more sense to define the strategy within JavaScript - and call the function from a BigQuery SQL query.
For example, the following query does the following steps:
Take the top 200 male names in the US.
Find if one of the top 200 female names matches.
If not, look for the most similar female name within the options.
Note that the logic to choose the closest option is encapsulated within the JS UDF fhoffa.x.fuzzy_extract_one(). See https://medium.com/#hoffa/new-in-bigquery-persistent-udfs-c9ea4100fd83 to learn more about this.
WITH data AS (
SELECT name, gender, SUM(number) c
FROM `bigquery-public-data.usa_names.usa_1910_2013`
GROUP BY 1,2
), top_men AS (
SELECT * FROM data WHERE gender='M'
ORDER BY c DESC LIMIT 200
), top_women AS (
SELECT * FROM data WHERE gender='F'
ORDER BY c DESC LIMIT 200
)
SELECT name male_name,
COALESCE(
(SELECT name FROM top_women WHERE name=a.name)
, fhoffa.x.fuzzy_extract_one(name, ARRAY(SELECT name FROM top_women))
) female_version
FROM top_men a

How to return most common login location for each user (duplicates possible) in SQL?

this is my first post so bear with me.
I've been working on a problem with a large data set for about a week now and I am banging my head against the wall. Essentially, I have a database containing records of each time a user accesses a service; each record has a unique ID associated with the user (user_id), an assigned country tag which may differ between accesses (demo_tag) which is a best-guess about a users' geolocation, and a bunch of other information I'm not currently worried about.
What I want to accomplish is to determine which country a user most likely resides in, based on the number of times they've accessed the service with a certain assigned country. In the event of a tie, I want to retrieve BOTH regions (say, a user has logged in equal numbers of times from both France and Belgium, I want to associate the user with both countries). Basically for each user, I want to know the maximum number of times they've logged in from one specific location, and which location(s) it is/they are.
e.g. If I had:
user_id region
1 USA
1 CAN
1 CAN
2 MEX
2 MEX
2 USA
2 USA
I'd expect to get back:
user_id region count
1 CAN 2
2 MEX 2
2 USA 2
Right now I have a very ugly, multi-nested query and I feel there must be a better way to do this. Any advice?
Use group by and rank():
select ur.*
from (select user_id, region, count(*),
rank() over (partition by user_id order by count(*) desc) as seqnum
from t
group by user_id, region
) ur
where seqnum = 1;

Multiple occurances of the same Column variable in a query

I have two tables
TicketsForSale
ticket_id (PK)
type
category
Transactions
transaction_id (PK)
ticket_id (FK)
I want to get the transactions per type of tickets. This is what I've tried:
SELECT ticketsforsale.type
, COUNT(transactions.ticket_id)
FROM ticketsforsale
INNER JOIN transactions ON ticketsforsale.ticket_id = transactions.ticket_id
GROUP BY ticketsforsale.type
What I hope for as a result is something like this
{
Sports 5
Theater 7
Cruise 8
Cinema 10
}
But instead I get the following :
{ Theater 2
Cruise 1
Sports 1
Sports 2
Cruise 3
Cinema 5
}
The numbers aren't accurate, just used for demonstration.
(The category column is listing the specific show you attend by "purchasing" the ticket. E.G If the type is "Sports", the category could be Basketball or Football or Volleyball etc. etc. ) I just thought that this column could somehow be the issue here, but maybe I'm wrong.
Try this:
select distinct type
, encode(type::bytea,'hex') hex_type
from TicketsForSale order by 1;
You'll probably find that you have multiple type values that appear identical but have different hexadecimal representations. Fix those discrepancies, and the you should be good to go.

Oracle REGEXP_SUBSTR for string matching b/w two columns

The problem
Users are frequently inputting "country name" strings into the "city name" field. Heuristically, this appears to be an extremely common practice. For example, a user might put "TAIPEI TAIWAN" in the city name when only "TAIPEI" should be input and then the country would be "TAIWAN". I am working to aggregate these instances for this specific field (your help will allow me to expand this to other columns and tables) and then identify where possible rankings associated with strictly the "country" names in the "city" field.
I have two tables that I am attempting to leverage to track down data validation issues. Tbl1 is named "Customer_Address" comprised of geographic columns like (Customer_Num, Address, City_Name, State, Country_Code, Zipcode). Tbl2 named "HR_Countries" is clean table of 2-digit ISO country codes with their corresponding name values (Lebanon, Taiwan, China, Syria, Russia, Ukraine, etc) and some other fields not presently used.
The initial step is to query "Customer_Address" to find City_Names LIKE a series of OR statements (LIKE '%CHINA', OR LIKE 'TAIWAN', OR etc etc) and count the number of occurrences where the City_Name is like the designated country_name string I passed it and the results are pretty good. I've coded in some exclusions to deal with things like "Lebanon, OH" so my overall results are satisfactory for the first phase.
Part of the query does a LEFT join from Tbl1 to Tbl2 to add the risk rating from tbl2 as a result of the query against tbl1:
LEFT JOIN tbl2 risk
ON INSTR(addr.CITY_NM, risk.COUNTRY_NAME,1) <> 0
Example of Tbl1 Data Output (head(tbl1), n=7)
CountryNameInCity CountOfOccurences RR
China 15 High
Taiwan 2000 Medium
Japan 250 Low
Taipei, Taiwan 25 NULL
Kabul, Afghanistan 10 NULL
Shenzen China 100 NULL
Afghanistan 52 Very High
Example of Tb2 Data (head(tbl2), n=6)
CountryName CountryCode RR
China CN High
Taiwan TW High
Iraq IQ Very High
Cuba CU Medium
Lebanon LB Very High
Greece GR High
So my question(s) are as follows:
1) Instead of manually passing in a series of OR-statements for country codes is there a better way to using Tbl2 as the matching "LIKE" driving the query?
2) Can you recommend a better way of comparing the output of the query (see Tbl1 example) and ensuring that multiple strings (Taipei, Taiwan, etc) are appropriately aggregated and bring back the correct 'RR' rating.
Thanks for taking the time to review this and respond.

Microsoft Access 2010 - Updating Multiple Rows with Different values in ONE query

I have a question about updating multiple rows with different values in MS Access 2010.
Table 1: Food
ID | Favourite Food
1 | Apple
2 | Orange
3 | Pear
Table 2: New
ID | Favourite Food
1 | Watermelon
3 | Cherries
Right now, it looks deceptively simple to execute them separately (because this is just an example). But how would I execute a whole lot of them at the same time if I had, say, 500 rows to update out of 1000 records.
So what I want to do is to update the "Food" table based on the new values from the "New" table.
Would appreciate if anyone could give me some direction / syntax so that I can test it out on MS Access 2010. If this requires VBA, do provide some samples of how I should carry this out programmatically, not manually statement-by-statement.
Thank you!
ADDENDUM (REAL DATA)
Table: Competitors
Columns: CompetitorNo (PK), FirstName, LastName, Score, Ranking
query: FinalScore
Columns: CompetitorNo, Score, Ranking
Note - this query is a query of another query, which in turn, is a query of another query (could there be a potential problem here? There are at least 4 queries before this FinalScore query is derived. Should I post them?)
In the competitors table, all the columns except "Score" and "Ranking" are filled. We would need to take the values from the FinalScore query and insert them into the relevant competitor columns.
Addendum (Brief Explanation of Query)
Table: Competitors
Columns: CompetitorNo (PK), FirstName, LastName, Score, Ranking
Sample Data: AX1234, Simpson, Danny, <blank initially>, <blank initially>
Table: CompetitionRecord
Columns: EventNo (PK composite), CompetitorNo (PK composite), Timing, Bonus
Sample Data1: E01, AX1234, 14.4, 1
Sample Data2: E01, AB1938, 12.5, 0
Sample Data3: E01, BB1919, 13.0, 2
Event No specifies unique event ID
Timing measures the time taken to run 200 metres. The lesser, the better.
Bonus can be given in 3 values (0 - Disqualified, 1 - Normal, 2 - Exceptional). Competitors with Exceptional are given bonus points (5% off their timing).
Query: FinalScore
Columns: CompetitorNo (PK), Score, Ranking
Score is calculated by wins. For example, in the above event (E01), there are three competitors. The winner of the event is BB1919. Winners get 1 point. Losers don't get any points. Those that are disqualified do not receive any points as well.
This query lists the competitors and their cumulative scores (from a list of many events - E01, E02, E03 etc.) and calculates their ranking in the ranking column everytime the query is executed. (For example, a person who wins the most 200m events would be at the top of this list).
Now, I am required to update the Competitors table with this information. The query is rather complex - with all the grouping, summations, rankings and whatnots. Thus, I had to create multiple queries to achieve the end result.
How about:
UPDATE Food
INNER JOIN [New]
ON Food.ID=New.ID
SET Food.[Favourite Food] = New.[Favourite Food]