Write query to show all unique occurences numbered, and with variants listed - sql

I am working on past national censuses stored in an Oracle database. My main tools for working with it, are MS Access and LibreOffice Base, depending on what kind of task I have to solve. I do not have direct access to the dbase; I cannot, for instance, run update queries directly on the main tables, but I can do this on subtables I have created in my environment.
I would like to list all unique standardised names from a census, with the number of instances shown as a count, and listing all variants of the name in a seperate column. How would such a query be written?
In the example below, the …S following Firstname, indicates which standard name the source’s first name is encoded under.
Firstname FirstnameS
Tor Tor
Thor Tor
Per Per
Peer Per
Pær Per
Pär Per
Caroline Karoline
Charoline Karoline
Karoliine Karoline
Desired output
FirstnameS Σ Firstname_variants
Tor 2 Tor, Thor
Per 4 Per, Peer, Pær, Pär
Karoline 3 Caroline, Charoline, Karoliine
───
I hope I’ve provided all information and asked the question in a manner befitting the RoC of Stackoverflow. Be gentle; it’s my first question!

SELECT FirstnameS, COUNT(Firstname) AS Num
FROM myTable
GROUP BY FirstnameS
gives you the first two columns.
The third depends on the database system - can you run Oracle queries (directly or pass-through)?
Edit:
Oracle: SQL Query to concatenate column values from multiple rows in Oracle
MS-Access: Combine values from related rows into a single concatenated string value

Related

Tableau count values after a GROUP BY in SQL

I'm using Tableau to show some schools data.
My data structure gives a table that has all de school classes in the country. The thing is I need to count, for example, how many schools has Primary and Preschool (both).
A simplified version of my table should look like this:
In that table, if I want to know the number needed in the example, the result should be 1, because in only one school exists both Primary and Preschool.
I want to have a multiple filter in Tableau that gives me that information.
I was thinking in the SQL query that should be made and it needs a GROUP BY statement. An example of the consult is here in a fiddle: Database example query
In the SQL query I group by id all the schools that meet either one of the conditions inside de IN(...) and then count how many of them meet both (c=2).
Is there a way to do something like this in Tableau? Either using groups or sets, using advanced filters or programming a RAW SQL calculated fiel?
Thanks!
Dubafek
PS: I add a link to my question in Tableu's forum because you can download my testing workbook there: Tableu's forum question
I've solved the issue using LODs (specifically INCLUDE and EXCLUDE statements).
I created two calculated fields having the aggregation I needed:
Then I made a calculated field that leaves only the School IDs that matches the number of types they have (according with the filtering) with the number of types selected in the multiple filter (both of the fields shown above):
Finally, I used COUNTD([Condition]) to display the amounts of schools matching with at least the School types selected.
Hope this helps someone with similar issue.
PS: If someone wants the Workbook with the solution I've uploaded it in an answer in the Tableau Forum

Merge two CSV and collate data

I have two CSV files, the first like so:
Book1:
ID,TITLE,SUBJECT
0001,BLAH,OIL
0002,BLAH,HAMSTER
0003,BLAH,HAMSTER
0004,BLAH,PLANETS
0005,BLAH,JELLO
0006,BLAH,OIL
0007,BLAH,HAMSTER
0008,BLAH,JELLO
0009,BLAH,JELLO
0010,BLAH,HAMSTER
0011,BLAH,OIL
0012,BLAH,OIL
0013,BLAH,OIL
0014,BLAH,JELLO
0015,BLAH,JELLO
0016,BLAH,HAMSTER
0017,BLAH,PLANETS
0018,BLAH,PLANETS
0019,BLAH,HAMSTER
0020,BLAH,HAMSTER
And then a second CSV with items associated with the first list, with ID being the common attribute between the two.
Book2:
ID,ITEM
0001,PURSE
0001,STEAM
0001,SEASHELL
0002,TRUMPET
0002,TRAMPOLINE
0003,PURSE
0003,DOLPHIN
0003,ENVELOPE
0004,SEASHELL
0004,SERPENT
0004,TRUMPET
0005,CAR
0005,NOODLE
0006,CANNONBALL
0006,NOODLE
0006,ORANGE
0006,SEASHELL
0007,CREAM
0007,CANNONBALL
0007,GUM
0008,SERPENT
0008,NOODLE
0008,CAR
0009,CANNONBALL
0009,SERPENT
0009,GRAPE
0010,SERPENT
0010,CAR
0010,TAPE
0011,CANNONBALL
0011,GRAPE
0012,ORANGE
0012,GUM
0012,SEASHELL
0013,NOODLE
0013,CAR
0014,STICK
0014,ORANGE
0015,GUN
0015,GRAPE
0015,STICK
0016,BASEBALL
0016,SEASHELL
0017,CANNONBALL
0017,ORANGE
0017,TRUMPET
0018,GUM
0018,STICK
0018,GRAPE
0018,CAR
0019,CANNONBALL
0019,TRUMPET
0019,ORANGE
0020,TRUMPET
0020,CHERRY
0020,ORANGE
0020,GUM
The real datasets are millions of records, so I'm sorry in advance for my simple example.
The problem I need to solve is getting the data merged and collated in a way where I can see which item groupings most commonly appear together on the same ID. (e.g. GRAPE,GUM,SEASHELL appear together 340 times, ORANGE and STICK 89 times, etc...)
Then I need to see if there is any change/deviation to the general results in common appearance when grouped by SUBJECT.
Tools I'm familiar with are Excel and SQL, but I also have PowerBI and Alteryx at my disposal.
Full disclosure: Not homework, or work, but a volunteer project, thus my unfamiliarity with this kind of data manipulation.
Thanks in advance.
An Alteryx solution:
Drag the two .csv files onto your canvas (seen as book1.csv and book2.csv in my picture; Alteryx will create "Input" tools for you.
Drag a "Join" tool on and connect the two .csv files to its inputs; select "ID" as the join field; unselect the "Right_ID" as output since it's merely a duplicate of "ID"
Drag a "Summary" tool on and connect the Join tool's output to the Summary tool's input; select all three of the outputs and add as a "group by"... then add the ID column with a "count"
Drag a browse tool on and connect the summary's output to the browse tool's input.
run the workflow
After all that, click on the browse tool and you should see what is seen in my screenshot: (which is showing just the first ten rows of output):
+1 for taking on a volunteer project - I think anyone who knows data can have a big impact in support of their favourite group or cause.
I would just pull the 2 files into Power BI as 2 separate tables (Get Data / From File). Create a relationship between the 2 tables based on ID (it might get auto-generated). It should be one to many.
Then I would add a Calculated Column to the Book1 table to Concatenate the related ITEM values, eg.
Items =
CALCULATE (
CONCATENATEX (
DISTINCT ( 'Book2'[ITEM] ),
'Book2'[ITEM],
", ",
'Book2'[ITEM], ASC
)
)
Now you can use that Items field in visuals (e.g. a Table), along with Count of ID to get the frequency.
Adding Subject to a copy of the table (e.g. to the Columns well of a Matrix) will produce your grouped scenario, or you could add a Subject Slicer.
As you will be comparing subsets of varying size, I would change Count of ID to Show value as - % of grand total.
Little different solution using Alteryx.
With this dataset, there are very few repeating 3 or 4 item groups. You can do the two item affinity analysis and get a probability of 3 or 4 item groups, or you can count the 3 and 4 item groups individually. I believe what you want is the latter as your probability of getting grapes with oranges may be altered by whether you have bananas in the cart or not.
Anyway, I did not join in the subject until after finding all of my combinations. I found all the combinations by taking the Cartesian join of two, then three, then four of the original set. I then removed all duplicates by ensuring items were always in alphabetical order in each row. I then counted occurrences of each combination. More joins can be added in the same pattern to count groups of 5,6,7...
Once you have the counts of occurrences, then I would join back with the subjects and perform this analysis on each group and compare to the overall results.
I'm supposed to disclose that I work for Alteryx.
first of all if you are using windows
just navigate to the directory which contains the CSV and write the following command:
copy pattern newfileName.csv
#example
copy *.csv merged.csv
now you created one csv file, the file is too large now you can't process it once, depending on your programming language you can use appropriate way, for python you can use generators to process line by line, or pandas you can read chunk by chunk it will be easy.
I hope this help you.

sql oracle search by multiple terms in business object report

I am writing a report where i would like the end user to be able to search by multiple terms (ie. UK, CZ)
but my code it does not fetch any results
like #variable('2. COUNTRY (UK, CZ, AT or use % for all)')
It works when just using just one term (ie. UK) but not when the user tries to search for more than one value.
I have tried using different statements before the variable but still get no results.
Is a search like this possible?
I'm writing this for Business Objects 5
Thanks
Matt
You're trying to perform a wildcard search (by using the LIKE keyword) in combination with a prompt (I take it it's a multi-value prompt).
Lets go through a few possible scenarios:
Wildcard
Example: the user enters % in the prompt.
SQL translation: Country LIKE '%'
Result: the query returns all records due to the wildcard
Single-value
Example: the user enters UK in the prompt.
SQL translation: Country LIKE 'UK'
Result: the query returns all records with the Country column matching the value UK
Multiple values
Example: the user selects UK and AT in the prompt.
SQL translation: Country LIKE 'UK,AT'
Result: the query returns no records because there is no record that contains the value UK,AT (literally) for the Country column.
What you're trying to do, as far as I can determine, is to allow the user to select multiple values or skip the selection altogether and return all values (for which you used the combination of the LIKE keyword and % wildcard).
However, with multiple values, you need to use the IN keyword instead. In current versions of BusinessObjects (you're using a very old version), it's possible to make prompts optional.
As you don't have this feature, the only alternative is to create a universe condition in which you build a CASE around your #prompt function, to determine if the user entered a % or selected multiple values and then construct your WHERE clause accordingly.
Have a look at this article for an example how to build such a condition.

Linking two seperate sets of data codes without a common identifier

I have two large sets of data. Both sets are a form of structured coding system,and is used to categorize groups of people based on their occupation. The two sets of data have no common identifier. Besides a column that contains a unique identifier each table has a description for said identifier, but although they may be describing similar things the descriptions are not identical.
How do I create a table, that connects the two sets of data, without having to go back and manually try to figure out how to make the connection between the two identifiers. I am not sure if this can be done on Access or SQL. If there is a way to do this, I would like to know what software is maybe out there.
Here's some example data:
Table 1:
Z Identifier DescriptionA
162000 Pharmacist
3123566 Electronic Repairman
143246 Banker
8444455 Doctor
Table 2:
Q Identifier DescriptionB
XX134556 COPY/PRINT/SCAN EQUIP
666Q1224 DRUGS
722WWYZ Financial Svc
8456435T Medical Services
15666PP Health Services
Desired Output:
Table 3:
Z Identifier DescriptionA Q Identifier DescriptionB
162000 Pharmacist 666Q1224 DRUGS
3123566 Electr Repairman XX134556 COPY/PRINT/SCAN EQUIP
143246 Banker 722WWYZ Financial Svc
8444455 Doctor 8456435T Medical Services
Table 1:
Z Identifier DescriptionA
162000 Pharmacist
3123566 Electronic Repairman
143246 Banker
8444455 Doctor
Table 2:
Q Identifier DescriptionB
XX134556 COPY/PRINT/SCAN EQUIP
666Q1224 DRUGS
722WWYZ Financial Svc
8456435T Medical Services
15666PP Health Services
Output:
Z Identifier DescriptionA Q Identifier DescriptionB
162000 Pharmacist 666Q1224 DRUGS
3123566 Electr Repairman XX134556 COPY/PRINT/SCAN EQUIP
143246 Banker 722WWYZ Financial Svc
8444455 Doctor 8456435T Medical Services
Conventional tools that you are used to (like Access, Excel, and SQL) can only go so far with comparing the meaning and usage of words.
In other words (forgive the pun), in order to do this, you need some sort of natural language processing toolkit (NLPT). Along with that, you also need some knowledge of how to program, because I don't think there exists front-end interfaces that can give you the output you want given only the input you listed by just filling out some forms.
So with that in mind, in order to solve your problem (I'll assume you know how to program and can pick up a NLPT in a language of your choice), you need to do the following:
Put your two datasets in some tables.
Manipulate DescriptionA and DescriptionB to be something meaningful to the NLPT you are using. They won't like a string such as "COPY/PRINT/SCAN/ EQUIP". They'll want the slashes removed and the words separated.
Compare DescriptionA with DescriptionB in a permutation-style manner by using a path_similarity type of function in the library. For example path_similarity('animal.definition1', 'dog.definition1') should return a high value, say .60, while path_similarity('animal.definition1', 'book.definition1') should return a low value, like .10.
If the path_similarity is above a certain value (up for you to decide), join the two items together and append them as a single row to a results table, while removing them from their respective tables. Continue doing this until the list is exhausted of DescriptionA greater than a certain similarity to a DescriptionB. Then do something else with the rows that are left in Table 1 and Table 2.
This should all be fairly easy to do programmatically. You may find you are not getting proper matches in some places with this method because you are randomly choosing two words to compare. Because of that, you may want to find another algorithm other than just permutations, perhaps one that looks at the statistics of the path_similarity of every piece of your data to every other piece and acts more appropriately.
Additionally, you may want to allow more than two words to be paired up. For example; "lumberjack", "tree cutter", and "tree chopper" make more sense to be grouped in one row with an additional two columns created than to throw one of them out who will likely be left without a pair. All of the problems I just listed in this paragraph, I'm sure are not new problems and you can search around the internet in order to solve them. Best of luck!

SQL IN statement "inclusiveness"

I'm not a programmer, but trying to learn. I'm a nurse, and need to pull data for medical referral tracking from a database. I have a piece of GUI software which builds JOIN queries for me to pull things from the database. One of the operators I can use in the drop-down is "IN." The referral documentation is stored in the table as codes made up of one to three letters. For example, the code for a completed dental referral is CDF, and the code for a dental referral is D.
I want to build a report to allow other nurses to pull all their outstanding referrals, so I'll want to pull "D" but not "CDF"
If I use IN as the operator, and set my parameters to 'S','D','BP' {etc} will that also pull the records which have the other, longer codes which contain those same letters? (like CDF, CSR, CBP)
I don't want to test it because I only have access to the production database, and I don't want to hose up actual patient records. Thanks in advance for any help!
Assuming that the column that holds the referral code holds one and only one code per record (which is what it sounds like) the query should function as you want and will not attempt to match substrings.
In any event, there's no danger that a query in the form IN ('S', 'D', 'BP') will match substrings. To perform substring matches in SQL you have to use the LIKE operator.
The situation in which this will not work is if the referral code column holds multiple codes separated by commas. This is an all-too-common mistake in designing databases but if the product you're using is commercial rather than home-grown, I think it's very unlikely to be the case. If it is, searching it is much more difficult.