I'm new to MS Access and am trying to speed up a data gathering process that is taking forever in Powershell. In Powershell I have 10 or so web API calls to get data and each comes back as an object with multiple properties (fields.) Each set of data has related fields to 1 or more of the other sets of data. Getting the data is very quick but piping an array of objects to where-object to select-object takes over an hour and there's really not that much data. Each object contains 500-1500 "records" and 5 to 10 "fields" so I thought why not export that data and use something that's intended to search through data to do the job? I exported each object as a separate .CSV file. So enter MS Access..
I imported each of the CSV's as a separate table (easy enough.) I'm going to simplify this down for this example to the following 3 tables:
[Tables]https://i.stack.imgur.com/UCH1F.jpg
Every table has fields that relate it over to other tables. Pretty much there's some sort of Id field in every table that is related to another Id field in a different table that I need to pull a field called "name" from. I'm trying to follow the bread crumbs from the Player name to it's Network name to it's Application name, to it's Layout name, etc... I want to build a query that I would eventually just be able to export as an Excel file. I also would prefer to just write out the SQL unless it's really easier to to understand the visual query builder. I'm looking to build a sheet with the following information:
Player's Name would include all names from the Players table and getting just that data makes sense to me. SELECT Name AS PlayerName FROM Players Everything else, not so much. I feel like this will end up being some mega query as I get deeper into related table after related table. In Excel, it would be straightforward using Vlookups across tabs but that doesn't seem to be the best approach. Given the info above, I'm trying to achieve the following output:
Result table
Any help with strategy and syntax greatly appreciated!
You're looking for the JOIN clause.
SELECT
Players.Name PlayerName, Networks.Name PlayerNetwork, Applications.Name ApplicationName
FROM
Players
LEFT OUTER JOIN
Networks
ON
Networks.ID = Players.NetworkId
LEFT OUTER JOIN
Applications
ON
Applications.Id = Players.ApplicationID
I have two CSV files, the first like so:
Book1:
ID,TITLE,SUBJECT
0001,BLAH,OIL
0002,BLAH,HAMSTER
0003,BLAH,HAMSTER
0004,BLAH,PLANETS
0005,BLAH,JELLO
0006,BLAH,OIL
0007,BLAH,HAMSTER
0008,BLAH,JELLO
0009,BLAH,JELLO
0010,BLAH,HAMSTER
0011,BLAH,OIL
0012,BLAH,OIL
0013,BLAH,OIL
0014,BLAH,JELLO
0015,BLAH,JELLO
0016,BLAH,HAMSTER
0017,BLAH,PLANETS
0018,BLAH,PLANETS
0019,BLAH,HAMSTER
0020,BLAH,HAMSTER
And then a second CSV with items associated with the first list, with ID being the common attribute between the two.
Book2:
ID,ITEM
0001,PURSE
0001,STEAM
0001,SEASHELL
0002,TRUMPET
0002,TRAMPOLINE
0003,PURSE
0003,DOLPHIN
0003,ENVELOPE
0004,SEASHELL
0004,SERPENT
0004,TRUMPET
0005,CAR
0005,NOODLE
0006,CANNONBALL
0006,NOODLE
0006,ORANGE
0006,SEASHELL
0007,CREAM
0007,CANNONBALL
0007,GUM
0008,SERPENT
0008,NOODLE
0008,CAR
0009,CANNONBALL
0009,SERPENT
0009,GRAPE
0010,SERPENT
0010,CAR
0010,TAPE
0011,CANNONBALL
0011,GRAPE
0012,ORANGE
0012,GUM
0012,SEASHELL
0013,NOODLE
0013,CAR
0014,STICK
0014,ORANGE
0015,GUN
0015,GRAPE
0015,STICK
0016,BASEBALL
0016,SEASHELL
0017,CANNONBALL
0017,ORANGE
0017,TRUMPET
0018,GUM
0018,STICK
0018,GRAPE
0018,CAR
0019,CANNONBALL
0019,TRUMPET
0019,ORANGE
0020,TRUMPET
0020,CHERRY
0020,ORANGE
0020,GUM
The real datasets are millions of records, so I'm sorry in advance for my simple example.
The problem I need to solve is getting the data merged and collated in a way where I can see which item groupings most commonly appear together on the same ID. (e.g. GRAPE,GUM,SEASHELL appear together 340 times, ORANGE and STICK 89 times, etc...)
Then I need to see if there is any change/deviation to the general results in common appearance when grouped by SUBJECT.
Tools I'm familiar with are Excel and SQL, but I also have PowerBI and Alteryx at my disposal.
Full disclosure: Not homework, or work, but a volunteer project, thus my unfamiliarity with this kind of data manipulation.
Thanks in advance.
An Alteryx solution:
Drag the two .csv files onto your canvas (seen as book1.csv and book2.csv in my picture; Alteryx will create "Input" tools for you.
Drag a "Join" tool on and connect the two .csv files to its inputs; select "ID" as the join field; unselect the "Right_ID" as output since it's merely a duplicate of "ID"
Drag a "Summary" tool on and connect the Join tool's output to the Summary tool's input; select all three of the outputs and add as a "group by"... then add the ID column with a "count"
Drag a browse tool on and connect the summary's output to the browse tool's input.
run the workflow
After all that, click on the browse tool and you should see what is seen in my screenshot: (which is showing just the first ten rows of output):
+1 for taking on a volunteer project - I think anyone who knows data can have a big impact in support of their favourite group or cause.
I would just pull the 2 files into Power BI as 2 separate tables (Get Data / From File). Create a relationship between the 2 tables based on ID (it might get auto-generated). It should be one to many.
Then I would add a Calculated Column to the Book1 table to Concatenate the related ITEM values, eg.
Items =
CALCULATE (
CONCATENATEX (
DISTINCT ( 'Book2'[ITEM] ),
'Book2'[ITEM],
", ",
'Book2'[ITEM], ASC
)
)
Now you can use that Items field in visuals (e.g. a Table), along with Count of ID to get the frequency.
Adding Subject to a copy of the table (e.g. to the Columns well of a Matrix) will produce your grouped scenario, or you could add a Subject Slicer.
As you will be comparing subsets of varying size, I would change Count of ID to Show value as - % of grand total.
Little different solution using Alteryx.
With this dataset, there are very few repeating 3 or 4 item groups. You can do the two item affinity analysis and get a probability of 3 or 4 item groups, or you can count the 3 and 4 item groups individually. I believe what you want is the latter as your probability of getting grapes with oranges may be altered by whether you have bananas in the cart or not.
Anyway, I did not join in the subject until after finding all of my combinations. I found all the combinations by taking the Cartesian join of two, then three, then four of the original set. I then removed all duplicates by ensuring items were always in alphabetical order in each row. I then counted occurrences of each combination. More joins can be added in the same pattern to count groups of 5,6,7...
Once you have the counts of occurrences, then I would join back with the subjects and perform this analysis on each group and compare to the overall results.
I'm supposed to disclose that I work for Alteryx.
first of all if you are using windows
just navigate to the directory which contains the CSV and write the following command:
copy pattern newfileName.csv
#example
copy *.csv merged.csv
now you created one csv file, the file is too large now you can't process it once, depending on your programming language you can use appropriate way, for python you can use generators to process line by line, or pandas you can read chunk by chunk it will be easy.
I hope this help you.
I have what I consider a real need to create a query with several hundred columns.
We are working on a mailing for our client. In this mailing, they are listing out several locations where their customers can go to get information. As our designers create the template for this mailing, they are setting up "Slots" for each address. The number of slots on the mailing varies from one mailing to the other, from 6 to possibly 50.
My need for the query is to setup the merge of data into the mailing. I need to provide a query where each mailing is 1 record containing all the information they need for that mailing. I am dynamically creating the SQL statement with the max number of slots on that mailing. With up to 50 slots on that mailing, my query needs to look like this:
MailingID,
LogoLocation,
APNCode,
TFN,
CopyVersion,
Slot1_Name,
Slot1_Address,
Slot1_City,
Slot1_State,
Slot1_DateTime,
...
Slot50_Name,
Slot50_Address,
Slot50_City,
Slot50_State,
Slot50_DateTime
My first attempt was to create a table with all these fields, but I got this error:
The table has been created, but its maximum row size exceeds the allowed maximum of 8060 bytes. INSERT or UPDATE to this table will fail if the resulting row exceeds the size limit.
They only want the data in a CSV file, so I don't need to create a temp table for it.
My problem is that I'm trying to create a standard process and with the number of fields varying like that, I want to set this up in a way that we won't blow up the system every time we try and run it.
I've looked at a few pages and found details on the size limitations of SQL Server and several comments saying a table like this shows a bad database design.
http://msdn.microsoft.com/en-us/library/ms143432(v=sql.105).aspx
http://social.msdn.microsoft.com/Forums/en-US/fec1efbb-94ff-4fe9-8d69-12e95c48587d/its-maximum-row-size-exceeds-the-allowed-maximum-of-8060-bytes-insert-or-update-to-this-table-will?forum=transactsql
Work around SQL Server maximum columns limit 1024 and 8kb record size
I'm hoping that someone out there has some experience doing this and can share some insights on how to make this efficient. Is there another way to accomplish this that I don't know about?
UPDATE:
Thanks for all the quick replies.
More detail on my scenario. You get a flyer in the mail and when you turn the flyer over, it lists 50 locations in your county where you could go take a class or attend a meeting or something. All the details for that flyer needs to be in 1 record so they can map the fields on the one page. If that county has 50 address/date/time combinations, they need them included in the 1 record so they can properly slot the flyer. Think giant mail merge where there might only be 100 counties (100 flyers) but each flyer has tons of information.
When the data is actually stored in the database, I'm storing an id for the specific flyer (MailingID) and each address/date/time combo is its own record. It's just the file they need to merge the details onto the creative piece that has to be denormalized like this.
I haven't been able to find any details on limitations on views. Does a View have the same limitations as a table? Would it work to create a view for them that they can download when they need the data?
All the details for that flyer needs to be in 1 record so they can map the fields on the one page That is a questionable assumption. Why can't the data be stored in 50 rows in a 2nd table?
Anyway, if you insist on storing everything in one row you should probable use XML or JSON. That makes all these problems go away. SQL Server has great support for XML. You can even generate XML on the fly. So you could properly store the 50 items in a 2nd table and only combine them into one XML value for query purposes.
I was wondering if it is possible to work on a per row basis in the kettle?
I am trying to implement a reporting scheme which consists of a table, where the requests get queued for processing and then the Pentaho job that picks up the records on that table.
my job currently has 3 transformations in it,
1st is to get records from the queued requests table
2nd is to analyze the values on each record and come up with multiple results based on that record. for example, a user would request to have records of movies of the horror genre. then it should spit out the horror movies
3rd is to further retrieve the information about the movies such as the year, director and etc, which is to be outputted to an excel file.
this is the idea, but it's a bit challenging doing it in Pentaho as it does stuff all at the same. is there a way that I can make my job work on records one by one?
EDIT.
Just to add, I have been trying to extend the implementation of the Pentaho cookbook sample but if I compare to my design, its like step 2 and step 3 only.
I can't seem to make the table input step work one at a time.
i just made it act like the implementation in the cookbook, i did adjustments on it. instead of using two transformations to gather all the necessary fields, i just retrieved all the information that i need in 1 transformation.
then after that i copied those information to the next steps, then some queries to complete the information and it is now working.
passing parameters between transformations is a bit confusing, there are parameters to be set on the transformation itself and also on the job where the transformations lay so i kinda went guessing for some time just to make it work.
I am having a problem trying to achieve the following:
I'd like to have a page with 'infinite' scrolling functionality and all the results fetched to be sorted by certain attributes. The way the code currently works is, it places the query, sorts the results, and displays them. The problem is, that once the user reaches the bottom of the page and new query is placed, the results from this query are sorted, but in its own context. That is, if you have a total of 100 results, and the first query display only 50, then they are sorted. But the next query (for the next 50) sorts the results only based on these 50 results, not based on the 100 (total results).
So, do I have to fetch all the results at once, sort them, and then apply some pagination logic to them or there's a way for MongoDB to actually have infinite scrolling (AJAX requests) with sorting applying to the results?
There's a few ways to do this with MongoDB. You can use the .skip() and .limit() commands (documented here: http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-CursorMethods) to apply pagination to the query.
Alternatively, you could add a clause to your query like: {sorted_field : {$gt : <value from last record>}}. In other words, filter out matches of the query whose sorted value is less than that of the last resulting item from the current page of results. For example, if page 1 of results returns documents A through D, then to retrieve the next page 2 you repeat the same query with the additional filter x > D.
Let me preface this by saying that I have no experience with MongoDB (though I am aware that it is a NoSQL database).
This question, however, is somewhat of a general database one (you'd probably get more responses tagging it as such). I've implemented such a feature using Cassandra (another, albiet quite different NoSQL database), however the same principles apply.
Use the sorted-by attribute of the last retrieved record, and conduct a range search based on it in the database. So, assuming your database consists of the following set of letters:
A
B
C
D
E
F
G
..and you were retrieving 2 letters at a time, you'd retrieve A, B first. When more records are needed, you'd use B to conduct a range search on the set of letters in the database. In plain English this would be something like:
Get the letters that appear after B, limit the results to 2
From a brief look at the MongoDB tutorial, it looks like you have conditional operators to help you implement this.