Merge two CSV and collate data - sql
I have two CSV files, the first like so:
Book1:
ID,TITLE,SUBJECT
0001,BLAH,OIL
0002,BLAH,HAMSTER
0003,BLAH,HAMSTER
0004,BLAH,PLANETS
0005,BLAH,JELLO
0006,BLAH,OIL
0007,BLAH,HAMSTER
0008,BLAH,JELLO
0009,BLAH,JELLO
0010,BLAH,HAMSTER
0011,BLAH,OIL
0012,BLAH,OIL
0013,BLAH,OIL
0014,BLAH,JELLO
0015,BLAH,JELLO
0016,BLAH,HAMSTER
0017,BLAH,PLANETS
0018,BLAH,PLANETS
0019,BLAH,HAMSTER
0020,BLAH,HAMSTER
And then a second CSV with items associated with the first list, with ID being the common attribute between the two.
Book2:
ID,ITEM
0001,PURSE
0001,STEAM
0001,SEASHELL
0002,TRUMPET
0002,TRAMPOLINE
0003,PURSE
0003,DOLPHIN
0003,ENVELOPE
0004,SEASHELL
0004,SERPENT
0004,TRUMPET
0005,CAR
0005,NOODLE
0006,CANNONBALL
0006,NOODLE
0006,ORANGE
0006,SEASHELL
0007,CREAM
0007,CANNONBALL
0007,GUM
0008,SERPENT
0008,NOODLE
0008,CAR
0009,CANNONBALL
0009,SERPENT
0009,GRAPE
0010,SERPENT
0010,CAR
0010,TAPE
0011,CANNONBALL
0011,GRAPE
0012,ORANGE
0012,GUM
0012,SEASHELL
0013,NOODLE
0013,CAR
0014,STICK
0014,ORANGE
0015,GUN
0015,GRAPE
0015,STICK
0016,BASEBALL
0016,SEASHELL
0017,CANNONBALL
0017,ORANGE
0017,TRUMPET
0018,GUM
0018,STICK
0018,GRAPE
0018,CAR
0019,CANNONBALL
0019,TRUMPET
0019,ORANGE
0020,TRUMPET
0020,CHERRY
0020,ORANGE
0020,GUM
The real datasets are millions of records, so I'm sorry in advance for my simple example.
The problem I need to solve is getting the data merged and collated in a way where I can see which item groupings most commonly appear together on the same ID. (e.g. GRAPE,GUM,SEASHELL appear together 340 times, ORANGE and STICK 89 times, etc...)
Then I need to see if there is any change/deviation to the general results in common appearance when grouped by SUBJECT.
Tools I'm familiar with are Excel and SQL, but I also have PowerBI and Alteryx at my disposal.
Full disclosure: Not homework, or work, but a volunteer project, thus my unfamiliarity with this kind of data manipulation.
Thanks in advance.
An Alteryx solution:
Drag the two .csv files onto your canvas (seen as book1.csv and book2.csv in my picture; Alteryx will create "Input" tools for you.
Drag a "Join" tool on and connect the two .csv files to its inputs; select "ID" as the join field; unselect the "Right_ID" as output since it's merely a duplicate of "ID"
Drag a "Summary" tool on and connect the Join tool's output to the Summary tool's input; select all three of the outputs and add as a "group by"... then add the ID column with a "count"
Drag a browse tool on and connect the summary's output to the browse tool's input.
run the workflow
After all that, click on the browse tool and you should see what is seen in my screenshot: (which is showing just the first ten rows of output):
+1 for taking on a volunteer project - I think anyone who knows data can have a big impact in support of their favourite group or cause.
I would just pull the 2 files into Power BI as 2 separate tables (Get Data / From File). Create a relationship between the 2 tables based on ID (it might get auto-generated). It should be one to many.
Then I would add a Calculated Column to the Book1 table to Concatenate the related ITEM values, eg.
Items =
CALCULATE (
CONCATENATEX (
DISTINCT ( 'Book2'[ITEM] ),
'Book2'[ITEM],
", ",
'Book2'[ITEM], ASC
)
)
Now you can use that Items field in visuals (e.g. a Table), along with Count of ID to get the frequency.
Adding Subject to a copy of the table (e.g. to the Columns well of a Matrix) will produce your grouped scenario, or you could add a Subject Slicer.
As you will be comparing subsets of varying size, I would change Count of ID to Show value as - % of grand total.
Little different solution using Alteryx.
With this dataset, there are very few repeating 3 or 4 item groups. You can do the two item affinity analysis and get a probability of 3 or 4 item groups, or you can count the 3 and 4 item groups individually. I believe what you want is the latter as your probability of getting grapes with oranges may be altered by whether you have bananas in the cart or not.
Anyway, I did not join in the subject until after finding all of my combinations. I found all the combinations by taking the Cartesian join of two, then three, then four of the original set. I then removed all duplicates by ensuring items were always in alphabetical order in each row. I then counted occurrences of each combination. More joins can be added in the same pattern to count groups of 5,6,7...
Once you have the counts of occurrences, then I would join back with the subjects and perform this analysis on each group and compare to the overall results.
I'm supposed to disclose that I work for Alteryx.
first of all if you are using windows
just navigate to the directory which contains the CSV and write the following command:
copy pattern newfileName.csv
#example
copy *.csv merged.csv
now you created one csv file, the file is too large now you can't process it once, depending on your programming language you can use appropriate way, for python you can use generators to process line by line, or pandas you can read chunk by chunk it will be easy.
I hope this help you.
Related
Is it possible to use SQL to show the average of some values in one column and then in subsequent columns display the individual values?
I have a bunch of data and I want the output to display an average of all the data points but also the individual data points in subsequent columns. Ideally it would look something like this: Compound | Subject | Avg datapoint | Datapoint Experiment 1 | Datapoint Exp 2 | ... ..........XYZ......|.....ABC....|............40...............|...............20..............................|...............60...............|...... ..........TUV......|.....ABC....|............30...............|...............20..............................|...............40...............|...... ..........TUV......|.....DEF....|............20...............|...............10..............................|...............30...............|...... One problem I'm running in to is that I get repetitive lines of information. Another is that I have some rows pulling in info that doesn't apply, such that some of the individual datapoints in, say, row 2 would have info from subject DEF when I only want it to have info from subject ABC. I hope this makes sense! I'm currently using inner join with a ton of where qualifiers. I'm close but not quite there. Any help is appreciate and let me know if I can provide additional info to help you help me.
The SQL language has a very strict rule requiring you to know the exact number of columns for your result set in advance, before looking at any data in your tables. Therefore, if this average is based off a known fixed number of columns, or if the number of potential columns is reasonably small, where you can manually setup placeholders, then this will be possible. The key search terms to learn how to do this is "conditional aggregation", where you may also need to join the table to itself for each field. Otherwise, you will need to pivot and aggregate your data in your client code or reporting tool.
How to populate all possible combination of values in columns, using Spark/normal SQL
I have a scenario, where my original dataset looks like below Data: Country,Commodity,Year,Type,Amount US,Vegetable,2010,Harvested,2.44 US,Vegetable,2010,Yield,15.8 US,Vegetable,2010,Production,6.48 US,Vegetable,2011,Harvested,6 US,Vegetable,2011,Yield,18 US,Vegetable,2011,Production,3 Argentina,Vegetable,2010,Harvested,15.2 Argentina,Vegetable,2010,Yield,40.5 Argentina,Vegetable,2010,Production,2.66 Argentina,Vegetable,2011,Harvested,15.2 Argentina,Vegetable,2011,Yield,40.5 Argentina,Vegetable,2011,Production,2.66 Bhutan,Vegetable,2010,Harvested,7 Bhutan,Vegetable,2010,Yield,35 Bhutan,Vegetable,2010,Production,5 Bhutan,Vegetable,2011,Harvested,2 Bhutan,Vegetable,2011,Yield,6 Bhutan,Vegetable,2011,Production,3 Image of the above csv: Now there is a very small country lookup table which has all possible countries the source data can come with, listed. PFB: I want to have the output data's number of columns always fixed (this is to ensure the reporting/visualization tool doesn't get dynamic number columns with every day's new source data ingestions depending on the varying distinct number of countries present). So, I've to somehow join the source data with the country_lookup csv and populate all those columns with default value as F. Every country column would be binary with T or F being the possible values. The original dataset from the above has to be converted into below: Data (I've kept the Amount field unsolved for column Type having Derived Yield as is, rather than calculating them below for a better understanding and for you to match with the formulae): Country,Commodity,Year,Type,Amount,US,Argentina,Bhutan,India,Nepal,Bangladesh US,Vegetable,2010,Harvested,2.44,T,F,F,F,F,F US,Vegetable,2010,Yield,15.8,T,F,F,F,F,F US,Vegetable,2010,Production,6.48,T,F,F,F,F,F US,Vegetable,2010,Derived Yield,(2.44+15.2)/(6.48+2.66),T,T,F,F,F,F US,Vegetable,2010,Derived Yield,(2.44+7)/(6.48+5),T,F,T,F,F,F US,Vegetable,2010,Derived Yield,(2.44+15.2+7)/(6.48+2.66+5),T,T,T,F,F,F US,Vegetable,2011,Harvested,6,T,F,F,F,F,F US,Vegetable,2011,Yield,18,T,F,F,F,F,F US,Vegetable,2011,Production,3,T,F,F,F,F,F US,Vegetable,2011,Derived Yield,(6+10)/(3+9),T,T,F,F,F,F US,Vegetable,2011,Derived Yield,(6+2)/(3+3),T,F,T,F,F,F US,Vegetable,2011,Derived Yield,(6+10+2)/(3+9+3),T,T,T,F,F,F Argentina,Vegetable,2010,Harvested,15.2,F,T,F,F,F,F Argentina,Vegetable,2010,Yield,40.5,F,T,F,F,F,F Argentina,Vegetable,2010,Production,2.66,F,T,F,F,F,F Argentina,Vegetable,2010,Derived Yield,(2.44+15.2)/(6.48+2.66),T,T,F,F,F,F Argentina,Vegetable,2010,Derived Yield,(15.2+7)/(2.66+5),F,T,T,F,F,F Argentina,Vegetable,2010,Derived Yield,(2.44+15.2+7)/(6.48+2.66+5),T,T,T,F,F,F Argentina,Vegetable,2011,Harvested,10,F,T,F,F,F,F Argentina,Vegetable,2011,Yield,90,F,T,F,F,F,F Argentina,Vegetable,2011,Production,9,F,T,F,F,F,F Argentina,Vegetable,2011,Derived Yield,(6+10)/(3+9),T,T,F,F,F,F Argentina,Vegetable,2011,Derived Yield,(10+2)/(9+3),F,T,T,F,F,F Argentina,Vegetable,2011,Derived Yield,(6+10+2)/(3+9+3),T,T,T,F,F,F Bhutan,Vegetable,2010,Harvested,7,F,F,T,F,F,F Bhutan,Vegetable,2010,Yield,35,F,F,T,F,F,F Bhutan,Vegetable,2010,Production,5,F,F,T,F,F,F Bhutan,Vegetable,2010,Derived Yield,(2.44+7)/(6.48+5),T,F,T,F,F,F Bhutan,Vegetable,2010,Derived Yield,(15.2+7)/(2.66+5),F,T,T,F,F,F Bhutan,Vegetable,2010,Derived Yield,(2.44+15.2+7)/(6.48+2.66+5),T,T,T,F,F,F Bhutan,Vegetable,2011,Harvested,2,F,F,T,F,F,F Bhutan,Vegetable,2011,Yield,6,F,F,T,F,F,F Bhutan,Vegetable,2011,Production,3,F,F,T,F,F,F Bhutan,Vegetable,2011,Derived Yield,(2.44+7)/(6.48+5),T,F,T,F,F,F Bhutan,Vegetable,2011,Derived Yield,(10+2)/(9+3),F,T,T,F,F,F Bhutan,Vegetable,2011,Derived Yield,(6+10+2)/(3+9+3),T,T,T,F,F,F The image of the above expected output data for a structured look at it: Part 1 - Part 2 - Formulae for populating Amount Field for Derived Type: Derived Amount = Sum of Harvested of all countries with T (True) grouped by Year and Commodity columns divided by Sum of Production of all countries with T (True)grouped by Year and Commodity columns. So, the target is to have a combination of all the countries from source and calculate the sum of respective Harvested and Production values which then has to be divided. The commodity can be more than one in the actual scenario for any given country, but that should not bother as the summation of amount happens on grouped commodity and year. Note: The users in the frontend can select any combination of countries. The sole purpose of doing it in the backend rather than dynamically doing it in the frontend is because AWS QuickSight (our visualisation tool), even though can populate sum on selected column filters but doesn't yet support calculation on those derived summed fields. Hence, the entire calculation of all combination of countries has to be pre-populated (very naive approach) in order to make it available in report on dynamic users selection of countries. Also if you've any better approach (than the above naive approach mentioned in note) to solve this problem, you are most welcome to guide me. I've also posted a question on the same problem without writing my expected approach for experts to show me the path on how we can solve this kind of a problem better than this naive approach. If you want to help solve it with some other technique, you're most welcome, here is the link to that question. Any help shall be greatly acknowledged.
Using MS Access to obtain data across linked tables
I'm new to MS Access and am trying to speed up a data gathering process that is taking forever in Powershell. In Powershell I have 10 or so web API calls to get data and each comes back as an object with multiple properties (fields.) Each set of data has related fields to 1 or more of the other sets of data. Getting the data is very quick but piping an array of objects to where-object to select-object takes over an hour and there's really not that much data. Each object contains 500-1500 "records" and 5 to 10 "fields" so I thought why not export that data and use something that's intended to search through data to do the job? I exported each object as a separate .CSV file. So enter MS Access.. I imported each of the CSV's as a separate table (easy enough.) I'm going to simplify this down for this example to the following 3 tables: [Tables]https://i.stack.imgur.com/UCH1F.jpg Every table has fields that relate it over to other tables. Pretty much there's some sort of Id field in every table that is related to another Id field in a different table that I need to pull a field called "name" from. I'm trying to follow the bread crumbs from the Player name to it's Network name to it's Application name, to it's Layout name, etc... I want to build a query that I would eventually just be able to export as an Excel file. I also would prefer to just write out the SQL unless it's really easier to to understand the visual query builder. I'm looking to build a sheet with the following information: Player's Name would include all names from the Players table and getting just that data makes sense to me. SELECT Name AS PlayerName FROM Players Everything else, not so much. I feel like this will end up being some mega query as I get deeper into related table after related table. In Excel, it would be straightforward using Vlookups across tabs but that doesn't seem to be the best approach. Given the info above, I'm trying to achieve the following output: Result table Any help with strategy and syntax greatly appreciated!
You're looking for the JOIN clause. SELECT Players.Name PlayerName, Networks.Name PlayerNetwork, Applications.Name ApplicationName FROM Players LEFT OUTER JOIN Networks ON Networks.ID = Players.NetworkId LEFT OUTER JOIN Applications ON Applications.Id = Players.ApplicationID
How to fetch data for a news feed like system?
I have few tables as shown below Polls PollId Question Option 1 What 1 2 Why 4 Updates UpdateId Text 1 Sleep 2 Play Polls and updates are just two sample tables (In reality there are more tables like ,photos, videos,links etc). But when a user visit his home (like facebook new feed) he must be displayed with data relevant to him (no such data included in this example). ie I want to select data from all tables with less number of query executions. (ie, I want to present a mixture of datas, ie polls, photos, videos etc ) Currently, I'm fetching only ids and type (ie which table) from all of the tables and gather further data while iterating through this resultset. (ie from c# calling another SqlQuery) . Is there a way to query the data from whole tables at once? (OUTER JOIN?, UNION?) Or simply, How can I select different type of entities at once in a single sql Query?
You could write your query so that you have one long select list for everything you want and it all comes back in one result set but I suspect that wouldn't work too well because you might have varying numbers of different types of items per user. If you really must have it all in one hit then you can issue multiple queries in one go and get multiple result sets back. To handle this you can use an ADO.Net DataSet. See this SO example (but not the accepted answer - see Vikram Dibyal's answer as that gives a very basic overview of what I think you're asking for). I won't copy and paste the stuff from the linked thread, just head over and take a look.
How can I summarize and reuse a complex dataset
How can I re-use a single complex dataset across a number of tables? The dataset has a number of computed columns that needs to be reported both in detail and in summary. Here's a very simplified example dataset: is_food sale_association food_type total_sold total_associations percent_total 1 Before Movie Popcorn 50 3 x BirtMath.safeDivide(...) 0 Before Movie Soda 10 2 x BirtMath.safeDivide(...) 1 During Movie Jujubee 10 1 x BirtMath.safeDivide(...) 0 After Movie Soda 15 2 x BirtMath.safeDivide(...) From this one dataset, I'd want to create a detailed summary of all food types while rolling up non food (using the 'is_food' column), another summary of all food types, another detailed summary of food with rolled up non-food by sale_association, etc. etc. The report would also contain a number of percentages (6 in the most complex table) that need to be calculated (some across a row, others across all rows in a given group), all of which can have a zero value for the denominator and so need to be guarded against with safeDivide (which is a PITA to do in the source SQL query which itself is doing aggregation -- checking for divide by zero when both the numerator and denominator are sums leads to hairy queries). Obviously I can do this by focusing the() SQL query as appropriate, but it seems like a waste of time and effort to create 12 or 15 queries that are very similar when I've already managed to create the monster query for the most detailed table. What doesn't seem straightforward is how to perform the rollups in a table. I managed to hack something together by hiding rows that would later be summed up (e.g. "is_food == 0" in the example) and then creating custom data bindings that are displayed in a footer row. Not only does it feel like a hack, it also interferes with the ability to naturally order rows. Again, going back to the example, if I was ordering by total_sold and summarizing rows with is_food == 0, the natural order should be Popcorn, Non-food, Jujubee. There's nothing in the BIRT wiki about this, nor does "BIRT: A Field Guide, 3rd E." really delve into the topic.
This seems like a fairly open-ended question (although I agree that re-using a single dataset makes much more sense than having multiple queries retrieving the same data in slightly different ways). A few general suggestions: Use the most detailed version of the data required as a common dataset for each BIRT report item (typically BIRT tables) Where summary-only level reporting is required, add groups to the BIRT table at the desired level, add data items as required to the group headers/footers and delete the detail level row(s) from the BIRT table. Where detail-level reporting is required in some cases (eg. for food items but not for non-food items), add groups to the BIRT table as above, and set the visibility of the detail row (in Property Editor - Properties - Visibility) to check Hide Element, then specify the appropriate expression to suppress the non-required rows (non-food items, in this example). Aggregations (ie. summary expressions) can be added to tables by selecting the whole table, selecting the Binding tab within the Property Editor and clicking the Add Aggregation... button.