How to avoid storing duplicate results

How to avoid storing duplicate results - scrapy

I store scraped content within a csv file.
Each row contains a unique ID and description of an item.
My ID is coming from the website where I scrape the content and not generated on the scraper side.
I use Scrapy's feedExporter to generate the csv file
When I scrape again my website, I would like my script to check if unique ID is already stored within the csv file, if it's not I would add the new row, if it is I will just move on to the next item.
As I assume this is a classic thing to do with a scraping framework I believe there must be a smart way to do it with Scrapy however I can't find anything on this topic within the Scrapy's documentation
Should I simply open the csv file, go through each item and if iterator's value not present add a new row or skip if it is?

One solution might be to create an empty HashMap. Then on the first scan, put your items in the HashMap. HashMaps do not hold duplicate values. So in light of this on your second scan, look up the key, and if the key exists, onto the next. If it doesn't exist, add it.

Related

PDI /Kettle - Passing data from previous hop to database query

I'm new to PDI and Kettle, and what I thought was a simple experiment to teach myself some basics has turned into a lot of frustration.
I want to check a database to see if a particular record exists (i.e. vendor). I would like to get the name of the vendor from reading a flat file (.CSV).
My first hurdle selecting only the vendor name from 8 fields in the CSV
The second hurdle is how to use that vendor name as a variable in a database query.
My third issue is what type of step to use for the database lookup.
I tried a dynamic SQL query, but I couldn't determine how to build the query using a variable, then how to pass the desired value to the variable.
The database table (VendorRatings) has 30 fields, one of which is vendor. The CSV also has 8 fields, one of which is also vendor.
My best effort was to use a dynamic query using:
SELECT * FROM VENDORRATINGS WHERE VENDOR = ?
How do I programmatically assign the desired value to "?" in the query? Specifically, how do I link the output of a specific field from Text File Input to the "vendor = ?" SQL query?

The best practice is a Stream lookup. For each record in the main flow (VendorRating) lookup in the reference file (the CSV) for the vendor details (lookup fields), based on its identifier (possibly its number or name or firstname+lastname).
First "hurdle" : Once the path of the csv file defined, press the Get field button.
It will take the first line as header to know the field names and explore the first 100 (customizable) record to determine the field types.
If the name is not on the first line, uncheck the Header row present, press the Get field button, and then change the name on the panel.
If there is more than one header row or other complexities, use the Text file input.
The same is valid for the lookup step: use the Get lookup field button and delete the fields you do not need.

Due to the fact that
There is at most one vendorrating per vendor.
You have to do something if there is no match.
I suggest the following flow:
Read the CSV and for each row look up in the table (i.e.: the lookup table is the SQL table rather that the CSV file). And put default upon not matching. I suggest something really visible like "--- NO MATCH ---".
Then, in case of no match, the filter redirect the flow to the alternative action (here: insert into the SQL table). Then the two flows and merged into the downstream flow.

Is there a (creative) way to hide a text field in Indesign if there is no information in the data merge field?

I am creating a data-merge document in InDesign.
There are various tables that I've created which only show as many rows as there is actual data in the field, through some creative table and cell styles.
Now I've been asked to only have an entirely separate table only show if there is information in any of those fields.
I'm at a total loss. With the way the current structure is set up, I can cause it to not display any text, but it still shows empty header cells and one line of empty row cells.
Pre-DataMerge, with the data fields
Post-Datamerge, with the resulting empty cells
Any creative ideas to hide that table? I was thinking there might be a way to hide the entire text field, if not the table. Maybe a script? I tried one that deletes blank tables, but that didn't seem to work after the data-merge was run.

I am not sure you can get that level of processing with InDesign datamerge. You could think of a script to post remove those tables or use a dedicated plugin such as Easycatalog that can take care of such empty items natively.

Create multi-page document from a single page template using doc4j

I am planning to use doc4j for search and replace in a template. I'do like to create the page for each member in the list. Basically, I need to replicate the same page from the template. I have done simple search and replace. However, this little complex one for which I need some sample examples. Here is my requirement:
I have a docx template which has the content with place holders.
There is a table with 3 columns in it and I need to replace with different values for each column like first name, last name and etc. The number of rows may vary anywhere from one to 200. So technically this may go beyond one page. If it exceeds more than one page, then I need the table header to repeat in the next page too.
I want to copy the same template on every page and replace the place holder. Basically create a single document with multiple pages each page for one member.
Please provide me with the example.
Appreciate the help.
Thanks.

Get list of columns of source flat file in SSIS

We get weekly data files (flat files) from our vendor to import into SQL, and at times the column names change or new columns are added.
What we have currently is an SSIS package to import columns that have been defined. Since we've assigned the mapping, SSIS only throws up an error when a column is absent. However when a new column is added (apart from the existing ones), it doesn't get imported at all, as it is not named. This is a concern for us.
What we'd like is to get the list of all the columns present in the flat file so that we can check whether any new columns are present before we import the file.
I am relatively new to SSIS, so a detailed help would be much appreciated.
Thanks!

Exactly how to code this will depend on the rules for the flat file layout, but I would approach this by writing a script task that reads the flat file using the file system object and a StreamReader object, and looks at the columns, which are hopefully named in the first line of the file.
However, about all you can do if the columns have changed is send an alert. I know of no way to dynamically change your data transformation task to accomodate new columns. It will have to be edited to handle them. And frankly, if all you're going to do is send an alert, you might as well just use the error handler to do it, and save yourself the trouble of pre-reading the column list.

I agree with the answer provided by #TabAlleman. SSIS can't natively handle dynamic columns (and niether can your SQL destination).
May I propose an alternative? You can detect a change in headers without using a C# Script Tasks. One way to do this would be to create a flafile connection that reads the entire row as a single column. Use a Conditional Split to discard anything other than the header row. Save that row to a RecordSet object. Any change? Send Email.
The "Get Header Row" DataFlow would look like this. Row Number if needed.
The Control Flow level would look like this. Use a ForEach ADO RecordSet object to assign the header row value to an SSIS variable CurrentHeader..
Above, the precedent constraints (fx icons ) of
[#ExpectedHeader] == [#CurrentHeader]
[#ExpectedHeader] != [#CurrentHeader]
determine whether you load data or send email.
Hope this helps!

i have worked for banking clients. And for banks to randomly add columns to a db is not possible due to fed requirements and rules. That said I get your not fed regulated bizz. So here are some steps
This is not a code issue but more of soft skills and working with other teams(yours and your vendors).
Steps you can take are:
(1) reach a solid columns structure that you always require. Because for newer columns older data rows will carry NULL.
(2) if a new column is going to be sent by the vendor. You or your team needs to make the DDL/DML changes to the table were data will be inserted. Ofcouse of correct data type.
(3) document this change in data dictanary as over time you or another member will do analysis on this data and would like to know what is the use of each attribute or column.
(4) long-term you do not wish to keep changing table structure monthly because one of your many vendors decided to change the style the send you data. Some clients push back very aggresively other not so much.

If a third-party tool is an option for you, check out CozyRoc's Data Flow Task Plus. It handles variable columns in sources.

SSIS cannot make the columns dynamic,

one thing, i always do, is use a script task to read the first and last lines of a file.
if it is not an expected list of csv columns i mark file as errored and continue/fail as required.
Headers are obviously important, but so are footers. Files can through any unknown issue be partially built. Requesting the header be placed at the rear of the file it is a double check.
I also do not know if SSIS can do this dynamically, but it never ceases to amaze me how people add/change order of columns and assume things will still work.

1-SSIS Does not provide dynamic source and destination mapping.But some third party component such as Data flow task plus , supporting this feature
2-We can achieve this using ssis script task.
3-If the Header is correct process further for migration else fail the package before DFT execute.
4-Read the line from the header using script task and store in array or list object
5-Then compare those array values to user defined variables declare earlier contained default value as column name.
6-If values are matching exactly then progress further else fail it.

Word VBA - Matching large selecting of text based keys with data. Embedded resource/text?

I have a pretty complex VBA plugin for Word written that automatically creates a report for me, using XML input, cycling through the X objects within the report to create the output. It is currently embedded into a Word Template file .DOCM.
I need to insert into the report a static list of text, based on the name of the item within the XML. For example, within my XML I have entries with a name BLAH1, BLAH2, BLAH3. Every time I see BLAH1, I need to match it with the static INSERT1, and BLAH2 match it with INSERT2, etc.
This seems simple enough, but her lies the problem...
It appears there are no Hashmap's in VBA without requiring external libraries, which I can't really rely on, since I can't install items on the machines where this will be running. As a result I can't store this reference data in a Hashmap as far as I can tell.
I can't seem to concatenate more than about 20 lines of strings together without hitting a max within VBA, and just parsing the chunk of text for what I need since there are about 1500 "lines" in my reference data, which greatly exceeds 20.
I also haven't found a way to embed a text, or any other type of file to hold this information within the file, and then parse the data.
I really would like to have everything within the single template file, without requiring additional text or other files to be bundled with the document. If there is no other option, I will go that route, but I wanted to see what create ideas people at Stackoverflow might have first ;-)

Have you considered using Word's Document Variables? They are name/value pairs stored invisibly within the document. (ActiveDocument.Variables("BLAH1").Value = "INSERT1" to create one, debug.print ActiveDocument.Variables("BLAH1").Value to retrieve a value (you have to use an error handler to detect non-existent indices if you go that route). Word can store (at least) hundreds of thousands of these things).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to avoid storing duplicate results - scrapy

One solution might be to create an empty HashMap. Then on the first scan, put your items in the HashMap. HashMaps do not hold duplicate values. So in light of this on your second scan, look up the key, and if the key exists, onto the next. If it doesn't exist, add it.

Related

PDI /Kettle - Passing data from previous hop to database query

Is there a (creative) way to hide a text field in Indesign if there is no information in the data merge field?

Create multi-page document from a single page template using doc4j

Get list of columns of source flat file in SSIS

Word VBA - Matching large selecting of text based keys with data. Embedded resource/text?

Categories

Resources