Creating an ESRI Shapefile, will it overwrite duplicates? Or rename them? - gdal

I've created a Program that can create ESRI shape files (this is using Gdal Libaries). While I'm creating the file using a basic for loop to populate some points using a random int. If I add a geometry that has the same name as another geometry. Will it over write that geometry and all its data? or will it create a new one but change its name slightly so it can add it.

I tested my answer in a basic project which creates a ESRI shapefile, and creates 100 of the same items, with the same name and geometries. It seems to just append a dot then sequential number, so in short it doesn't overwrite a feature, just creates new ones.

Related

Azure Data Factory - How to create multiple datasets and apply different treatments on files in same blob container?

Starting up with Azure Data factory here.
I have a scenario where I gather csv files (different sources and formats/templates) that I store in a single Azure blob container. I would like to extract the data to an SQL DB. I need to apply different treatments to the files before pushing the data to SQL, based on the format. The format is indicated in each file name (for example: Myfile-formatA-20201201).
I am unclear on my pipeline / datasets setup. I assume I need to create a new (input) dataset for each CSV format, but cannot find a way to create differentiated datasets by relying on the different naming pattern. If creating a single input dataset instead, I can create a pipeline with differentiated copy activity using the same single dataset created in input and applying different filtering rules (relying based on my files naming pattern) - which seems to be working fine for files having the same encoding, column delimiters etc.. but as expected, fails for other files that do not.
I could not find any official information on how to to apply filters on creating multiple datasets from files contained in the same container. Is it possible at all? Or is a prerequisite to store files with different format in different containers or directories?
I created a test to copy different format csv in one pipeline.Then select different copy activities according to the file name. I think this is the answer you want.
In my container, I created csv in two formats:
Creat a dataset to the input container:
Edit: Do not specify a file in the File Path
Using Get Metadata1 activity to get the Child items.
The output is as follows:
Then in ForEach1 activity, we can traverse this array. Add dynamic content #activity('Get Metadata1').output.childItems to the Items tab.
5.Inside ForEach1 activity, we can use Switch1 activity and add dynamic content #split(item().name,'-')[1] to the Expression. It will get the format name. Such as: Myfile-formatA-20201201 -> formatA.
Case default, we can copy csv files of fortmatA.
Edit: in order to select only files of with "formatA" in their name, in the copy activity, use the Wildcard file path option:
enter image description here
Key in #item().name , so we can specify one csv file.
Add formatB case:
Then use the same source dataset.
Edit: as in previous step, use the Wildcard file path option:
enter image description here
That's all. We can set different sink at these Copy activities.

Excel to CSV Plugin for Kettle

I am trying to develop a reusable component in Pentaho which will take an Excel file and convert it to a CSV with an encoding option.
In short, I need to develop a transformation that has an Excel input and a CSV output.
I don't know the columns in advance. The columns have to be dynamically injected to the excel input.
That's a perfect candidate for Pentaho Metadata Injection.
You should have a template transformation wich contains the basic workflow (read from the excel, write to the text file), but without specifiying the input and/or output formats. Then, you should store your metadata (the list of columns and their properties) somewhere. In Pentaho example an excel spreadsheet is used, but you're not limited to that. I've used a couple of database tables to store the metadata for example, one for the input format and another one for the output format.
Also, you need to have a transformation that has the Metadata Injection step to "inject" the metadata into the template transformation. What it basically does, is to create a new transformation at runtime, by using the template and the fields you set to be populated, and then it runs it.
Pentaho's example is pretty clear if you follow it step by step, and from that you can then create a more elaborated solution.
You'll need at least two steps in a transformation:
Input step: Microsoft Excel input
Output step: Text file output
So, Here is the solution. In your Excel Input Component, in Fields Section, mention maximum number of fields which will come in any excel. Then Route the Input excel to text field based on the Number of fields which are actually present. You need to play switch/case component here.

How to find table region for camelot

As mentioned in camelot, we can extract table from particular region like:
tables = camelot.read_pdf('table_regions.pdf', table_regions=['170,370,560,270'])
But how can I find these regions for my pdf.
You can detect this regions, by some visual debugging.
https://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging
I know it's a late reply - but I just came across a possible solution.
If you're looking for a automated extraction method, you could use lattice in a first step, retrieve the table boundaries with tables[0]._bbox and use these numbers in a second call to camelot.read_pdf() into the argument table_areas.
Be aware that they are in a weirdly sorted format for a bbox.
If you just want to detect the table region you are reading, try to do this using Jupyter Notebook:
Define the table region inside .read_pdf method: tables = camelot.read_pdf('table_regions.pdf', table_regions=['170,370,560,270'], flavor='lattice'); pay attention on the flavor, because it defines whether the table have borderlines or not(it can be lattice for borders or stream for space).
Use camelot-py with plot from matplotlib: camelot.plot(tables[index], kind='contour') (You may know about how many index your object have by simply executing the name of the object. e.g.: tables runnign inside .ipynb cell)(contour is a visual debugging).
The plot will show an image of your table with a red rectangle contour. Just repeats step 2 until you achieve the table region you want to extract.
To test if the data is correct just use tables[index].df.

How to change a dataset name

I created a dataset in BigQuery. Unfortunately, it is unclear how to rename it? I clicked on the arrow in the right side of the dataset name, but I can not see any option to rename it.
It is not possible to rename a dataset when using in BigQuery. Instead, it it required to recreate the resource and copy the old information into the new dataset, as mentioned in the public documentation.
Currently, you cannot change the name of an existing dataset, and you
cannot copy a dataset and give it a new name. If you need to change
the dataset name, follow these steps to recreate the dataset:
Create a new dataset and specify the new name.
Copy the tables from the old dataset to the new one.
Recreate the views in the new dataset.
Delete the old dataset to avoid additional storage costs.

Adding two extra columns to input data - Pentaho Kettle

I am working on a transformation step for Pentaho Kettle. It selects several input columns and based on that adds two new columns during transformation. I am unable to understand (based on code from other plugins), how I can add the two new columns so that 1) steps downstream are aware of these columns and 2) i can push the transformed data into these columns.
Thanks in advance.
You might need to override meta.getStepFields() to add new ValueMetaInterface objects to the RowMetaInterface passed in. This is the standard way to add columns at runtime; however, the row's metadata (i.e. list of ValueMetaInterface objects) must be the same from row to row or else the next step in your transformation will complain.
Often when doing data-driven custom plugins, you consume as many rows as you need (using getRow()) in order to figure out what the outgoing row format/metadata will be, then you can construct a RowMetaInterface (usually using meta.getStepFields()) that will be passed into the putRow() call. If you intend to pass through the incoming fields, do something like:
RowMetaInterface outputRowMeta = getInputRowMeta().clone();
If you're creating new rows use this:
RowMetaInterface outputRowMeta = new RowMeta();
Either way when you call meta.getStepFields(outputRowMeta, ...) it should populate outputRowMeta with the appropriate fields, by adding/changing/removing ValueMetaInterface objects from outputRowMeta.
I've got a blog post using Groovy to add/replace fields in the incoming rows here:
http://funpdi.blogspot.com/2014/10/flatten-json-to-key-value-pairs-in-pdi.html
Not sure if that is similar to your use case or not. If you have more questions, feel free to find me on IRC at ##pentaho (my nick is usually mburgess_pdi)
IF i have understood your question correctly, i think you are trying to create an output file with dynamic column. So you can do this by checking on the "fast dumping" option in Text File Output Step. While doing so , donot define any column names in the "Fields" tab
Check my image below:
Hope it helps :)