How to list duplicates based on different criteria's T-SQL - sql

I'm looking for someone to help me with a very specific task I have.
I'm analysing data from computer hard drives and need to be able to list folders which are duplicated after being extracted from .zip files. Here is an example of the data I am working with:
ItemName
Extension
ItemType
MyZipFolder.zip
.zip
File
MyZipFolder
null
Folder
PersonalDocuments.zip
.zip
File
PersonalDocuments
null
Folder
As you can see the extension '.zip' is included in the 'ItemName' and 'Extension' column. When extracted from a .zip file, it becomes a folder. I need a way of listing either the .zip file or the folder which it becomes after extraction (either will do, it just needs to be listed with the knowledge that it is a duplicate).
The caveat to this is that my data consists of plenty other folders and files with different extensions e.g. '.docx', '.msg' so the query needs to discount these.
I hope this makes sense - thanks!
Expected output might look something like this:
ItemName
Extension
ItemType
MyZipFolder
null
Folder
PersonalDocuments
null
Folder
So a list of all the folders which I know have a .zip equivalent in the data.

Not sure yet, but do you mean something like this?
select *
from your_table y
where ItemType = 'Folder'
and exists (
select 1 from your_table yy
where yy.Extension = '.zip'
and yy.ItemName = y.ItemName + '.zip'
)

I think I got what you need :
select ItemName
from tablename
group by replace(ItemName, isnull(Extension,''))
having case count(case when Extension = '.zip' then 1 end) > 1

Related

Databricks: Adding path to table from csv

In databricks I have several CSV files that I need to load. I would like to add a column to my table with the file path, but I can't seem to find that option
My data is structured with
FileStore/subfolders/DATE01/filenameA.csv
FileStore/subfolders/DATE01/filenameB.csv
FileStore/subfolders/DATE02/filenameA.csv
FileStore/subfolders/DATE02/filenameB.csv
I'm using this SQL function in databricks, as this can loop through all the dates and add all filenameA into clevertablenameA, and all filenameB into clevertablenameB etc.
DROP view IF EXISTS clevertablenameA;
create temporary view clevertablenameA
USING csv
OPTIONS (path "dbfs:/FileStore/subfolders/*/filenameA.csv", header = true)
My desired outcome is something like this
col1 | col2|....| path
data | data|....| dbfs:/FileStore/subfolders/DATE02/filenameA.csv
data | data|....| dbfs:/FileStore/subfolders/DATE02/filenameA.csv
data | data|....| dbfs:/FileStore/subfolders/DATE02/filenameA.csv
Is there a clever option, or should I load my data another way?
The function input_file_name() could be used to retrieve the file name while reading.
SELECT *, input_file_name() as path FROM clevertablenameA
Note that this does not add a column to the view and merely returns the name of the file being read.
Refer to below link for more information.
https://learn.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/functions/input_file_name
Alternatively you could try reading the files in a pyspark/scala cell and add the file name using the same function using the .withColumn("path", input_file_name()) function and then create the view on top of it.

Query repository file contents from GitLab

I want retrieve the commit id of a file readmeTest.txt through Invantive SQL like so:
select * from repository_files(29, file-path, 'master')
But for this to work I need a project-id, file-path and a ref.
I know my project-id (I got it from select * from projects) and my ref (master branch) but I don’t know where I can find the path to the file I want to retrieve information of.
So where can I find the value of file-path and ref?
This is my repository directory tree, where I can see the files exist:
You need to join several entities in GitLab to get the information you need.
The fields from your repository_files table function and their meaning:
project-id can be found as id in the projects entity, as you already knew;
ref-name can be found as name in repositories;
ref is the name of a branch, a tag or a commit, so let's assume you want the master for now.
Giving this information, you need the below query to get all repository files and their content in a project (I narrowed it down to a single project for now):
select pjt.name project_name
, rpe.name repository_name
, rpf.content file
from projects pjt
join repositories(pjt.id) rpe
on 1=1
and rpe.name like '%.%'
join repository_files(pjt.id, rpe.name, 'master') rpf
on 1=1
where pjt.id = 1

Is there a way to get the OpenText Content Server node id using a word macro

I'm after getting the OpenText Content Server node id using a word macro after a Content Server user creates a word doc by opening word on their pc and saves (using the enterprise connect dialog) and before the word doc is closed - I'm building a macro to hook the item number and pull some metadata into the doc, allowing the user to insert/update a document footer.
Is there some aspect of the various APIs or SDKs that will allow a word macro to access its own node id (and possibly other metadata) in this scenario?
I've found the file C:\Users[username]\AppData\Roaming\OpenText\OTEdit\sync.fedb which seems to hold a mapping between the file location/name and the document in content server, but interrogating this directly seems like a bit of a hack as OTEdit.exe always has a lock on the file, and I wonder if there is a supported way to do this.
I've investigated DPS as a way to stamp the content server node id into the word doc properties, and while this works if the user closes and re-opens the doc, the properties are not available before the doc is closed and so it is not useful in this situation.
I found a different approach because sync.fedb is locked by the OTEdit process, and there doesn't seem to be any way to access the document metadata via the SDK using a word macro. It's a bit of a hack, but I've put the details here in case anyone else is interested in doing this.
Edited documents are stored under a folder in a path like: C:\Users\[username]\AppData\Roaming\OpenText\OTEdit\EC_[servername]\[folder]\[current document name]
[folder] might match a folder in Content Server, or might not - it is better to check the ~otdirinfo.ini file and parse the parent folder id out of the Browse url.
From here we can do a database search using something like:
SELECT
t.DataID AS NodeId,
CAST(t.CreateDate AS DATE) AS CreateDate,
CASE WHEN k.FirstName IS NULL
AND k.LastName IS NULL THEN k.Name
ELSE LTRIM(RTRIM(( ISNULL(k.FirstName, '') + ' ' + ISNULL(k.LastName, '') )))
END AS CreatedByFullName,
CASE WHEN kr.FirstName IS NULL
AND kr.LastName IS NULL THEN kr.Name
ELSE LTRIM(RTRIM(( ISNULL(kr.FirstName, '') + ' ' + ISNULL(kr.LastName, '') )))
END AS ReservedByFullName,
t.CreatedBy,
t.ReservedBy,
t.ParentID,
t.Name AS Title,
v.FileName
FROM
DTree t
INNER JOIN KUAF k
ON t.CreatedBy = k.ID
LEFT OUTER JOIN KUAF kr
ON t.ReservedBy = kr.ID
INNER JOIN DVersData v
ON t.DataID = v.DocID AND t.VersionNum = v.Version
In practice, I have written an API to wrap the database lookup that returns the results of interest in JSON, which is slightly easier deal with than managing database connections and returns results faster than CWS at my site. I use the handy VBA-Web macros to make the call and handle parsing, place the results of the call into the word doc properties, and then call our existing footer-generation macro.
Note: I'm using Content Server 10.5 for this, apparently the approach for extracting parent id sometimes differs per version.

Concatenation of multiple file in Qlikview

Is that possible in qlikview to concatenate multiple files from different paths.
Suppose, i am loading multiple files with a path and want to concatenate multiple files which have same number and name of columns as first path's file. So, my question is how can i do that.
Thanks in Advance.
When you say "load a file", I am assuming you mean that you are loading the contents into a table, as you would an QVD, XML, or Excel file.
If this is the case, if the columns are identical in each load, QlikView will attempt to concatenate them by default if they are loaded in sequence.
Otherwise, name your first table, such as TableName:, then preface the following loads of other files with concatenate(TableName).
Ex:
TableName:
LOAD Col1, Col2
from [file.qvd];
CONCATENATE(TableName)
LOAD Col1, Col2
from [file2.qvd];
Note: As I mentioned above, since these are in sequence and have identically named columns, QlikView will attempt to autoconcatenate them in my example, so the CONCATENATE line, though still functional, is not required.
I just want to add example how to do it if there is dynamic amount of files in multiple directories with some name:
SUB LoadFromFolder (RootDir)
TRACE Loading data ...;
TRACE Directory: $(RootDir);
TRACE ;
FOR Each FoundFile in FileList(RootDir & '\FileName.xml')
TRACE Loading data from '$(FoundFile)' ...;
Data:
LOAD Prop1,
Prop2,
Prop3
From [$(FoundFile)] (XmlSimple, Table is [XmlRoot/XmlTag]);
TRACE Loaded.;
NEXT FoundFile
FOR Each SubDirectory in DirList(RootDir & '\*' )
CALL LoadFromFolder(SubDirectory);
NEXT SubDirectory
TRACE ;
END Sub
CALL LoadFromFolder ('C:\Path\To\Dir\WithoutslashAtTheEnd');
As Dickie already told, each time you load to "Data:", it will be added there.

Designing Database for File Structure

We are using file system to store files within the application. Now we change this to use SQL2K5 for storing as BLOB instead as per requirement.
Now, we need advice regarding the design for table. Obviously, it must have a folder, files within files, size, last date modified, etc., similar to file system.
I start with:
FileID, ParentFileID, FileName, Size, LastDateModified, DateCreated, LastModifiedBy, ModifiedBy
How can this be modified to handle folders as well?
As Mitch Wheat said, there's a really good system for this already, and it's called the File System - my first recommendation would be to look at your requirements again to see if it is actually required.
However, you may have your reasons, so here's how i'd structure the table:
filesystem (
id, // auto increment
type, // flag field: 1 = file, 2 = folder, 3 = symlink, if needed (?)
parent_id, // id of a folder
filename,
modified,
created,
modified_by,
created_by,
file_data // blob
)
You'd need a unique index on (parent_id, filename) if you wanted to emulate a real system.
If you needed per-file permissions, I'd just duplicate the Unix approach with owner/group/everyone permissions - you'd need to track owner and group_id in that table too. Perhaps you could simplify it to owner/everyone, and you probably could just use read/write (forgoing "execute").
Please find the modified one:
FileSytemObjects(
FileSystemObjectID,
ParentFileSystemObjectID,
FileSystemTypeID, --File, Folder, Shortcut
Data,
DateCreated,
LastModified,
CreatedBy,
LastModifiedBy,
IsActive
)
FileSystemSecurity(
FileSystemObjectID,
GroupOrUserID,
IsAllowFullControl,
IsDenyFullControl,
IsAllowExecute,
IsDenyExecute,
IsAllowListFolder,
IsDenyListFolder,
...
...
)
With IsAllowFullControl, IsDenyFullControl, IsAllowExecute, IsDenyExecute, IsAllowListFolder,
IsDenyListFolder, I know it's not an ideal of DB design BUT it's much quicker to get permission in one hit.
What do you think?