python is taking multiple column as index while importing csv file - pandas

I am trying to import a csv file, the first column is dates, second is none, and the rest are some product data. When I import the file, it take my first three column as index. I just need to have dates as index.

When you are loading the csv file by specifying the column names, all the columns that you didn't pass to the names argument will become an index. So you can either
1) Pass the name of the second and third index as the names argument which will give only the dates as index.
2) Prd_data.reset_index(level=[1,2],inplace=True)
This will drop the second and the third index and make them as columns and keep only the date as index.

Related

Pandas Identify Column name where Pattern was found using df.iterrows()

I have datasets I am scanning for a certain pattern using regex. Some of these Tables have millions of rows and doing column by column search is time consuming. So I am using iterrows.
This way the first index, row it finds the matching pattern it flags and ends the loop. But the problem with this is that I can't determine the column name. Ideally I want the name of column where it found the match
Code sample:
for index, row in df.iterrows():
#regex to identify any 9 digit number starting with 456 goes here
Currently my output prints the index of the row it found the first match in and exits. What's a better way I can write this so that I can capture the column name or column index it was found in? Like for the Data sample above Ideally I want the columns "Acc_Number" printed.

Is there a quick way to subset columns in PANDAS?

I am trying to setup a PANDAS project that I can use to compare and return the differences in excel and csv files over time. Currently I load the excel/csv files into pandas and assign them a version column. I assign them a "Version" column because in my last step, I want the program to create me a file containing only what has changed in the "new" version file so that I do not have to update the entire database, only the data points that have changed.
old = pd.read_excel(landdata20201122.xlsx')
new = pd.read_excel(landdata20210105.xlsx')
old['version'] = "old"
new['version'] = "new"
I merge the sheets into one, I then drop duplicate rows based on all the columns in the original files. I have to subset the data because if the program looks at my added version column, it will not be seen as a duplicate row. Statement is listed below
df2 = df1.drop_duplicates(subset=["UWI", "Current DOI Partners", "Encumbrances", "Lease Expiry Date", "Mineral Leases", "Operator", "Attached Land Rights", "Surface Leases"])
df2.shape
I am wondering if there is a quicker way to subset the data, basically the way I currently have it setup, I have to list each column title. Some of my sheets have 100+ columns, so it is a lot of work when I only want it to negate 1 column. Is there a way that I can populate all the column titles and remove the ones I do not want looked at? Or is there a way to enter the columns I DO NOT want compared in the drop duplicates command instead of entering all the columns except one?
If I can just list the columns I do not want to compare, I will be able to use the same script for much more of the data that I am working with as I will not have to edit the drop_duplicates statement each time I compare sheets.
Any help is appreciated, thank you in advance!
If I've understood well:
Store the headers in a list.
Remove the names you don't want by hand.
Inside the subset of drop_duplicates(), place the list.
In case that the columns you want to remove are more than those you want to keep, add by hand all the wanted columns in the list.
With a list, you won't need to write them every time.
How to iterate a list:
list=['first', 'second', 'third']
for i in list:
print(i)
# Output: 'first', 'second', 'third'

How do I load entire file content as a text into a column AzureSQLDW table?

I have a some file in an azure data lake 2 and I want to load them as a column value nvarchar(max) in AzureSQLDW. The table in AzureSQLDW is heap. I couldn't find any way to do it? All I see is column delimited when load them into multiple rows instead of one row in single column. How I achieve this?
I don't guarantee this will work, but try using COPY INTO and define non-present values for row and column delimiters. Make your target a single column table.
I would create a Source Dataset with a single column. You do this by specifying "No delimiter":
Next, go to the "Schema" tab and Import the schema, which should create a single column called "Prop_0":
Now the data should come through as a single string instead of delimited columns.

Is there any way to exclude columns from a source file/table in Pentaho using "like" or any other function?

I have a CSV file having more than 700 columns. I just want 175 columns from them to be inserted into a RDBMS table or a flat file usingPentaho (PDI). Now, the source CSV file has variable columns i.e. the columns can keep adding or deleting but have some specific keywords that remain constant throughout. I have the list of keywords which are present in column names that have to excluded, e.g. starts_with("avgbal_"), starts_with("emi_"), starts_with("delinq_prin_"), starts_with("total_utilization_"), starts_with("min_overdue_"), starts_with("payment_received_")
Any column which have the above keywords have to be excluded and should not pass onto my RDBMS table or a flat file. Is there any way to remove the above columns by writing some SQL query in PDI? Selecting specific 175 columns is not possible as they are variable in nature.
I think your example is fit to use meta data injection you can refer to example shared below
https://help.pentaho.com/Documentation/7.1/0L0/0Y0/0K0/ETL_Metadata_Injection
two things you need to be careful
maintain list of columns you need to push in.
since you have changing column names so you may face issue with valid columns as well which you want to import or work with. in order to do so make sure you generate the meta data file every time so you are sure about the column names you want to push out from the flat file.

Use google-refine on csv without headers and with various number of columns per record

I'm attempting to import in open-refine a csv extracted from a NoSQL database (Cassandra) without headers and with different number of columns per record.
For instance, fields are comma separated and could look like below:
1 - userid:100456, type:specific, status:read, feedback:valid
2 - userid:100456, status:notread, message:"some random stuff here but with quotation marks", language:french
There's a maximum number of columns and there aren't cleansing required on their names.
How do I make up a big excel file I could mine using pivot table?
If you can get JSON instead, Refine will ingest it directly.
If that's not a possibility, I'd probably do something along the lines of:
import as lines of text
split into two columns containing row ID and fields
split multi-valued cells on fields column using comma as a separatd
split fields column into two columns using colon as a separate
use key/value on these two columns to unfold into columns