I'd like to read a .xlsx using python pandas. The problem is the at the beginning of the excel file, it has some additional data like title or description of the table and tables contents starts. That introduce the unnamed columns because pandas DataReader takes it as the columns.
But tables contents starts after few lines later.
A B C
this is description
last updated: Mar 18th,2014
Table content
Country Year Product_output
Canada 2017 3002
Bulgaria 2016 2201
...
The table content starts in line 4. And columns must be "Country", "year", "proudct_output" instead "this is description", "unnamed", "unnamed".
when you use read_excel function set skiprows paramter to 3.
Try using index_col=[0] parameter
pd.read_excel('Excel_Sample.xlsx',sheet_name='Sheet1',index_col=[0])
Related
I have an SSIS package that is to ingest a number of Excel files with similar structures but irregular names and import them into a SQL table. Along with the data from the excel files, I have a number of variables that are set and different with each file (User::ExcelFileName, User::VarMonth, User::VarProgram, User::VarYear, etc). All of the table data from the Excel files are going to the same destination table, but for each row of data alongside the Excel dataset I want to insert a column for each variable to pass through as well into SQL. An example of my dataset is below:
Excel
ID
Name
Foo
Bar
111
Bob
88yu
117
112
Jim
JKL
A TU
113
George
FTD
19900
SSIS Variables (set during execution)
User::ExcelFileName = c:\temp\excelfile1.xlsx
User::VarMonth = Jan
User::VarProgram = Daily
User::VarYear = 2023
Desired SQL Destination:
ExcelFileName
VarMonth
VarProgram
VarYear
ID
Name
Foo
Bar
c:\temp\excelfile1.xlsx
Jan
Daily
2023
111
Bob
88yu
117
c:\temp\excelfile1.xlsx
Jan
Daily
2023
112
Jim
JKL
A TU
c:\temp\excelfile1.xlsx
Jan
Daily
2023
113
George
FTD
19900
I've tried a few configurations and I've referenced this post for piping in variable data into SQL, but I haven't gotten a working model yet.
Worth noting, Excel COnnection is dynamic and set to run within a Foreach Loop container to iterate through my Excel sources. Any advice or guidance would be appreciated!
It sounds like you want a Derived Column task.
in the task, just add the new columns you want, and map the variables to the column.
I am trying to read a Google Sheet using pandas pd.read_csv(), however when the columns contain cells with text and other cells with numeric values, the text is not read. My code is:
def build_sheet_url(doc_id, sheet_id):
return r"https://docs.google.com/spreadsheets/d/{}/gviz/tq?tqx=out:csv&sheet={}".format(doc_id, sheet_id)
sheet_url = build_sheet_url(doc_id, sheet_name)
df = pd.read_csv(sheet_url)
> df
Column1 Column2
0 12 21
1 13 22
2 14 23
3 15 24
This is what the spreadsheet looks like:
I have tried using dtype=str and dtype=object but could not get the text to show in my dataframe. Specifying the encoding encoding='utf-8' did not work either.
This is because query doesn't support mixed data types:
Data type. Supported data types are string, number, boolean, date, datetime and timeofday. All values of a column will have a data type that matches the column type, or a null value. These types are similar, but not identical, to the JavaScript types.
Use the /export end point(or drive-api endpoint instead):
https://docs.google.com/spreadsheets/d/[SPREADSHEET_ID]/export?format=[FORMAT]&gid=(SHEET_ID)&range=(A1NOTATION)
Related:
Google sheet to pandas via shared link without credentials in python
Query is ignoring string (non numeric) value
I am trying to replace substrings in a data frame by the lists "name" and "lemma". As long as I enter the lists manually, the code delivers the result in the dataframe m.
name=['Charge','charge','Prepaid']
lemma=['Hallo','hallo','Hi']
m=sdf.replace(regex= name, value =lemma)
As soon as I am reading in both lists from an excel file, my code is not replacing the substrings anymore. I need to use an excel file, since the lists are in one table that is very large.
sdf= pd.read_excel('training_data.xlsx')
synonyms= pd.read_excel('synonyms.xlsx')
lemma=synonyms['lemma'].tolist()
name=synonyms['name'].tolist()
m=sdf.replace(regex= name, value =lemma)
Thanks for your help!
df.replace()
Replace values given in to_replace with value.
Values of the DataFrame are replaced with other values dynamically. This differs from updating with .loc or .iloc, which require you to specify a location to update with some value.
in short, this method won't make change on the series level, only on values.
This may achieve what you want:
sdf.regex = synonyms.name
sdf.value = synonyms.lemma
If you are just trying to replace 'Charge' with 'Hallo' and 'charge' with 'hallo' and 'Prepaid' with 'Hi' then you can use repalce() and pass the list of words to finds as the first argument and the list of words to replace with as the second keyword argument value.
Try this:
df=df.replace(name, value=lemma)
Example:
name=['Charge','charge','Prepaid']
lemma=['Hallo','hallo','Hi']
df = pd.DataFrame([['Bob', 'Charge', 'E333', 'B442'],
['Karen', 'V434', 'Prepaid', 'B442'],
['Jill', 'V434', 'E333', 'charge'],
['Hank', 'Charge', 'E333', 'B442']],
columns=['Name', 'ID_First', 'ID_Second', 'ID_Third'])
df=df.replace(name, value=lemma)
print(df)
Output:
Name ID_First ID_Second ID_Third
0 Bob Hallo E333 B442
1 Karen V434 Hi B442
2 Jill V434 E333 hallo
3 Hank Hallo E333 B442
I have data in pandas dataframe. i need to extract all the content between the string which starts with "Impact Factor:" and ends with "&#". If the content doesn't have "Impact Factor:" i want null in that row of the dataframe
this is sample data from a single row.
Save to EndNote online &# Add to Marked List &# Impact Factor: Journal 2 and Citation Reports 500 &# Other Information &# IDS Number: EW5UR &#
I want the content like the below in a dataframe .
Journal 2 and Citation Reports 500
Journal 6 and Citation Reports 120
Journal 50 and Citation Reports 360
Journal 30 and Citation Reports 120
Hi you can just use a regular expression here:
result = your_df.your_col.apply(lambda x: re.findall('Impact Factor:(.*?)&#',x))
You may want to strip white spaces too in which case you could use:
result = your_df.your_col.apply(lambda x: re.findall('Impact Factor:\s*(.*?)\s*&#',x))
New to SQL, my question is I had some trouble importing data which led to some discrepancy.
Column A Column B Column C Column D Column E Column F
WB-002 "Brown Sales" 14A 140000 12/5/2015 12/5/2016
WB-002 "Johnson Inc" 24B 150000 12/5/2015,2/5/2016
WB-005 "Sonoma Inc" 26C 300000 7/30/2015,7/30/2016
How would I be able to shift the data over one for the rows affected past column 1? Or would I have to replace each rows data with the next row over and over again? Final result wanted:
Column A Column B Column C Column D Column E Column F
WB-002 "Brown Sales" 14A 140000 12/5/2015 12/5/2016
WB-002 "Johnson Inc" 24B 150000 12/5/2015 2/5/2016
WB-005 "Sonoma Inc" 26C 300000 7/30/2015 7/30/2016
This is too long for a comment.
I don't think SQL Server understands the real CSV format (unless more recent versions have seen improvements in this regard). Alas. You should try re-importing the data (okay fingers, don't type Postgres which does understand CSV).
If the file is small enough, then load it into Excel and save it with tab delimiters -- or something that is not a comma. Then you can bring it into SQL Server correctly.
If it is larger, I'm not sure what to do (I guess when I've faced this problem, Excel has always come to the rescue). Depending on your skills, you could pre-process in a language such as Python, grep, or PowerShell. Or you could load each line into SQL Server as a string and then do all the parsing in SQL (not trivial either).
In the meantime, let Microsoft know that the most common export format from their Excel product should be able to be imported into their database product.