How to remove quotes in the column value pyspark - dataframe

I have a csv file with quotes in the column values. How to remove those quotes from the column value. for eg.
+--------+------+------+
|sample |id |status|
+--------+------+------+
|00000001|'1111'|'yes' |
|00000002|'1222'|'no' |
|00000003|'1333'|'yes' |
+--------+------+------+
When i read it i should have DF like below without the single quote
+--------+------+------+
|sample |id |status|
+--------+------+------+
|00000001| 1111 | yes |
|00000002| 1222 | no |
|00000003| 1333 | yes |
+--------+------+------+

While loading csv data, You can specify below options & Spark will automatically parses quotes.
Check below code.
spark. \
read. \
option("quote", "\'"). \
option("escape", "\'"). \
csv("<path to directory>")

Related

How to get part of the String before last delimiter in AWS Athena

Suppose I have the following table in AWS Athena
+----------------+
| Thread |
+----------------+
| poll-23 |
| poll-34 |
| pool-thread-24 |
| spartan.error |
+----------------+
I need to extract the part of the string from columns before last delimiter(Here '-' is delimiter)
Basically need a query which can give me output as
+----------------+
| Thread |
+----------------+
| poll |
| poll |
| pool-thread |
| spartan.error |
+----------------+
Also i need a group by query which ca generate this
+---------------+-------+
| Thread | Count |
+---------------+-------+
| poll | 2 |
| pool-thread | 1 |
| spartan.error | 1 |
+---------------+-------+
I tried various forms of MySql queries using LEFT(), RIGHT(), LOCATE(), SUBSTRING_INDEX() functions but it seems that athena does not support all these functions.
You could use regexp_replace() to remove the part of the string that follows the last '-':
select regexp_replace(thread, '-[^-]*$', ''), count(*)
from mytable
group by regexp_replace(thread, '-[^-]*$', '')

keep the extra whitespaces in display of pandas dataframe in jupyter notebook

In jupyter notebook, extra whitespaces in dataframe are removed. But sometime that is not preferred, e.g.
df=pd.DataFrame({'A':['a b','c'],'B':[1,2]})
df
The result I get:
| | A | B |
|---|-----|---|
| 0 | a b | 1 |
| 1 | c | 2 |
But I want:
| | A | B |
|---|-------|---|
| 0 | a b | 1 |
| 1 | c | 2 |
Is it possible? Thanks
It's actually HTML: pandas dutifully write all the spaces into the HTML markup (the front end format used by Jupyter Notebook). HTML, by default, collapses multiple adjacent whitespaces into one. Use the style object to change this:
df.style.set_properties(**{'white-space': 'pre'})
You unfortunately can't change the default render style of a DataFrame yet. You can write a function to wrap that line:
def print_df(df):
return df.style.set_properties(**{'white-space': 'pre'})
print_df(df)

How to import Excel table with double headers into oracle database

I have this excel table I am trying to transfer over to an oracle database. The thing is that the table has headers that overlap and I'm not sure if there is a way to import this nicely into an Oracle Database.
+-----+-----------+-----------+
| | 2018-01-01| 2018-01-02|
|Item +-----+-----+-----+-----+
| | RMB | USD | RMB | USD |
+-----+-----+-----+-----+-----+
| | | | | |
+-----+-----+-----+-----+-----+
| | | | | |
+-----+-----+-----+-----+-----+
| | | | | |
+-----+-----+-----+-----+-----+
| | | | | |
+-----+-----+-----+-----+-----+
The top headers are just the dates for the month and then their respective data for that date. Is there a way to nicely transfer this to an oracle table?
EDIT: Date field is an actual date such as 02/19/2018.
If you pre-create a table (as I do), then you can start loading from the 3rd line (i.e. skip the first two), putting every Excel column into the appropriate Oracle table column.
Alternatively (& obviously), rename column headers so that file wouldn't have two header levels).

Let pandas use 0-based row number as index when reading Excel files

I am trying to use pandas to process a series of XLS files. The code I am currently using looks like:
with pandas.ExcelFile(data_file) as xls:
data_frame = pandas.read_excel(xls, header=[0, 1], skiprows=2, index_col=None)
And the format of the XLS file looks like
+---------------------------------------------------------------------------+
| REPORT |
+---------------------------------------------------------------------------+
| Unit: 1000000 USD |
+---------------------------------------------------------------------------+
| | | | | Balance |
+ ID + Branch + Customer ID + Customer Name +--------------------------+
| | | | | Daily | Monthly | Yearly |
+--------+---------+-------------+---------------+-------+---------+--------+
| 111111 | Branch1 | 1 | Company A | 10 | 5 | 2 |
+--------+---------+-------------+---------------+-------+---------+--------+
| 222222 | Branch2 | 2 | Company B | 20 | 25 | 20 |
+--------+---------+-------------+---------------+-------+---------+--------+
| 111111 | Branch1 | 3 | Company C | 30 | 35 | 40 |
+--------+---------+-------------+---------------+-------+---------+--------+
Even I explicitly gave index_col=None, pandas still take ID column as the index. I am wondering the right way of making row numbers to be the index.
pandas currently doesn't support parsing a MultiIndex columns without also parsing a row index. Related issue here - it probably could be supported, but this gets tricky to define in a non-ambiguous way.
It's a hack, but the easiest way to work around this right now is to add a blank column on the left side of data, then read it in like this.
pd.read_excel('file.xlsx', header=[0,1], skiprows=2).reset_index(drop=True)
Edit:
If you can't / don't want to modify the files, a couple options are:
If the data has a known / common header, use pd.read_excel(..., skiprows=4, header=None) and assign the columns yourself, suggested by #ayhan.
If you need to parse the header, use pd.read_excel(..., skiprows=2, header=0), then munge the second level of labels into a MultiIndex. This will probably mess up dtypes, so you may also need to do some typecasting (pd.to_numeric) as well.

Removing special characters using Hive

I have data stored in Cassandra 1.2 as shown below. There is special character under sValue - highlighted as bold. How can I use hive function to remove this ?
Date | Timestam | payload_Timestamp | actDate | actHour | actMinute | sDesc | sName | sValue
---------------------------------+--------------------------------------+--------------------------+----------------------+----------------------+------------------------+---------------------------+--------------------------------+---------------------
2014-06-25 00:00:00-0400 | 2014-06-25 08:31:23-0400 | 2014-06-25 08:31:23-0400 | 06-25-2014 | 8 | 31 | lable | /t1/t2/100/200/11/99 | 2743326591.03\x00
You can use regexp_replace() function.
More details available on
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF