remove rows with today's date pandas - pandas

I have a df with several columns, and I want to remove rows that have today's date time.
col1 col2 col3 col4
ABC 2022-08-12 00:03:29.872 123 A1B2
BCD 2022-08-12 00:02:08.067 234 B1C2
CDE 2022-08-11 23:57:24.208 345 C1D2
DEF 2022-08-11 23:56:55.257 456 D1E2
expected result (assuming today's date is 12th august 2022):
col1 col2 col3 col4
CDE 2022-08-11 23:57:24.208 345 C1D2
DEF 2022-08-11 23:56:55.257 456 D1E2
I tried doing below
df[pd.to_datetime(df.col2, errors='coerce') < pd.to_datetime('today')]
but it is not working, I still get rows from today. Can someone please help me with this?

Use Series.dt.date with != Timestamp.date:
df = df[pd.to_datetime(df.col2, errors='coerce').dt.date != pd.to_datetime('today').date()]
print (df)
col1 col2 col3 col4
2 CDE 2022-08-11 23:57:24.208 345 C1D2
3 DEF 2022-08-11 23:56:55.257 456 D1E2

Related

making list of categorical columns with unique values greater than a specific number pandas

I have a DF with categorical, numeric and date columns. I want to make a list of all categorical columns that have unique values more than 2. So my df is something like this
date_time1 date_time2 cat_col1 cat_col_2 num_col1 num_col2 cat_col3
2020-10-08 2021-11-08 ABC xyz 20 40 PQR
19:09:21.884 15:18:26.864
2020-10-08 2021-11-08 BCD xyz 30 50 ABC
19:09:21.884 15:18:26.864
2020-10-08 2021-11-08 ABC yza 40 30 MNO
19:09:21.884 15:18:26.864
2020-10-08 2021-11-08 CDE xyz 10 80 CDE
19:09:21.884 15:18:26.864
2020-10-08 2021-11-08 BCD xyz 20 70 MNO
19:09:21.884 15:18:26.864
I want to now get a list of only categorical column names which have unique value counts more than 2. So in this case it should be
mylist =['cat_col1', 'cat_col3']
Can someone please help me with this?
If you want to select the columns just by the name:
[col for col in df.columns if col.startswith('cat_') and df[col].nunique() > 2]
Result:
['cat_col1', 'cat_col3']
If you want to select by type:
[col for col in df.select_dtypes(include='category').columns if df[col].nunique() > 2]

check sorting by year and quarter pandas dataframe

I have a df that looks like below
date col1 col2
0 2000 Q1 123 456
1 2000 Q2 234 567
2 2000 Q3 345 678
3 2000 Q4 456 789
4 2001 Q1 567 890
The df has over 200 rows. I need to -
check if the data is sorted by date
if not, then sort it by date
Can someone please help me with this?
Many thanks
Use DataFrame.sort_values with key parameter and converting values to datetimes:
df = df.sort_values('date', key=lambda x: pd.to_datetime(x.str.replace('\s+', '')))
print (df)
date col1 col2
0 2000 Q1 123 456
1 2000 Q2 234 567
2 2000 Q3 345 678
3 2000 Q4 456 789
4 2001 Q1 567 890
EDIT: You can use Series.is_monotonic for test if values are monotonic_increasing:
if not df['date'].is_monotonic:
df = df.sort_values('date', key=lambda x: pd.to_datetime(x.str.replace('\s+', '')))
You can convert your date column as pd.Index (or define it as the index of your dataframe):
if not pd.Index(df['date']).is_monotonic_increasing:
df = df.sort_values('date')

Remove multiple strings at a time in sql

I want to remove 'hi' ,'by', 'dy' from col2 at one shot in sql. I'm very new to sql server, if anyone could give an outline how such problems are solved that would be really helpful.
Col1 col2 col3
A hi!abcd 123
B bypython 678
C norm 888
D dupty dy 999
output:
Col1 col2 col3
A abcd 123
B python 678
C norm 888
D dupty 999

Linux remove space/column if matching string pattern

I would like awk or sed or any other display filter mechanisms in native shell to be able to remove space from string lines that a match & do not remove any space between the 2 strings (columns) when not matched. And then display the output.
My file:
# cat test2
col0 col1 col2 col3 col4 col5
ln1 abc def ghi jkl mno
ln2 abc defghi jkl mno pqr
My output I would like to be:
# <something to view/grep/filter with> test2
col0 col1 col2 col3 col4 col5
ln1 abc defghi jkl mno
ln2 abc defghi jkl mno pqr
I have tried multiple combinations of grep & awk & cut. But not able to do it. I am not good with sed, but I can try. I even tried to use an interim file i.e. echo output to some file & then grep. But I failed to do that too.
Edited with more of my requirement:
My biggest problem is that I can't predict where the space will be & what the contents of the row entries would be. So I would like sed to get the output not based on a specific string but instead based on column numbers.
My file:
# cat test2
col0 col1 col2 col3 col4 col5
ln1 abc def ghi jkl mno
ln2 abc defghi jkl mno pqr
ln3 aaa bbb ccc ddd eee
ln4 aaa bbbccc ddd eee
Output File:
# <something to view/grep/filter with> test2
col0 col1 col2 col3 col4 col5
ln1 abc def ghi jkl mno
ln2 abc defghi jkl mno pqr
ln3 aaa bbbccc ddd eee
ln4 aaa bbbccc ddd eee
sed 's/def ghi/defghi/' file
If that's not what you wanted then edit your question to clarify your requirements and provide input/output that better demonstrates your problem.

Parse hive array<string> output into either json format or something like following

I have a HIVE table with 3 columns, col3 is array the output from the table is like this
select col1, col2, col3 from testing;
col1 col2 col3
xyx 123 ["xyz","Good investing","123","abc","Bad investing","006","port123","future investing","008","flaf4","good research investing","01"]
xyx 789 ["xyz","Good investing","789","flag1","Bad investing","006","port123","future investing","008"]
I want to parse the col3 so that the output like as following
xyx 123 "xyz","Good investing","123"
xyx 123 "abc","Bad investing","006","port123"
xyx 123 "future investing","008","flaf4"
xyx 123 "good research investing","01"
xyx 789 "xyz","Good investing","789",
xyx 789 "flag1","Bad investing","006",
xyx 789 "port123","future investing","008"
any help will be highly appreciated
-kb