how to show numeric data in seaborn - matplotlib

I am analyzing the covid 19 data in Seaborn, I have taken a specific state Maharashtra I have given the x='Dates' and for y = "Deaths" data = maha and color = g like this
but when I run my output of Date becomes messed up. like this
How do I show "date" as in date format like 2020-05-03,
Please provide a solution on how I can achieve this format

Related

Matplotlib output line chart looks "box like" (for lack of a better word) for monthly data sampled over a 30 year period

I am doing a very simple chart with Matplotlib and Python. It is 30 years worth of monthly sampled data (PMI - US Purchasing Manager Index). All up it has around 400 monthly observation.
Sample monthly data:
Date
PMI
1/03/2022
57.1
1/02/2022
58.6
1/01/2022
57.6
1/12/2021
58.8
1/11/2021
60.6
1/10/2021
60.8
1/09/2021
60.5
I produced a very simple line chart with Matplotlib. Dataframe name is pmi. Date is not set to index but dates are set to pandas.datetime.
plt.plot(pmi.Date, pmi.PMI, c='mediumblue', lw=0.8)
Output:
Why does the output look so box like. It seems to me like it just doesn't capture all the data available in the dataframe. I'm sure it does though, so is this a formatting issue? How do you smooth this output line out so to remove sharp, edge like breaks?

Pandas create graph from Date and Time while them being in different columns

My data looks like this:
Creation Day Time St1 Time St2
0 28.01.2022 14:18:00 15:12:00
1 28.01.2022 14:35:00 16:01:00
2 29.01.2022 00:07:00 03:04:00
3 30.01.2022 17:03:00 22:12:00
It represents parts being at a given station. What I now need is something that counts how many Columns have the same Day and Hour e.g. How many parts were at the same station for a given Hour.
Here 2 Where at Station 1 for the 28th and the timespan 14-15.
Because in the end I want a bar graph that show production speed. Additionally later in the project I want to highlight Parts that havent moved for >2hrs.
Is it practical to create a datetime object for every Station (I have 5 in total)? Or is there a much simpler way to do this?
FYI I import this data from an excel sheet
I found the solution. As they are just strings I can just add them and reformat the result with pd.to_datetime().
Example:
df["Time St1"] = pd.to_datetime(
df["Creation Day"] + ' ' + df["Time St1"],
infer_datetime_format=False, format='%d.%m.%Y %H:%M:%S'
)

Retain all dataframe columns when using spark map

I am trying to expand the body json structure using map (as below), but also need to keep the DateTime column. Currently only the expanded json columns are kept.
Do you know how to solve this?
jsonRdd = df.select(df.DateTime, df.Body.cast("string").alias("json"))
jsonRdd = jsonRdd.rdd.map(lambda x : x.json)
data = spark.read.json(jsonRdd)
display(data)
current output looks like :
name age
j blogg 21
expected output should be :
DateTime name age
4/6/2020 j blogg 21
thank you.

Dendrograms with SciPy

I have a dataset that I shaped according to my needs, the dataframe is as follows:
Index A B C D ..... Z
Date/Time 1 0 0 0,35 ... 1
Date/Time 0,75 1 1 1 1
The total number of rows is 8878
What I try to do is create a time-series dendrogram (Example: Whole A column will be compared to whole B column in whole time).
I am expecting an output like this:
(source: rsc.org)
I tried to construct the linkage matrix with Z = hierarchy.linkage(X, 'ward')
However, when I print the dendrogram, it just shows an empty picture.
There is no problem if a compare every time point with each other and plot, but in that way, the dendrogram becomes way too complicated to observe even in truncated form.
Is there a way to handle the data as a whole time series and compare within columns in SciPy?

Mapping column values to a combination of another csv file's information

I have a dataset that indicates date & time in 5-digit format: ddd + hm
ddd part starts from 2009 Jan 1. Since the data was collected only from then to 2-years period, its [min, max] would be [1, 365 x 2 = 730].
Data is observed in 30-min interval, making 24 hrs per day period to lengthen to 48 at max. So [min, max] for hm at [1, 48].
Following is the excerpt of daycode.csv file that contains ddd part of the daycode, matching date & hm part of the daycode, matching time.
And I think I agreed to not showing the dataset which is from ISSDA. So..I will just describe that the daycode in the File1.txt file reads like '63317'.
This link gave me a glimpse of how to approach this problem, and I was in the middle of putting up this code together..which of course won't work at this point.
consume = pd.read_csv("data/File1.txt", sep= ' ', encoding = "utf-8", names =['meter', 'daycode', 'val'])
df1= pd.read_csv("data/daycode.csv", encoding = "cp1252", names =['code', 'print'])
test = consume[consume['meter']==1048]
test['daycode'] = test['daycode'].map(df1.set_index('code')['print'])
plt.plot(test['daycode'], test['val'], '.')
plt.title('test of meter 1048')
plt.xlabel('daycode')
plt.ylabel('energy consumption [kWh]')
plt.show()
Not all units(thousands) have been observed at full length but 730 x 48 is a large combination to lay out on excel by hand. Tbh, not an elegant solution but I tried by dragging - it doesn't quite get it.
If I could read the first 3 digits of the column values and match with another file's column, 2 last digits with another column, then combine.. is there a way?
For the last 2 lines you can just do something like this
df['first_3_digits'] = df['col1'].map(lambda x: str(x)[:3])
df['last_2_digits'] = df['col1'].map(lambda x: str(x)[-2:])
for joining 2 dataframes
df3 = df.merge(df2,left_on=['first_3_digits','last_2_digits'],right_on=['col1_df2','col2_df2'],how='left')