Using Seaborn to plot pH over time with approximately 5000 points

58 views Asked by At

I am trying to learn more about Pandas/Matplotlib/Seaborn, by working with datasets at work. I would like to get to automating reports with Seaborn graphs, granted, that is probably overkill for basic graphs...but I like the way they look. I am struggling to get a simple graph to work. We pull data weekly, and its 4500-5000 datapoints. I want to work with pH and datetime to better understand working with large datasets.

Here is a snapshot of how I have initially set the data up. The csv file had a bunch of crap in top rows, so I set custom names for the columns I want to work with and removed the first 7 so that it starts with my data:

import pandas as pd
import seaborn as sns
import os
from datetime import datetime

df = pd.read_csv(r"Datasets\Effluent data export 12.28.2023 through 01.03.2024.csv")
df = df.drop(df.index[0:8]).rename(columns={df.columns[0]: 'Date', df.columns[1]: 'Level (in)', df.columns[2]: 'Flow Rate (gpm)', df.columns[3]: 'Total Flow (Gal)', df.columns[12]: 'pH', df.columns[13]: 'Temperature (C)'})
df = df.dropna(subset=['pH'])

I added the dropna for pH because using df.info() I noticed that there was one row of NA data and I don't know how to use that so I removed it. I am unsure how to work with NA as far as replacing it with something to maintain true data integrity, but I can figure that out later.

My graph is coming out with black lines all over the X and Y axis, which is probably due to it trying to show ALL of the ticks at once. I tried to solve this by changing the date from an object to datetime using:

df['Date'] = pd.to_datetime(df['Date'])

That got my date column to go from an object to a datetime64 type, and I have equal non-null counts for both of these columns now.

My issue now is that I don't know how to make a graph with this much data. I know that I need to go back and better understand Numpy, Matplotlib, and Pandas...but I am hoping someone can help me understand what it is that I actually need to learn/focus on for how to work with the graphs to show this data. Currently, I have tried these two things, and the result is ultimately the same in both cases with a line that doesn't look right, and no real graph:

sns.lineplot(data=df, x="Date", y="pH")
sns.scatterplot(data=df, x="Date", y="pH")

Here is an example of the scatterplot graph, which due to the pH being fairly tight, should look more like a parallel line and not one with a negative slope as shown:

Seaborn Scatterplot Fail

My ultimate goal is to create graphs for weeks, months, and years, for our facility data. But I don't know how to work with sets this large, or what graphs would be good. I want to look at water level, flow rate, pH, and temperature (see my .rename() for the actual names but I need to figure out how to make a meaningful graph of datetime vs pH for now so I can see how it works.

1

There are 1 answers

1
Daweo On

create graphs for weeks, months, and years

I suggest trying grouping by period, consider following simple example

import pandas as pd
df = pd.DataFrame({"when":[f"2024-1-{i}" for i in range(1,31)],"value":range(100,130)})
df["when"] = pd.to_datetime(df["when"])
df_weekly_median = df.groupby(df["when"].dt.to_period("W"))["value"].median()
print(df_weekly_median)

output

when
2024-01-01/2024-01-07    103.0
2024-01-08/2024-01-14    110.0
2024-01-15/2024-01-21    117.0
2024-01-22/2024-01-28    124.0
2024-01-29/2024-02-04    128.5
Freq: W-SUN, Name: value, dtype: float64

Explanation: I create dataframe with dates from 2024-1-1 to 2024-1-30 with values from 100 to 129 (range is inclusive-exclusive) then I convert it to datetime and I use groupby to aggregate data, in this case into weeks (W, see Period aliases for allowed values) and taking median of it.