What is a good way to identify the influence of independent variables on a dependent variable when using time-series data?

56 views Asked by At

I have a dataset that includes environmental data collected by 17 sensors (each with their own device_id). The data is collected every 10 minutes, over 3 months. I am trying to find the main variables that influence the changes of a dependent variable (pm25).

The options I have considered are:

  • GEE (Generalized Estimating Equation) analysis
  • Random Forest
  • PCA

Rather than just analyzing the effect of IndVariableA and IndVariableB on DepVariableC, I would like the analysis to consider the date/time if possible and the device_id as a clustering factor. The Proximity variables are constant for a particular device_id.

I tried doing these in SPSS and in Python, but I am not able to interpret the results properly, and not quite sure if the parameters I'm entering are correct. This is a snippet of the data I am using.

device_id date time temp hum pm1 pm10 pm25 tvoc eco2
0 14384 7/11/2021 4:11:00 25.92 68.67 4 7 7 93 1016
1 14384 7/11/2021 4:21:00 26.21 66.66 3 4 4 62 813
2 14389 7/11/2021 4:22:00 29.12 55.52 8 13 13 7 450
3 14392 7/11/2021 4:22:00 24.33 51.44 0 0 0 0 400
4 14389 7/11/2021 4:31:00 28.52 56.60 7 11 11 12 483

pressure AQI ProximitytoPark ProximitytoAve
0 98.20 23 0.06 0.07
1 98.19 13 0.06 0.07
2 96.97 42 0.52 0.16
3 97.46 0 0.03 1.00
4 96.97 35 0.52 0.16

Proximitytohighway ProximitytoTrainTracksBusway
0 0.49 0.56
1 0.49 0.56
2 0.32 1.60
3 0.78 2.20
4 0.32 1.60

I used the statsmodels api to obtain the following results, but I am not confident it is giving me a good result, as I would like them to be clustered by device_id.

Code:

model = smf.gee("pm25 ~ ProximitytoPark + ProximitytoAve + Proximitytohighway + ProximitytoTrainTracksBusway + temp + hum + device_id", "pm25", X, family = sm.families.Gaussian())

Results:

 GEE Regression Results Dep. Variable:  pm25    No. Observations:   230793
Model:  GEE     No. clusters:   196
Method:     Generalized     Min. cluster size:  1
    Estimating Equations    Max. cluster size:  26232
Family:     Gaussian    Mean cluster size:  1177.5
Dependence structure:   Independence    Num. iterations:    60
Date:   Mon, 19 Jun 2023    Scale:  226.346
Covariance type:    robust  Time:   16:57:21
    coef    std err     z   P>|z|   [0.025  0.975]
Intercept   861.5242    357.682     2.409   0.016   160.481     1562.568
ProximitytoPark     1.2047  4.722   0.255   0.799   -8.050  10.460
ProximitytoAve  3.9539  1.870   2.114   0.034   0.288   7.619
Proximitytohighway  -2.1691     3.306   -0.656  0.512   -8.650  4.312
ProximitytoTrainTracksBusway    -2.4669     0.577   -4.274  0.000   -3.598  -1.336
temp    0.7586  0.119   6.384   0.000   0.526   0.992
hum     0.1702  0.054   3.170   0.002   0.065   0.275
device_id   -0.0606     0.025   -2.409  0.016   -0.110  -0.011
Skew:   5.9555  Kurtosis:   179.9029
Centered skew:  0.2678  Centered kurtosis:  1.1036

Additionally, I have considered PCA and Random Forest, but I am not sure if they are right for this, as it does not consider the time-series (if i understand correctly).

I would appreciate any help in identifying a method of statistic analysis that will help me identify factors that influence the dependent variable (pm25).

0

There are 0 answers