H2o Cluster Resources shared issue while on XGBoosts model process

46 views Asked by At

I have used XGBoost for my model. i have noticed that h2o cluster not share memory while on this model process. master A server RAM utilization is very high and master B RAM utilization is very low. i checked h2o logs on both servers and noticed master A log file continuously updating while on model processing but master B log file not updating. it shows cluster created logs only

Some times while on model processing master A h2o jar down due to high memory usage.

I'm using h2o-3.36.1.1 version and created two node cluster. Cluster successfully created and logged cluster details on log file.

i have check master A & B connectivity and did curl with both side. all work fine and cluster is work well.

  • H2O_cluster_uptime: 15 mins 14 secs
  • H2O_cluster_timezone: Asia/Colombo
  • H2O_data_parsing_timezone: UTC
  • H2O_cluster_version: 3.36.1.1
  • H2O_cluster_version_age: 11 months and 28 days !!!
  • H2O_cluster_name: XXXXXX
  • H2O_cluster_total_nodes: 2
  • H2O_cluster_free_memory: 43.36 Gb
  • H2O_cluster_total_cores: 30
  • H2O_cluster_allowed_cores: 30
  • H2O_cluster_status: locked,
  • healthy H2O_connection_url: http://localhost:54321
  • H2O_connection_proxy: {"http": null, "https": null}
  • H2O_internal_security: False
  • Python_version: 3.7.11 final

Could anyone please help me to troubleshoot these issues.

Why both servers not share server resources while on model processing ?

Why master B h2o log not update ?

Why master A h2o jar down on high memory usage ?

Master A log

            main  INFO water.default: Open H2O Flow in your web browser: http://xxx.xxx.xxx.xx:54321
        main  INFO water.default: 
   FJ-126-15  INFO water.default: Cloud of size 2 formed [master01.user.com/xxx.xxx.xxx.xx:54321, master02.user.com/xxx.xxx.xxx.xx:54321]
  058452-166  INFO water.default: GET /3/Metadata/schemas/CloudV3, parms: {}
  058452-166  INFO water.default: Locking cloud to new members, because water.api.schemas3.MetadataV3
  4058452-14  INFO water.default: GET /3/Metadata/schemas/H2OErrorV3, parms: {}
  4058452-15  INFO water.default: GET /3/Metadata/schemas/H2OModelBuilderErrorV3, parms: {}
  4058452-18  INFO water.default: POST /4/sessions, parms: {}
  4058452-16  INFO water.default: POST /99/Rapids, parms: {ast=(setTimeZone "UTC"), session_id=_sid_a391}
  4058452-13  INFO water.default: DELETE /3/DKV, parms: {}
  4058452-13  INFO water.default: Removing all objects
  4058452-13  INFO water.default: Finished removing objects
  4058452-12  INFO water.default: DELETE /3/DKV, parms: {}
  4058452-12  INFO water.default: Removing all objects
  4058452-12  INFO water.default: Finished removing objects
  058452-170  INFO water.default: DELETE /3/DKV, parms: {}
  058452-170  INFO water.default: Removing all objects
  058452-170  INFO water.default: Finished removing objects
  4058452-14  INFO water.default: GET /3/Metadata/schemas/CloudV3, parms: {}
  058452-169  INFO water.default: GET /3/Metadata/schemas/H2OErrorV3, parms: {}
  058452-166  INFO water.default: GET /3/Metadata/schemas/H2OModelBuilderErrorV3, parms: {}
  4058452-19  INFO water.default: POST /4/sessions, parms: {}
  4058452-18  INFO water.default: POST /99/Rapids, parms: {ast=(setTimeZone "UTC"), session_id=_sid_bfac}
  058452-170  INFO water.default: Reading byte InputStream into Frame:
  058452-170  INFO water.default:     frameKey:    upload_bbcd4f6aeb3c1095e63f66a89cdd4756
  058452-170  INFO water.default:     totalChunks: 2
  058452-170  INFO water.default:     totalBytes:  4404663
  058452-170  INFO water.default:     Success.
  058452-167  INFO water.default: POST /3/ParseSetup, parms: {single_quotes=False, source_frames=["upload_bbcd4f6aeb3c1095e63f66a89cdd4756"], check_header=0}
  058452-169  INFO water.default: Total file size: 4.2 MB
  058452-169  INFO water.default: Parse chunk size 4194304
     FJ-1-15  INFO water.default: Parse result for Key_Frame__upload_bbcd4f6aeb3c1095e63f66a89cdd4756.hex (2023 rows, 436 columns):
     FJ-1-15  INFO water.default:                               ColV2    type          min          max         mean        sigma         NAs constant cardinality
     FJ-1-15  INFO water.default:                                COL1:  factor    011022232    YA9854024                                                  1334
     FJ-1-15  INFO water.default:                      COL2: numeric      2019.00      2020.00      2019.70     0.457960                            
     FJ-1-15  INFO water.default:                     COL3: numeric      1.00000      12.0000      6.07860      2.82287                            
     FJ-1-15  INFO water.default:                         COL4:  factor |00011000813 |09988000074                                                  1334
     FJ-1-15  INFO water.default:                       COL5:  factor    CUST NAME     CUSTOMER                                                     2
     FJ-1-15  INFO water.default:                         COL6: numeric  1.14005e+08  4.10024e+08  2.96146e+08  4.57328e+07                            
     FJ-1-15  INFO water.default:                    COL7: numeric      10000.0      30000.0      28294.6      5573.93           3                
     FJ-1-15  INFO water.default:                     COL8:  factor                       USD                                                     4
     FJ-1-15  INFO water.default:                              COL9:  factor          927         RM17                                                    20
     FJ-1-15  INFO water.default:               COL10:  factor           NO          YES                                                     2
     FJ-1-15  INFO water.default: Additional column information only sent to log file...
     FJ-1-15  INFO water.default:                COL11: numeric     -1.00000      175.250      1.07602      5.07740                            
     FJ-1-15  INFO water.default:                COL12: numeric     -1.00000      97.2262     0.447662      3.19167                            
     FJ-1-15  INFO water.default:                COL13: numeric     -1.00000      124.206      1.03933      3.94221                            
     FJ-1-15  INFO water.default:                      response_class:  factor           1A to_be_filled                                                     5
     FJ-1-15  INFO water.default:                    response_class_5:  factor           1B          1B1                                                     2
     FJ-1-15  INFO water.default:                    response_class_4:  factor           1A NON_PERFORME                                                     4
     FJ-1-15  INFO water.default:                    response_class_3:  factor           1A NON_PERFORME                                                     4
     FJ-1-15  INFO water.default:                    response_class_2:  factor           1A NON_PERFORME                                                     4
     FJ-1-15  INFO water.default:                    response_class_1:  factor           1A NON_PERFORME                                                     4
     FJ-1-15  INFO water.default:                              subset:  factor         test        train                                                     2
     FJ-1-15  INFO water.default: Chunk compression summary:
     FJ-1-15  INFO water.default:   Chunk Type                 Chunk Name       Count  Count Percentage        Size  Size Percentage
     FJ-1-15  INFO water.default:          C0L              Constant long          74           8.486 %      5.8 KB          0.207 %
     FJ-1-15  INFO water.default:          CBS                     Binary          19           2.179 %      4.4 KB          0.159 %
     FJ-1-15  INFO water.default:          CXI            Sparse Integers          80           9.174 %     25.0 KB          0.897 %
     FJ-1-15  INFO water.default:          CXF               Sparse Reals          50           5.734 %     48.9 KB          1.753 %
     FJ-1-15  INFO water.default:           C1            1-Byte Integers           7           0.803 %     11.8 KB          0.423 %
     FJ-1-15  INFO water.default:          C1N  1-Byte Integers (w/o NAs)          92          10.550 %    104.0 KB          3.731 %
     FJ-1-15  INFO water.default:          C1S           1-Byte Fractions         142          16.284 %    118.4 KB          4.245 %
     FJ-1-15  INFO water.default:           C2            2-Byte Integers          72           8.257 %    231.7 KB          8.309 %
     FJ-1-15  INFO water.default:          C2S           2-Byte Fractions          18           2.064 %     22.9 KB          0.822 %
     FJ-1-15  INFO water.default:           C4            4-Byte Integers          50           5.734 %    109.1 KB          3.913 %
     FJ-1-15  INFO water.default:          C4S           4-Byte Fractions         127          14.564 %    360.5 KB         12.925 %
     FJ-1-15  INFO water.default:           C8            8-byte Integers           1           0.115 %     15.0 KB          0.539 %
     FJ-1-15  INFO water.default:          CUD               Unique Reals           5           0.573 %     13.2 KB          0.472 %
     FJ-1-15  INFO water.default:          C8D               64-bit Reals         135          15.482 %      1.7 MB         61.606 %
     FJ-1-15  INFO water.default: Frame distribution summary:
     FJ-1-15  INFO water.default:                             Size  Number of Rows  Number of Chunks per Column  Number of Chunks

Master B

    main  INFO water.default: H2O started in 4906ms
     main  INFO water.default: 
     main  INFO water.default: Open H2O Flow in your web browser: http://xxx.xxx.xxx.xx:54321
     main  INFO water.default: 
FJ-126-15  INFO water.default: Cloud of size 2 formed [master01.user.com/xxx.xxx.xxx.xx:54321, master02.user.com/xxx.xxx.xxx.xx:54321]
FJ-123-15  INFO water.default: Locking cloud to new members, because Class Id=56
  FJ-2-15  INFO water.default: Key upload_bbcd4f6aeb3c1095e63f66a89cdd4756 will be parsed using method DistributedParse.
  FJ-2-21  INFO water.default: Key upload_902bcdd31a4aea9f65690f1bc6074886 will be parsed using method DistributedParse.
0

There are 0 answers