Does the train_test_split method of scikit-learn duplicate the data? In other words, if I work with a large dataset, X, y, does it mean that after performing something like
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2023)
my data use twice as much memory as the original dataset? Or is there some scikit-learn (or basic python) magic that prevents it? (E.g., as using .to_numpy() does not necessarily lead to data duplication)
If the memory use does double, what is the best practical way around this problem? Perhaps, something like
X, X_test, y, y_test = train_test_split(X, y, test_size=0.2, random_state=2023)
?
Remark
Using np.shares_memor(X_train, X) suggests that the data is indeed duplicated.