I'm in the early stages of learning SkFlow/TensorFlow, so I'll lay out my understanding of what I'm trying to do, incorrect as it may be.
Let's imagine I'm trying to build a model to predict if a car will fail an emissions test.
My training and testing csv might look something like this
make, fuel, year, mileage, days since service, passed test
vw, diesel, 2015, 10000, 20, 0
honda, petrol, 2008, 1000000, 234, 1
So the pass/fail column being by y, the others being x.
So far, with Baltimore's help in my previous SO question I'm able to process the Iris dataset from a CSV file. That dataset is all numbers however.
This example on the TensorFlow website shows a model built with census data, using categorical and continuous data. I'm trying to use SkFlow as I understand it simplifies the process.
Anyway, to my code
x_train = genfromtxt('/Users/ben/Desktop/data.csv', dtype=None, delimiter=',' , usecols=(0, 1, 2, 3,4))
y_train = genfromtxt('/Users/ben/Desktop/data.csv', dtype='int', delimiter=',', usecols = (5))
feature_columns = [tf.contrib.layers.real_valued_column("", dimension=1)]
classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,
hidden_units=[10, 20, 10],
n_classes=2,
model_dir="./tmp/model1")
# Fit model. Add your train data here
classifier.fit(x=x_train,y=y_train,steps=2000)
So I've got my csv data reading in fine into my x_train and y_train objects. The CSV has no headers, but could do if required.
I believe I'm trying to define which columns have which kind of data, something like
make = tf.contrib.layers.sparse_column_with_hash_bucket("make", hash_bucket_size=1000)
fuel = tf.contrib.layers.sparse_column_with_keys(column_name="fuel", keys=["diesel", "petrol"])
How do I build the feature_columns object that gets passed into the classifier?
Here's my shot at it. The input_fn function creates an dict of tensors that are passed into the fit and evaluate methods via wrappers. That dict is used when creating the model. It defines the data that will be used. The other constant value tensors are the data. They are what's passed in during model fitting with the feature_columns argument: feature_columns=[gear,mpg,cyl...].
I left out all of the crossed columns stuff, but it could be put in.
I turned off WARNINGS but if you want that, the switch is there. This also produces as surprising amount of log data, so be sure to check out the graphs with tensorboard.
`