Trying to do something fairly straightforward with Blaze and my local Spark instance. Loading a csv file with blaze's into() and then use blaze's by()
Python 3.4
Spark 1.4.0
Blaze 0.8.0
csv (simple.csv)
id,car
1,Mustang
2,Malibu
3,Mustang
4,Malibu
5,Murano
code
mport blaze as bz
rdd = bz.into(sc,"simple.csv")
simple = bz.Data(rdd)
simple.count() #gives me 5 so far so good
bz.by(simple.car, count=simple.id.count()) #throws an error
AttributeError: 'InteractiveSymbol' object has no attribute 'car'
Any ideas on what's going on here?
Side note; this works
simple_csv = bz.Data("simple.csv")
bz.by(simple_csv.car, count=simple_csv.id.count())
car count
0 Malibu 2
1 Murano 1
2 Mustang 2
And so does this
simple_csv.car.count_values()
car count
0 Malibu 2
2 Mustang 2
1 Murano 1
Gotta be the way I'm "loading" it into Spark, right?
You'll want to create a Spark DataFrame (formerly
SchemaRDD) using aSQLContextinstead of creating a "raw"RDDwith theSparkContext.RDDs don't have named columns, which you would need in order for thebyoperation to succeed. This is why theInteractiveSymboldid not have acarattribute, it was stripped away in the process of creating theRDD. Executing this in a Jupyter code cell:would produce a
pyspark.sql.dataframe.DataFrameobject, and execute a program on the Spark driver to count the rows:At this point, you should be able to compute the group-by as you were trying to before:
BUT. There is a problem with
Blaze, at least for me, as of today, running Blaze0.9.0with both Spark 1.6 and Spark 1.4.1. Likely, this is not the same problem you had in the first place, but it is preventing me from reaching a working solution. I tried dropping Jupyter, and running in apysparksession directly. To do so yourself, you can omit a few of the lines above, sincepysparkautomatically createsscandsqlContext:This produces an error. Even just trying to get an interactive view of
simplelike this also produces an error:Anyway, there seems to be some activity in the Blaze project on Github related to upgrading support for Spark 1.6, so hopefully they'll get this stuff fixed at that point.