I am using spark 1.3.1. In PySpark, I have created a DataFrame from a RDD and registered the schema, something like this :
dataLen=sqlCtx.createDataFrame(myrdd, ["id", "size"]) dataLen.registerTempTable("tbl")
at this point everything is fine I can make a "select" query from "tbl", for example "select size from tbl where id='abc'".
Then in a Python function, I define something like :
def getsize(id): total=sqlCtx.sql("select size from tbl where id='" + id + "'") return total.take(1).size
at this point still no problem, I can do
getsize("ab") and it does return a value.
The problem occurred when I invoked
getsize within a rdd, say I have a rdd named data which is of (key, value) list, when I do
data.map(lambda x: (x, getsize("ab"))
this generated an error which is
py4j.protocol.Py4JError: Trying to call a package