Spark RDDs with shared pointers within each partition (and the magic number 200??)

For anybody stumbling upon this in the future, I eventually came up with a super hacky solution (though I'd still be happy to hear a better one). Instead of using rdd.cache(), I define:

def cached[T: ClassTag](rdd:RDD[T]) = {
    rdd.mapPartitions(p => 
    ).cache().mapPartitions(p =>

so that cached(rdd) returns an RDD that is generated from a 'cached' List

How to find the number of pointers in an array of pointers
There is no standard way to do it. Some compilers have functions, to return the size of the allocated block, e.g. _msize in Visual Studio. If you divide that by the size of your pointers, you will get the number of elements in your "array". But apart from being non-Standard, this only works, if the "array" has been allocated dynamically and the caller intends to use the whole block for the array

Spark Cassandra connector - Range query on partition key
i think the CassandraRDD error is telling that the query that you are trying to do is not allowed in Cassandra and you have to load all the table in a CassandraRDD and then make a spark filter operation over this CassandraRDD. So your code (in scala) should something like this: val cassRDD= sc.cassandraTable("keyspace name", "table name").filter(row=> row.getDate("timestamp")>=DateFormat('

new and make_shared for shared pointers
Why is the second one not a shared pointer ? Will that not increment a reference count I believe the quote refers to the original poster's code, which claims to create a smart pointer but in fact does not do that. ptr_res2 is just a regular pointer. cout << "Create smart_ptr using new..." << endl; auto ptr_res2(new Object("new")); cout << "Create smart_ptr using new: done."

row number over partition
Use Dense_Rank() Ranking Function SELECT Dense_rank()OVER (partition BY name ORDER BY cdt) Rn, rvdt, cdt, name, template FROM #temp OUTPUT : Rn rvdt cdt name template -- ----------------------- ----------------------- ---- ---------- 1 2014-11-11 22:56:27.000 2014-10-11 23:56:27.000 Joe Tempalte 1 1 2014-11-1

Spark Error: Not enough space to cache partition rdd_8_2 in memory! Free memory is 58905314 bytes
Currently you are running with the default memory options, like indicated in the logs: 14/11/22 17:07:24 INFO MemoryStore: MemoryStore started with capacity 265.1 MB If you are running locally, you need to set the option --driver-memory 4G instead.

