Technology, Software, Predictive Analytics and Big Data: Reading resource files in Spark using Scala

Often while coding up unit tests in Scala, I need to read from a file which is available in the resources folder. There are a few variations to how this can be done, specifically if I am using the contents of the file as DataFrame in Spark. Here are some examples what I want to keep for myself as notes.

All these examples work with JDK 1.8u144, Scala 2.11.8 and Spark 2.3.1. Test them out if your version of software differ substantially from these versions.

Get the contents of the text file as Iterator:

// In the path below, "/" refers to the root of the resources folder
val fileStream = scala.io.Source.fromInputStream(getClass.getResourceAsStream("/path/to/file.txt"))
val iterator: Iterator[String] = fileStream.getLines()

// Print the lines 
fileStream.getLines().foreach(println)

// To get the complete contents of the file in a string
val contents = fileStream.getLines().mkString("\n")

Get a text file as Spark RDD[String], individual lines as rows:

// In the path below, "/" refers to the root of the resources folder
val session: SparkSession = SparkSession.builder().config(conf).getOrCreate()
val fileStream = getClass.getResourceAsStream("/path/to/file.txt")
val input = ss.sparkContext.makeRDD(scala.io.Source.fromInputStream(fileStream).getLines().toList)

// Another shorter alternative:
val input = session.sparkContext.textFile(getClass.getResource("/path/to/file.txt").getPath)

Get a text file as a [n x 1] Spark DataFrame with individual lines as rows:

// In the path below, "/" refers to the root of the resources folder
val session: SparkSession = SparkSession.builder().config(conf).getOrCreate()
val input: DataFrame = session.sparkContext.textFile(getClass.getResource("/path/to/file.txt").getPath).toDF("text")

Read a line-delimited JSON file into a Spark DataFrame:

// In the path below, "/" refers to the root of the resources folder
val session: SparkSession = SparkSession.builder().config(conf).getOrCreate()
// The below line works when the file is in the resource folder and the program is run
// through IDE or through command line. However, doesn't work when packaged in a jar.
// Results in org.apache.spark.sql.AnalysisException: Path does not exist
val input: DataFrame = session.read.json(getClass.getResource("/path/to/json/file").getPath).select("text")

// The following works when the resource file is packaged in a jar
import session.implicits._
val ds = session.createDataset[String](Source.fromInputStream(getClass.getResourceAsStream("/path/to/json/file")).getLines().toSeq)
val input = session.read.json(ds).select("text")

Read a binary file from resources (with or without Spark): Uses the java.nio.file.Files class.

val filePath = Paths.get(getClass.getResource("/myfile.zip").toURI)
val fileBytes = Files.readAllBytes(filePath)

The final one is a niche use case where we have a bunch of events in an avro file, and would like to read the events. In this case, we use an iterator style, i.e. we stream the file lazily. For more notes on this, see here.

// In the path below, "/" refers to the root of the resources folder
val resourceURI: URI = getClass.getResource("/path/to/events.avro").toURI
val file: File = Paths.get(resourceURI).toFile
val datumReader = new SpecificDatumReader[Event]()
val dataFileReader = new DataFileReader[Event](new File(file.getAbsolutePath), datumReader)

// DataFileReader behaves like an iterator, i.e., a stream of events.
// E.g., if we wanted filter a few events to a separate collection:
def eventFilter(event: Event) = ???
val specialSet = dataFileReader.withFilter(eventFilter).foldLeft(...)

The specific resource file needs to be available in the target folder or the compiled artifact (e.g., the jar file) in order for any of the above code to work. I use maven for all my projects, and use the following snippet in the "build" section of the pom file to control what resources would be available in the target folder or the jar file.

Notes:

IDE's like IntelliJ also uses the settings in the pom file for making resource files available.
You might have to reload pom and rebuild project in order for changes to take effect.

<build>
	<finalName>${project.name}-${project.version}</finalName>

	<resources>
		<resource>
			<directory>${basedir}/src/main/resources</directory>
		</resource>
	</resources>
	<testResources>
		<testResource>
			<directory>${basedir}/src/test/resources</directory>
			<includes>
				<include>**/*.conf</include>
				<include>**/*.json</include>
				<include>**/*.zip</include>
			</includes>
		</testResource>
	</testResources>


	<plugins>
		<plugin>
		...
		</plugin>
	</plugins>
</build>

Update (Sep 16, 2018): Added way to read JSON file in Spark through read.json
Update (Jan 05, 2019): Updated title
Update (Aug 11, 2020): Added notes on reading binary files from resource, notes on the maven pom file

Technology, Software, Predictive Analytics and Big Data

Sunday, September 2, 2018

Reading resource files in Spark using Scala

No comments:

Post a Comment