Often while coding up unit tests in Scala, I need to read from a file which is available in the resources folder. There are a few variations to how this can be done, specifically if I am using the contents of the file as DataFrame in Spark. Here are some examples what I want to keep for myself as notes.
All these examples work with JDK 1.8u144, Scala 2.11.8 and Spark 2.3.1. Test them out if your version of software differ substantially from these versions.
- Get the contents of the text file as Iterator:
// In the path below, "/" refers to the root of the resources folder val fileStream = scala.io.Source.fromInputStream(getClass.getResourceAsStream("/path/to/file.txt")) val iterator: Iterator[String] = fileStream.getLines() // Print the lines fileStream.getLines().foreach(println) // To get the complete contents of the file in a string val contents = fileStream.getLines().mkString("\n")
- Get a text file as Spark RDD[String], individual lines as rows:
// In the path below, "/" refers to the root of the resources folder val session: SparkSession = SparkSession.builder().config(conf).getOrCreate() val fileStream = getClass.getResourceAsStream("/path/to/file.txt") val input = ss.sparkContext.makeRDD(scala.io.Source.fromInputStream(fileStream).getLines().toList) // Another shorter alternative: val input = session.sparkContext.textFile(getClass.getResource("/path/to/file.txt").getPath)
- Get a text file as a [n x 1] Spark DataFrame with individual lines as rows:
// In the path below, "/" refers to the root of the resources folder val session: SparkSession = SparkSession.builder().config(conf).getOrCreate() val input: DataFrame = session.sparkContext.textFile(getClass.getResource("/path/to/file.txt").getPath).toDF("text")
- Read a line-delimited JSON file into a Spark DataFrame:
// In the path below, "/" refers to the root of the resources folder val session: SparkSession = SparkSession.builder().config(conf).getOrCreate() // The below line works when the file is in the resource folder and the program is run // through IDE or through command line. However, doesn't work when packaged in a jar. // Results in org.apache.spark.sql.AnalysisException: Path does not exist val input: DataFrame = session.read.json(getClass.getResource("/path/to/json/file").getPath).select("text") // The following works when the resource file is packaged in a jar import session.implicits._ val ds = session.createDataset[String](Source.fromInputStream(getClass.getResourceAsStream("/path/to/json/file")).getLines().toSeq) val input = session.read.json(ds).select("text")
- Read a binary file from resources (with or without Spark): Uses the java.nio.file.Files class.
val filePath = Paths.get(getClass.getResource("/myfile.zip").toURI) val fileBytes = Files.readAllBytes(filePath)
- The final one is a niche use case where we have a bunch of events in an avro file, and would like to read the events. In this case, we use an iterator style, i.e. we stream the file lazily. For more notes on this, see here.
// In the path below, "/" refers to the root of the resources folder val resourceURI: URI = getClass.getResource("/path/to/events.avro").toURI val file: File = Paths.get(resourceURI).toFile val datumReader = new SpecificDatumReader[Event]() val dataFileReader = new DataFileReader[Event](new File(file.getAbsolutePath), datumReader) // DataFileReader behaves like an iterator, i.e., a stream of events. // E.g., if we wanted filter a few events to a separate collection: def eventFilter(event: Event) = ??? val specialSet = dataFileReader.withFilter(eventFilter).foldLeft(...)
The specific resource file needs to be available in the target folder or the compiled artifact (e.g., the jar file) in order for any of the above code to work. I use maven for all my projects, and use the following snippet in the "build" section of the pom file to control what resources would be available in the target folder or the jar file.
Notes:
- IDE's like IntelliJ also uses the settings in the pom file for making resource files available.
- You might have to reload pom and rebuild project in order for changes to take effect.
<build> <finalName>${project.name}-${project.version}</finalName> <resources> <resource> <directory>${basedir}/src/main/resources</directory> </resource> </resources> <testResources> <testResource> <directory>${basedir}/src/test/resources</directory> <includes> <include>**/*.conf</include> <include>**/*.json</include> <include>**/*.zip</include> </includes> </testResource> </testResources> <plugins> <plugin> ... </plugin> </plugins> </build>
Update (Sep 16, 2018): Added way to read JSON file in Spark through read.json
Update (Jan 05, 2019): Updated title
Update (Aug 11, 2020): Added notes on reading binary files from resource, notes on the maven pom file
No comments:
Post a Comment