Friday, October 07, 2016

Groovy script to load CSV data files into PostgreSQL database onto Amazon RDS . . .

Recently, I worked on a task to migrate a Java 6 Grails 2.2.1 application from Oracle to PostgreSQL. In my recent post (Part-1 of 2), I covered details on migrating the Grails application as such. I am yet to cover details on data migration in my next part. However, in this post, I just wanted to share some of my findings and learnings on loading migrated data onto PostgreSQL database hosted on Amazon RDS.

As always, for any problem, there are multiple solutions. After quickly looking at a couple of commercial tools and an open source tool that offer migrating database from Oracle to PostgreSQL, and trying out trail-version of a commercial tool, I decided not to use any such tool for data migration. Many of these tools simply migrate schema, and attempt to migrate data & other database objects to the closest possible match of the target database. It often results with a fully migrated database, but leaving you to deal with the problem of object-relational impedance mismatch.

Since, I already leveraged Grails database-migration plugin to create new schema, constraints and sequences for PostgreSQL, all I needed was just to get data migrated and loaded into newly created schema. I took the well-known approach in Software Engineering: ETL which is a 3 step process of migrating data. For step 1) Extraction- I leveraged SQL Developer tool to extract data out into delimited files. For step 2) Transformation- I wrote Groovy scripts to transform data by taking care of data of those data types that needed data transformation. For step 3) Load- I wrote groovy scripts to load transformed data files into PostgreSQL using its COPY command which is very cool for this task.

All looked great with my local PostgreSQL database. I was able to script out both the tasks of data transformation and loading. But when I ran my loader script connecting to Amazon RDS hosted PostgreSQL database to load transformed data into, the COPY command simply failed with a misleading message saying "must be superuser for running COPY".

After some investigation and some reading of Amazon docs, I came to know that Amazon RDS doesn't allow PostgreSQL COPY command for security reasons. However, they do allow other \COPY PostgreSQL command, but it requires to be run through psql interactive command-line tool that comes with PostgreSQL comes. After playing with \COPY command in psql, I was able to manually load one migrated data -ile into a specific database table on Amazon RDS. Cool!

My next step was to take it into my loader Groovy script. As Java allows to execute an external program/process concurrently from Java application and Groovy enhances the java.lang.Process class further, it was easy executing psql tool from my script and getting all migrated data files loaded into Amazon RDS hosted PostgreSQL database.

Following is the snippet of Groovy script code with code highlighted that is of particular interest in this post (certain details are omitted for brevity):
@GrabResolver(name='postgres', root='https://mvnrepository.com/artifact/org.postgresql/postgresql') @GrabConfig(systemClassLoader=true) @Grab(group='org.postgresql', module='postgresql', version='9.4.1211.jre6') import groovy.sql.Sql //read config groovy file that contains all necessary configurations //including data-source, list of migrated files in the order to be loaded etc. def config = new ConfigSlurper().parse( new File('postgres-data-migration-config.groovy').toURI().toURL() ) ... def sql = Sql.newInstance( dataSource.url, dataSource.user, dataSource.password, dataSource.driverClassName ) boolean isLoadingToAmazonAws = dataSource.url.contains('amazonaws.com') //load each migrated data file into it's table dataFiles.each { String dataFileName -> println "loading data: $dataFileName" String header = new File("${datafileBaseDir}/${dataFileName}").withReader { return it.readLine() } //transform header: "h1"%"h2"%"h3" into: h1,h2,h3 to form //column names needed for COPY or \COPY command String columnNames = header.replaceAll(dataDelimiter, ',').replaceAll('"', '') ... if(isLoadingToAmazonAws) { //load data onto amazon aws database //build psql process command with required arguments to invoke it as an external process from this script def copyCommand = [ 'psql', "--host=${parsedDataSourceUrl.host}", "--port=${parsedDataSourceUrl.port}", "--dbname=${parsedDataSourceUrl.dbname}", "--username=${dataSource.user}", "--no-password", //suppress password prompt "--command=\\COPY ${tableName}(${columnNames}) FROM '${datafileBaseDir}/${dataFileName}' DELIMITER '${dataDelimiter}' CSV HEADER" ] println copyCommand.join(' ') //this is how it would look like if it was run from command-line //run psql tool as an external process def process = copyCommand.execute() def out = new StringBuffer() def err = new StringBuffer() process.waitForProcessOutput(out, err) //wait for the process to finish println "exit code: ${process.exitValue()}" if (out.size()) { println "Success: $out" } if (err.size()) { println "Error: $err" } } else { //load data to a local database by simply running COPY command as an sql statement String copyStatement = """ COPY ${tableName}(${columnNames}) FROM '${datafileBaseDir}/${dataFileName}' DELIMITER '${dataDelimiter}' CSV HEADER """ println copyStatement boolean status = sql.execute(copyStatement) println status ? "Error loading..." : "Successfully loaded..." } } println "Altering sequence values..." postgresSequences.each { sequenceName, value -> String alterStatement = "ALTER SEQUENCE $sequenceName RESTART WITH $value" println alterStatement boolean status = sql.execute(alterStatement) println status ? "Error altering sequence $sequenceName" : "Successfully altered sequence: $sequenceName to start with: $value" } println "Done"

A sample psql \COPY command for running from a terminal command prompt looks like:
psql \ --host=myhost.rds.amazonaws.com --port=5432 --dbname=my-db \ --username=my-user --no-password \ --command=\COPY my_table(col1, col2) FROM '/my-file.dsv.migrated' DELIMITER ',' CSV HEADER

A sample result of the above psql execution that also gives number of records loaded looks like:
exit code: 0 Success: COPY 43628

Note

There is no password used for executing the \COPY command through psql tool. The password can be stored in ~/.pgpass a file under user home dir or in any other file referenced by env variable PGPASSFILE in the format: hostname:port:database:username:password

References

4 comments:

  1. Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

    Big Data Services

    Data Lake Services

    Advanced Analytics Services

    Full Stack Development Services

    ReplyDelete
  2. The article is so appealing. You should read this article before choosing the Big data implementation services you want to learn.

    ReplyDelete