Please log in to watch this conference skillscast.
Real world data is rarely clean: there are often corrupted and duplicate records, and even corrupted records that are duplicates! One step in data cleaning is entity resolution: connecting all of the duplicate records into the single underlying entity that they represent.
This talk will describe how we approach entity resolution, and look at some of the challenges, solutions and lessons learnt when doing entity resolution on top of Apache Spark, and scaling it to process billions of records.
YOU MAY ALSO LIKE: