文章詳情頁(yè)

JAVA讀取HDFS的文件數(shù)據(jù)出現(xiàn)亂碼的解決方案

瀏覽：4日期：2022-08-21 09:51:08

使用JAVA api讀取HDFS文件亂碼踩坑

想寫(xiě)一個(gè)讀取HFDS上的部分文件數(shù)據(jù)做預(yù)覽的接口，根據(jù)網(wǎng)上的博客實(shí)現(xiàn)后，發(fā)現(xiàn)有時(shí)讀取信息會(huì)出現(xiàn)亂碼，例如讀取一個(gè)csv時(shí)，字符串之間被逗號(hào)分割

英文字符串a(chǎn)aa，能正常顯示中文字符串“你好”，能正常顯示中英混合字符串如“aaa你好”，出現(xiàn)亂碼

查閱了眾多博客，解決方案大概都是：使用xxx字符集解碼。抱著不信的想法，我依次嘗試，果然沒(méi)用。

解決思路

因?yàn)镠DFS支持6種字符集編碼，每個(gè)本地文件編碼方式又是極可能不一樣的，我們上傳本地文件的時(shí)候其實(shí)就是把文件編碼成字節(jié)流上傳到文件系統(tǒng)存儲(chǔ)。那么在GET文件數(shù)據(jù)時(shí)，面對(duì)不同文件、不同字符集編碼的字節(jié)流，肯定不是一種固定字符集解碼就能正確解碼的吧。

那么解決方案其實(shí)有兩種

固定HDFS的編解碼字符集。比如我選用UTF-8，那么在上傳文件時(shí)統(tǒng)一編碼，即把不同文件的字節(jié)流都轉(zhuǎn)化為UTF-8編碼再進(jìn)行存儲(chǔ)。這樣的話在獲取文件數(shù)據(jù)的時(shí)候，采用UTF-8字符集解碼就沒(méi)什么問(wèn)題了。但這樣做的話仍然會(huì)在轉(zhuǎn)碼部分存在諸多問(wèn)題，且不好實(shí)現(xiàn)。動(dòng)態(tài)解碼。根據(jù)文件的編碼字符集選用對(duì)應(yīng)的字符集對(duì)解碼，這樣的話并不會(huì)對(duì)文件的原生字符流進(jìn)行改動(dòng)，基本不會(huì)亂碼。

我選用動(dòng)態(tài)解碼的思路后，其難點(diǎn)在于如何判斷使用哪種字符集解碼。參考下面的內(nèi)容，獲得了解決方案

java檢測(cè)文本(字節(jié)流)的編碼方式

需求：

某文件或者某字節(jié)流要檢測(cè)他的編碼格式。

實(shí)現(xiàn)：

基于jchardet

<dependency><groupId>net.sourceforge.jchardet</groupId><artifactId>jchardet</artifactId><version>1.0</version></dependency>

代碼如下：

public class DetectorUtils {private DetectorUtils() {} static class ChineseCharsetDetectionObserver implementsnsICharsetDetectionObserver {private boolean found = false;private String result; public void Notify(String charset) {found = true;result = charset;} public ChineseCharsetDetectionObserver(boolean found, String result) {super();this.found = found;this.result = result;} public boolean isFound() {return found;} public String getResult() {return result;} } public static String[] detectChineseCharset(InputStream in)throws Exception {String[] prob=null;BufferedInputStream imp = null;try {boolean found = false;String result = Charsets.UTF_8.toString();int lang = nsPSMDetector.CHINESE;nsDetector det = new nsDetector(lang);ChineseCharsetDetectionObserver detectionObserver = new ChineseCharsetDetectionObserver(found, result);det.Init(detectionObserver);imp = new BufferedInputStream(in);byte[] buf = new byte[1024];int len;boolean isAscii = true;while ((len = imp.read(buf, 0, buf.length)) != -1) {if (isAscii)isAscii = det.isAscii(buf, len);if (!isAscii) {if (det.DoIt(buf, len, false))break;}} det.DataEnd();boolean isFound = detectionObserver.isFound();if (isAscii) {isFound = true;prob = new String[] { 'ASCII' };} else if (isFound) {prob = new String[] { detectionObserver.getResult() };} else {prob = det.getProbableCharsets();}return prob;} finally {IOUtils.closeQuietly(imp);IOUtils.closeQuietly(in);}}}

測(cè)試：

String file = 'C:/3737001.xml';String[] probableSet = DetectorUtils.detectChineseCharset(new FileInputStream(file));for (String charset : probableSet) {System.out.println(charset);}

Google提供了檢測(cè)字節(jié)流編碼方式的包。那么方案就很明了了，先讀一些文件字節(jié)流，用工具檢測(cè)編碼方式，再對(duì)應(yīng)進(jìn)行解碼即可。

具體解決代碼

pom

<dependency><groupId>net.sourceforge.jchardet</groupId><artifactId>jchardet</artifactId><version>1.0</version></dependency>

從HDFS讀取部分文件做預(yù)覽的邏輯

// 獲取文件的部分?jǐn)?shù)據(jù)做預(yù)覽 public List<String> getFileDataWithLimitLines(String filePath, Integer limit) { FSDataInputStream fileStream = openFile(filePath); return readFileWithLimit(fileStream, limit); } // 獲取文件的數(shù)據(jù)流 private FSDataInputStream openFile(String filePath) { FSDataInputStream fileStream = null; try { fileStream = fs.open(new Path(getHdfsPath(filePath))); } catch (IOException e) { logger.error('fail to open file:{}', filePath, e); } return fileStream; } // 讀取最多l(xiāng)imit行文件數(shù)據(jù) private List<String> readFileWithLimit(FSDataInputStream fileStream, Integer limit) { byte[] bytes = readByteStream(fileStream); String data = decodeByteStream(bytes); if (data == null) { return null; } List<String> rows = Arrays.asList(data.split('rn')); return rows.stream().filter(StringUtils::isNotEmpty) .limit(limit) .collect(Collectors.toList()); } // 從文件數(shù)據(jù)流中讀取字節(jié)流 private byte[] readByteStream(FSDataInputStream fileStream) { byte[] bytes = new byte[1024*30]; int len; ByteArrayOutputStream stream = new ByteArrayOutputStream(); try { while ((len = fileStream.read(bytes)) != -1) { stream.write(bytes, 0, len); } } catch (IOException e) { logger.error('read file bytes stream failed.', e); return null; } return stream.toByteArray(); } // 解碼字節(jié)流 private String decodeByteStream(byte[] bytes) { if (bytes == null) { return null; } String encoding = guessEncoding(bytes); String data = null; try { data = new String(bytes, encoding); } catch (Exception e) { logger.error('decode byte stream failed.', e); } return data; } // 根據(jù)Google的工具判別編碼 private String guessEncoding(byte[] bytes) { UniversalDetector detector = new UniversalDetector(null); detector.handleData(bytes, 0, bytes.length); detector.dataEnd(); String encoding = detector.getDetectedCharset(); detector.reset(); if (StringUtils.isEmpty(encoding)) { encoding = 'UTF-8'; } return encoding; }

以上就是JAVA讀取HDFS的文件數(shù)據(jù)出現(xiàn)亂碼的解決方案的詳細(xì)內(nèi)容，更多關(guān)于JAVA讀取HDFS的文件亂碼的資料請(qǐng)關(guān)注好吧啦網(wǎng)其它相關(guān)文章！

Java

上一條：java實(shí)現(xiàn)刪除某條信息并刷新當(dāng)前頁(yè)操作下一條：Java將CSV的數(shù)據(jù)發(fā)送到kafka的示例

相關(guān)文章：

1. 使用Python和百度語(yǔ)音識(shí)別生成視頻字幕的實(shí)現(xiàn)2. css代碼優(yōu)化的12個(gè)技巧3. CSS可以做的幾個(gè)令你嘆為觀止的實(shí)例分享4. msxml3.dll 錯(cuò)誤 800c0019 系統(tǒng)錯(cuò)誤:-2146697191解決方法5. 利用ajax+php實(shí)現(xiàn)商品價(jià)格計(jì)算6. xml中的空格之完全解說(shuō)7. Vue的Options用法說(shuō)明8. axios和ajax的區(qū)別點(diǎn)總結(jié)9. 怎樣才能用js生成xmldom對(duì)象，并且在firefox中也實(shí)現(xiàn)xml數(shù)據(jù)島？10. ASP刪除img標(biāo)簽的style屬性只保留src的正則函數(shù)

排行榜

					
					教你如何寫(xiě)出可維護(hù)的JS代碼
Django 解決由save方法引發(fā)的錯(cuò)誤
axios和ajax的區(qū)別點(diǎn)總結(jié)
ASP刪除img標(biāo)簽的style屬性只保留src的正則函數(shù)
css代碼優(yōu)化的12個(gè)技巧
怎樣才能用js生成xmldom對(duì)象，并且在firefox中也實(shí)現(xiàn)xml數(shù)據(jù)島？
利用ajax+php實(shí)現(xiàn)商品價(jià)格計(jì)算
xml中的空格之完全解說(shuō)
msxml3.dll 錯(cuò)誤 800c0019 系統(tǒng)錯(cuò)誤:-2146697191解決方法
使用Python和百度語(yǔ)音識(shí)別生成視頻字幕的實(shí)現(xiàn)
IDEA版最新MyBatis程序配置教程詳解