我有一个网站,每月通过FTP接收一个CSV文件一次。多年来,它是一个ASCII文件。现在我在一个月后收到UTF-8,接下来是UTF-16BE和UTF-16LE。也许我会在下个月获得UTF-32。 Fgets返回UTF文件开头的字节顺序标记。我如何让PHP自动识别字符编码?我曾尝试mb_detect_encoding,无论文件类型如何,它都返回ASCII。我更改了代码以读取BOM,并明确地将字符编码转换为mb_convert_encoding。这工作,直到最新的文件,这是UTF-16LE。在这个文件中,它正确地读取第一行,所有后续行显示为问号(“?”)。我究竟做错了什么?PHP字符编码地狱阅读csv文件与fgets
$fhandle = fopen($file_in, "r");
if (fhandle === false)
{
echo "<p class=redbold>Error opening file $file_in.</p>";
die();
}
$i = 0;
while(($line = fgets($fhandle)) !== false)
{
$i++;
// Detect encoding on first line. Actual text always begins with string "Document"
if ($i == 1)
{
$line_start = substr($line, 0, 4);
$line_start_hex = bin2hex($line_start);
$utf16_start = 'fffe4400';
$utf8_start = 'efbbbf44';
if (strcmp($line_start, 'Docu') == 0)
{ $char_encoding = 'ASCII'; }
elseif (strcmp($line_start_hex, 'efbbbf44') == 0)
{
$char_encoding = 'UTF-8';
$line = substr($line, 3);
}
elseif (strcmp($line_start_hex, 'fffe4400') == 0)
{
$char_encoding = 'UTF-16LE';
$line = substr($line, 2);
}
elseif (strcmp($line_start_hex, 'feff4400') == 0)
{
$char_encoding = 'UTF-16BE';
$line = substr($line, 2);
}
else
{
echo "<p class=redbold>Error, unknown character encoding. Line =<br>", $line_start_hex, '</p>';
require('../footer.php');
die();
}
echo "<p>char_encoding = $char_encoding</p>";
}
// Convert UTF
if ($char_encoding != 'ASCII')
{
$line = mb_convert_encoding($line, 'ASCII', $char_encoding);
}
echo '<p>'; var_dump($line); echo '</p>';
}
输出:
char_encoding = UTF-16LE
string(101) "DocumentNumber,RecordedTS,Title,PageCount,City,TransTaxAccountCode,TotalTransferTax,Description,Name
"
string(83) "???????????????????????????????????????????????????????????????????????????????????"
string(88) "????????????????????????????????????????????????????????????????????????????????????????"
string(84) "????????????????????????????????????????????????????????????????????????????????????"
string(80) "????????????????????????????????????????????????????????????????????????????????"
不幸的是,mb_detect_encoding似乎为某些UTF文件返回“ASCII”。 – George
哎呀,错过了那部分问题..回到绘图板 –
但ascii是unicode的一个子集(第一个255十进制),因此它们应该很容易转换。只需转换为ascii并且不使用多字节字符串。哦,你有没有想过可能会向提供FTP数据的人大喊大叫? – Amelia